www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - interpolation proposals and safety

reply Bruce Carneal <bcarneal gmail.com> writes:
Are both of the interpolation proposals (1027 and 1036e) equally 
helpful when seeking to avoid SQL injection attacks, DOM XSS 
vulnerabilities and the like?

Is it really easy, trivial even, to button things up with either 
proposal or is one easier to use correctly than the other?

I am neither an SQL savant nor a Web programmer, but it seems 
like safety in this area would be a real plus for D going forward.
Dec 23 2023
parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Saturday, 23 December 2023 at 22:55:34 UTC, Bruce Carneal 
wrote:
 Is it really easy, trivial even, to button things up with 
 either proposal or is one easier to use correctly than the 
 other?
1027 makes it possible to do some cases correctly, but difficult to trust in the general case since it makes no attempt at type safety and its string cannot differentiate between user-injected strings and format string literals. So, when you process a 1027 style format string, and see a %, was that part of the string or was that injected by the compiler to indicate a param placeholder? What if the user forgets to escape something, or passes the wrong syntax as a custom specifier? These are all unforced errors in the design of 1027, that led to its DIP being rejected by community review. On the other hand, 1036e corrects these flaws, while adding the possibility for CTFE manipulation, aggregation, and verification of all string literals passed. I encourage everyone to look at the sample repository here: https://github.com/adamdruppe/interpolation-examples/ Several of the use cases selected for that specifically demonstrate how it gives the users the convenient syntax they expect from string interpolation, yet actually lowers to the correct semantics for each specialized problem domain. to do, it works quite easily for it. and processed in library code, including compile-time verification associated with the data types passed. provides to be compatible with legacy functions in a zero-runtime-cost manner. in the previous examples to use the industry-standard GNU gettext library, coupled with automatic aggregation of translatable strings at compile time, to provide full context to non-developers to add new language packs at run time. The next three examples are directly relevant to your question, and address common problems web developers face, where security problems are often introduced where strings are convenient, but no longer appropriate for correctness. demonstrated techniques, to make a directly-manipulable high-level object out of what looks to be a simple, familiar string. Since it works at a high level, aware of the surrounding context, it ensures each injected component is encoded appropriately for that context. separating code and data - delegating the recombination of them to the database engine to do it safely and correctly, yet appearing to the user to be a convenient mixture of the two! Notice how the usage example, at the top level of the repository, *looks like* string interpolation, yet the implementation, in the `lib` folder, actually binds the data to a prepared statement in a structured way, like the guides say you are supposed to! again, separating HTML structure from added data and ensuring correct encodings and valid data positioning is done in all contexts. With CTFE validation, it prevents common mistakes that can manifest as bugs or exploitable holes in production, and by working on a high level, using object representations instead of raw strings, it ensures all semantic invariants are maintained from creation to consumption. It goes beyond just bringing web best practices to the D programming language - it also enables innovation by allowing coupling of these security guidelines and development best practices with D's unique features for static analysis and compile-time processing. Similar examples could be written for shell scripting, json, and more, but I thought this was enough to make the point and demonstrate the relevant patterns. By the end of this year, when this new feature is merged, D will cement its position as an innovating pioneer, learning the lessons from the past and applying their best libraries in a whole new way.
Dec 23 2023
parent reply kdevel <kdevel vogtner.de> writes:
On Saturday, 23 December 2023 at 23:33:31 UTC, Adam D Ruppe wrote:

 1027 makes it possible to do some cases correctly, but 
 difficult to trust in the general case since it makes no 
 attempt at type safety and its string cannot differentiate 
 between user-injected strings and format string literals.
As I will point out below, the current implementation (DMD v2.109.1) doesn't do either. At least not in the HTML case.
 [...]

 On the other hand, 1036e corrects these flaws, while adding the 
 possibility for CTFE manipulation, aggregation, and 
 verification of all string literals passed.

 I encourage everyone to look at the sample repository here:

 https://github.com/adamdruppe/interpolation-examples/
 [...]

 again, separating HTML structure from added data and ensuring 
 correct encodings and valid data positioning is done in all 
 contexts. [...]
One way to "commit" a mistake is by omitting necessary parts. In A CGI context the webserver is reading the stdout of the CGI application. The original example (with comments stripped) is: ``` import lib.html; void main() { string name = "<bar>"; auto element = i"<foo>$(name)</foo>".html; assert(element.tagName == "foo"); import std.stdio; writeln(element.toString()); } ``` Now i forget to `import lib.html` and to call `html` on the IES: ``` void main() { string name = "<script>alert(-1)</script>"; auto element = i"<foo>$(name)</foo>"; import std.stdio; writeln(element); } ``` ``` $ dmd htmli.d $ ./htmli <foo><script>alert(-1)</script></foo> ``` `name` may have been a URL parameter or may be part of the POST body. The important part is that it is attacker supplied and controlled. `writeln` should not print unadorned interpolated string expressions.
Aug 22 2024
next sibling parent Grandoz <mofarky0 gmail.com> writes:
On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 On Saturday, 23 December 2023 at 23:33:31 UTC, Adam D Ruppe 
 wrote:

 [...]
As I will point out below, the current implementation (DMD v2.109.1) doesn't do either. At least not in the HTML case. [...]
Merci
Aug 29 2024
prev sibling next sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 Now i forget to `import lib.html` and to call `html` on the IES:

 ```
 void main() {
 	string name = "<script>alert(-1)</script>";
 	auto element = i"<foo>$(name)</foo>";

 	import std.stdio;
 	writeln(element);
 }
 ```

 ```
 $ dmd htmli.d
 $ ./htmli
 <foo><script>alert(-1)</script></foo>
 ```

 `name` may have been a URL parameter or may be part of the POST 
 body. The important part is that it is attacker supplied and 
 controlled.

 `writeln` should not print unadorned interpolated string 
 expressions.
The real problem here is that the type system does not distinguish between strings that are controlled by the user (and thus may contain malicious data) and strings that are controlled by the programmer. If you define a separate type for user-controlled strings, the mistake is easily caught at compile time: ```d struct UserString { string unwrap; disable string toString(); } void main() { auto name = UserString("<script>alert(-1)</script>"); auto element = i"<foo>$(name)</foo>"; import std.stdio; writeln(element); // Error: static assert: "UserString cannot be formatted // because its `toString` is marked with ` disable`" } ```
Aug 29 2024
parent reply kdevel <kdevel vogtner.de> writes:
On Thursday, 29 August 2024 at 14:18:48 UTC, Paul Backus wrote:
 [...]

 `writeln` should not print unadorned interpolated string 
 expressions.
The real problem here is that the type system does not distinguish between strings that are controlled by the user (and thus may contain malicious data) and strings that are controlled by the programmer. If you define a separate type for user-controlled strings, the mistake is easily caught at compile time:
Sure. But if you forget to do so, you have a "typesafe" implementation of XSS. Using the facilities of 1036e in a careless way is actually unsafe. Ideally compilation of such unadorned writes would fail.
Aug 30 2024
next sibling parent Paul Backus <snarwin gmail.com> writes:
On Friday, 30 August 2024 at 11:18:10 UTC, kdevel wrote:
 On Thursday, 29 August 2024 at 14:18:48 UTC, Paul Backus wrote:
 The real problem here is that the type system does not 
 distinguish between strings that are controlled by the user 
 (and thus may contain malicious data) and strings that are 
 controlled by the programmer. If you define a separate type 
 for user-controlled strings, the mistake is easily caught at 
 compile time:
Sure. But if you forget to do so, you have a "typesafe" implementation of XSS. Using the facilities of 1036e in a careless way is actually unsafe. Ideally compilation of such unadorned writes would fail.
Interpolation is just syntax sugar. You can use it to build safe APIs, or unsafe ones. `writeln` is not a safe API (w.r.t. XSS), and will not magically become one just because you used interpolation to pass your arguments to it. If you were lead to believe that interpolation would somehow make existing unsafe APIs safe (from injection attacks, XSS, etc.), then you were misled.
Aug 30 2024
prev sibling parent Adam D Ruppe <destructionator gmail.com> writes:
On Friday, 30 August 2024 at 11:18:10 UTC, kdevel wrote:
 Ideally compilation of such unadorned writes would fail.
for what its worth, after starting to use this more, I'm actually inclined to agree with you. The idea behind 1036e is that it allows the language to provide the pieces that library authors can use to make things that do the right thing. Two existing library functions work with it out of the box in a reasonably good way: std.conv.text and std.stdio.writeln. I've seen people in the wild using std.conv.text to do things like make sql queries - exactly what you're not supposed to do. We want some way to do a toString, but it should be a bit of a pain: doing it the right thing should be the easy way. Some of this is just because library support is still a bit immature - I still haven't even put it in place in all the places arsd wants it - but part of it means committing to the change too. So like we currently have `db.query(string, args...)` and... maybe that should be deprecated, so people stop trying to do `db.query(text(istring))`. Maybe text(istring) itself should be deprecated. It is some breakage though that requires users update their compiler so that makes it more complicated. Also, you sometimes DO want a runtime string, so it probably more of a rename than a remove. I didn't think about using writeln directly in a cgi context. There's MUCH better ways to make cgi programs! And much better ways to construct html too. And those libraries tend to make you be more careful with these cases. But indeed, writeln bypasses all that. I do think this is a harder case though... outright banning writeln in these cases isn't super realistic so the question is where you balance people's capability of doing it wrong with the flexibility to build anything on top at all. A lot of it comes back to making the better way the easy way: use my cgi library, and `cgi.write(i"istring")` is a compile error. But it still doesn't prevent you from bypassing the library and doing it wrong. Still, my feeling is if you make the right way just as easy as the wrong way, at least it helps point people the right way.
Aug 30 2024
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 `writeln` should not print unadorned interpolated string 
 expressions.
I find this argument unconvincing. You can print anything with `writeln`. Even if an IES was something else that doesn't print nicely, *`writeln` will still print it*. The point of making IES play nice with `writeln` is that it is a major expectation of any kind of interpolation setup. People just expect to log interpolated sequences that have their stuff in it. These concerns don't translate to other domains. In most SQL libraries, there is not a best effort function that just attempts to translate all data to strings and executes it. Assigning the thing to a string doesn't work either. Basically, you found just a very narrow example that is unlikely to exist, but indeed might be confusing if an exact series of mistakes are made. Even without IES, a user is equally likely to use `writef` to make the same mistake. -Steve
Aug 29 2024
parent reply kdevel <kdevel vogtner.de> writes:
On Thursday, 29 August 2024 at 14:21:24 UTC, Steven Schveighoffer 
wrote:
 On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 `writeln` should not print unadorned interpolated string 
 expressions.
I find this argument unconvincing. You can print anything with `writeln`. [...]
Not really, e.g. in the case of an object the class name will be printed instead of the potentially dangerous content: ``` import std.stdio; class C { string s; this (string s) { this.s = s; } } void main () { auto c = new C ("<script>alert(-1)</script>"); writeln (c); } ``` ``` $ dmd classprint.d $ ./classprint classprint.C ``` In a superior implementation of write(ln) this would simply also not compile. I mean there is a difference between printing the data payload to the output channel and OTOH dumping debug information to the developer.
 The point of making IES play nice with `writeln` is that it is 
 a major expectation of any kind of interpolation setup. People 
 just expect to log interpolated sequences that have their stuff 
 in it.
I don't know if you noticed your own wording: We are expecting to "log" IES data but not to "print" them to the output channel.
 [SQL]

 Basically, you found just a very narrow example that is 
 unlikely to exist, but indeed might be confusing if an exact 
 series of mistakes are made. Even without IES, a user is 
 equally likely to use `writef` to make the same mistake.
With post-1036e D the user has now three equally potent ways to shoot theirself in the foot: 1. ``` data = "alert (-1)"; writeln ("<script>" ~ data ~ "</script>"); ``` 2. ``` data = "alert (-1)"; writeln (format!"<script>%s</script>" (data)); ``` 3. ``` data = "alert (-1)"; writeln (i"<script>$(data)</script>"); ```
Aug 30 2024
next sibling parent kdevel <kdevel vogtner.de> writes:
On Friday, 30 August 2024 at 12:07:47 UTC, kdevel wrote:
 [...]
The examples should read like ``` auto data = "<script>alert (-1)</script>"; writeln ("<div>" ~ data ~ "</div>"); // 1. writeln (format!"<div>%s</div>" (data)); // 2. writeln (i"<div>$(data)</div>"); // 3. ```
Aug 30 2024
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Friday, 30 August 2024 at 12:07:47 UTC, kdevel wrote:
 On Thursday, 29 August 2024 at 14:21:24 UTC, Steven 
 Schveighoffer wrote:
 On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 `writeln` should not print unadorned interpolated string 
 expressions.
I find this argument unconvincing. You can print anything with `writeln`. [...]
Not really, e.g. in the case of an object the class name will be printed instead of the potentially dangerous content: ``` import std.stdio; class C { string s; this (string s) { this.s = s; } } void main () { auto c = new C ("<script>alert(-1)</script>"); writeln (c); } ```
But it still prints something. I guarantee if someone could have figured out how to make it print all the fields via a runtime interface, this would have happened. In other words, the *expectation* is that the `toString` value is useful for string representation logging by default. The class name is the best you can do with a runtime-only interface by default, so that's what it is. It wasn't a conscious choice with HTML injection in mind.
 In a superior implementation of write(ln) this would simply 
 also not compile. I mean there is a difference between printing 
 the data payload to the output channel and OTOH dumping debug 
 information to the developer.
If `Object.toString()` didn't exist, then `writeln(someObject)` would by default print the class name via `typeid(someObject).toString()`. There is no chance this would be an error by default.
 The point of making IES play nice with `writeln` is that it is 
 a major expectation of any kind of interpolation setup. People 
 just expect to log interpolated sequences that have their 
 stuff in it.
I don't know if you noticed your own wording: We are expecting to "log" IES data but not to "print" them to the output channel.
Yes, I purposely said log, because that's the intention of `writeln`. It's logging the most useful data to the console. For structs, this is all the fields, and the struct name. For IES, this is what it does for string representation, because that's what people would expect. Use any other language with interpolation, and you will find it does the same thing when logging. Can you imagine what the blowback would be if `writeln(i"hello $(name)!");` was an error? If anything, this draws attention to the pitfalls of CGI in general. For instance, what if you call a function that logs something to the console? That gets included in your html.
 [SQL]

 Basically, you found just a very narrow example that is 
 unlikely to exist, but indeed might be confusing if an exact 
 series of mistakes are made. Even without IES, a user is 
 equally likely to use `writef` to make the same mistake.
With post-1036e D the user has now three equally potent ways to shoot theirself in the foot: 1. ``` data = "alert (-1)"; writeln ("<script>" ~ data ~ "</script>"); ``` 2. ``` data = "alert (-1)"; writeln (format!"<script>%s</script>" (data)); ``` 3. ``` data = "alert (-1)"; writeln (i"<script>$(data)</script>"); ```
4. ```d writefln("<script>%s</script>", data); ``` When you have a footgun cabinet that is almost full, I'm not sure adding one more is something to worry about. -Steve
Aug 30 2024
next sibling parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Fri, Aug 30, 2024 at 02:36:56PM +0000, Steven Schveighoffer via
Digitalmars-d wrote:
 On Friday, 30 August 2024 at 12:07:47 UTC, kdevel wrote:
[...]
 With post-1036e D the user has now three equally potent ways to
 shoot theirself in the foot:
 
 1.
 ```
      data = "alert (-1)";
      writeln ("<script>" ~ data ~ "</script>");
 ```
 
 2.
 ```
      data = "alert (-1)";
      writeln (format!"<script>%s</script>" (data));
 ```
 
 3.
 ```
      data = "alert (-1)";
      writeln (i"<script>$(data)</script>");
 ```
4. ```d writefln("<script>%s</script>", data); ```
[...] Actually, the footgun here is just one: writeln + CGI. The rest are merely specific instances of the same thing: writing unvetted data to your output stream. No amount of cleverness is going to prevent this; you, as the programmer, are responsible for making sure the program logic properly vets all data before sending them to the output stream. If some coder wannabe can't figure out how to properly segregate their input data from their output data, no amount of programming language features will be able to help them. The program logic has to be structured in such a way that *all* input data is properly escaped, or *all* output data is properly encoded. The latter is much harder; recoding input data is recommended. If the fundamental program logic is flawed, no magic feature will be able to fix the resulting problems. T -- Study gravitation, it's a field with a lot of potential.
Aug 30 2024
parent kdevel <kdevel vogtner.de> writes:
On Friday, 30 August 2024 at 15:09:39 UTC, H. S. Teoh wrote:
 The program logic has to be structured in such a way that *all* 
 input data is properly escaped,
What does that mean? Can, after such input escaping, Bobby Tables first name still be stored in the database?
 or *all* output data is properly encoded.  The latter is much 
 harder; recoding input data is recommended.
By whom? And "recode" (re-encode) into which code? The web is full of content recommending the opposite: - https://benhoyt.com/writings/dont-sanitize-do-escape/ "Every so often developers talk about “sanitizing user input” to prevent cross-site scripting attacks. This is well-intentioned, but leads to a false sense of security, and sometimes mangles perfectly good input." - https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html - https://security.stackexchange.com/questions/32394/when-to-escape-user-input (2013) "In general, you want to keep strings as strings, and delegate any encoding or escaping to specialized functions which do that well. For instance, for SQL, you use prepared statements. With HTML from a PHP context, you would use htmlspecialchars()." - https://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/ "The right way to handle issues with untrusted data is: Filter on input, escape on output This means that you validate or limit data that comes in (filter), but only transform (escape or encode) it at the point you are sending it as output to another system that requires the encoding. It has been standard best practice since just about forever [citation required]. [...] First of all, escape-on-input is just wrong – you've taken some input and applied some transformation that is totally irrelevant to that data. If, taking our example, you have some data collected by HTTP POST or GET parameters, applying HTML escaping to it is a layering violation – it mixes an output formatting concern into input handling. Layering violations make your code much harder to understand and maintain, because you have to take into account other layers instead of letting each component and layer do its own job. Doing things ‘right’ is very important, even if doing them ‘wrong’ seems to work and you are tempted to be dismissive of ‘theoretical’ concerns about purity etc. When you have to maintain code, you will be very glad if things are in the right place, and not full of hacks and surprises."
Aug 31 2024
prev sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Friday, 30 August 2024 at 14:36:56 UTC, Steven Schveighoffer 
wrote:
 Yes, I purposely said log, because that's the intention of 
 `writeln`.
Gonna disagree with this: writeln's intention is to write a line to the stdout stream. It has no more semantic knowledge; it is fundamentally a low level function with a wide variety of high level uses. You can use it for logging to the console, but it might also go to a file, but it might also go to a user for interactive use ("Enter your name: "), and it might also be structured output intended to be consumed by another program: ./yourapp | whatever. CGI, on the low level, works as one of those unix pipelines. You ought to be more careful with that, but writeln has no idea how it is actually going to be used.
 For instance, what if you call a function that logs something 
 to the console? That gets included in your html.
They're supposed to use the stderr stream for that kind of thing, which is traditionally redirected to the server's log file. But yeah, writeln is a low level function and you have to treat it as such - limiting what it is allowed to do can help one use case but hurt another, so that's hard to say... but at the same time, users really shouldn't be encouraged to use it. There should be high level intent libs that do those jobs all better and easier.
Aug 30 2024
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On Friday, 30 August 2024 at 15:21:49 UTC, Adam D Ruppe wrote:
 On Friday, 30 August 2024 at 14:36:56 UTC, Steven Schveighoffer 
 wrote:
 Yes, I purposely said log, because that's the intention of 
 `writeln`.
Gonna disagree with this: writeln's intention is to write a line to the stdout stream. It has no more semantic knowledge; it is fundamentally a low level function with a wide variety of high level uses.
`writeln` has a *default mechanism* for writing out data. What does this default mechanism do? It dumps the contents of a struct. This screams "logging". If `writeln` only supported strings, or required opt-in to string conversion (i.e. erroring without explicit `toString` function), then it would be clearer. But I can't see any purpose for this default other than logging data when it's useful. But technically, this job belongs to `format`, so `writeln` indeed is a mechanism to output formatted data to a stream. It's just that the combination of formatting data with the default format being "contents of object" is best suited to logging. If one is worried about accidentally writing badly-formatted or accidentally revealing data, one shouldn't use `writeln`. This reminds me of javascript when you get a field on the screen and the text is `[Object object]` -Steve
Aug 30 2024