digitalmars.D - interpolation proposals and safety

Bruce Carneal (7/7) Dec 23 2023 Are both of the interpolation proposals (1027 and 1036e) equally

Adam D Ruppe (71/74) Dec 23 2023 1027 makes it possible to do some cases correctly, but difficult

kdevel (36/51) Aug 22 2024 As I will point out below, the current implementation (DMD

Grandoz (2/9) Aug 29 2024 Merci
Paul Backus (22/41) Aug 29 2024 The real problem here is that the type system does not

kdevel (5/16) Aug 30 2024 Sure. But if you forget to do so, you have a "typesafe"

Paul Backus (8/19) Aug 30 2024 Interpolation is just syntax sugar. You can use it to build safe
Adam D Ruppe (36/37) Aug 30 2024 for what its worth, after starting to use this more, I'm actually

Steven Schveighoffer (17/19) Aug 29 2024 I find this argument unconvincing.

kdevel (47/61) Aug 30 2024 Not really, e.g. in the case of an object the class name will be

kdevel (8/9) Aug 30 2024 The examples should read like
Steven Schveighoffer (32/91) Aug 30 2024 But it still prints something. I guarantee if someone could have

H. S. Teoh (18/44) Aug 30 2024 [...]

kdevel (42/46) Aug 31 2024 What does that mean? Can, after such input escaping, Bobby Tables

Adam D Ruppe (21/25) Aug 30 2024 Gonna disagree with this: writeln's intention is to write a line

Steven Schveighoffer (17/25) Aug 30 2024 `writeln` has a *default mechanism* for writing out data. What

Bruce Carneal <bcarneal gmail.com> writes:

Are both of the interpolation proposals (1027 and 1036e) equally 
helpful when seeking to avoid SQL injection attacks, DOM XSS 
vulnerabilities and the like?

Is it really easy, trivial even, to button things up with either 
proposal or is one easier to use correctly than the other?

I am neither an SQL savant nor a Web programmer, but it seems 
like safety in this area would be a real plus for D going forward.

Dec 23 2023

Adam D Ruppe <destructionator gmail.com> writes:

On Saturday, 23 December 2023 at 22:55:34 UTC, Bruce Carneal 
wrote:
 Is it really easy, trivial even, to button things up with 
 either proposal or is one easier to use correctly than the 
 other?

1027 makes it possible to do some cases correctly, but difficult 
to trust in the general case since it makes no attempt at type 
safety and its string cannot differentiate between user-injected 
strings and format string literals.

So, when you process a 1027 style format string, and see a %, was 
that part of the string or was that injected by the compiler to 
indicate a param placeholder? What if the user forgets to escape 
something, or passes the wrong syntax as a custom specifier? 
These are all unforced errors in the design of 1027, that led to 
its DIP being rejected by community review.

On the other hand, 1036e corrects these flaws, while adding the 
possibility for CTFE manipulation, aggregation, and verification 
of all string literals passed.

I encourage everyone to look at the sample repository here:

https://github.com/adamdruppe/interpolation-examples/

Several of the use cases selected for that specifically 
demonstrate how it gives the users the convenient syntax they 
expect from string interpolation, yet actually lowers to the 
correct semantics for each specialized problem domain.


to do, it works quite easily for it.


and processed in library code, including compile-time 
verification associated with the data types passed.


provides to be compatible with legacy functions in a 
zero-runtime-cost manner.


in the previous examples to use the industry-standard GNU gettext 
library, coupled with automatic aggregation of translatable 
strings at compile time, to provide full context to 
non-developers to add new language packs at run time.

The next three examples are directly relevant to your question, 
and address common problems web developers face, where security 
problems are often introduced where strings are convenient, but 
no longer appropriate for correctness.


demonstrated techniques, to make a directly-manipulable 
high-level object out of what looks to be a simple, familiar 
string. Since it works at a high level, aware of the surrounding 
context, it ensures each injected component is encoded 
appropriately for that context.


separating code and data - delegating the recombination of them 
to the database engine to do it safely and correctly, yet 
appearing to the user to be a convenient mixture of the two! 
Notice how the usage example, at the top level of the repository, 
*looks like* string interpolation, yet the implementation, in the 
`lib` folder, actually binds the data to a prepared statement in 
a structured way, like the guides say you are supposed to!


again, separating HTML structure from added data and ensuring 
correct encodings and valid data positioning is done in all 
contexts. With CTFE validation, it prevents common mistakes that 
can manifest as bugs or exploitable holes in production, and by 
working on a high level, using object representations instead of 
raw strings, it ensures all semantic invariants are maintained 
from creation to consumption. It goes beyond just bringing web 
best practices to the D programming language - it also enables 
innovation by allowing coupling of these security guidelines and 
development best practices with D's unique features for static 
analysis and compile-time processing.


Similar examples could be written for shell scripting, json, and 
more, but I thought this was enough to make the point and 
demonstrate the relevant patterns.

By the end of this year, when this new feature is merged, D will 
cement its position as an innovating pioneer, learning the 
lessons from the past and applying their best libraries in a 
whole new way.

Dec 23 2023

kdevel <kdevel vogtner.de> writes:

On Saturday, 23 December 2023 at 23:33:31 UTC, Adam D Ruppe wrote:

 1027 makes it possible to do some cases correctly, but 
 difficult to trust in the general case since it makes no 
 attempt at type safety and its string cannot differentiate 
 between user-injected strings and format string literals.

As I will point out below, the current implementation (DMD 
v2.109.1)
doesn't do either. At least not in the HTML case.

 [...]

 On the other hand, 1036e corrects these flaws, while adding the 
 possibility for CTFE manipulation, aggregation, and 
 verification of all string literals passed.

 I encourage everyone to look at the sample repository here:

 https://github.com/adamdruppe/interpolation-examples/

 [...]


 again, separating HTML structure from added data and ensuring 
 correct encodings and valid data positioning is done in all 
 contexts. [...]

One way to "commit" a mistake is by omitting necessary parts. In 
A CGI context the webserver is reading the stdout of the CGI 
application. The original example (with comments stripped) is:

```
import lib.html;

void main() {
    string name = "<bar>";
    auto element = i"<foo>$(name)</foo>".html;
    assert(element.tagName == "foo");

    import std.stdio;
    writeln(element.toString());

}
```

Now i forget to `import lib.html` and to call `html` on the IES:

```
void main() {
	string name = "<script>alert(-1)</script>";
	auto element = i"<foo>$(name)</foo>";

	import std.stdio;
	writeln(element);
}
```

```
$ dmd htmli.d
$ ./htmli
<foo><script>alert(-1)</script></foo>
```

`name` may have been a URL parameter or may be part of the POST 
body. The important part is that it is attacker supplied and 
controlled.

`writeln` should not print unadorned interpolated string 
expressions.

Aug 22 2024

Grandoz <mofarky0 gmail.com> writes:

On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 On Saturday, 23 December 2023 at 23:33:31 UTC, Adam D Ruppe 
 wrote:

 [...]

 As I will point out below, the current implementation (DMD 
 v2.109.1)
 doesn't do either. At least not in the HTML case.

 [...]

Merci

Aug 29 2024

Paul Backus <snarwin gmail.com> writes:

On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 Now i forget to `import lib.html` and to call `html` on the IES:

 ```
 void main() {
 	string name = "<script>alert(-1)</script>";
 	auto element = i"<foo>$(name)</foo>";

 	import std.stdio;
 	writeln(element);
 }
 ```

 ```
 $ dmd htmli.d
 $ ./htmli
 <foo><script>alert(-1)</script></foo>
 ```

 `name` may have been a URL parameter or may be part of the POST 
 body. The important part is that it is attacker supplied and 
 controlled.

 `writeln` should not print unadorned interpolated string 
 expressions.

The real problem here is that the type system does not 
distinguish between strings that are controlled by the user (and 
thus may contain malicious data) and strings that are controlled 
by the programmer. If you define a separate type for 
user-controlled strings, the mistake is easily caught at compile 
time:

```d
struct UserString
{
	string unwrap;
	 disable string toString();
}

void main() {
	auto name = UserString("<script>alert(-1)</script>");
	auto element = i"<foo>$(name)</foo>";

	import std.stdio;
	writeln(element);
	// Error: static assert:  "UserString cannot be formatted
	// because its `toString` is marked with ` disable`"
}
```

Aug 29 2024

kdevel <kdevel vogtner.de> writes:

On Thursday, 29 August 2024 at 14:18:48 UTC, Paul Backus wrote:
 [...]

 `writeln` should not print unadorned interpolated string 
 expressions.

 The real problem here is that the type system does not 
 distinguish between strings that are controlled by the user 
 (and thus may contain malicious data) and strings that are 
 controlled by the programmer. If you define a separate type for 
 user-controlled strings, the mistake is easily caught at 
 compile time:

Sure. But if you forget to do so, you have a "typesafe" 
implementation of XSS. Using the facilities of 1036e in a 
careless way is actually unsafe.

Ideally compilation of such unadorned writes would fail.

Aug 30 2024

Paul Backus <snarwin gmail.com> writes:

On Friday, 30 August 2024 at 11:18:10 UTC, kdevel wrote:
 On Thursday, 29 August 2024 at 14:18:48 UTC, Paul Backus wrote:
 The real problem here is that the type system does not 
 distinguish between strings that are controlled by the user 
 (and thus may contain malicious data) and strings that are 
 controlled by the programmer. If you define a separate type 
 for user-controlled strings, the mistake is easily caught at 
 compile time:

 Sure. But if you forget to do so, you have a "typesafe" 
 implementation of XSS. Using the facilities of 1036e in a 
 careless way is actually unsafe.

 Ideally compilation of such unadorned writes would fail.

Interpolation is just syntax sugar. You can use it to build safe 
APIs, or unsafe ones. `writeln` is not a safe API (w.r.t. XSS), 
and will not magically become one just because you used 
interpolation to pass your arguments to it.

If you were lead to believe that interpolation would somehow make 
existing unsafe APIs safe (from injection attacks, XSS, etc.), 
then you were misled.

Aug 30 2024

Adam D Ruppe <destructionator gmail.com> writes:

On Friday, 30 August 2024 at 11:18:10 UTC, kdevel wrote:
 Ideally compilation of such unadorned writes would fail.

for what its worth, after starting to use this more, I'm actually 
inclined to agree with you.

The idea behind 1036e is that it allows the language to provide 
the pieces that library authors can use to make things that do 
the right thing. Two existing library functions work with it out 
of the box in a reasonably good way: std.conv.text and 
std.stdio.writeln.

I've seen people in the wild using std.conv.text to do things 
like make sql queries - exactly what you're not supposed to do. 
We want some way to do a toString, but it should be a bit of a 
pain: doing it the right thing should be the easy way. Some of 
this is just because library support is still a bit immature - I 
still haven't even put it in place in all the places arsd wants 
it - but part of it means committing to the change too.

So like we currently have `db.query(string, args...)` and... 
maybe that should be deprecated, so people stop trying to do 
`db.query(text(istring))`. Maybe text(istring) itself should be 
deprecated. It is some breakage though that requires users update 
their compiler so that makes it more complicated. Also, you 
sometimes DO want a runtime string, so it probably more of a 
rename than a remove.

I didn't think about using writeln directly in a cgi context. 
There's MUCH better ways to make cgi programs! And much better 
ways to construct html too. And those libraries tend to make you 
be more careful with these cases. But indeed, writeln bypasses 
all that. I do think this is a harder case though... outright 
banning writeln in these cases isn't super realistic so the 
question is where you balance people's capability of doing it 
wrong with the flexibility to build anything on top at all.

A lot of it comes back to making the better way the easy way: use 
my cgi library, and `cgi.write(i"istring")` is a compile error. 
But it still doesn't prevent you from bypassing the library and 
doing it wrong. Still, my feeling is if you make the right way 
just as easy as the wrong way, at least it helps point people the 
right way.

Aug 30 2024

Steven Schveighoffer <schveiguy gmail.com> writes:

On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 `writeln` should not print unadorned interpolated string 
 expressions.

I find this argument unconvincing.

You can print anything with `writeln`. Even if an IES was 
something else that doesn't print nicely, *`writeln` will still 
print it*.

The point of making IES play nice with `writeln` is that it is a 
major expectation of any kind of interpolation setup. People just 
expect to log interpolated sequences that have their stuff in it.

These concerns don't translate to other domains. In most SQL 
libraries, there is not a best effort function that just attempts 
to translate all data to strings and executes it.

Assigning the thing to a string doesn't work either.

Basically, you found just a very narrow example that is unlikely 
to exist, but indeed might be confusing if an exact series of 
mistakes are made. Even without IES, a user is equally likely to 
use `writef` to make the same mistake.

-Steve

Aug 29 2024

kdevel <kdevel vogtner.de> writes:

On Thursday, 29 August 2024 at 14:21:24 UTC, Steven Schveighoffer 
wrote:
 On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 `writeln` should not print unadorned interpolated string 
 expressions.

 I find this argument unconvincing.

 You can print anything with `writeln`. [...]

Not really, e.g. in the case of an object the class name will be 
printed instead of the potentially dangerous content:

```
import std.stdio;

class C {
    string s;
    this (string s)
    {
       this.s = s;
    }
}

void main ()
{
    auto c = new C ("<script>alert(-1)</script>");
    writeln (c);
}
```

```
$ dmd classprint.d
$ ./classprint
classprint.C
```

In a superior implementation of write(ln) this would simply also 
not compile. I mean there is a difference between printing the 
data payload to the output channel and OTOH dumping debug 
information to the developer.

 The point of making IES play nice with `writeln` is that it is 
 a major expectation of any kind of interpolation setup. People 
 just expect to log interpolated sequences that have their stuff 
 in it.

I don't know if you noticed your own wording: We are expecting to 
"log" IES data but not to "print" them to the output channel.

 [SQL]

 Basically, you found just a very narrow example that is 
 unlikely to exist, but indeed might be confusing if an exact 
 series of mistakes are made. Even without IES, a user is 
 equally likely to use `writef` to make the same mistake.

With post-1036e D the user has now three equally potent ways to 
shoot theirself in the foot:

1.
```
      data = "alert (-1)";
      writeln ("<script>" ~ data ~ "</script>");
```

2.
```
      data = "alert (-1)";
      writeln (format!"<script>%s</script>" (data));
```

3.
```
      data = "alert (-1)";
      writeln (i"<script>$(data)</script>");
```

Aug 30 2024

kdevel <kdevel vogtner.de> writes:

On Friday, 30 August 2024 at 12:07:47 UTC, kdevel wrote:
 [...]

The examples should read like

```
      auto data = "<script>alert (-1)</script>";
      writeln ("<div>" ~ data ~ "</div>"); // 1.
      writeln (format!"<div>%s</div>" (data)); // 2.
      writeln (i"<div>$(data)</div>"); // 3.
```

Aug 30 2024

Steven Schveighoffer <schveiguy gmail.com> writes:

On Friday, 30 August 2024 at 12:07:47 UTC, kdevel wrote:
 On Thursday, 29 August 2024 at 14:21:24 UTC, Steven 
 Schveighoffer wrote:
 On Thursday, 22 August 2024 at 19:34:32 UTC, kdevel wrote:
 `writeln` should not print unadorned interpolated string 
 expressions.

 I find this argument unconvincing.

 You can print anything with `writeln`. [...]

 Not really, e.g. in the case of an object the class name will 
 be printed instead of the potentially dangerous content:

 ```
 import std.stdio;

 class C {
    string s;
    this (string s)
    {
       this.s = s;
    }
 }

 void main ()
 {
    auto c = new C ("<script>alert(-1)</script>");
    writeln (c);
 }
 ```

But it still prints something. I guarantee if someone could have 
figured out how to make it print all the fields via a runtime 
interface, this would have happened.

In other words, the *expectation* is that the `toString` value is 
useful for string representation logging by default. The class 
name is the best you can do with a runtime-only interface by 
default, so that's what it is. It wasn't a conscious choice with 
HTML injection in mind.

 In a superior implementation of write(ln) this would simply 
 also not compile. I mean there is a difference between printing 
 the data payload to the output channel and OTOH dumping debug 
 information to the developer.

If `Object.toString()` didn't exist, then `writeln(someObject)` 
would by default print the class name via 
`typeid(someObject).toString()`. There is no chance this would be 
an error by default.

 The point of making IES play nice with `writeln` is that it is 
 a major expectation of any kind of interpolation setup. People 
 just expect to log interpolated sequences that have their 
 stuff in it.

 I don't know if you noticed your own wording: We are expecting 
 to "log" IES data but not to "print" them to the output channel.

Yes, I purposely said log, because that's the intention of 
`writeln`. It's logging the most useful data to the console. For 
structs, this is all the fields, and the struct name.

For IES, this is what it does for string representation, because 
that's what people would expect. Use any other language with 
interpolation, and you will find it does the same thing when 
logging. Can you imagine what the blowback would be if 
`writeln(i"hello $(name)!");` was an error?

If anything, this draws attention to the pitfalls of CGI in 
general.

For instance, what if you call a function that logs something to 
the console? That gets included in your html.

 [SQL]

 Basically, you found just a very narrow example that is 
 unlikely to exist, but indeed might be confusing if an exact 
 series of mistakes are made. Even without IES, a user is 
 equally likely to use `writef` to make the same mistake.

 With post-1036e D the user has now three equally potent ways to 
 shoot theirself in the foot:

 1.
 ```
      data = "alert (-1)";
      writeln ("<script>" ~ data ~ "</script>");
 ```

 2.
 ```
      data = "alert (-1)";
      writeln (format!"<script>%s</script>" (data));
 ```

 3.
 ```
      data = "alert (-1)";
      writeln (i"<script>$(data)</script>");
 ```

4.
```d
writefln("<script>%s</script>", data);
```

When you have a footgun cabinet that is almost full, I'm not sure 
adding one more is something to worry about.

-Steve

Aug 30 2024

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Fri, Aug 30, 2024 at 02:36:56PM +0000, Steven Schveighoffer via
Digitalmars-d wrote:
 On Friday, 30 August 2024 at 12:07:47 UTC, kdevel wrote:

[...]
 With post-1036e D the user has now three equally potent ways to
 shoot theirself in the foot:
 
 1.
 ```
      data = "alert (-1)";
      writeln ("<script>" ~ data ~ "</script>");
 ```
 
 2.
 ```
      data = "alert (-1)";
      writeln (format!"<script>%s</script>" (data));
 ```
 
 3.
 ```
      data = "alert (-1)";
      writeln (i"<script>$(data)</script>");
 ```

 
 4.
 ```d
 writefln("<script>%s</script>", data);
 ```

[...]

Actually, the footgun here is just one: writeln + CGI.  The rest are
merely specific instances of the same thing: writing unvetted data to
your output stream.  No amount of cleverness is going to prevent this;
you, as the programmer, are responsible for making sure the program
logic properly vets all data before sending them to the output stream.
If some coder wannabe can't figure out how to properly segregate their
input data from their output data, no amount of programming language
features will be able to help them.  The program logic has to be
structured in such a way that *all* input data is properly escaped, or
*all* output data is properly encoded.  The latter is much harder;
recoding input data is recommended.  If the fundamental program logic is
flawed, no magic feature will be able to fix the resulting problems.


T

-- 
Study gravitation, it's a field with a lot of potential.

Aug 30 2024

kdevel <kdevel vogtner.de> writes:

On Friday, 30 August 2024 at 15:09:39 UTC, H. S. Teoh wrote:
The program logic has to be structured in such a way that *all*
input data is properly escaped,

What does that mean? Can, after such input escaping, Bobby Tables
first name still be stored in the database?

or *all* output data is properly encoded. The latter is much
harder; recoding input data is recommended.

By whom? And "recode" (re-encode) into which code?

The web is full of content recommending the opposite:

- https://benhoyt.com/writings/dont-sanitize-do-escape/

"Every so often developers talk about “sanitizing user input” to
prevent cross-site scripting attacks. This is well-intentioned,
but leads to a false sense of security, and sometimes mangles
perfectly good input."

-
https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html

-
https://security.stackexchange.com/questions/32394/when-to-escape-user-input
(2013)

"In general, you want to keep strings as strings, and delegate
any encoding or escaping to specialized functions which do that
well. For instance, for SQL, you use prepared statements. With
HTML from a PHP context, you would use htmlspecialchars()."

-
https://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/

"The right way to handle issues with untrusted data is:

Filter on input, escape on output

This means that you validate or limit data that comes in
(filter), but only transform (escape or encode) it at the point
you are sending it as output to another system that requires the
encoding. It has been standard best practice since just about
forever [citation required].

[...]

First of all, escape-on-input is just wrong – you've taken some
input and applied some transformation that is totally irrelevant
to that data. If, taking our example, you have some data
collected by HTTP POST or GET parameters, applying HTML escaping
to it is a layering violation – it mixes an output formatting
concern into input handling. Layering violations make your code
much harder to understand and maintain, because you have to take
into account other layers instead of letting each component and
layer do its own job.

Doing things ‘right’ is very important, even if doing them
‘wrong’ seems to work and you are tempted to be dismissive of
‘theoretical’ concerns about purity etc. When you have to
maintain code, you will be very glad if things are in the right
place, and not full of hacks and surprises."

Aug 31 2024

Adam D Ruppe <destructionator gmail.com> writes:

On Friday, 30 August 2024 at 14:36:56 UTC, Steven Schveighoffer 
wrote:
 Yes, I purposely said log, because that's the intention of 
 `writeln`.

Gonna disagree with this: writeln's intention is to write a line 
to the stdout stream. It has no more semantic knowledge; it is 
fundamentally a low level function with a wide variety of high 
level uses.

You can use it for logging to the console, but it might also go 
to a file, but it might also go to a user for interactive use 
("Enter your name: "), and it might also be structured output 
intended to be consumed by another program: ./yourapp | whatever.

CGI, on the low level, works as one of those unix pipelines. You 
ought to be more careful with that, but writeln has no idea how 
it is actually going to be used.

 For instance, what if you call a function that logs something 
 to the console? That gets included in your html.

They're supposed to use the stderr stream for that kind of thing, 
which is traditionally redirected to the server's log file.

But yeah, writeln is a low level function and you have to treat 
it as such - limiting what it is allowed to do can help one use 
case but hurt another, so that's hard to say... but at the same 
time, users really shouldn't be encouraged to use it. There 
should be high level intent libs that do those jobs all better 
and easier.

Aug 30 2024

Steven Schveighoffer <schveiguy gmail.com> writes:

On Friday, 30 August 2024 at 15:21:49 UTC, Adam D Ruppe wrote:
 On Friday, 30 August 2024 at 14:36:56 UTC, Steven Schveighoffer 
 wrote:
 Yes, I purposely said log, because that's the intention of 
 `writeln`.

 Gonna disagree with this: writeln's intention is to write a 
 line to the stdout stream. It has no more semantic knowledge; 
 it is fundamentally a low level function with a wide variety of 
 high level uses.

`writeln` has a *default mechanism* for writing out data. What 
does this default mechanism do? It dumps the contents of a 
struct. This screams "logging".

If `writeln` only supported strings, or required opt-in to string 
conversion (i.e. erroring without explicit `toString` function), 
then it would be clearer. But I can't see any purpose for this 
default other than logging data when it's useful.

But technically, this job belongs to `format`, so `writeln` 
indeed is a mechanism to output formatted data to a stream. It's 
just that the combination of formatting data with the default 
format being "contents of object" is best suited to logging.

If one is worried about accidentally writing badly-formatted or 
accidentally revealing data, one shouldn't use `writeln`.

This reminds me of javascript when you get a field on the screen 
and the text is `[Object object]`

-Steve

Aug 30 2024

D Programming

C/C++ Programming

Other

digitalmars.D - interpolation proposals and safety