digitalmars.D - toStringz and toUTFz potentially unsafe

Johann MacDonagh (59/59) Jul 24 2011 Both toStringz and toUTFz do something potentially unsafe. Both check

Andrei Alexandrescu (9/68) Jul 24 2011 I'm not too worried. I think it is fair to guarantee the pointer
Jonathan M Davis (22/107) Jul 24 2011 The _only_ time that this matters is if you actually keep the result of
Brad Roberts (13/125) Jul 24 2011 alling toStringz will

Andrei Alexandrescu (4/8) Jul 24 2011 Good point. In fact the implementation already takes care of that.

Jonathan M Davis (12/125) Jul 24 2011 The real question is what to do with to!(char*)(str). The plan is to mak...

Johann MacDonagh (7/19) Jul 24 2011 In that case, maybe we should implement @schveiguy's suggestion.

Jonathan M Davis (13/37) Jul 24 2011 If you always append a '\0', then there's no point to toStringz at all. ...

Vladimir Panteleev (6/7) Jul 25 2011 Semantics? toStringz(s) is much more readable than (s~'\0').ptr

Steven Schveighoffer (40/76) Jul 25 2011 It is of critical importance to note that this *ONLY* happens for

Jonathan M Davis (16/26) Jul 25 2011 That's not quite true actually. It _is_ true for toStringz, because it

Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:

Both toStringz and toUTFz do something potentially unsafe. Both check 
whether the character after the end of the string is NULL. If so, then 
it simply returns a pointer to the original string. This is a good 
optimization in theory because this code:

string s = "abc";

will be a slice to a read-only section of the executable. The compiler 
will insert a NULL after the string in the read-only section. So this:

auto x = toStringz("abc");

is efficient. No relocations.

As  AndrejMitrovic commented in Phobos pull request 123 
https://github.com/D-Programming-Language/phobos/pull/123, this has 
potential issues:

import std.string;
import std.stdio;

struct A
{
     immutable char[2] foo;
     char[2] bar;
}

void main()
{
     auto a = A("aa", "\0b");
     auto charptr = toStringz(a.foo[]);

     a.bar = "bo";
     printf(charptr);   // two chars, then garbage
}

Another issue not mentioned is with slices. If I do...

string s = "abc";
string y = s[];
string z = y[];

z ~= '\0';

auto c = toStringz(y);

assert(c.ptr == y.ptr);

... what happens if I change that last character of z before I pass c to 
the C routine? Bad news. I think this optimization is great, but doesn't 
it go against D's motto of doing "the right thing by default"?

The question is, how can we keep this optimization so that:

toStringz("abc");

remains efficient?

The capacity field is 0 if the string is in a read-only section *or* if 
the string is on the stack:

auto x = "abc";
assert(x.capacity == 0);
char[3] y = "abc";
assert(x.capacity == 0);

So, this isn't safe either. This code:

     char[3] x = "abc";
     char y = '\0';

will put y right after x, so changing y after calling toStringz will 
cause issues.

In reality, the only time it's safe to do the "peek after end" is if the 
string is in the read-only section. Otherwise, there are potential 
issues (even if they are edge cases).

Do we care about this? Is there something we can add to druntime arrays 
that will tell whether or not the data backing a slice is in read-only 
memory (or perhaps an enum: read-only, stack, heap, other)? In reality, 
the only time this changes is when a read-only / stack heap is appended 
to, so performance issues are minimal.

Comments? Ideas?

Jul 24 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/24/11 7:41 PM, Johann MacDonagh wrote:
 Both toStringz and toUTFz do something potentially unsafe. Both check
 whether the character after the end of the string is NULL. If so, then
 it simply returns a pointer to the original string. This is a good
 optimization in theory because this code:

 string s = "abc";

 will be a slice to a read-only section of the executable. The compiler
 will insert a NULL after the string in the read-only section. So this:

 auto x = toStringz("abc");

 is efficient. No relocations.

 As  AndrejMitrovic commented in Phobos pull request 123
 https://github.com/D-Programming-Language/phobos/pull/123, this has
 potential issues:

 import std.string;
 import std.stdio;

 struct A
 {
 immutable char[2] foo;
 char[2] bar;
 }

 void main()
 {
 auto a = A("aa", "\0b");
 auto charptr = toStringz(a.foo[]);

 a.bar = "bo";
 printf(charptr); // two chars, then garbage
 }

 Another issue not mentioned is with slices. If I do...

 string s = "abc";
 string y = s[];
 string z = y[];

 z ~= '\0';

 auto c = toStringz(y);

 assert(c.ptr == y.ptr);

 ... what happens if I change that last character of z before I pass c to
 the C routine? Bad news. I think this optimization is great, but doesn't
 it go against D's motto of doing "the right thing by default"?

 The question is, how can we keep this optimization so that:

 toStringz("abc");

 remains efficient?

 The capacity field is 0 if the string is in a read-only section *or* if
 the string is on the stack:

 auto x = "abc";
 assert(x.capacity == 0);
 char[3] y = "abc";
 assert(x.capacity == 0);

 So, this isn't safe either. This code:

 char[3] x = "abc";
 char y = '\0';

 will put y right after x, so changing y after calling toStringz will
 cause issues.

 In reality, the only time it's safe to do the "peek after end" is if the
 string is in the read-only section. Otherwise, there are potential
 issues (even if they are edge cases).

 Do we care about this? Is there something we can add to druntime arrays
 that will tell whether or not the data backing a slice is in read-only
 memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
 the only time this changes is when a read-only / stack heap is appended
 to, so performance issues are minimal.

 Comments? Ideas?

I'm not too worried. I think it is fair to guarantee the pointer 
returned from toStringz is guaranteed to point to a zero-terminated 
string only up until the first relevant change. It is difficult to 
define what a relevant change is, but practically I think it is 
understood what's going on. If there's a need for a persistent stringz, 
creating a private copy immediately after the call to toStringz is 
always an option.


Andrei

Jul 24 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
 Both toStringz and toUTFz do something potentially unsafe. Both check
 whether the character after the end of the string is NULL. If so, then
 it simply returns a pointer to the original string. This is a good
 optimization in theory because this code:
 
 string s = "abc";
 
 will be a slice to a read-only section of the executable. The compiler
 will insert a NULL after the string in the read-only section. So this:
 
 auto x = toStringz("abc");
 
 is efficient. No relocations.
 
 As  AndrejMitrovic commented in Phobos pull request 123
 https://github.com/D-Programming-Language/phobos/pull/123, this has
 potential issues:
 
 import std.string;
 import std.stdio;
 
 struct A
 {
      immutable char[2] foo;
      char[2] bar;
 }
 
 void main()
 {
      auto a = A("aa", "\0b");
      auto charptr = toStringz(a.foo[]);
 
      a.bar = "bo";
      printf(charptr);   // two chars, then garbage
 }
 
 Another issue not mentioned is with slices. If I do...
 
 string s = "abc";
 string y = s[];
 string z = y[];
 
 z ~= '\0';
 
 auto c = toStringz(y);
 
 assert(c.ptr == y.ptr);
 
 ... what happens if I change that last character of z before I pass c to
 the C routine? Bad news. I think this optimization is great, but doesn't
 it go against D's motto of doing "the right thing by default"?
 
 The question is, how can we keep this optimization so that:
 
 toStringz("abc");
 
 remains efficient?
 
 The capacity field is 0 if the string is in a read-only section *or* if
 the string is on the stack:
 
 auto x = "abc";
 assert(x.capacity == 0);
 char[3] y = "abc";
 assert(x.capacity == 0);
 
 So, this isn't safe either. This code:
 
      char[3] x = "abc";
      char y = '\0';
 
 will put y right after x, so changing y after calling toStringz will
 cause issues.
 
 In reality, the only time it's safe to do the "peek after end" is if the
 string is in the read-only section. Otherwise, there are potential
 issues (even if they are edge cases).
 
 Do we care about this? Is there something we can add to druntime arrays
 that will tell whether or not the data backing a slice is in read-only
 memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
 the only time this changes is when a read-only / stack heap is appended
 to, so performance issues are minimal.
 
 Comments? Ideas?

The _only_ time that this matters is if you actually keep the result of 
toStringz or toUTFz around. If you do what is almost always done - that is 
pass the the result of toStringz/toUTFz directly to a function and not save it 
at all - then there is _zero_ problem.

If you want to save the char* instead, then you can simply choose to append 
'\0' instead of calling toStringz/toUTFz. toStringz has pretty much 0 value if 
it's forced to copy the string in all cases, and unless checking past the end 
of the string _and_ guaranteeing that the character one past the end won't 
change can be done and done efficiently, then the only way to guarantee that 
then value one past the end won't change is to make a copy, in which case 
toStringz is essentially pointless.

The documentation for toUTFz has an appropriate warning on it, so the issue 
should be clear. If you want to avoid it, simply append '\0' instead of 
calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly 
to a C function anyway, in which case there is not problem.

So, I really don't think that this is really an issue. If we can find an 
efficient way to make better guarantees, then that would be great, but the risk 
at this point is _very_ minimal, and there's an easy way to get around it in 
the _rare_ case where it could be a problem (simply append '\0' instead of 
calling toStringz/toUTFz).

- Jonathan M Davis

Jul 24 2011

Brad Roberts <braddr puremagic.com> writes:

On Sunday, July 24, 2011 5:56:04 PM, Jonathan M Davis wrote:
 On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
 Both toStringz and toUTFz do something potentially unsafe. Both check
 whether the character after the end of the string is NULL. If so, then
 it simply returns a pointer to the original string. This is a good
 optimization in theory because this code:

 string s = "abc";

 will be a slice to a read-only section of the executable. The compiler
 will insert a NULL after the string in the read-only section. So this:

 auto x = toStringz("abc");

 is efficient. No relocations.

 As  AndrejMitrovic commented in Phobos pull request 123
 https://github.com/D-Programming-Language/phobos/pull/123, this has
 potential issues:

 import std.string;
 import std.stdio;

 struct A
 {
      immutable char[2] foo;
      char[2] bar;
 }

 void main()
 {
      auto a = A("aa", "\0b");
      auto charptr = toStringz


(a.foo[]);
      a.bar = "bo";
      printf(charptr);   // two chars, then garbage
 }

 Another issue not mentioned is with slices. If I do...

 string s = "abc";
 string y = s[];
 string z = y[];

 z ~= '\0';

 auto c = toStringz(y);

 assert(c.ptr == y.ptr);

 ... what happens if I change that last character of z before I pass c to
 the C routine? Bad news. I think this optimization is great, but doesn't
 it go against D's motto of doing "the right thing by default"?

 The question is, how can we keep this optimization so that:

 toStringz("abc");

 remains efficient?

 The capacity field is 0 if the string is in a read-only section *or* if
 the string is on the stack:

 auto x = "abc";
 assert(x.capacity == 0);
 char[3] y = "abc";
 assert(x.capacity == 0);

 So, this isn't safe either. This code:

      char[3] x = "abc";
      char y = '\0';

 will put y right after x, so changing y after c


alling toStringz will
 cause issues.

 In reality, the only time it's safe to do the "peek after end" is if the
 string is in the read-only section. Otherwise, there are potential
 issues (even if they are edge cases).

 Do we care about this? Is there something we can add to druntime arrays
 that will tell whether or not the data backing a slice is in read-only
 memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
 the only time this changes is when a read-only / stack heap is appended
 to, so performance issues are minimal.

 Comments? Ideas?

 
 The _only_ time that this matters is if you actually keep the result of 
 toStringz or toUTFz around. If you do what is almost always done - that is 
 pass the the result of toStringz/toUTFz directly to a function and not save it 
 at all - then there is _zero_ problem.
 
 If you want to save the char* instead, then you can simply choose to append 
 '\0' instead of calling toStrin

gz/toUTFz. toStringz has pretty much 0 value if 
 it's forced to copy the string in all cases, and unless checking past the end 
 of the string _and_ guaranteeing that the character one past the end won't 
 change can be done and done efficiently, then the only way to guarantee that 
 then value one past the end won't change is to make a copy, in which case 
 toStringz is essentially pointless.
 
 The documentation for toUTFz has an appropriate warning on it, so the issue 
 should be clear. If you want to avoid it, simply append '\0' instead of 
 calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly 
 to a C function anyway, in which case there is not problem.
 
 So, I really don't think that this is really an issue. If we can find an 
 efficient way to make better guarantees, then that would be great, but the
risk 
 at this point is _very_ minimal, and there's an easy way to get around it in 
 the _rare_ case where it could be a problem

 (simply append '\0' instead of 
 calling toStringz/toUTFz).
 
 - Jonathan M Davis

Not entire true.  There's one more very important way this can cause a 
serious problem:  when the string in question is at the end of a block 
of memory and the +1 location is outside that.  The result will be a 
segv.

While I agree the chances of problem is low, it's certainly not 0 and 
is the sort of thing that if we saw another language do it we'd laugh 
and point.  This kind of code produces hard to debug problems and 
doesn't belong in the standard library.

Jul 24 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/24/11 8:05 PM, Brad Roberts wrote:
 Not entire true.  There's one more very important way this can cause a
 serious problem:  when the string in question is at the end of a block
 of memory and the +1 location is outside that.  The result will be a
 segv.

Good point. In fact the implementation already takes care of that.

https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L403

Andrei

Jul 24 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:
 On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
 Both toStringz and toUTFz do something potentially unsafe. Both check
 whether the character after the end of the string is NULL. If so, then
 it simply returns a pointer to the original string. This is a good
 optimization in theory because this code:
 
 string s = "abc";
 
 will be a slice to a read-only section of the executable. The compiler
 will insert a NULL after the string in the read-only section. So this:
 
 auto x = toStringz("abc");
 
 is efficient. No relocations.
 
 As  AndrejMitrovic commented in Phobos pull request 123
 https://github.com/D-Programming-Language/phobos/pull/123, this has
 potential issues:
 
 import std.string;
 import std.stdio;
 
 struct A
 {
 
      immutable char[2] foo;
      char[2] bar;
 
 }
 
 void main()
 {
 
      auto a = A("aa", "\0b");
      auto charptr = toStringz(a.foo[]);
      
      a.bar = "bo";
      printf(charptr);   // two chars, then garbage
 
 }
 
 Another issue not mentioned is with slices. If I do...
 
 string s = "abc";
 string y = s[];
 string z = y[];
 
 z ~= '\0';
 
 auto c = toStringz(y);
 
 assert(c.ptr == y.ptr);
 
 ... what happens if I change that last character of z before I pass c to
 the C routine? Bad news. I think this optimization is great, but doesn't
 it go against D's motto of doing "the right thing by default"?
 
 The question is, how can we keep this optimization so that:
 
 toStringz("abc");
 
 remains efficient?
 
 The capacity field is 0 if the string is in a read-only section *or* if
 the string is on the stack:
 
 auto x = "abc";
 assert(x.capacity == 0);
 char[3] y = "abc";
 assert(x.capacity == 0);
 
 So, this isn't safe either. This code:
      char[3] x = "abc";
      char y = '\0';
 
 will put y right after x, so changing y after calling toStringz will
 cause issues.
 
 In reality, the only time it's safe to do the "peek after end" is if the
 string is in the read-only section. Otherwise, there are potential
 issues (even if they are edge cases).
 
 Do we care about this? Is there something we can add to druntime arrays
 that will tell whether or not the data backing a slice is in read-only
 memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
 the only time this changes is when a read-only / stack heap is appended
 to, so performance issues are minimal.
 
 Comments? Ideas?

 
 The _only_ time that this matters is if you actually keep the result of
 toStringz or toUTFz around. If you do what is almost always done - that is
 pass the the result of toStringz/toUTFz directly to a function and not save
 it at all - then there is _zero_ problem.
 
 If you want to save the char* instead, then you can simply choose to append
 '\0' instead of calling toStringz/toUTFz. toStringz has pretty much 0 value
 if it's forced to copy the string in all cases, and unless checking past
 the end of the string _and_ guaranteeing that the character one past the
 end won't change can be done and done efficiently, then the only way to
 guarantee that then value one past the end won't change is to make a copy,
 in which case toStringz is essentially pointless.
 
 The documentation for toUTFz has an appropriate warning on it, so the issue
 should be clear. If you want to avoid it, simply append '\0' instead of
 calling toStringz or toUTFz. In 99.99% of cases, the char* is passed
 directly to a C function anyway, in which case there is not problem.
 
 So, I really don't think that this is really an issue. If we can find an
 efficient way to make better guarantees, then that would be great, but the
 risk at this point is _very_ minimal, and there's an easy way to get around
 it in the _rare_ case where it could be a problem (simply append '\0'
 instead of calling toStringz/toUTFz).

The real question is what to do with to!(char*)(str). The plan is to make it 
call toUTFz, but at that point, the warning about toUTFz is not as obvious 
(though it can re-iterate the warning or point you to the toUTFz documentation 
to read it). Also, since you already have toUTFz, calling to!(char*) is kind 
of pointless. So, I think that there's a good argument for forcing to!(char*) 
to append '\0' instead of checking one past the end. Then when you want a 
guarantee that the '\0' isn't going to change, you can use to!(char*), and if 
you want the efficiency, you can call toUTFz. But it is debatable whether we 
should do that or just have to!(char*) call toUTFz in all cases. I'm leaning 
towards making it always copy though.

- Jonathan M Davis

Jul 24 2011

Johann MacDonagh <johann.macdonagh.no spam.gmail.com> writes:

On 7/24/2011 9:06 PM, Jonathan M Davis wrote:
 On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:
 The real question is what to do with to!(char*)(str). The plan is to make it
 call toUTFz, but at that point, the warning about toUTFz is not as obvious
 (though it can re-iterate the warning or point you to the toUTFz documentation
 to read it). Also, since you already have toUTFz, calling to!(char*) is kind
 of pointless. So, I think that there's a good argument for forcing to!(char*)
 to append '\0' instead of checking one past the end. Then when you want a
 guarantee that the '\0' isn't going to change, you can use to!(char*), and if
 you want the efficiency, you can call toUTFz. But it is debatable whether we
 should do that or just have to!(char*) call toUTFz in all cases. I'm leaning
 towards making it always copy though.

 - Jonathan M Davis

In that case, maybe we should implement  schveiguy's suggestion.

immutable(char)* toStringz(string s, bool unsafe = true) pure nothrow

That way the user can decide whether to take the optimization risk (or 
if they know the string is on the stack, etc...). In addition, always 
copying is wasteful. We're usually able to append a NULL to a dynamic 
array without relocation / copying.

Jul 24 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 24 July 2011 21:57:34 Johann MacDonagh wrote:
 On 7/24/2011 9:06 PM, Jonathan M Davis wrote:
 On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:
 The real question is what to do with to!(char*)(str). The plan is to
 make it call toUTFz, but at that point, the warning about toUTFz is not
 as obvious (though it can re-iterate the warning or point you to the
 toUTFz documentation to read it). Also, since you already have toUTFz,
 calling to!(char*) is kind of pointless. So, I think that there's a
 good argument for forcing to!(char*) to append '\0' instead of checking
 one past the end. Then when you want a guarantee that the '\0' isn't
 going to change, you can use to!(char*), and if you want the
 efficiency, you can call toUTFz. But it is debatable whether we should
 do that or just have to!(char*) call toUTFz in all cases. I'm leaning
 towards making it always copy though.
 
 - Jonathan M Davis

 
 In that case, maybe we should implement  schveiguy's suggestion.
 
 immutable(char)* toStringz(string s, bool unsafe = true) pure nothrow
 
 That way the user can decide whether to take the optimization risk (or
 if they know the string is on the stack, etc...). In addition, always
 copying is wasteful. We're usually able to append a NULL to a dynamic
 array without relocation / copying.

If you always append a '\0', then there's no point to toStringz at all. So, 
sure we _could_ add the unsafe parameter like that, but I seriously question 
the value of it. The primary value in toStringz is to give you the 
optimization of looking one past the end of the string and attempting to 
completely avoid any chance of reallocation.

And yes, you'd use ~= which _might_ copy rather than forcing a copy every 
time, but the cases where you could have just checked one past the end of the 
array and done nothing if it were '\0' are generally going to be the cases 
where ~= has to reallocate. So, in reality, you're pretty much going to copy 
every time that you could have avoided the copy if you had gone with true for 
unsafe rather than false.

- Jonathan M Davis

Jul 24 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Mon, 25 Jul 2011 05:09:17 +0300, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 If you always append a '\0', then there's no point to toStringz at all.

Semantics? toStringz(s) is much more readable than (s~'\0').ptr

-- 
Best regards,
  Vladimir                            mailto:vladimir thecybershadow.net

Jul 25 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 24 Jul 2011 20:41:47 -0400, Johann MacDonagh  
<johann.macdonagh.no spam.gmail.com> wrote:

 Both toStringz and toUTFz do something potentially unsafe. Both check  
 whether the character after the end of the string is NULL. If so, then  
 it simply returns a pointer to the original string.

It is of critical importance to note that this *ONLY* happens for  
immutable data.  It is not done for const or mutable strings (instead a  
zero is always appended).

 Another issue not mentioned is with slices. If I do...

 string s = "abc";
 string y = s[];
 string z = y[];

 z ~= '\0';

 auto c = toStringz(y);

 assert(c.ptr == y.ptr);

 ... what happens if I change that last character of z before I pass c to  
 the C routine?

Not possible.  There are two important errors in your assumptions.  First,  
z is an immutable(char)[], so you cannot change the last character without  
a cast.  And if you do that, it's undefined behavior.

Second, z.ptr != y.ptr.  You cannot append to a ROM string, it will  
reallocate.

However, it's quite possible that *unallocated* data may change.  That is:

string s = "abc".dup;  // it's possible that the data after "abc" is a 0.   
The GC does not default initialize the unallocated portion of a data block  
if you are allocating a NO_SCAN block (for efficiency).

auto c = toStringz(s); // might just return s.ptr.

s ~= 'a'; // now if c == s.ptr, c is affected.

However, I still am not concerned about this (see my statement at the end  
of this message).

 Bad news. I think this optimization is great, but doesn't it go against  
 D's motto of doing "the right thing by default"?

 The question is, how can we keep this optimization so that:

 toStringz("abc");

 remains efficient?

 The capacity field is 0 if the string is in a read-only section *or* if  
 the string is on the stack:

 auto x = "abc";
 assert(x.capacity == 0);
 char[3] y = "abc";
 assert(x.capacity == 0);

 So, this isn't safe either. This code:

      char[3] x = "abc";
      char y = '\0';

 will put y right after x, so changing y after calling toStringz will  
 cause issues.

A not-so-small flaw in this argument is char[3] x is not immutable -- as I  
stated above, only an immutable string causes the optimization to occur.

It's highly unlikely anyone would ever write:

immutable char[3] y = "abc";

 In reality, the only time it's safe to do the "peek after end" is if the  
 string is in the read-only section. Otherwise, there are potential  
 issues (even if they are edge cases).

 Do we care about this? Is there something we can add to druntime arrays  
 that will tell whether or not the data backing a slice is in read-only  
 memory (or perhaps an enum: read-only, stack, heap, other)? In reality,  
 the only time this changes is when a read-only / stack heap is appended  
 to, so performance issues are minimal.

I think there could be a compromise solution:

If the slice is for immutable data (i.e. string), AND the slice is of a  
heap-allocated array (the runtime can determine this via a block lookup),  
AND the characer after the slice is part of the used block data, it would  
be safe to assume that data is also immutable.  There are remote  
possibilities that this could not be the case, but this involves  
improbable casting scenarios.

Note that the lookup cost is not trivial -- it's O(lgn), and it requires a  
GC lock.

This, of course, would have to be in addition to checking if the string is  
in ROM.  I don't know if this is possible...

BTW, I agree with Jonathan that this is a very rare corner-case (it might  
even be non-existent in real code).

I've only ever used toStringz or seen it used as an expression for a  
parameter -- I've never actually saved the result and did other stuff  
before passing it.

I think the warning he states in documentation is sufficient.

-Steve

Jul 25 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday 25 July 2011 09:27:34 Steven Schveighoffer wrote:
 On Sun, 24 Jul 2011 20:41:47 -0400, Johann MacDonagh
 
 <johann.macdonagh.no spam.gmail.com> wrote:
 Both toStringz and toUTFz do something potentially unsafe. Both check
 whether the character after the end of the string is NULL. If so, then
 it simply returns a pointer to the original string.

 
 It is of critical importance to note that this *ONLY* happens for
 immutable data.  It is not done for const or mutable strings (instead a
 zero is always appended).

That's not quite true actually. It _is_ true for toStringz, because it 
_always_ returns immutable char*, but because toUTFz has more control over the 
return type, it does it in every case where it could conceivably avoid a copy. 
So, C[] -> C*, C[] -> const C*, immutable (C)[] -> const(C)[], and 
immutable(C)[] -> immutable C* all check one past the end of the array. Now, 
the non-immutable strings probably end up appending anyway, because they don't 
get the embedded '\0' insert one past their end by the compiler, but they 
still try and avoid it. But the non-immutable strings aren't really any more 
likely to have the problem than immutable ones, since it all hinges on saving 
the returned pointer, which just isn't done normally. The only real increased 
risk is if you slice a mutable array where the first element after the slice is 
'\0', since then you could easily change it, where you couldn't if it were 
immutable, but again, you have to save the pointer, and the warning makes it 
clear that that's a situation which would be a problem.

- Jonathan M Davis

Jul 25 2011

D Programming

C/C++ Programming

Other

digitalmars.D - toStringz and toUTFz potentially unsafe