digitalmars.D - toStringz and toUTFz potentially unsafe
- Johann MacDonagh (59/59) Jul 24 2011 Both toStringz and toUTFz do something potentially unsafe. Both check
- Andrei Alexandrescu (9/68) Jul 24 2011 I'm not too worried. I think it is fair to guarantee the pointer
- Jonathan M Davis (22/107) Jul 24 2011 The _only_ time that this matters is if you actually keep the result of
- Brad Roberts (13/125) Jul 24 2011 alling toStringz will
- Andrei Alexandrescu (4/8) Jul 24 2011 Good point. In fact the implementation already takes care of that.
- Jonathan M Davis (12/125) Jul 24 2011 The real question is what to do with to!(char*)(str). The plan is to mak...
- Johann MacDonagh (7/19) Jul 24 2011 In that case, maybe we should implement @schveiguy's suggestion.
- Jonathan M Davis (13/37) Jul 24 2011 If you always append a '\0', then there's no point to toStringz at all. ...
- Vladimir Panteleev (6/7) Jul 25 2011 Semantics? toStringz(s) is much more readable than (s~'\0').ptr
- Steven Schveighoffer (40/76) Jul 25 2011 It is of critical importance to note that this *ONLY* happens for
- Jonathan M Davis (16/26) Jul 25 2011 That's not quite true actually. It _is_ true for toStringz, because it
Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code: string s = "abc"; will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this: auto x = toStringz("abc"); is efficient. No relocations. As AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues: import std.string; import std.stdio; struct A { immutable char[2] foo; char[2] bar; } void main() { auto a = A("aa", "\0b"); auto charptr = toStringz(a.foo[]); a.bar = "bo"; printf(charptr); // two chars, then garbage } Another issue not mentioned is with slices. If I do... string s = "abc"; string y = s[]; string z = y[]; z ~= '\0'; auto c = toStringz(y); assert(c.ptr == y.ptr); ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"? The question is, how can we keep this optimization so that: toStringz("abc"); remains efficient? The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack: auto x = "abc"; assert(x.capacity == 0); char[3] y = "abc"; assert(x.capacity == 0); So, this isn't safe either. This code: char[3] x = "abc"; char y = '\0'; will put y right after x, so changing y after calling toStringz will cause issues. In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases). Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal. Comments? Ideas?
Jul 24 2011
On 7/24/11 7:41 PM, Johann MacDonagh wrote:Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code: string s = "abc"; will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this: auto x = toStringz("abc"); is efficient. No relocations. As AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues: import std.string; import std.stdio; struct A { immutable char[2] foo; char[2] bar; } void main() { auto a = A("aa", "\0b"); auto charptr = toStringz(a.foo[]); a.bar = "bo"; printf(charptr); // two chars, then garbage } Another issue not mentioned is with slices. If I do... string s = "abc"; string y = s[]; string z = y[]; z ~= '\0'; auto c = toStringz(y); assert(c.ptr == y.ptr); ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"? The question is, how can we keep this optimization so that: toStringz("abc"); remains efficient? The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack: auto x = "abc"; assert(x.capacity == 0); char[3] y = "abc"; assert(x.capacity == 0); So, this isn't safe either. This code: char[3] x = "abc"; char y = '\0'; will put y right after x, so changing y after calling toStringz will cause issues. In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases). Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal. Comments? Ideas?I'm not too worried. I think it is fair to guarantee the pointer returned from toStringz is guaranteed to point to a zero-terminated string only up until the first relevant change. It is difficult to define what a relevant change is, but practically I think it is understood what's going on. If there's a need for a persistent stringz, creating a private copy immediately after the call to toStringz is always an option. Andrei
Jul 24 2011
On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code: string s = "abc"; will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this: auto x = toStringz("abc"); is efficient. No relocations. As AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues: import std.string; import std.stdio; struct A { immutable char[2] foo; char[2] bar; } void main() { auto a = A("aa", "\0b"); auto charptr = toStringz(a.foo[]); a.bar = "bo"; printf(charptr); // two chars, then garbage } Another issue not mentioned is with slices. If I do... string s = "abc"; string y = s[]; string z = y[]; z ~= '\0'; auto c = toStringz(y); assert(c.ptr == y.ptr); ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"? The question is, how can we keep this optimization so that: toStringz("abc"); remains efficient? The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack: auto x = "abc"; assert(x.capacity == 0); char[3] y = "abc"; assert(x.capacity == 0); So, this isn't safe either. This code: char[3] x = "abc"; char y = '\0'; will put y right after x, so changing y after calling toStringz will cause issues. In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases). Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal. Comments? Ideas?The _only_ time that this matters is if you actually keep the result of toStringz or toUTFz around. If you do what is almost always done - that is pass the the result of toStringz/toUTFz directly to a function and not save it at all - then there is _zero_ problem. If you want to save the char* instead, then you can simply choose to append '\0' instead of calling toStringz/toUTFz. toStringz has pretty much 0 value if it's forced to copy the string in all cases, and unless checking past the end of the string _and_ guaranteeing that the character one past the end won't change can be done and done efficiently, then the only way to guarantee that then value one past the end won't change is to make a copy, in which case toStringz is essentially pointless. The documentation for toUTFz has an appropriate warning on it, so the issue should be clear. If you want to avoid it, simply append '\0' instead of calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly to a C function anyway, in which case there is not problem. So, I really don't think that this is really an issue. If we can find an efficient way to make better guarantees, then that would be great, but the risk at this point is _very_ minimal, and there's an easy way to get around it in the _rare_ case where it could be a problem (simply append '\0' instead of calling toStringz/toUTFz). - Jonathan M Davis
Jul 24 2011
On Sunday, July 24, 2011 5:56:04 PM, Jonathan M Davis wrote:On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:(a.foo[]);Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code: string s = "abc"; will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this: auto x = toStringz("abc"); is efficient. No relocations. As AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues: import std.string; import std.stdio; struct A { immutable char[2] foo; char[2] bar; } void main() { auto a = A("aa", "\0b"); auto charptr = toStringzalling toStringz willa.bar = "bo"; printf(charptr); // two chars, then garbage } Another issue not mentioned is with slices. If I do... string s = "abc"; string y = s[]; string z = y[]; z ~= '\0'; auto c = toStringz(y); assert(c.ptr == y.ptr); ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"? The question is, how can we keep this optimization so that: toStringz("abc"); remains efficient? The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack: auto x = "abc"; assert(x.capacity == 0); char[3] y = "abc"; assert(x.capacity == 0); So, this isn't safe either. This code: char[3] x = "abc"; char y = '\0'; will put y right after x, so changing y after cgz/toUTFz. toStringz has pretty much 0 value ifcause issues. In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases). Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal. Comments? Ideas?The _only_ time that this matters is if you actually keep the result of toStringz or toUTFz around. If you do what is almost always done - that is pass the the result of toStringz/toUTFz directly to a function and not save it at all - then there is _zero_ problem. If you want to save the char* instead, then you can simply choose to append '\0' instead of calling toStrinit's forced to copy the string in all cases, and unless checking past the end of the string _and_ guaranteeing that the character one past the end won't change can be done and done efficiently, then the only way to guarantee that then value one past the end won't change is to make a copy, in which case toStringz is essentially pointless. The documentation for toUTFz has an appropriate warning on it, so the issue should be clear. If you want to avoid it, simply append '\0' instead of calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly to a C function anyway, in which case there is not problem. So, I really don't think that this is really an issue. If we can find an efficient way to make better guarantees, then that would be great, but the risk at this point is _very_ minimal, and there's an easy way to get around it in the _rare_ case where it could be a problem(simply append '\0' instead ofcalling toStringz/toUTFz). - Jonathan M DavisNot entire true. There's one more very important way this can cause a serious problem: when the string in question is at the end of a block of memory and the +1 location is outside that. The result will be a segv. While I agree the chances of problem is low, it's certainly not 0 and is the sort of thing that if we saw another language do it we'd laugh and point. This kind of code produces hard to debug problems and doesn't belong in the standard library.
Jul 24 2011
On 7/24/11 8:05 PM, Brad Roberts wrote:Not entire true. There's one more very important way this can cause a serious problem: when the string in question is at the end of a block of memory and the +1 location is outside that. The result will be a segv.Good point. In fact the implementation already takes care of that. https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L403 Andrei
Jul 24 2011
On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:The real question is what to do with to!(char*)(str). The plan is to make it call toUTFz, but at that point, the warning about toUTFz is not as obvious (though it can re-iterate the warning or point you to the toUTFz documentation to read it). Also, since you already have toUTFz, calling to!(char*) is kind of pointless. So, I think that there's a good argument for forcing to!(char*) to append '\0' instead of checking one past the end. Then when you want a guarantee that the '\0' isn't going to change, you can use to!(char*), and if you want the efficiency, you can call toUTFz. But it is debatable whether we should do that or just have to!(char*) call toUTFz in all cases. I'm leaning towards making it always copy though. - Jonathan M DavisBoth toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code: string s = "abc"; will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this: auto x = toStringz("abc"); is efficient. No relocations. As AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues: import std.string; import std.stdio; struct A { immutable char[2] foo; char[2] bar; } void main() { auto a = A("aa", "\0b"); auto charptr = toStringz(a.foo[]); a.bar = "bo"; printf(charptr); // two chars, then garbage } Another issue not mentioned is with slices. If I do... string s = "abc"; string y = s[]; string z = y[]; z ~= '\0'; auto c = toStringz(y); assert(c.ptr == y.ptr); ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"? The question is, how can we keep this optimization so that: toStringz("abc"); remains efficient? The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack: auto x = "abc"; assert(x.capacity == 0); char[3] y = "abc"; assert(x.capacity == 0); So, this isn't safe either. This code: char[3] x = "abc"; char y = '\0'; will put y right after x, so changing y after calling toStringz will cause issues. In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases). Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal. Comments? Ideas?The _only_ time that this matters is if you actually keep the result of toStringz or toUTFz around. If you do what is almost always done - that is pass the the result of toStringz/toUTFz directly to a function and not save it at all - then there is _zero_ problem. If you want to save the char* instead, then you can simply choose to append '\0' instead of calling toStringz/toUTFz. toStringz has pretty much 0 value if it's forced to copy the string in all cases, and unless checking past the end of the string _and_ guaranteeing that the character one past the end won't change can be done and done efficiently, then the only way to guarantee that then value one past the end won't change is to make a copy, in which case toStringz is essentially pointless. The documentation for toUTFz has an appropriate warning on it, so the issue should be clear. If you want to avoid it, simply append '\0' instead of calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly to a C function anyway, in which case there is not problem. So, I really don't think that this is really an issue. If we can find an efficient way to make better guarantees, then that would be great, but the risk at this point is _very_ minimal, and there's an easy way to get around it in the _rare_ case where it could be a problem (simply append '\0' instead of calling toStringz/toUTFz).
Jul 24 2011
On 7/24/2011 9:06 PM, Jonathan M Davis wrote:On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote: The real question is what to do with to!(char*)(str). The plan is to make it call toUTFz, but at that point, the warning about toUTFz is not as obvious (though it can re-iterate the warning or point you to the toUTFz documentation to read it). Also, since you already have toUTFz, calling to!(char*) is kind of pointless. So, I think that there's a good argument for forcing to!(char*) to append '\0' instead of checking one past the end. Then when you want a guarantee that the '\0' isn't going to change, you can use to!(char*), and if you want the efficiency, you can call toUTFz. But it is debatable whether we should do that or just have to!(char*) call toUTFz in all cases. I'm leaning towards making it always copy though. - Jonathan M DavisIn that case, maybe we should implement schveiguy's suggestion. immutable(char)* toStringz(string s, bool unsafe = true) pure nothrow That way the user can decide whether to take the optimization risk (or if they know the string is on the stack, etc...). In addition, always copying is wasteful. We're usually able to append a NULL to a dynamic array without relocation / copying.
Jul 24 2011
On Sunday 24 July 2011 21:57:34 Johann MacDonagh wrote:On 7/24/2011 9:06 PM, Jonathan M Davis wrote:If you always append a '\0', then there's no point to toStringz at all. So, sure we _could_ add the unsafe parameter like that, but I seriously question the value of it. The primary value in toStringz is to give you the optimization of looking one past the end of the string and attempting to completely avoid any chance of reallocation. And yes, you'd use ~= which _might_ copy rather than forcing a copy every time, but the cases where you could have just checked one past the end of the array and done nothing if it were '\0' are generally going to be the cases where ~= has to reallocate. So, in reality, you're pretty much going to copy every time that you could have avoided the copy if you had gone with true for unsafe rather than false. - Jonathan M DavisOn Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote: The real question is what to do with to!(char*)(str). The plan is to make it call toUTFz, but at that point, the warning about toUTFz is not as obvious (though it can re-iterate the warning or point you to the toUTFz documentation to read it). Also, since you already have toUTFz, calling to!(char*) is kind of pointless. So, I think that there's a good argument for forcing to!(char*) to append '\0' instead of checking one past the end. Then when you want a guarantee that the '\0' isn't going to change, you can use to!(char*), and if you want the efficiency, you can call toUTFz. But it is debatable whether we should do that or just have to!(char*) call toUTFz in all cases. I'm leaning towards making it always copy though. - Jonathan M DavisIn that case, maybe we should implement schveiguy's suggestion. immutable(char)* toStringz(string s, bool unsafe = true) pure nothrow That way the user can decide whether to take the optimization risk (or if they know the string is on the stack, etc...). In addition, always copying is wasteful. We're usually able to append a NULL to a dynamic array without relocation / copying.
Jul 24 2011
On Mon, 25 Jul 2011 05:09:17 +0300, Jonathan M Davis <jmdavisProg gmx.com> wrote:If you always append a '\0', then there's no point to toStringz at all.Semantics? toStringz(s) is much more readable than (s~'\0').ptr -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Jul 25 2011
On Sun, 24 Jul 2011 20:41:47 -0400, Johann MacDonagh <johann.macdonagh.no spam.gmail.com> wrote:Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string.It is of critical importance to note that this *ONLY* happens for immutable data. It is not done for const or mutable strings (instead a zero is always appended).Another issue not mentioned is with slices. If I do... string s = "abc"; string y = s[]; string z = y[]; z ~= '\0'; auto c = toStringz(y); assert(c.ptr == y.ptr); ... what happens if I change that last character of z before I pass c to the C routine?Not possible. There are two important errors in your assumptions. First, z is an immutable(char)[], so you cannot change the last character without a cast. And if you do that, it's undefined behavior. Second, z.ptr != y.ptr. You cannot append to a ROM string, it will reallocate. However, it's quite possible that *unallocated* data may change. That is: string s = "abc".dup; // it's possible that the data after "abc" is a 0. The GC does not default initialize the unallocated portion of a data block if you are allocating a NO_SCAN block (for efficiency). auto c = toStringz(s); // might just return s.ptr. s ~= 'a'; // now if c == s.ptr, c is affected. However, I still am not concerned about this (see my statement at the end of this message).Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"? The question is, how can we keep this optimization so that: toStringz("abc"); remains efficient? The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack: auto x = "abc"; assert(x.capacity == 0); char[3] y = "abc"; assert(x.capacity == 0); So, this isn't safe either. This code: char[3] x = "abc"; char y = '\0'; will put y right after x, so changing y after calling toStringz will cause issues.A not-so-small flaw in this argument is char[3] x is not immutable -- as I stated above, only an immutable string causes the optimization to occur. It's highly unlikely anyone would ever write: immutable char[3] y = "abc";In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases). Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal.I think there could be a compromise solution: If the slice is for immutable data (i.e. string), AND the slice is of a heap-allocated array (the runtime can determine this via a block lookup), AND the characer after the slice is part of the used block data, it would be safe to assume that data is also immutable. There are remote possibilities that this could not be the case, but this involves improbable casting scenarios. Note that the lookup cost is not trivial -- it's O(lgn), and it requires a GC lock. This, of course, would have to be in addition to checking if the string is in ROM. I don't know if this is possible... BTW, I agree with Jonathan that this is a very rare corner-case (it might even be non-existent in real code). I've only ever used toStringz or seen it used as an expression for a parameter -- I've never actually saved the result and did other stuff before passing it. I think the warning he states in documentation is sufficient. -Steve
Jul 25 2011
On Monday 25 July 2011 09:27:34 Steven Schveighoffer wrote:On Sun, 24 Jul 2011 20:41:47 -0400, Johann MacDonagh <johann.macdonagh.no spam.gmail.com> wrote:That's not quite true actually. It _is_ true for toStringz, because it _always_ returns immutable char*, but because toUTFz has more control over the return type, it does it in every case where it could conceivably avoid a copy. So, C[] -> C*, C[] -> const C*, immutable (C)[] -> const(C)[], and immutable(C)[] -> immutable C* all check one past the end of the array. Now, the non-immutable strings probably end up appending anyway, because they don't get the embedded '\0' insert one past their end by the compiler, but they still try and avoid it. But the non-immutable strings aren't really any more likely to have the problem than immutable ones, since it all hinges on saving the returned pointer, which just isn't done normally. The only real increased risk is if you slice a mutable array where the first element after the slice is '\0', since then you could easily change it, where you couldn't if it were immutable, but again, you have to save the pointer, and the warning makes it clear that that's a situation which would be a problem. - Jonathan M DavisBoth toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string.It is of critical importance to note that this *ONLY* happens for immutable data. It is not done for const or mutable strings (instead a zero is always appended).
Jul 25 2011