www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fixing "toStringz"

reply "monarch_dodra" <monarchdodra gmail.com> writes:
I'm looking for advice on fixing "toStringz". The current 
implementation that accepts "immutable(char)[]" is:

immutable(char)* toStringz(string s) pure nothrow
     if (s.empty) return "".ptr;
     /* Peek past end of s[], if it's 0, no conversion necessary.
      * Note that the compiler will put a 0 past the end of static
      * strings, and the storage allocator will put a 0 past the 
      * of newly allocated char[]'s.
     immutable p = s.ptr + s.length;
     // Is p dereferenceable? A simple test: if the p points to an
     // address multiple of 4, then conservatively assume the 
     // might be pointing to a new block of memory, which might be
     // unreadable. Otherwise, it's definitely pointing to valid
     // memory.
     if ((cast(size_t) p & 3) && *p == 0)
         return s.ptr;
     return toStringz(cast(const char[]) s);

The explanation is clear enough (I think). The problem with this 
approach is that nothing guarantees that, say, the current string 
is allocator allocated from a very large buffer, and the "past 
the end" \0 is not actually immutable (and will be changed when 
allocator allocates a new string).

Long story short, the function operates under wrong assumptions.

Since "being correct" > "performance", this needs to be 
corrected. I'm looking for ideas though on being able to detect 
if a string *is* a manifest constant string?

The solution doesn't necessarily have to work on all platforms, 
but if we can get something to work on some platforms, that would 
already be good.

Aug 20 2013
next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 8/21/13, monarch_dodra <monarchdodra gmail.com> wrote:
 I'm looking for ideas though on being able to detect
 if a string *is* a manifest constant string?
I've proposed once to allow declaring an enum parameter, e.g.: void foo(enum string input) { } This would only allow foo to be called with a manifest constant (e.g. a string literal). This way you could overload functions and perform optimizations in some overloads. Note that we already have a special way to e.g. mark a function to only take a null literal, by making the parameter "typeof(null)". But using typeof("") wouldn't quite work here, so if there's a language enhancement to be made I'm in favor of the enum approach. Btw this thread has already popped up before, e.g.: http://forum.dlang.org/thread/j0ie3e$18mh$1 digitalmars.com where I showed the problem of toStringz: ----- import std.string; import std.stdio; struct A { immutable char[2] foo; char[2] bar; } void main() { auto a = A("aa", "\0b"); // embed \0 in the second field // toStringz picks up \0 from the 'bar' field, does not allocate auto charptr = toStringz(a.foo[]); a.bar = "bo"; // remove embedded \0 from second field printf(charptr); // prints two chars and then garbage } ----- Couldn't this also have security issues? I think we definitely have to fix toStringz, the look-ahead for \0 is an over-reaching optimization, and at best it should be selectable behavior in the same way as we have assumeUnique for arrays.
Aug 21 2013
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 21, 2013 08:18:47 monarch_dodra wrote:
 I'm looking for advice on fixing "toStringz".
Well, I suspect that we have one of two options. Either 1. Do something with the type system as Andrej suggested. 2. Check whether the string refers to the section of memory that the string literals are put in. Anything else would probably involve making assumptions like toStringz is currently doing. Augmenting the type system clearly isn't a short term solution, even if it were determined to be a good idea (and I'm not convinced that it's at all worth it), so that's out at least for the moment. Checking whether the string is in the memory range that string literals may or may not be feasible. It is my understanding that on Linux, string literals are put into a section of the binary which is treated as read-only, so I would think that it would be possible to get the start and end points of that memory range and check it, but I don't know how you'd do that (you'd also have to do it in a static constructor or somethnig if you wanted to maintain purity, which isn't entirely ideal). I also have no idea whether that's possible on Windows, since, as I recall, the string literals get put into writable memory on Windows, so there may be no special memory range to check. Other than that, unfortunately, I don't know how you'd check for string literals unless there's some piece of information that druntime has that I don't know about or can't think of at the moment. Now, another optimization that could be added which would be of less benefit and would not affect string literals would be to check whether the last character in the string is '\0', and that would at least catch the cases where someone had already appended a null character or had a null-terminated string for one reason or another. So, as a first-pass fix, I'd suggest simply doing that check and dropping the attempt to catch string literals. That way, we at least get the correctness issue fixed. Once that's done, we can look further into how we might determine whether a string is a literal or not. - Jonathan M Davis
Aug 21 2013