digitalmars.D - toStringz and predictability

Ben Hinkle (31/31) Jan 18 2005 There's something about toStringz that has me uncomfortable. Consider th...

Walter (14/23) Jan 18 2005 length

Ben Hinkle (6/29) Jan 18 2005 But the string doesn't necessarily own the byte after the string. It's a...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (7/19) Jan 18 2005 Yes, it does work for string literals and for dynamic arrays...

Ben Hinkle (19/20) Jan 18 2005 Actually it doesn't even work for dynamic arrays:

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (9/53) Jan 18 2005 Funny, I was just writing that :-)

Ben Hinkle (28/51) Jan 19 2005 the

parabolis (14/23) Jan 19 2005 Would this implementation work?

Ben Hinkle (18/41) Jan 19 2005 the

Lukas Pinkowski (19/29) Jan 19 2005 Hm, doesn't initialize D uninitialized chars to 0 (here str[length-1]), ...

Ben Hinkle (9/38) Jan 19 2005 good

Georg Wrede (18/23) Jan 24 2005 What bothers me is, if a string gets repeatedly passed, say, between a

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (14/51) Jan 24 2005 Isn't that just what "string.length = string.length + 1" does, anyway ?

Ben Hinkle (5/5) Jan 19 2005 another version:

Ben Hinkle (12/35) Jan 20 2005 ok, one last try. Walter, I can't tell if you still think this counts as...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (26/50) Jan 18 2005 That's dependent on the compiler, and the alignment:

Ben Hinkle (10/35) Jan 18 2005 That's becaseu the "new" allocates space on the heap and so it has nothi...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (4/8) Jan 18 2005 Never mind, I was thinking in C (just because it is implemented
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (9/28) Jan 18 2005 Right, I think I only got lucky because how it allocates memory...

parabolis (18/19) Jan 19 2005 There is something else that you should be uncomfortable about - the

Ben Hinkle (6/14) Jan 19 2005 I hate to disagree but.. that doesn't bother me. I don't see anything wr...

parabolis (4/25) Jan 19 2005 ----------------------------------------------------------------

Matthew (2/18) Jan 21 2005 Has there been debate about unless/until? If so, count me on the list of...

parabolis (8/35) Jan 21 2005 Yes back around the time the digitalmars.d newsgroup started:

Ben Hinkle <Ben_member pathlink.com> writes:

There's something about toStringz that has me uncomfortable. Consider this code:

import std.string;

int main() {
char* x;
uint b1;
char[4] y;
uint b2;
y[0] = 'a';
y[1] = 'b';
y[2] = 'c';
y[3] = 'd';
x = toStringz(y);
printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
b1 = 0x11223344;
b2 = 0x11223344;
printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
return 0;
}

Here's what it prints when I run it:
x length is 4, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874
x length is 17, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874

The reason why the length changed is that toStringz looks at one past the length
of the string to see if it is 0 and does nothing to the string if it is. But the
sample code then changes the byte past the string by touching a completely
different variable and so the toStringz result is "corrupted". I have toStringz
calls sprinkled through my code when I call C functions and now I'm starting to
get nervous about the lifespans of those strings and how to figure out if they
are valid or not. Thoughts? Walter, is there a guideline I should follow? The
most extreme one that comes to mind is "only call toStringz for strings that get
immediately copied".

-Ben

Jan 18 2005

"Walter" <newshound digitalmars.com> writes:

"Ben Hinkle" <Ben_member pathlink.com> wrote in message
news:csj4hq$1cvi$1 digitaldaemon.com...
 The reason why the length changed is that toStringz looks at one past the

length
 of the string to see if it is 0 and does nothing to the string if it is.

But the
 sample code then changes the byte past the string by touching a completely
 different variable and so the toStringz result is "corrupted". I have

toStringz
 calls sprinkled through my code when I call C functions and now I'm

starting to
 get nervous about the lifespans of those strings and how to figure out if

they
 are valid or not. Thoughts? Walter, is there a guideline I should follow?

The
 most extreme one that comes to mind is "only call toStringz for strings

that get
 immediately copied".

It's "COW" (Copy On Write) to the rescue. The idea is only modify a string
that you know is unique. If you don't know it is unique, make a copy of it
before modifying it. After the toStringz(), you're modifying the argument to
toStringz() but there's another reference to that string that expects it to
not change.

Jan 18 2005

Ben Hinkle <Ben_member pathlink.com> writes:

In article <csjffu$1qtp$1 digitaldaemon.com>, Walter says...
"Ben Hinkle" <Ben_member pathlink.com> wrote in message
news:csj4hq$1cvi$1 digitaldaemon.com...
 The reason why the length changed is that toStringz looks at one past the

length
 of the string to see if it is 0 and does nothing to the string if it is.

But the
 sample code then changes the byte past the string by touching a completely
 different variable and so the toStringz result is "corrupted". I have

toStringz
 calls sprinkled through my code when I call C functions and now I'm

starting to
 get nervous about the lifespans of those strings and how to figure out if

they
 are valid or not. Thoughts? Walter, is there a guideline I should follow?

The
 most extreme one that comes to mind is "only call toStringz for strings

that get
 immediately copied".

It's "COW" (Copy On Write) to the rescue. The idea is only modify a string
that you know is unique. If you don't know it is unique, make a copy of it
before modifying it. 

But the string doesn't necessarily own the byte after the string. It's a random
piece of memory. Even if the string is living on the heap the byte one past the
array can be changed at pretty much any time by anything. Modifying the byte
following a string is different than modifying a string.

 After the toStringz(), you're modifying the argument to
 toStringz() [...]

actually I'm not. I'm modifying another variable.

Jan 18 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Ben Hinkle wrote:

 But the string doesn't necessarily own the byte after the string. It's a random
 piece of memory. Even if the string is living on the heap the byte one past the
 array can be changed at pretty much any time by anything. Modifying the byte
 following a string is different than modifying a string.

The bug is in std/string.d :

 	p = &string[0] + string.length;
 
 	// Peek past end of string[], if it's 0, no conversion necessary.
 	// Note that the compiler will put a 0 past the end of static
 	// strings, and the storage allocator will put a 0 past the end
 	// of newly allocated char[]'s.
 	if (*p == 0)
 	    return string;

Yes, it does work for string literals and for dynamic arrays...
But it doesn't work for slices of pointers, or static arrays ?

Unless there is a way to separate them, it should be avoided.
(since with the pointers/statics, the byte after is off-limits)

--anders

Jan 18 2005

Ben Hinkle <Ben_member pathlink.com> writes:

Yes, it does work for string literals and for dynamic arrays...

Actually it doesn't even work for dynamic arrays:

import std.string;
int main() {
char* x;
char[] y = new char[32];
y[] = 0;
char[] z = new char[32];
z[] = 32;
x = toStringz(z);
printf("x length is %d\n",strlen(x));
y[] = 32;
printf("x length is %d\n",strlen(x));
return 0;
}

outputs
x length is 32
x length is 67

This is due to how the memory manager allocates memory.
-Ben

Jan 18 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Ben Hinkle wrote:

Yes, it does work for string literals and for dynamic arrays...

 
 Actually it doesn't even work for dynamic arrays:

Funny, I was just writing that :-)

It breaks down for certain multiples of two.
(16, 32, 64, 128, 256, 512, 1024, and so on)

Sample test program:

 import std.string;
 void main()
 {
   for (int x = 15; x <= 17; x++)
   {
     char[] a = new char[x];
     char[] b = new char[x];
     char[] c = new char[x];
     a[0] = 0;
     b[0] = 0;
     c[0] = 0;
     printf("%d %p\n",a);
     printf("%d %p\n",b);
     printf("%d %p\n",c);
     char *p = &a[0] + a.length;
     if(*p != 0) printf("not 0\n"); else printf("is 0\n");
     for(int i = 0; i < b.length; i++)
       b[i] = 'A' + i;
     char *z = toStringz(b);
     for(int i = 0; i < a.length; i++)
       a[i] = 'X';
     for(int i = 0; i < c.length; i++)
       c[i] = 'X';
     printf("%s\n",z);
   }
 }

Prints:

 15 0xbf498fe0
 15 0xbf498fd0
 15 0xbf498fc0
 is 0
 ABCDEFGHIJKLMNO
 16 0xbf498fb0
 16 0xbf498fa0
 16 0xbf498f90
 not 0
 ABCDEFGHIJKLMNOPXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 17 0xbf497fa0
 17 0xbf497f80
 17 0xbf497f60
 is 0
 ABCDEFGHIJKLMNOPQ

Perhaps a bit contrived, but shows how it works...

std.string.toStringz is broken.

--anders

Jan 18 2005

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Walter" <newshound digitalmars.com> wrote in message
news:csjffu$1qtp$1 digitaldaemon.com...
 "Ben Hinkle" <Ben_member pathlink.com> wrote in message
 news:csj4hq$1cvi$1 digitaldaemon.com...
 The reason why the length changed is that toStringz looks at one past


the
 length
 of the string to see if it is 0 and does nothing to the string if it is.

 But the
 sample code then changes the byte past the string by touching a


completely
 different variable and so the toStringz result is "corrupted". I have

 toStringz
 calls sprinkled through my code when I call C functions and now I'm

 starting to
 get nervous about the lifespans of those strings and how to figure out


if
 they
 are valid or not. Thoughts? Walter, is there a guideline I should


follow?
 The
 most extreme one that comes to mind is "only call toStringz for strings

 that get
 immediately copied".

 It's "COW" (Copy On Write) to the rescue. The idea is only modify a string
 that you know is unique. If you don't know it is unique, make a copy of it
 before modifying it. After the toStringz(), you're modifying the argument

to
 toStringz() but there's another reference to that string that expects it

to
 not change.

In case you need another example, I can imagine just the act of calling a
function could corrupt a toStringz result. Suppose the char[] was stored on
the stack and the last element of the array is at the very top of the stack
and that the next item after the stack is zero (and that the stack grows up
in memory). Then calling toStringz (also suppose it that call was inlined
just for simplicity) wouldn't make a copy.but calling a function after that
would push another stack frame which could potentially set a non-zero byte
immediately following the array. That would corrupt the result of toStringz.
I couldn't get this to happen on any machine I have around here but it
depends on the stack architecture and how function calls work but the
problem is still there for some architectures.

So I have a suggestion. Have toStringz always copy if the array is on the
stack. Have it never copy if the array is in the data segment (so literals
behave as they do today) and have it check the GC capacity to ask the GC for
control over the byte following the array (though the length of the array
would be unchanged). To implement this toStringz would probably have to be
moved out of std.string and into internal. If it copied everything except
literals then I can see keeping it in std.string. Anyhow, I agree wth Anders
that something should be done.

-Ben

Jan 19 2005

parabolis <parabolis softhome.net> writes:

Ben Hinkle wrote:

 So I have a suggestion. Have toStringz always copy if the array is on the
 stack. Have it never copy if the array is in the data segment (so literals
 behave as they do today) and have it check the GC capacity to ask the GC for
 control over the byte following the array (though the length of the array
 would be unchanged). To implement this toStringz would probably have to be
 moved out of std.string and into internal. If it copied everything except
 literals then I can see keeping it in std.string. Anyhow, I agree wth Anders
 that something should be done.
 

Would this implementation work?

----------------------------------------------------------------
char* toStringzz(char[] str) {
   str.length++;
   str[length-1] = cast(char)0x00;
   return cast(char*)&str;
}
----------------------------------------------------------------

That is to say is the array resizing implementation sufficient to 
determine whether str is dynamic or static on its own and if it is 
dynamic deal wisely with cases where incrementing length might be 
sufficient? Can you break toStringzz in any of the cases that 
toStringz breaks?

Jan 19 2005

"Ben Hinkle" <bhinkle mathworks.com> writes:

"parabolis" <parabolis softhome.net> wrote in message
news:csmbbh$444$1 digitaldaemon.com...
 Ben Hinkle wrote:

 So I have a suggestion. Have toStringz always copy if the array is on


the
 stack. Have it never copy if the array is in the data segment (so


literals
 behave as they do today) and have it check the GC capacity to ask the GC


for
 control over the byte following the array (though the length of the


array
 would be unchanged). To implement this toStringz would probably have to


be
 moved out of std.string and into internal. If it copied everything


except
 literals then I can see keeping it in std.string. Anyhow, I agree wth


Anders
 that something should be done.

 Would this implementation work?

 ----------------------------------------------------------------
 char* toStringzz(char[] str) {
    str.length++;
    str[length-1] = cast(char)0x00;
    return cast(char*)&str;
 }
 ----------------------------------------------------------------

 That is to say is the array resizing implementation sufficient to
 determine whether str is dynamic or static on its own and if it is
 dynamic deal wisely with cases where incrementing length might be
 sufficient? Can you break toStringzz in any of the cases that
 toStringz breaks?

Nice idea. I think it's on the right track. I've cleaned it up a bit:
char* toStringzz(char[] str) {
    str.length = str.length+1;
    str[length-1] = 0;
    return str.ptr;
}

Also it copies string literals. If there is an easy way to check if
something is a string literal we can add that to your code and have a good
solution, I think.

Jan 19 2005

Lukas Pinkowski <Lukas.Pinkowski web.de> writes:

Ben Hinkle wrote:
 Nice idea. I think it's on the right track. I've cleaned it up a bit:
 char* toStringzz(char[] str) {
     str.length = str.length+1;
     str[length-1] = 0;
     return str.ptr;
 }
 
 Also it copies string literals. If there is an easy way to check if
 something is a string literal we can add that to your code and have a good
 solution, I think.

Hm, doesn't initialize D uninitialized chars to 0 (here str[length-1]), so
you can leave out the str[length-1] = 0; part?

Thus better:

char* toStringzz(char[] str) {
    str.length = str.length+1;
    return str.ptr;
}

But this actually alters the parameter (is this intended?)

My version would be:

char* toStringz( in char[] str )
{
  char[] new_str;
  new_str.length = str.length + 1;
  new_str[0 .. length-2] = str[0 .. length-1];
  return &new_str[0];
}

Creating a copy of the parameter, thus not changing it as you would think
for in-parameters. I checked and it works for string literals, too.

Jan 19 2005

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Lukas Pinkowski" <Lukas.Pinkowski web.de> wrote in message
news:csmfl4$a4c$1 digitaldaemon.com...
 Ben Hinkle wrote:
 Nice idea. I think it's on the right track. I've cleaned it up a bit:
 char* toStringzz(char[] str) {
     str.length = str.length+1;
     str[length-1] = 0;
     return str.ptr;
 }

 Also it copies string literals. If there is an easy way to check if
 something is a string literal we can add that to your code and have a


good
 solution, I think.

 Hm, doesn't initialize D uninitialized chars to 0 (here str[length-1]), so
 you can leave out the str[length-1] = 0; part?

the initializer for char is 0xFF.

 Thus better:

 char* toStringzz(char[] str) {
     str.length = str.length+1;
     return str.ptr;
 }

 But this actually alters the parameter (is this intended?)

an array is a pointer to data and a length. Those are passed by value, so
changing the length does not change the original string passed to the
function.

 My version would be:

 char* toStringz( in char[] str )
 {
   char[] new_str;
   new_str.length = str.length + 1;
   new_str[0 .. length-2] = str[0 .. length-1];
   return &new_str[0];
 }

 Creating a copy of the parameter, thus not changing it as you would think
 for in-parameters. I checked and it works for string literals, too.

watch out for the case when new_str.ptr is str.ptr since I expect the array
copy will error if you try to copy overlapping arrays.

Jan 19 2005

Georg Wrede <georg.wrede nospam.org> writes:

(Actually, I refer here to several examples in this thread.)

char* toStringzz(char[] str) {
    str.length = str.length+1;
    str[length-1] = 0;
    return str.ptr;
}



What bothers me is, if a string gets repeatedly passed, say, between a 
library and the main program, and the library functions pass the string 
on to the OS or another library, every time using toStringz -- then what 
keeps the string from growing at each iteration? Finally we end up with 
a (possibly short) string with a lot of zeros at the end.

It seems harmless at first glance, but what if later this kind of 
strings are concatenated (in D code) and passed on to a C-written 
parser? It would see a lot of "empty strings" between real data.

Or am I missing something?

In the same manner, should toStringz guarantee a valid C string? I.e. no 
internal zeros? At the _very least_ in the non-release build!

----

The name toStringz is misleading. Since the only use for it is to make 
strings edible for C code, it should be renamed toStringC. Normally, if 
a programmer _wants_ to slap a zero at the end, he'd use ~, wouldn't he.

Misnomers like this introduce parallax, and in this case so subtle that 
we don't even notice. And that's where it _really_ counts!

Jan 24 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Georg Wrede wrote:

 It seems harmless at first glance, but what if later this kind of 
 strings are concatenated (in D code) and passed on to a C-written 
 parser? It would see a lot of "empty strings" between real data.
 
 Or am I missing something?

It would probably be easier to remove the hack altogether and just copy?

     body
     {
 	if (string.length == 0)
 	    return "";
 
 	// Need to make a copy
 	char[] copy = new char[string.length + 1];
 	copy[0..string.length] = string;
 	copy[string.length] = 0;
 	return copy;
     }

Isn't that just what "string.length = string.length + 1" does, anyway ?

It would be neat if it could be optimized for string literals, but not
at the expense of making the whole function instable? (like it is now)

 In the same manner, should toStringz guarantee a valid C string? I.e. no 
 internal zeros? At the _very least_ in the non-release build!

The contract for toStringz specifies that the char[] is *without* '\0':

     in
     {
 	if (string)
 	{
 	    // No embedded 0's
 	    for (uint i = 0; i < string.length; i++)
 		assert(string[i] != 0);
 	}
     }
     out (result)
     {
 	if (result)
 	{   assert(strlen(result) == string.length);
 	    assert(memcmp(result, string, string.length) == 0);
 	}
     }

It also (implicitly) returns a "" string, for an input param of null.

 The name toStringz is misleading. Since the only use for it is to make 
 strings edible for C code, it should be renamed toStringC. Normally, if 
 a programmer _wants_ to slap a zero at the end, he'd use ~, wouldn't he.

It converts a char[], to a zero-terminated char*. No "C" about that ??
(I'm not sure why it doesn't just 'return (string ~ "\0");', anyone ?)
==> body { return ((string.length == 0) ? "" : string ~ "\0"); }

Besides, most of the C functions does not accept UTF-8 input anyway...
To be usable from regular C, it would need to be converted to byte* ?
(and that would most likely involve charset encoding conversion too)

--anders

Jan 24 2005

"Ben Hinkle" <bhinkle mathworks.com> writes:

another version:

char* toStringzz(char[] str) {
    str ~= 0;
    return str.ptr;
}

Jan 19 2005

Ben Hinkle <Ben_member pathlink.com> writes:

In article <csjffu$1qtp$1 digitaldaemon.com>, Walter says...
"Ben Hinkle" <Ben_member pathlink.com> wrote in message
news:csj4hq$1cvi$1 digitaldaemon.com...
 The reason why the length changed is that toStringz looks at one past the

length
 of the string to see if it is 0 and does nothing to the string if it is.

But the
 sample code then changes the byte past the string by touching a completely
 different variable and so the toStringz result is "corrupted". I have

toStringz
 calls sprinkled through my code when I call C functions and now I'm

starting to
 get nervous about the lifespans of those strings and how to figure out if

they
 are valid or not. Thoughts? Walter, is there a guideline I should follow?

The
 most extreme one that comes to mind is "only call toStringz for strings

that get
 immediately copied".

It's "COW" (Copy On Write) to the rescue. The idea is only modify a string
that you know is unique. If you don't know it is unique, make a copy of it
before modifying it. After the toStringz(), you're modifying the argument to
toStringz() but there's another reference to that string that expects it to
not change.

ok, one last try. Walter, I can't tell if you still think this counts as COW. So
let me boil it down to a question. Given the code
char[1] str;
char* cstr = toStringz(str);
ubyte x = 1;
what is strlen(cstr)?
I claim the answer is compiler dependent and depends on if the compiler stuck
the storage location for x immediately following str. Sure running the code
doesn't have a problem due to word alignment etc but following the language
definition and the definition of toStringz the strlen is unknown.

-Ben

Jan 20 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Ben Hinkle wrote:

 There's something about toStringz that has me uncomfortable. Consider this
code:
 
 import std.string;
 
 int main() {
 char* x;
 uint b1;
 char[4] y;
 uint b2;
 y[0] = 'a';
 y[1] = 'b';
 y[2] = 'c';
 y[3] = 'd';
 x = toStringz(y);
 printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
 b1 = 0x11223344;
 b2 = 0x11223344;
 printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
 return 0;
 }
 
 Here's what it prints when I run it:
 x length is 4, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874
 x length is 17, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874

That's dependent on the compiler, and the alignment:

GDC Linux:
x length is 4, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0
x length is 25, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0

GDC Mac OS X:
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8


But why are you calling toStringz on a simple (char*),
without having it properly NUL-terminated at the end ?
If you change the code to : char[] y = new char[4];

Then it prints:
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c


A more interesting question is why : x = toStringz(y[0..4]);
does *not* make a copy of the converted pointer-to-characters,
just because the next byte in memory happens to be a NUL char?
(ie. it works if first byte of "b1" is 42, but not if it's 0)

Having to use x = toStringz(y[0..4].dup); just because of
this little "optimization" feature is not exactly a given...
There should probably be a small warning printed about using
toStringz on slices (since it works with literals and arrays)

But that it fails on pointers and static arrays is not surprising?

--anders

PS. If you add a -O on Mac OS X, then it prints "12" instead.
     So just because it printed 4 above doesn't mean it works.

Jan 18 2005

Ben Hinkle <Ben_member pathlink.com> writes:

That's dependent on the compiler, and the alignment:

GDC Linux:
x length is 4, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0
x length is 25, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0

GDC Mac OS X:
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8

even more interesting...

But why are you calling toStringz on a simple (char*),
without having it properly NUL-terminated at the end ?

The point of toStringz is to make a D string null terminated.

If you change the code to : char[] y = new char[4];

Then it prints:
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c

That's becaseu the "new" allocates space on the heap and so it has nothing to do
with b1 and b2 after that. To corrupt the string on the heap you'l have to wait
until something else gets allocated right after that string and then assign
something to the first byte.

A more interesting question is why : x = toStringz(y[0..4]);
does *not* make a copy of the converted pointer-to-characters,
just because the next byte in memory happens to be a NUL char?
(ie. it works if first byte of "b1" is 42, but not if it's 0)

Having to use x = toStringz(y[0..4].dup); just because of
this little "optimization" feature is not exactly a given...
There should probably be a small warning printed about using
toStringz on slices (since it works with literals and arrays)

I'm starting to think the only safe usage of toStringz is on arrays where you
can guarantee the byte after the string is owned by the string - which includes
literals and maybe some other special cases.

But that it fails on pointers and static arrays is not surprising?

--anders


PS. If you add a -O on Mac OS X, then it prints "12" instead.
     So just because it printed 4 above doesn't mean it works.

ok.

Jan 18 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Ben Hinkle wrote:

But why are you calling toStringz on a simple (char*),
without having it properly NUL-terminated at the end ?

 
 The point of toStringz is to make a D string null terminated.

Never mind, I was thinking in C (just because it is implemented
that way), forget that D treats static arrays as having lengths...

--anders

Jan 18 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Ben Hinkle wrote:

If you change the code to : char[] y = new char[4];

Then it prints:
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c

 
 That's becaseu the "new" allocates space on the heap and so it has nothing to
do
 with b1 and b2 after that. To corrupt the string on the heap you'l have to wait
 until something else gets allocated right after that string and then assign
 something to the first byte.

Right, I think I only got lucky because how it allocates memory...

I couldn't find any traces of "the storage allocator will put a 0
past the end of newly allocated char[]'s", so that must be just DMC.

In fact, I'm not sure that even DMD does it ? This test program:

 void main()
 {
   for (int i = 1; i <= 1024; i++)
   {
     char[] a = new char[i];
     char *p = &a[0] + a.length;
     if(*p != 0) printf("%d\n",i);
   }
 }

Prints out 16,32,64,128,256,512,1024 for *all* the various D compilers.

So that toStringz peeks beyond the length of the array is clearly a bug!
Perhaps if it could tell that the argument is a string literal ? Naah...

--anders

Jan 18 2005

parabolis <parabolis softhome.net> writes:

Ben Hinkle wrote:
 There's something about toStringz that has me uncomfortable. Consider this
code:

There is something else that you should be uncomfortable about - the 
domains of C strings and D strings are not the same. The toStringz 
function is so named because C strings are 'Z'ero (or null) 
terminated. That implies they cannot contain a null character yet D 
strings have no such silly limitations. So the toStringz function 
should probably look like this:

----------------------------------------------------------------
char* toStringz(char[] dStr) {
   char[] cStr = new char[dStr.length+1];
   foreach(int i, char dChar; dStr) {
     if(!(cStr[i] = dChar)) throw new Exception("Null char");
   }
   return &cStr;
----------------------------------------------------------------

Now seems like a great time for plugging the unless/until feature of 
Perl as being nice in this context allowing:

   unless(cStr[i] = dChar) throw new Exception("Null char");

Jan 19 2005

"Ben Hinkle" <ben.hinkle gmail.com> writes:

"parabolis" <parabolis softhome.net> wrote in message 
news:csmiqa$edp$1 digitaldaemon.com...
 Ben Hinkle wrote:
 There's something about toStringz that has me uncomfortable. Consider 
 this code:

 There is something else that you should be uncomfortable about - the 
 domains of C strings and D strings are not the same. The toStringz 
 function is so named because C strings are 'Z'ero (or null) terminated. 
 That implies they cannot contain a null character yet D strings have no 
 such silly limitations.

I hate to disagree but.. that doesn't bother me. I don't see anything wrong 
with ignoring interior zeros. toStringz just makes sure it is 
zero-terminated - not that that aren't any internal zeros.
[snip]

Jan 19 2005

parabolis <parabolis softhome.net> writes:

Ben Hinkle wrote:
 "parabolis" <parabolis softhome.net> wrote in message 
 news:csmiqa$edp$1 digitaldaemon.com...
 
Ben Hinkle wrote:

There's something about toStringz that has me uncomfortable. Consider 
this code:

There is something else that you should be uncomfortable about - the 
domains of C strings and D strings are not the same. The toStringz 
function is so named because C strings are 'Z'ero (or null) terminated. 
That implies they cannot contain a null character yet D strings have no 
such silly limitations.

 
 
 I hate to disagree but.. that doesn't bother me. I don't see anything wrong 
 with ignoring interior zeros. toStringz just makes sure it is 
 zero-terminated - not that that aren't any internal zeros.
 [snip]
 
 

----------------------------------------------------------------
char* toStringz(char[] dStr, bit ignoreNullsInString = true)
----------------------------------------------------------------

Jan 19 2005

"Matthew" <admin.hat stlsoft.dot.org> writes:

"parabolis" <parabolis softhome.net> wrote in message
news:csmiqa$edp$1 digitaldaemon.com...
 Ben Hinkle wrote:
 There's something about toStringz that has me uncomfortable. Consider this
code:

 There is something else that you should be uncomfortable about - the domains
of C strings and D strings are not the 
 same. The toStringz function is so named because C strings are 'Z'ero (or
null) terminated. That implies they cannot 
 contain a null character yet D strings have no such silly limitations. So the
toStringz function should probably look 
 like this:

 ----------------------------------------------------------------
 char* toStringz(char[] dStr) {
   char[] cStr = new char[dStr.length+1];
   foreach(int i, char dChar; dStr) {
     if(!(cStr[i] = dChar)) throw new Exception("Null char");
   }
   return &cStr;
 ----------------------------------------------------------------

 Now seems like a great time for plugging the unless/until feature of Perl as
being nice in this context allowing:

   unless(cStr[i] = dChar) throw new Exception("Null char");

Has there been debate about unless/until? If so, count me on the list of
'wanting'. :-)

Jan 21 2005

parabolis <parabolis softhome.net> writes:

Matthew wrote:
 "parabolis" <parabolis softhome.net> wrote in message
news:csmiqa$edp$1 digitaldaemon.com...
 
----------------------------------------------------------------
char* toStringz(char[] dStr) {
  char[] cStr = new char[dStr.length+1];
  foreach(int i, char dChar; dStr) {
    if(!(cStr[i] = dChar)) throw new Exception("Null char");
  }
  return &cStr;
----------------------------------------------------------------

Now seems like a great time for plugging the unless/until feature of Perl as
being nice in this context allowing:

  unless(cStr[i] = dChar) throw new Exception("Null char");

 
 
 Has there been debate about unless/until? If so, count me on the list of
'wanting'. :-) 
 

Yes back around the time the digitalmars.d newsgroup started:

http://www.digitalmars.com/d/archives/digitalmars/D/1714.html

Walter wrote:
"Brian Hammond" <d at brianhammond dot comBrian_member xx 
pathlink.com> wrote
in message news:c8lmu2$vdm$1 xx digitaldaemon.com...
 I really like the unless because it reads so well.

 "do this unless this is true"

 That just seems backwards to me <g>. I like things to execute
 forwards, not backwards.

However Walter's response was long before "is" replaced "===" and so I 
think it at least deserves another consideration as Perl's unless 
construct would give us "unless(A is null)" instead of the akward and 
much maligned "if(!(A is null))".

Jan 21 2005

D Programming

C/C++ Programming

Other

digitalmars.D - toStringz and predictability