digitalmars.D - toStringz or not toStringz
- Regan Heath (16/16) Jul 08 2011 Sorry if this has been asked/answered before but I've been out of the lo...
- Walter Bright (2/5) Jul 08 2011 Because char* in C does not necessarily mean "zero terminated string".
- Regan Heath (20/26) Jul 08 2011 Sure, but in many (most?) cases it does. And in those cases where it
- Steven Schveighoffer (21/46) Jul 08 2011 What about a template function that does this automatically? I'm thinkin...
- SimonM (12/15) Jul 08 2011 This is kind of off-topic, and I don't know if it's just me, but I've
- Mike Parker (6/13) Jul 08 2011 import std.utf;
- Jonathan M Davis (3/21) Jul 08 2011 https://github.com/D-Programming-Language/phobos/pull/123
- Walter Bright (5/21) Jul 08 2011 In the worst case, you're adding an extra memory allocation and function...
- Regan Heath (17/47) Jul 12 2011 This worst case only happens when:
- Steven Schveighoffer (5/43) Jul 12 2011 What if you expect the function is expecting to write to the buffer, and...
- Regan Heath (16/62) Jul 12 2011 Assuming a C function in this form:
- Steven Schveighoffer (12/75) Jul 12 2011 No, assuming C function in this form:
- Steven Schveighoffer (6/15) Jul 12 2011 And, actually, the cost penalty of checking if you are going to segfault...
- Regan Heath (9/25) Jul 12 2011 I wouldn't know anything about this. I was assuming when toStringz was ...
- Regan Heath (36/70) Jul 12 2011 Ok, that's an even better example for my case.
- Steven Schveighoffer (22/79) Jul 12 2011 No, it wouldn't compile. char[] does not cast implicitly to char *. (i...
- Regan Heath (12/100) Jul 12 2011 Replace foo with foo.ptr, it makes no difference to the point I was maki...
- Regan Heath (7/31) Jul 12 2011 Gah.. bad grammar.. 1/2 baked sentences..
- Steven Schveighoffer (44/153) Jul 12 2011 You fix does not help in that case, foo.ptr will be passed as a non-null...
- Regan Heath (58/189) Jul 13 2011 Ok, it's clear there has been some confusion over what exactly I am
- Steven Schveighoffer (57/149) Jul 13 2011 This is an O(n) operation -- too much overhead. Especially if you alrea...
- Jonathan M Davis (9/208) Jul 13 2011 Andrej Mitrovic found a rather annoying issue (which is fortunately high...
- Regan Heath (82/247) Jul 13 2011 On 2nd thought, this step is unnecessary unless the array length matches...
- Steven Schveighoffer (5/21) Jul 13 2011 Who said the char * points into GC memory? It could point at stack
- Regan Heath (24/47) Jul 14 2011 Ok. What would toStringz do in this case? .. because that's what I'm
- Steven Schveighoffer (19/67) Jul 14 2011 Nothing, you don't call toStringz on a char *, you call it on a string. ...
- Regan Heath (20/92) Jul 14 2011 Good points all. So, the idea should be limited to cases where D's char...
Sorry if this has been asked/answered before but I've been out of the loop for a while.. I was just thinking about the recent discussion on renaming toStringz and I wondered why we need to explicitly call it at all. Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*? I believe some extern "C" functions are defined as taking ubyte* or byte* instead of char*, but in those cases I believe they are 'buffers' and have a supplied length as well, meaning there is no need for the trailing \0 in any case. I am probably missing something obvious, but it seems like it might work. Side note.. It bothers me a little that 'char' means utf-8 codepoint in D, and means unsigned byte in extern "C" definitions, but I can live with that. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 08 2011
On 7/8/2011 2:26 AM, Regan Heath wrote:Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".
Jul 08 2011
On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com> wrote:On 7/8/2011 2:26 AM, Regan Heath wrote:Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time. D is already allocating an extra \0 byte for string constants right? And, I assume, toStringz is already clever enough to detect cases where there is already a \0 in the correct position, or utilises the existing preallocated space remaining in a dynamic array, making it almost a no-op. The only case it actually does any work is a dynamic or static array which is full. In the former case the array is resized, and I'm not sure about the latter but I suspect it's more expensive. So, it seems the cost of this is very low. -- Using Opera's revolutionary email client: http://www.opera.com/mail/Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".
Jul 08 2011
On Fri, 08 Jul 2011 07:53:20 -0400, Regan Heath <regan netmail.co.nz> wrote:On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com> wrote:What about a template function that does this automatically? I'm thinking something like opDispatch: extern(C) foo(const(char)* c); struct CCall { auto opDispatch(string call, S...)(S args) if(call is a C function (can check this somehow?) ) { /* determine which args of S are char[], and translate them to zero-terminated */ ... } } usage: string s; CCall.foo(s); I personally think, barring this idea, the best path is simply to wrap C functions you want to call with toStringz'd versions. -SteveOn 7/8/2011 2:26 AM, Regan Heath wrote:Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time. D is already allocating an extra \0 byte for string constants right? And, I assume, toStringz is already clever enough to detect cases where there is already a \0 in the correct position, or utilises the existing preallocated space remaining in a dynamic array, making it almost a no-op. The only case it actually does any work is a dynamic or static array which is full. In the former case the array is resized, and I'm not sure about the latter but I suspect it's more expensive. So, it seems the cost of this is very low.Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".
Jul 08 2011
This is kind of off-topic, and I don't know if it's just me, but I've barely been able to use toStringz() where it's supposed to be useful: I tried using it with a C function whose parameters are not const(char)*, but just char*, but because it returns immutable(char)'s I had to write my own one. I tried using it with a C function that's unicode, but it won't take wstring's as arguments... so I had to write my own one. Maybe it's because I'm not really experienced with interfacing to C code from D, or maybe it's because I couldn't write the extern(C) code myself as I'm using someone else's C interface, but out of the 3 times I tried using it in the last day, it only helped once. On 2011/07/08 15:48 PM, Steven Schveighoffer wrote:I personally think, barring this idea, the best path is simply to wrap C functions you want to call with toStringz'd versions. -Steve
Jul 08 2011
On 7/8/2011 11:03 PM, SimonM wrote:This is kind of off-topic, and I don't know if it's just me, but I've barely been able to use toStringz() where it's supposed to be useful: I tried using it with a C function whose parameters are not const(char)*, but just char*, but because it returns immutable(char)'s I had to write my own one.someCFunc(cast(char*)myString.toStringz());I tried using it with a C function that's unicode, but it won't take wstring's as arguments... so I had to write my own one.import std.utf; some_wchar_func(myString.toUTF16z()); /* For non-const */ some_wchar_func_2(cast(wchar*)myString.toUTF16z());
Jul 08 2011
On 2011-07-08 07:03, SimonM wrote:This is kind of off-topic, and I don't know if it's just me, but I've barely been able to use toStringz() where it's supposed to be useful: I tried using it with a C function whose parameters are not const(char)*, but just char*, but because it returns immutable(char)'s I had to write my own one. I tried using it with a C function that's unicode, but it won't take wstring's as arguments... so I had to write my own one. Maybe it's because I'm not really experienced with interfacing to C code from D, or maybe it's because I couldn't write the extern(C) code myself as I'm using someone else's C interface, but out of the 3 times I tried using it in the last day, it only helped once. On 2011/07/08 15:48 PM, Steven Schveighoffer wrote:https://github.com/D-Programming-Language/phobos/pull/123 - Jonathan M DavisI personally think, barring this idea, the best path is simply to wrap C functions you want to call with toStringz'd versions.
Jul 08 2011
On 7/8/2011 4:53 AM, Regan Heath wrote:On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com> wrote:In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.On 7/8/2011 2:26 AM, Regan Heath wrote:Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".D is already allocating an extra \0 byte for string constants right?Yes, but in a way that is essentially free.
Jul 08 2011
On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright <newshound2 digitalmars.com> wrote:On 7/8/2011 4:53 AM, Regan Heath wrote:This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming. And, it *is* turn-off-able. You simply change the extern "C" to use ubyte*, byte*, or void* (instead of char*). This is arguably a better definition for this sort of function in the first place.On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com> wrote:In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.On 7/8/2011 2:26 AM, Regan Heath wrote:Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".Yep, this is essentially free, and calling toStringz automatically would be almost as free, for 99% of cases. Plus it would "just work" which is a big deal when you're talking about first impressions etc. -- Using Opera's revolutionary email client: http://www.opera.com/mail/D is already allocating an extra \0 byte for string constants right?Yes, but in a way that is essentially free.
Jul 12 2011
On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz> wrote:On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright <newshound2 digitalmars.com> wrote:What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising? -SteveOn 7/8/2011 4:53 AM, Regan Heath wrote:This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming.On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com> wrote:In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.On 7/8/2011 2:26 AM, Regan Heath wrote:Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".
Jul 12 2011
On Tue, 12 Jul 2011 15:18:04 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz> wrote:Assuming a C function in this form: void write_to_buffer(char *buffer, int length); You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that? -- Using Opera's revolutionary email client: http://www.opera.com/mail/On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright <newshound2 digitalmars.com> wrote:What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?On 7/8/2011 4:53 AM, Regan Heath wrote:This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming.On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com> wrote:In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.On 7/8/2011 2:26 AM, Regan Heath wrote:Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".
Jul 12 2011
On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:On Tue, 12 Jul 2011 15:18:04 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz> wrote:Assuming a C function in this form: void write_to_buffer(char *buffer, int length);On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright <newshound2 digitalmars.com> wrote:What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?On 7/8/2011 4:53 AM, Regan Heath wrote:This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming.On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com> wrote:In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.On 7/8/2011 2:26 AM, Regan Heath wrote:Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.Why can't we have the compiler call it automatically whenever we pass a string, or char[] to an extern "C" function, where the parameter is defined as char*?Because char* in C does not necessarily mean "zero terminated string".You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated. The only thing that guarantees null termination is a string literal. Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character. -Steve
Jul 12 2011
On Tue, 12 Jul 2011 10:59:58 -0400, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:And, actually, the cost penalty of checking if you are going to segfault (i.e. checking if the ptr is into heap data, and then getting the length) is quite costly. You must take the GC lock. -Steveand in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated. The only thing that guarantees null termination is a string literal. Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.
Jul 12 2011
On Tue, 12 Jul 2011 16:04:15 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Tue, 12 Jul 2011 10:59:58 -0400, Steven Schveighoffer <schveiguy yahoo.com> wrote:I wouldn't know anything about this. I was assuming when toStringz was called on a slice it would use the array capacity and length to figure out where the \0 needed to be, and do as little work as possible to achieve it. Meaning in most cases that \0 is written to 1 past the length, inside already allocated capacity. -- Using Opera's revolutionary email client: http://www.opera.com/mail/On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:And, actually, the cost penalty of checking if you are going to segfault (i.e. checking if the ptr is into heap data, and then getting the length) is quite costly. You must take the GC lock.and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated. The only thing that guarantees null termination is a string literal. Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.
Jul 12 2011
On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write: ucase(toStringz(foo)); So, +1 for compiler called toStringz. I am assuming also that if this idea were implemented it would handle things intelligently, like for example if when toStringz is called the underlying array is out of room and needs to be reallocated, the compiler would update the slice/reference 'foo' in the same way as it already does for an append which triggers a reallocation.No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?Assuming a C function in this form: void write_to_buffer(char *buffer, int length);True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements. In this particular case the extern "C" declaration (IMO) for this style of function should be one of: extern "C" void write_to_buffer(ubyte *buffer, int length); extern "C" void write_to_buffer(byte *buffer, int length); extern "C" void write_to_buffer(void *buffer, int length); which would all be ignored by my suggestion.You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated.The only thing that guarantees null termination is a string literal.string literals /and/ calling toStringz.Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something? I have just realised that char is initialised to 0xFF. That is a problem as my two examples above would be arrays full of 0xFF, not \0.. meaning toStringz would have to reallocate to append \0 to them, drat. That is yet another reason to use ubyte or byte when interfacing with C. Ok, how about going the other way. Can we have something to decorate extern "C" function parameters to trigger an implicit call of toStringz on them? -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 12 2011
On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz> wrote:On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?Assuming a C function in this form: void write_to_buffer(char *buffer, int length);I am assuming also that if this idea were implemented it would handle things intelligently, like for example if when toStringz is called the underlying array is out of room and needs to be reallocated, the compiler would update the slice/reference 'foo' in the same way as it already does for an append which triggers a reallocation.OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated.The final byte of the block is used as the hidden array length (in this case 15). -SteveThe only thing that guarantees null termination is a string literal.string literals /and/ calling toStringz.Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something?
Jul 12 2011
On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz> wrote:Replace foo with foo.ptr, it makes no difference to the point I was making.On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?Assuming a C function in this form: void write_to_buffer(char *buffer, int length);This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.I am assuming also that if this idea were implemented it would handle things intelligently, like for example if when toStringz is called the underlying array is out of room and needs to be reallocated, the compiler would update the slice/reference 'foo' in the same way as it already does for an append which triggers a reallocation.OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.Ah, ok, this was because I had forgotten char is initialised to 0xFF. If it was initialised to \0 then both arrays would have been full of null terminators. The default value of char is the killing blow to the idea.No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated.Good to know. -- Using Opera's revolutionary email client: http://www.opera.com/mail/The final byte of the block is used as the hidden array length (in this case 15).The only thing that guarantees null termination is a string literal.string literals /and/ calling toStringz.Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something?
Jul 12 2011
Gah.. bad grammar.. 1/2 baked sentences.. On Tue, 12 Jul 2011 18:00:41 +0100, Regan Heath <regan netmail.co.nz> wrote:On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:Which was that a new D user would pass foo.ptr rather than go looking for, and find toStringz. We've had a number of cases on the learn NG in the past.No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).Replace foo with foo.ptr, it makes no difference to the point I was making.--> ^byOK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.This is not a 'new' problem introduced the idea, it's a general problemfor D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.
Jul 12 2011
On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz> wrote:On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:You fix does not help in that case, foo.ptr will be passed as a non-null terminated string. So, your proposal fixes the case: 1. The user tries to pass a string/char[] to a C function. Fails to compile. 2. Instead of trying to understand the issue, realizes the .ptr member is the right type, and switches to that. It does not fix or help with cases where: * a programmer notices the type of the parameter is char * and uses foo.ptr without trying foo first. (crash) * a programmer calls toStringz without going through the compile/fix cycle above. * a programmer tries to pass string/char[], fails to compile, then looks up how to interface with C and finds toStringz I think this fix really doesn't solve a very common problem.On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz> wrote:Replace foo with foo.ptr, it makes no difference to the point I was making.On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?Assuming a C function in this form: void write_to_buffer(char *buffer, int length);It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.I am assuming also that if this idea were implemented it would handle things intelligently, like for example if when toStringz is called the underlying array is out of room and needs to be reallocated, the compiler would update the slice/reference 'foo' in the same way as it already does for an append which triggers a reallocation.OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.toStringz does not currently check for '\0' anywhere in the existing string. It simply appends '\0' to the end of the passed string. If you want it to check for '\0', how far should it go? Doesn't this also add to the overhead (looping over all chars looking for '\0')? Note also, that toStringz has old code that used to check for "one byte beyond" the array, but this is commented out, because it's unreliable (could cause a segfault).Ah, ok, this was because I had forgotten char is initialised to 0xFF. If it was initialised to \0 then both arrays would have been full of null terminators. The default value of char is the killing blow to the idea.No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated.Just for history trivia, it used to be there as an unallocated byte. Which means it likely had random data in it. It was there to prevent cross-block pointers. If the byte was part of the array, then it would be possible to do: arr1 = arr[$..$]; and now, arr1 points at the *next* block! arr1 ~= 5; and now, arr1 may have stomped over possibly unallocated data, or possibly some already allocated data! So it was a nice bonus that the byte I commandeered for storing the array length was already unused :) -SteveGood to know.The final byte of the block is used as the hidden array length (in this case 15).The only thing that guarantees null termination is a string literal.string literals /and/ calling toStringz.Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something?
Jul 12 2011
Ok, it's clear there has been some confusion over what exactly I am suggesting. I am not suggesting the compiler simply insert calls to the existing toStringz function as it appears the function does not, or cannot do what I am imagining. I am suggesting the compiler will perform a special operation on all char* parameters passed to extern "C" functions. The operation is a toStringz like operation which is (more or less) as follows: 1. If there is a \0 character inside foo[0..$], do nothing. 2. If the array allocated memory is > the array length, place a \0 at foo[$] 3. Reallocate the array memory, updating foo, place a \0 at foo[$] 4. Call the C function passing foo.ptr So, it will handle all the following cases: char[] foo; .. code to populate foo .. ucase(foo); ucase(foo.ptr); ucase(toStringz(foo)); The problem cases are the buffer cases I mentioned earlier, and they wouldn't be a problem if char was initialised to \0 as I first imagined. Other replies inline below.. On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz> wrote:No, see above.On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz> wrote:Replace foo with foo.ptr, it makes no difference to the point I was making.On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?Assuming a C function in this form: void write_to_buffer(char *buffer, int length);So, your proposal fixes the case: 1. The user tries to pass a string/char[] to a C function. Fails to compile. 2. Instead of trying to understand the issue, realizes the .ptr member is the right type, and switches to that. It does not fix or help with cases where: * a programmer notices the type of the parameter is char * and uses foo.ptr without trying foo first. (crash) * a programmer calls toStringz without going through the compile/fix cycle above. * a programmer tries to pass string/char[], fails to compile, then looks up how to interface with C and finds toStringz I think this fix really doesn't solve a very common problem.See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs). toStringz returns foo.ptr (I assume). 3. If the programmer passes 'foo', the compiler calls toStringz etc.None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array. In short, the end result will ALWAYS be that the passed slice/array will contain the output of the C function. The goal is to make a call to an extern "C" function "just work" in the has it's own string type.It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.I am assuming also that if this idea were implemented it would handle things intelligently, like for example if when toStringz is called the underlying array is out of room and needs to be reallocated, the compiler would update the slice/reference 'foo' in the same way as it already does for an append which triggers a reallocation.OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory). -- Using Opera's revolutionary email client: http://www.opera.com/mail/toStringz does not currently check for '\0' anywhere in the existing string. It simply appends '\0' to the end of the passed string. If you want it to check for '\0', how far should it go? Doesn't this also add to the overhead (looping over all chars looking for '\0')? Note also, that toStringz has old code that used to check for "one byte beyond" the array, but this is commented out, because it's unreliable (could cause a segfault).Ah, ok, this was because I had forgotten char is initialised to 0xFF. If it was initialised to \0 then both arrays would have been full of null terminators. The default value of char is the killing blow to the idea.No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that?In neither case are they required to be null terminated.
Jul 13 2011
On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz> wrote:I am suggesting the compiler will perform a special operation on all char* parameters passed to extern "C" functions. The operation is a toStringz like operation which is (more or less) as follows: 1. If there is a \0 character inside foo[0..$], do nothing.This is an O(n) operation -- too much overhead. Especially if you already know foo has a 0 in it. Note that toStringz does not have this overhead.2. If the array allocated memory is > the array length, place a \0 at foo[$]The check to see if the array has allocated length requires a GC lock, and O(lgn) search for the block info in the GC. Not that it doesn't already happen in toStringz, but I just want to point out that it's not a small cost.3. Reallocate the array memory, updating foo, place a \0 at foo[$] 4. Call the C function passing foo.ptr So, it will handle all the following cases: char[] foo; .. code to populate foo .. ucase(foo); ucase(foo.ptr);I read in your responses below, this is due to you making this equivalent to ucase(foo)? This still has the same problems I listed above. What about char * foo; .. code to populate foo .. ucase(foo); Is there still anything special done by the compiler?ucase(toStringz(foo)); The problem cases are the buffer cases I mentioned earlier, and they wouldn't be a problem if char was initialised to \0 as I first imagined.The largest problem I've had with all this is there is a necessary overhead of conversion. Not only that, but due to the way reallocation works, there may be a move of data. I think it's better to require explicit calls incurring such overhead vs. hiding the overhead calls from the developer. Especially if the overhead calls are unnecessary.Other replies inline below.. On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from? The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz> wrote:No, see above.Replace foo with foo.ptr, it makes no difference to the point I was making.You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.What if it's not foo.ptr? What if it's some random char * whose origin the compiler isn't aware of?So, your proposal fixes the case: 1. The user tries to pass a string/char[] to a C function. Fails to compile. 2. Instead of trying to understand the issue, realizes the .ptr member is the right type, and switches to that. It does not fix or help with cases where: * a programmer notices the type of the parameter is char * and uses foo.ptr without trying foo first. (crash) * a programmer calls toStringz without going through the compile/fix cycle above. * a programmer tries to pass string/char[], fails to compile, then looks up how to interface with C and finds toStringz I think this fix really doesn't solve a very common problem.See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs).toStringz returns foo.ptr (I assume).Huh? Why should it do anything with toStringz? I'm not getting this one, toStringz already has done the work your proposal wants to do.What about this case? char buffer[12]; buffer[] = "hello, world"; ucase(buffer[]); // does nothing to buffer! I'm saying, the charter of the function is to update a string in place, and your proposal is making that not true in some cases.None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array.This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.The goal is to make a call to an extern "C" function "just work" in the has it's own string type.a '0' at the end affects all references to that string, reallocation or not.s/clever/slow/ The only "intelligent" way to check for a 0 is a linear search. Without knowing where the data came from, there is no way to look past the slice without possibly calling a segfault. If you know it's a heap allocation, you can look at the block information to see if you can look past it. This might be possible to do for toStringz, but the linear check for 0 is just unacceptable for a simple function call. Appending a 0 is at least amortized. One thing though, it could make some smarter decisions as to whether to reallocate depending on the type of the array, since it is already doing a lookup of block info. But I still always come back to the fact that I should be able to circumvent some auto-intelligent decision that isn't aware of things that a developer can be aware of (such as knowing an array already contains a 0). The compiler shouldn't be too intrusive here. -StevetoStringz does not currently check for '\0' anywhere in the existing string. It simply appends '\0' to the end of the passed string. If you want it to check for '\0', how far should it go? Doesn't this also add to the overhead (looping over all chars looking for '\0')? Note also, that toStringz has old code that used to check for "one byte beyond" the array, but this is commented out, because it's unreliable (could cause a segfault).So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory).
Jul 13 2011
On 2011-07-13 09:00, Steven Schveighoffer wrote:On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz> wrote:Andrej Mitrovic found a rather annoying issue (which is fortunately highly unlikely and therefore almost certainly rare) with toStringz and toUTFz with checking for a terminating '\0' one past the end of the string (which both functions do under some circumstances). You might want to have a look at it: https://github.com/D-Programming-Language/phobos/pull/123 Given what you know about the GC and arrays, your thoughts on the matter would be welcome. - Jonathan M DavisI am suggesting the compiler will perform a special operation on all char* parameters passed to extern "C" functions. The operation is a toStringz like operation which is (more or less) as follows: 1. If there is a \0 character inside foo[0..$], do nothing.This is an O(n) operation -- too much overhead. Especially if you already know foo has a 0 in it. Note that toStringz does not have this overhead.2. If the array allocated memory is > the array length, place a \0 at foo[$]The check to see if the array has allocated length requires a GC lock, and O(lgn) search for the block info in the GC. Not that it doesn't already happen in toStringz, but I just want to point out that it's not a small cost.3. Reallocate the array memory, updating foo, place a \0 at foo[$] 4. Call the C function passing foo.ptr So, it will handle all the following cases: char[] foo; .. code to populate foo .. ucase(foo); ucase(foo.ptr);I read in your responses below, this is due to you making this equivalent to ucase(foo)? This still has the same problems I listed above. What about char * foo; .. code to populate foo .. ucase(foo); Is there still anything special done by the compiler?ucase(toStringz(foo)); The problem cases are the buffer cases I mentioned earlier, and they wouldn't be a problem if char was initialised to \0 as I first imagined.The largest problem I've had with all this is there is a necessary overhead of conversion. Not only that, but due to the way reallocation works, there may be a move of data. I think it's better to require explicit calls incurring such overhead vs. hiding the overhead calls from the developer. Especially if the overhead calls are unnecessary.Other replies inline below.. On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from? The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz> wrote:No, see above.Replace foo with foo.ptr, it makes no difference to the point I was making.You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.What if it's not foo.ptr? What if it's some random char * whose origin the compiler isn't aware of?So, your proposal fixes the case: 1. The user tries to pass a string/char[] to a C function. Fails to compile. 2. Instead of trying to understand the issue, realizes the .ptr member is the right type, and switches to that. It does not fix or help with cases where: * a programmer notices the type of the parameter is char * and uses foo.ptr without trying foo first. (crash) * a programmer calls toStringz without going through the compile/fix cycle above. * a programmer tries to pass string/char[], fails to compile, then looks up how to interface with C and finds toStringz I think this fix really doesn't solve a very common problem.See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs).toStringz returns foo.ptr (I assume).Huh? Why should it do anything with toStringz? I'm not getting this one, toStringz already has done the work your proposal wants to do.What about this case? char buffer[12]; buffer[] = "hello, world"; ucase(buffer[]); // does nothing to buffer! I'm saying, the charter of the function is to update a string in place, and your proposal is making that not true in some cases.None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array.This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.The goal is to make a call to an extern "C" function "just work" in the has it's own string type.a '0' at the end affects all references to that string, reallocation or not.s/clever/slow/ The only "intelligent" way to check for a 0 is a linear search. Without knowing where the data came from, there is no way to look past the slice without possibly calling a segfault. If you know it's a heap allocation, you can look at the block information to see if you can look past it. This might be possible to do for toStringz, but the linear check for 0 is just unacceptable for a simple function call. Appending a 0 is at least amortized. One thing though, it could make some smarter decisions as to whether to reallocate depending on the type of the array, since it is already doing a lookup of block info. But I still always come back to the fact that I should be able to circumvent some auto-intelligent decision that isn't aware of things that a developer can be aware of (such as knowing an array already contains a 0). The compiler shouldn't be too intrusive here.toStringz does not currently check for '\0' anywhere in the existing string. It simply appends '\0' to the end of the passed string. If you want it to check for '\0', how far should it go? Doesn't this also add to the overhead (looping over all chars looking for '\0')? Note also, that toStringz has old code that used to check for "one byte beyond" the array, but this is commented out, because it's unreliable (could cause a segfault).So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory).
Jul 13 2011
On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz> wrote:On 2nd thought, this step is unnecessary unless the array length matches the memory block length .. it was intended to detect an existing \0 and avoid the reallocation. But, this case is rare so this step could be skipped for the general case, or only carried out when the lengths match and reallocation is a possibility we want to avoid, or not if the cost is too high even for that.I am suggesting the compiler will perform a special operation on all char* parameters passed to extern "C" functions. The operation is a toStringz like operation which is (more or less) as follows: 1. If there is a \0 character inside foo[0..$], do nothing.This is an O(n) operation -- too much overhead. Especially if you already know foo has a 0 in it. Note that toStringz does not have this overhead.This is the cost Walter mentioned earlier. Does this mean that heap allocated arrays do not know how much memory they have allocated? I was assuming they held that information, and that a slice to them would also know. How else does an array append operation know whether to reallocate? Does it have to obtain the GC lock and perform an O(lgn) search on every append?2. If the array allocated memory is > the array length, place a \0 at foo[$]The check to see if the array has allocated length requires a GC lock, and O(lgn) search for the block info in the GC. Not that it doesn't already happen in toStringz, but I just want to point out that it's not a small cost.Problems above? You mean the cost? Yes, there is a cost to pay, but it's a cost which has to be paid (and is already paid by calling toStringz) to avoid corrupting memory whether it's done explicitly or implicitly. And the cost is only paid for extern "C" functions with char* parameters. In the rare case where the string already contains \0 and the programmer can guarantee that, we can have some way to indicate it, or in some cases changing the function parameter to ubyte* or byte* may be the correct solution.3. Reallocate the array memory, updating foo, place a \0 at foo[$] 4. Call the C function passing foo.ptr So, it will handle all the following cases: char[] foo; .. code to populate foo .. ucase(foo); ucase(foo.ptr);I read in your responses below, this is due to you making this equivalent to ucase(foo)? This still has the same problems I listed above.What about char * foo; .. code to populate foo .. ucase(foo); Is there still anything special done by the compiler?Assuming foo is allocated by the GC toStringz can still find the length of we can handle this case as well (for no extra cost than incurred by toStringz already).But, the overhead is something we already pay calling toStringz explicitly, and the reallocation is no different to an append operation. Generally speaking I would normally agree that it's better to require explicit calls incurring overhead etc, but this specific case is something new D programmers stumble on all the time, and it makes D look less slick different for each, but if we can achieve something similar for no extra cost (other than we already pay calling toStringz explicitly), then it's well worth considering. As far as I can see the only problem cases are those where we incur more cost than toStringz when it's not required, and those cases seem rare to me, and could be handled by an opt-out decoration/keyword or similar.ucase(toStringz(foo)); The problem cases are the buffer cases I mentioned earlier, and they wouldn't be a problem if char was initialised to \0 as I first imagined.The largest problem I've had with all this is there is a necessary overhead of conversion. Not only that, but due to the way reallocation works, there may be a move of data. I think it's better to require explicit calls incurring such overhead vs. hiding the overhead calls from the developer. Especially if the overhead calls are unnecessary.See your Q and my A above ("char * foo" example).Other replies inline below.. On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from?On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz> wrote:No, see above.Replace foo with foo.ptr, it makes no difference to the point I was making.You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?See above.What if it's not foo.ptr? What if it's some random char * whose origin the compiler isn't aware of?So, your proposal fixes the case: 1. The user tries to pass a string/char[] to a C function. Fails to compile. 2. Instead of trying to understand the issue, realizes the .ptr member is the right type, and switches to that. It does not fix or help with cases where: * a programmer notices the type of the parameter is char * and uses foo.ptr without trying foo first. (crash) * a programmer calls toStringz without going through the compile/fix cycle above. * a programmer tries to pass string/char[], fails to compile, then looks up how to interface with C and finds toStringz I think this fix really doesn't solve a very common problem.See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs).I was assuming the compiler could not detect the case where the programmer is explicitly calling toStringz i.e. what would be legacy code assuming this proposal came into effect.toStringz returns foo.ptr (I assume).Huh? Why should it do anything with toStringz? I'm not getting this one, toStringz already has done the work your proposal wants to do.Sure, but how is that different to this: char buffer[12]; buffer[] = "hello, world"; ucase(buffer ~ "a"); // does nothing to buffer! or in fact this: char buffer[12]; buffer[] = "hello, world"; ucase(cast(char*)toStringz(buffer)); // does nothing to buffer! in both cases buffer remains unchanged.What about this case? char buffer[12]; buffer[] = "hello, world"; ucase(buffer[]); // does nothing to buffer! I'm saying, the charter of the function is to update a string in place, and your proposal is making that not true in some cases.None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array.This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.\0 to the string. For all I know they're making a completely new copy, the goal is.The goal is to make a call to an extern "C" function "just work" in the has it's own string type.adding a '0' at the end affects all references to that string, reallocation or not.Fair enough.s/clever/slow/ The only "intelligent" way to check for a 0 is a linear search.toStringz does not currently check for '\0' anywhere in the existing string. It simply appends '\0' to the end of the passed string. If you want it to check for '\0', how far should it go? Doesn't this also add to the overhead (looping over all chars looking for '\0')? Note also, that toStringz has old code that used to check for "one byte beyond" the array, but this is commented out, because it's unreliable (could cause a segfault).So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory).Without knowing where the data came from, there is no way to look past the slice without possibly calling a segfault. If you know it's a heap allocation, you can look at the block information to see if you can look past it. This might be possible to do for toStringz, but the linear check for 0 is just unacceptable for a simple function call. Appending a 0 is at least amortized. One thing though, it could make some smarter decisions as to whether to reallocate depending on the type of the array, since it is already doing a lookup of block info.Ok, scrap the linear search, or only perform it when a reallocation may be required.But I still always come back to the fact that I should be able to circumvent some auto-intelligent decision that isn't aware of things that a developer can be aware of (such as knowing an array already contains a 0). The compiler shouldn't be too intrusive here.Sure, we want to keep everyone happy, the Q is, to my mind, which is the more general case. It would be nice to have your cake and eat it too, or in other words for the general case (as I see it): char[] foo; .. code which populates foo .. ucase(foo); to "just work" as a new D programmer might expect, at the same time I agree that cases where speed is of the essence, or the data is guaranteed to contain \0 we need to be able to avoid the cost. As most things it comes down to cost/benefit and I think D would benefit from this default behaviour, provided there is a way to avoid it as well. Perhaps restricting the idea to cases like the one above where the compiler has the information for the slice/array, and doing nothing for raw char* cases is a good compromise, it would allow people to avoid the behaviour just by adding .ptr or similar. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 13 2011
On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz> wrote:On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:Who said the char * points into GC memory? It could point at stack memory, or static data in ROM. -SteveHow does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from?See your Q and my A above ("char * foo" example).The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?
Jul 13 2011
On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok. What would toStringz do in this case? .. because that's what I'm proposing we do here. The goal here is to pick some low hanging fruit, the general case mentioned earlier, and make it work as a new D programmer would expect. In that case there is no technical difficulty implementing it (toStringz already exists), there is no extra cost (you already have to call toStringz), and the only disagreement seems to be whether it should be implicit or explicit. In this particular case I cannot see any harm in making it implicit. Yes, there are some edge cases, but they either already exist (as shown by the explicit toStringz example I gave where the passed char[] remained unchanged, and your example passing buffer[]), or they may be detectable by the compiler, or they are rare - in which case requiring some manual intervention is not too much to ask. So, on balance I reckon the implicit call would be "better" for more people more of the time, and at no extra cost. It seems like a win/win to me. Yes, there are edge cases, yes there are wrinkles to iron out, no it's not a "general/covers everything perfectly" kind of idea - which I agree we'd all prefer, but it makes D look slicker, and removes one more stumbling block for new D programmers. -- Using Opera's revolutionary email client: http://www.opera.com/mail/On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:Who said the char * points into GC memory? It could point at stack memory, or static data in ROM.How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from?See your Q and my A above ("char * foo" example).The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?
Jul 14 2011
On Thu, 14 Jul 2011 05:53:47 -0400, Regan Heath <regan netmail.co.nz> wrote:On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:Nothing, you don't call toStringz on a char *, you call it on a string. The point is, for those who have already guaranteed a char * has a 0 in it, they should not have to have the compiler injecting useless code for a simple function call. A really really good example is if you use a char * you got from a C function to call another C function.On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok. What would toStringz do in this case? .. because that's what I'm proposing we do here.On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:Who said the char * points into GC memory? It could point at stack memory, or static data in ROM.How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from?See your Q and my A above ("char * foo" example).The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?The goal here is to pick some low hanging fruit, the general case mentioned earlier, and make it work as a new D programmer would expect. In that case there is no technical difficulty implementing it (toStringz already exists), there is no extra cost (you already have to call toStringz), and the only disagreement seems to be whether it should be implicit or explicit.There is an extra cost where you wouldn't have to call toStringz currently.In this particular case I cannot see any harm in making it implicit. Yes, there are some edge cases, but they either already exist (as shown by the explicit toStringz example I gave where the passed char[] remained unchanged, and your example passing buffer[]), or they may be detectable by the compiler, or they are rare - in which case requiring some manual intervention is not too much to ask. So, on balance I reckon the implicit call would be "better" for more people more of the time, and at no extra cost. It seems like a win/win to me. Yes, there are edge cases, yes there are wrinkles to iron out, no it's not a "general/covers everything perfectly" kind of idea - which I agree we'd all prefer, but it makes D look slicker, and removes one more stumbling block for new D programmers.We also have to weigh this against two things: 1. How will existing code (that already calls toStringz) be affected? 2. This is *not* a trivial compiler change. So all other options should be considered, there's a *lot* of C calls that exist from D today that could possibly be affected. If C strings were their own type (and not conflated with "buffer pointer"), and verifying a C string was valid without segfaulting and in O(1) time, I'd agree that a compiler change would be warranted. There's just too many cases (note, these aren't the majority, but they are enough) where the injected calls will be either performance drags or unnecessary.
Jul 14 2011
On Thu, 14 Jul 2011 12:30:24 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Thu, 14 Jul 2011 05:53:47 -0400, Regan Heath <regan netmail.co.nz> wrote:Good points all. So, the idea should be limited to cases where D's char[] and string are passed to extern "C" functions expecting char*, and should not affect cases where D's char* is passed directly. Sounds good.On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:Nothing, you don't call toStringz on a char *, you call it on a string. The point is, for those who have already guaranteed a char * has a 0 in it, they should not have to have the compiler injecting useless code for a simple function call. A really really good example is if you use a char * you got from a C function to call another C function.On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz> wrote:Ok. What would toStringz do in this case? .. because that's what I'm proposing we do here.On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:Who said the char * points into GC memory? It could point at stack memory, or static data in ROM.How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from?See your Q and my A above ("char * foo" example).The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?The point I've tried to make all along is that this is a rare situation, and not the general case. In the general case you're going to need to call toStringz. Especially if you restrict this idea to D's char[] and string and not D's char* as mentioned above.The goal here is to pick some low hanging fruit, the general case mentioned earlier, and make it work as a new D programmer would expect. In that case there is no technical difficulty implementing it (toStringz already exists), there is no extra cost (you already have to call toStringz), and the only disagreement seems to be whether it should be implicit or explicit.There is an extra cost where you wouldn't have to call toStringz currently.Assuming the above mentioned restriction (char[] and string, not char*)...In this particular case I cannot see any harm in making it implicit. Yes, there are some edge cases, but they either already exist (as shown by the explicit toStringz example I gave where the passed char[] remained unchanged, and your example passing buffer[]), or they may be detectable by the compiler, or they are rare - in which case requiring some manual intervention is not too much to ask. So, on balance I reckon the implicit call would be "better" for more people more of the time, and at no extra cost. It seems like a win/win to me. Yes, there are edge cases, yes there are wrinkles to iron out, no it's not a "general/covers everything perfectly" kind of idea - which I agree we'd all prefer, but it makes D look slicker, and removes one more stumbling block for new D programmers.We also have to weigh this against two things:1. How will existing code (that already calls toStringz) be affected?Not at all.2. This is *not* a trivial compiler change. So all other options should be considered, there's a *lot* of C calls that exist from D today that could possibly be affected.It will affect none of these.If C strings were their own type (and not conflated with "buffer pointer"), and verifying a C string was valid without segfaulting and in O(1) time, I'd agree that a compiler change would be warranted. There's just too many cases (note, these aren't the majority, but they are enough) where the injected calls will be either performance drags or unnecessary.I disagree about the number of cases being too many, but this is a gut feeling and I have no evidence to support it. I think with the restriction I mentioned above the situation changes however, as all those edge cases are unaffected, old code is unaffected and only new code will allow char[] and string to be passed as extern "C" char* parameters. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 14 2011