www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - toStringz or not toStringz

reply "Regan Heath" <regan netmail.co.nz> writes:
Sorry if this has been asked/answered before but I've been out of the loop  
for a while..

I was just thinking about the recent discussion on renaming toStringz and  
I wondered why we need to explicitly call it at all.  Why can't we have  
the compiler call it automatically whenever we pass a string, or char[] to  
an extern "C" function, where the parameter is defined as char*?

I believe some extern "C" functions are defined as taking ubyte* or byte*  
instead of char*, but in those cases I believe they are 'buffers' and have  
a supplied length as well, meaning there is no need for the trailing \0 in  
any case.

I am probably missing something obvious, but it seems like it might work.

Side note.. It bothers me a little that 'char' means utf-8 codepoint in D,  
and means unsigned byte in extern "C" definitions, but I can live with  
that.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 08 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to an
extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Jul 08 2011
parent reply "Regan Heath" <regan netmail.co.nz> writes:
On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to  
 an extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time. D is already allocating an extra \0 byte for string constants right? And, I assume, toStringz is already clever enough to detect cases where there is already a \0 in the correct position, or utilises the existing preallocated space remaining in a dynamic array, making it almost a no-op. The only case it actually does any work is a dynamic or static array which is full. In the former case the array is resized, and I'm not sure about the latter but I suspect it's more expensive. So, it seems the cost of this is very low. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 08 2011
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 08 Jul 2011 07:53:20 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to  
 an extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time. D is already allocating an extra \0 byte for string constants right? And, I assume, toStringz is already clever enough to detect cases where there is already a \0 in the correct position, or utilises the existing preallocated space remaining in a dynamic array, making it almost a no-op. The only case it actually does any work is a dynamic or static array which is full. In the former case the array is resized, and I'm not sure about the latter but I suspect it's more expensive. So, it seems the cost of this is very low.
What about a template function that does this automatically? I'm thinking something like opDispatch: extern(C) foo(const(char)* c); struct CCall { auto opDispatch(string call, S...)(S args) if(call is a C function (can check this somehow?) ) { /* determine which args of S are char[], and translate them to zero-terminated */ ... } } usage: string s; CCall.foo(s); I personally think, barring this idea, the best path is simply to wrap C functions you want to call with toStringz'd versions. -Steve
Jul 08 2011
parent reply SimonM <user example.net> writes:
This is kind of off-topic, and I don't know if it's just me, but I've 
barely been able to use toStringz() where it's supposed to be useful:

I tried using it with a C function whose parameters are not 
const(char)*, but just char*, but because it returns immutable(char)'s I 
had to write my own one.

I tried using it with a C function that's unicode, but it won't take 
wstring's as arguments... so I had to write my own one.

Maybe it's because I'm not really experienced with interfacing to C code 
from D, or maybe it's because I couldn't write the extern(C) code myself 
as I'm using someone else's C interface, but out of the 3 times I tried 
using it in the last day, it only helped once.

On 2011/07/08 15:48 PM, Steven Schveighoffer wrote:
 I personally think, barring this idea, the best path is simply to wrap C
 functions you want to call with toStringz'd versions.

 -Steve
Jul 08 2011
next sibling parent Mike Parker <aldacron gmail.com> writes:
On 7/8/2011 11:03 PM, SimonM wrote:
 This is kind of off-topic, and I don't know if it's just me, but I've
 barely been able to use toStringz() where it's supposed to be useful:

 I tried using it with a C function whose parameters are not
 const(char)*, but just char*, but because it returns immutable(char)'s I
 had to write my own one.
someCFunc(cast(char*)myString.toStringz());
 I tried using it with a C function that's unicode, but it won't take
 wstring's as arguments... so I had to write my own one.
import std.utf; some_wchar_func(myString.toUTF16z()); /* For non-const */ some_wchar_func_2(cast(wchar*)myString.toUTF16z());
Jul 08 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-07-08 07:03, SimonM wrote:
 This is kind of off-topic, and I don't know if it's just me, but I've
 barely been able to use toStringz() where it's supposed to be useful:
 
 I tried using it with a C function whose parameters are not
 const(char)*, but just char*, but because it returns immutable(char)'s I
 had to write my own one.
 
 I tried using it with a C function that's unicode, but it won't take
 wstring's as arguments... so I had to write my own one.
 
 Maybe it's because I'm not really experienced with interfacing to C code
 from D, or maybe it's because I couldn't write the extern(C) code myself
 as I'm using someone else's C interface, but out of the 3 times I tried
 using it in the last day, it only helped once.
 
 On 2011/07/08 15:48 PM, Steven Schveighoffer wrote:
 I personally think, barring this idea, the best path is simply to wrap C
 functions you want to call with toStringz'd versions.
https://github.com/D-Programming-Language/phobos/pull/123 - Jonathan M Davis
Jul 08 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to an
extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.
In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.
 D is already allocating an extra \0 byte for string constants right?
Yes, but in a way that is essentially free.
Jul 08 2011
parent reply "Regan Heath" <regan netmail.co.nz> writes:
On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[]  
 to an extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.
In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.
This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming. And, it *is* turn-off-able. You simply change the extern "C" to use ubyte*, byte*, or void* (instead of char*). This is arguably a better definition for this sort of function in the first place.
 D is already allocating an extra \0 byte for string constants right?
Yes, but in a way that is essentially free.
Yep, this is essentially free, and calling toStringz automatically would be almost as free, for 99% of cases. Plus it would "just work" which is a big deal when you're talking about first impressions etc. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 12 2011
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[]  
 to an extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.
In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.
This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming.
What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising? -Steve
Jul 12 2011
parent reply "Regan Heath" <regan netmail.co.nz> writes:
On Tue, 12 Jul 2011 15:18:04 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[]  
 to an extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.
In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.
This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming.
What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?
Assuming a C function in this form: void write_to_buffer(char *buffer, int length); You might initially extern it as: extern "C" void write_to_buffer(char *buffer, int length); And, you could call it one of 2 ways (legitimately): char[] foo = new char[100]; write_to_buffer(foo, foo.length); or: char[100] foo; write_to_buffer(foo, foo.length); and in both cases, toStringz would do nothing as foo is zero terminated already (in both cases), or am I wrong about that? -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 12 2011
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Tue, 12 Jul 2011 15:18:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or  
 char[] to an extern
 "C" function, where the parameter is defined as char*?
Because char* in C does not necessarily mean "zero terminated string".
Sure, but in many (most?) cases it does. And in those cases where it doesn't you could argue ubyte* or byte* should have been used in the D extern "C" declaration instead. Plus, in those cases, worst case scenario, D passes an extra \0 byte to those functions which either ignore it because they were also passed a length, or expect a fixed sized structure, or .. I don't know what as I can't imagine another case where char* would be used without it being a "zero terminated string", or passing/knowing the length ahead of time.
In the worst case, you're adding an extra memory allocation and function call overhead (that is hidden to the user, and not turn-off-able). This is not acceptable when interfacing to C.
This worst case only happens when: 1. The extern "C" function takes a char* and is NOT expecting a "zero terminated string". 2. The char[], string, etc being passed is a fixed length array, or a slice which has no available space left for the \0. So, it's rare. I would guess a less than 1% of cases for general programming.
What if you expect the function is expecting to write to the buffer, and the compiler just made a copy of it? Won't that be pretty surprising?
Assuming a C function in this form: void write_to_buffer(char *buffer, int length);
No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.
 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero terminated  
 already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated. The only thing that guarantees null termination is a string literal. Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character. -Steve
Jul 12 2011
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 12 Jul 2011 10:59:58 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 and in both cases, toStringz would do nothing as foo is zero terminated  
 already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated. The only thing that guarantees null termination is a string literal. Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.
And, actually, the cost penalty of checking if you are going to segfault (i.e. checking if the ptr is into heap data, and then getting the length) is quite costly. You must take the GC lock. -Steve
Jul 12 2011
parent "Regan Heath" <regan netmail.co.nz> writes:
On Tue, 12 Jul 2011 16:04:15 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:59:58 -0400, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated. The only thing that guarantees null termination is a string literal. Even "abc".dup is not going to be guaranteed to be null terminated. For an actual example, try "012345678901234".dup. This should have a 0x0f right after the last character.
And, actually, the cost penalty of checking if you are going to segfault (i.e. checking if the ptr is into heap data, and then getting the length) is quite costly. You must take the GC lock.
I wouldn't know anything about this. I was assuming when toStringz was called on a slice it would use the array capacity and length to figure out where the \0 needed to be, and do as little work as possible to achieve it. Meaning in most cases that \0 is written to 1 past the length, inside already allocated capacity. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 12 2011
prev sibling parent reply "Regan Heath" <regan netmail.co.nz> writes:
On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 What if you expect the function is expecting to write to the buffer,  
 and the compiler just made a copy of it?  Won't that be pretty  
 surprising?
Assuming a C function in this form: void write_to_buffer(char *buffer, int length);
No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.
Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write: ucase(toStringz(foo)); So, +1 for compiler called toStringz. I am assuming also that if this idea were implemented it would handle things intelligently, like for example if when toStringz is called the underlying array is out of room and needs to be reallocated, the compiler would update the slice/reference 'foo' in the same way as it already does for an append which triggers a reallocation.
 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero terminated  
 already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated.
True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements. In this particular case the extern "C" declaration (IMO) for this style of function should be one of: extern "C" void write_to_buffer(ubyte *buffer, int length); extern "C" void write_to_buffer(byte *buffer, int length); extern "C" void write_to_buffer(void *buffer, int length); which would all be ignored by my suggestion.
 The only thing that guarantees null termination is a string literal.
string literals /and/ calling toStringz.
 Even "abc".dup is not going to be guaranteed to be null terminated.  For  
 an actual example, try "012345678901234".dup.  This should have a 0x0f  
 right after the last character.
Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something? I have just realised that char is initialised to 0xFF. That is a problem as my two examples above would be arrays full of 0xFF, not \0.. meaning toStringz would have to reallocate to append \0 to them, drat. That is yet another reason to use ubyte or byte when interfacing with C. Ok, how about going the other way. Can we have something to decorate extern "C" function parameters to trigger an implicit call of toStringz on them? -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 12 2011
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 What if you expect the function is expecting to write to the buffer,  
 and the compiler just made a copy of it?  Won't that be pretty  
 surprising?
Assuming a C function in this form: void write_to_buffer(char *buffer, int length);
No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.
Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:
No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).
 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called the  
 underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.
OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.
 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated.
True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.
No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).
 The only thing that guarantees null termination is a string literal.
string literals /and/ calling toStringz.
 Even "abc".dup is not going to be guaranteed to be null terminated.   
 For an actual example, try "012345678901234".dup.  This should have a  
 0x0f right after the last character.
Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something?
The final byte of the block is used as the hidden array length (in this case 15). -Steve
Jul 12 2011
parent reply "Regan Heath" <regan netmail.co.nz> writes:
On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 What if you expect the function is expecting to write to the buffer,  
 and the compiler just made a copy of it?  Won't that be pretty  
 surprising?
Assuming a C function in this form: void write_to_buffer(char *buffer, int length);
No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.
Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:
No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).
Replace foo with foo.ptr, it makes no difference to the point I was making.
 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called the  
 underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.
OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.
This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.
 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated.
True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.
No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).
Ah, ok, this was because I had forgotten char is initialised to 0xFF. If it was initialised to \0 then both arrays would have been full of null terminators. The default value of char is the killing blow to the idea.
 The only thing that guarantees null termination is a string literal.
string literals /and/ calling toStringz.
 Even "abc".dup is not going to be guaranteed to be null terminated.   
 For an actual example, try "012345678901234".dup.  This should have a  
 0x0f right after the last character.
Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something?
The final byte of the block is used as the hidden array length (in this case 15).
Good to know. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 12 2011
next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
Gah.. bad grammar.. 1/2 baked sentences..

On Tue, 12 Jul 2011 18:00:41 +0100, Regan Heath <regan netmail.co.nz>  
wrote:
 On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 No, it wouldn't compile.  char[] does not cast implicitly to char *.   
 (if it does, that needs to change).
Replace foo with foo.ptr, it makes no difference to the point I was making.
Which was that a new D user would pass foo.ptr rather than go looking for, and find toStringz. We've had a number of cases on the learn NG in the past.
 OK, but what if it's like this:

 char[] foo = new char[100];
 auto bar = foo;

 ucase(foo);

 In most cases, bar is also written to, but in some cases only foo is  
 written to.

 Granted, we're getting further out on the hypothetical limb here :)   
 But my point is, making it require explicit calling of toStringz  
 instead of implicit makes the code less confusing, because you  
 understand "oh, toStringz may reallocate, so I can't expect bar to also  
 get updated" vs. simply calling a function with a buffer.
This is not a 'new' problem introduced the idea, it's a general problem
--> ^by
 for D/arrays/slices and the same happens with an append, right?  In  
 which case it's not a reason against the idea.
Jul 12 2011
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 What if you expect the function is expecting to write to the  
 buffer, and the compiler just made a copy of it?  Won't that be  
 pretty surprising?
Assuming a C function in this form: void write_to_buffer(char *buffer, int length);
No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.
Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:
No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).
Replace foo with foo.ptr, it makes no difference to the point I was making.
You fix does not help in that case, foo.ptr will be passed as a non-null terminated string. So, your proposal fixes the case: 1. The user tries to pass a string/char[] to a C function. Fails to compile. 2. Instead of trying to understand the issue, realizes the .ptr member is the right type, and switches to that. It does not fix or help with cases where: * a programmer notices the type of the parameter is char * and uses foo.ptr without trying foo first. (crash) * a programmer calls toStringz without going through the compile/fix cycle above. * a programmer tries to pass string/char[], fails to compile, then looks up how to interface with C and finds toStringz I think this fix really doesn't solve a very common problem.
 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called the  
 underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.
OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.
This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.
It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.
 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated.
True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.
No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).
Ah, ok, this was because I had forgotten char is initialised to 0xFF. If it was initialised to \0 then both arrays would have been full of null terminators. The default value of char is the killing blow to the idea.
toStringz does not currently check for '\0' anywhere in the existing string. It simply appends '\0' to the end of the passed string. If you want it to check for '\0', how far should it go? Doesn't this also add to the overhead (looping over all chars looking for '\0')? Note also, that toStringz has old code that used to check for "one byte beyond" the array, but this is commented out, because it's unreliable (could cause a segfault).
 The only thing that guarantees null termination is a string literal.
string literals /and/ calling toStringz.
 Even "abc".dup is not going to be guaranteed to be null terminated.   
 For an actual example, try "012345678901234".dup.  This should have a  
 0x0f right after the last character.
Why 0x0f? Does the allocator initialise array memory to it's offset from the start of the block or something?
The final byte of the block is used as the hidden array length (in this case 15).
Good to know.
Just for history trivia, it used to be there as an unallocated byte. Which means it likely had random data in it. It was there to prevent cross-block pointers. If the byte was part of the array, then it would be possible to do: arr1 = arr[$..$]; and now, arr1 points at the *next* block! arr1 ~= 5; and now, arr1 may have stomped over possibly unallocated data, or possibly some already allocated data! So it was a nice bonus that the byte I commandeered for storing the array length was already unused :) -Steve
Jul 12 2011
parent reply "Regan Heath" <regan netmail.co.nz> writes:
Ok, it's clear there has been some confusion over what exactly I am  
suggesting.

I am not suggesting the compiler simply insert calls to the existing  
toStringz function as it appears the function does not, or cannot do what  
I am imagining.

I am suggesting the compiler will perform a special operation on all char*  
parameters passed to extern "C" functions.

The operation is a toStringz like operation which is (more or less) as  
follows:

1. If there is a \0 character inside foo[0..$], do nothing.
2. If the array allocated memory is > the array length, place a \0 at  
foo[$]
3. Reallocate the array memory, updating foo, place a \0 at foo[$]
4. Call the C function passing foo.ptr

So, it will handle all the following cases:

char[] foo;
.. code to populate foo ..

ucase(foo);
ucase(foo.ptr);
ucase(toStringz(foo));

The problem cases are the buffer cases I mentioned earlier, and they  
wouldn't be a problem if char was initialised to \0 as I first imagined.

Other replies inline below..

On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath  
 <regan netmail.co.nz> wrote:
 What if you expect the function is expecting to write to the  
 buffer, and the compiler just made a copy of it?  Won't that be  
 pretty surprising?
Assuming a C function in this form: void write_to_buffer(char *buffer, int length);
No, assuming C function in this form: void ucase(char* str); Essentially, a C function which takes a writable already-null-terminated string, and writes to it.
Ok, that's an even better example for my case. It would be used/called like... char[] foo; .. code which populates foo with something .. ucase(foo); and in D today this would corrupt memory. Unless the programmer remembered to write:
No, it wouldn't compile. char[] does not cast implicitly to char *. (if it does, that needs to change).
Replace foo with foo.ptr, it makes no difference to the point I was making.
You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.
No, see above.
 So, your proposal fixes the case:

 1. The user tries to pass a string/char[] to a C function.  Fails to  
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member  
 is the right type, and switches to that.

 It does not fix or help with cases where:

   * a programmer notices the type of the parameter is char * and uses  
 foo.ptr without trying foo first. (crash)
   * a programmer calls toStringz without going through the compile/fix  
 cycle above.
   * a programmer tries to pass string/char[], fails to compile, then  
 looks up how to interface with C and finds toStringz

 I think this fix really doesn't solve a very common problem.
See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs). toStringz returns foo.ptr (I assume). 3. If the programmer passes 'foo', the compiler calls toStringz etc.
 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called  
 the underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.
OK, but what if it's like this: char[] foo = new char[100]; auto bar = foo; ucase(foo); In most cases, bar is also written to, but in some cases only foo is written to. Granted, we're getting further out on the hypothetical limb here :) But my point is, making it require explicit calling of toStringz instead of implicit makes the code less confusing, because you understand "oh, toStringz may reallocate, so I can't expect bar to also get updated" vs. simply calling a function with a buffer.
This is not a 'new' problem introduced the idea, it's a general problem for D/arrays/slices and the same happens with an append, right? In which case it's not a reason against the idea.
It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.
None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array. In short, the end result will ALWAYS be that the passed slice/array will contain the output of the C function. The goal is to make a call to an extern "C" function "just work" in the has it's own string type.
 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?
In neither case are they required to be null terminated.
True, but I was outlining the worst case scenario for my suggestion, not describing the real C function requirements.
No, I mean you were wrong, D does not guarantee either of those (stack allocated or heap allocated) is null terminated. So toStringz must add a '\0' at the end (which is mildly expensive for heap data, and very expensive for stack data).
Ah, ok, this was because I had forgotten char is initialised to 0xFF. If it was initialised to \0 then both arrays would have been full of null terminators. The default value of char is the killing blow to the idea.
toStringz does not currently check for '\0' anywhere in the existing string. It simply appends '\0' to the end of the passed string. If you want it to check for '\0', how far should it go? Doesn't this also add to the overhead (looping over all chars looking for '\0')? Note also, that toStringz has old code that used to check for "one byte beyond" the array, but this is commented out, because it's unreliable (could cause a segfault).
So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory). -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 13 2011
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 I am suggesting the compiler will perform a special operation on all  
 char* parameters passed to extern "C" functions.

 The operation is a toStringz like operation which is (more or less) as  
 follows:

 1. If there is a \0 character inside foo[0..$], do nothing.
This is an O(n) operation -- too much overhead. Especially if you already know foo has a 0 in it. Note that toStringz does not have this overhead.
 2. If the array allocated memory is > the array length, place a \0 at  
 foo[$]
The check to see if the array has allocated length requires a GC lock, and O(lgn) search for the block info in the GC. Not that it doesn't already happen in toStringz, but I just want to point out that it's not a small cost.
 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
 4. Call the C function passing foo.ptr

 So, it will handle all the following cases:

 char[] foo;
 .. code to populate foo ..

 ucase(foo);
 ucase(foo.ptr);
I read in your responses below, this is due to you making this equivalent to ucase(foo)? This still has the same problems I listed above. What about char * foo; .. code to populate foo .. ucase(foo); Is there still anything special done by the compiler?
 ucase(toStringz(foo));

 The problem cases are the buffer cases I mentioned earlier, and they  
 wouldn't be a problem if char was initialised to \0 as I first imagined.
The largest problem I've had with all this is there is a necessary overhead of conversion. Not only that, but due to the way reallocation works, there may be a move of data. I think it's better to require explicit calls incurring such overhead vs. hiding the overhead calls from the developer. Especially if the overhead calls are unnecessary.
 Other replies inline below..

 On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 Replace foo with foo.ptr, it makes no difference to the point I was  
 making.
You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.
No, see above.
How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from? The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.
 So, your proposal fixes the case:

 1. The user tries to pass a string/char[] to a C function.  Fails to  
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member  
 is the right type, and switches to that.

 It does not fix or help with cases where:

   * a programmer notices the type of the parameter is char * and uses  
 foo.ptr without trying foo first. (crash)
   * a programmer calls toStringz without going through the compile/fix  
 cycle above.
   * a programmer tries to pass string/char[], fails to compile, then  
 looks up how to interface with C and finds toStringz

 I think this fix really doesn't solve a very common problem.
See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs).
What if it's not foo.ptr? What if it's some random char * whose origin the compiler isn't aware of?

 toStringz returns foo.ptr (I assume).
Huh? Why should it do anything with toStringz? I'm not getting this one, toStringz already has done the work your proposal wants to do.
 This is not a 'new' problem introduced the idea, it's a general  
 problem for D/arrays/slices and the same happens with an append,  
 right?  In which case it's not a reason against the idea.
It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.
None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array.
What about this case? char buffer[12]; buffer[] = "hello, world"; ucase(buffer[]); // does nothing to buffer! I'm saying, the charter of the function is to update a string in place, and your proposal is making that not true in some cases.
 The goal is to make a call to an extern "C" function "just work" in the  

 has it's own string type.
a '0' at the end affects all references to that string, reallocation or not.
 toStringz does not currently check for '\0' anywhere in the existing  
 string.  It simply appends '\0' to the end of the passed string.  If  
 you want it to check for '\0', how far should it go?  Doesn't this also  
 add to the overhead (looping over all chars looking for '\0')?

 Note also, that toStringz has old code that used to check for "one byte  
 beyond" the array, but this is commented out, because it's unreliable  
 (could cause a segfault).
So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory).
s/clever/slow/ The only "intelligent" way to check for a 0 is a linear search. Without knowing where the data came from, there is no way to look past the slice without possibly calling a segfault. If you know it's a heap allocation, you can look at the block information to see if you can look past it. This might be possible to do for toStringz, but the linear check for 0 is just unacceptable for a simple function call. Appending a 0 is at least amortized. One thing though, it could make some smarter decisions as to whether to reallocate depending on the type of the array, since it is already doing a lookup of block info. But I still always come back to the fact that I should be able to circumvent some auto-intelligent decision that isn't aware of things that a developer can be aware of (such as knowing an array already contains a 0). The compiler shouldn't be too intrusive here. -Steve
Jul 13 2011
next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On 2011-07-13 09:00, Steven Schveighoffer wrote:
 On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz>
 
 wrote:
 I am suggesting the compiler will perform a special operation on all
 char* parameters passed to extern "C" functions.
 
 The operation is a toStringz like operation which is (more or less) as
 follows:
 
 1. If there is a \0 character inside foo[0..$], do nothing.
This is an O(n) operation -- too much overhead. Especially if you already know foo has a 0 in it. Note that toStringz does not have this overhead.
 2. If the array allocated memory is > the array length, place a \0 at
 foo[$]
The check to see if the array has allocated length requires a GC lock, and O(lgn) search for the block info in the GC. Not that it doesn't already happen in toStringz, but I just want to point out that it's not a small cost.
 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
 4. Call the C function passing foo.ptr
 
 So, it will handle all the following cases:
 
 char[] foo;
 .. code to populate foo ..
 
 ucase(foo);
 ucase(foo.ptr);
I read in your responses below, this is due to you making this equivalent to ucase(foo)? This still has the same problems I listed above. What about char * foo; .. code to populate foo .. ucase(foo); Is there still anything special done by the compiler?
 ucase(toStringz(foo));
 
 The problem cases are the buffer cases I mentioned earlier, and they
 wouldn't be a problem if char was initialised to \0 as I first imagined.
The largest problem I've had with all this is there is a necessary overhead of conversion. Not only that, but due to the way reallocation works, there may be a move of data. I think it's better to require explicit calls incurring such overhead vs. hiding the overhead calls from the developer. Especially if the overhead calls are unnecessary.
 Other replies inline below..
 
 On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer
 
 <schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>
 
 wrote:
 Replace foo with foo.ptr, it makes no difference to the point I was
 making.
You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.
No, see above.
How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from? The inherent problem of zero-terminated strings is that you don't know how long it is until you search for a zero. If it's not properly terminated, then you are screwed. That problem cannot be "solved", even with compiler help -- you can get situations where there is no more information other than the pointer.
 So, your proposal fixes the case:
 
 1. The user tries to pass a string/char[] to a C function. Fails to
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member
 is the right type, and switches to that.
 
 It does not fix or help with cases where:
 * a programmer notices the type of the parameter is char * and uses
 
 foo.ptr without trying foo first. (crash)
 
 * a programmer calls toStringz without going through the compile/fix
 
 cycle above.
 
 * a programmer tries to pass string/char[], fails to compile, then
 
 looks up how to interface with C and finds toStringz
 
 I think this fix really doesn't solve a very common problem.
See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs).
What if it's not foo.ptr? What if it's some random char * whose origin the compiler isn't aware of?

 toStringz returns foo.ptr (I assume).
Huh? Why should it do anything with toStringz? I'm not getting this one, toStringz already has done the work your proposal wants to do.
 This is not a 'new' problem introduced the idea, it's a general
 problem for D/arrays/slices and the same happens with an append,
 right? In which case it's not a reason against the idea.
It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.
None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array.
What about this case? char buffer[12]; buffer[] = "hello, world"; ucase(buffer[]); // does nothing to buffer! I'm saying, the charter of the function is to update a string in place, and your proposal is making that not true in some cases.
 The goal is to make a call to an extern "C" function "just work" in the

 has it's own string type.
a '0' at the end affects all references to that string, reallocation or not.
 toStringz does not currently check for '\0' anywhere in the existing
 string. It simply appends '\0' to the end of the passed string. If
 you want it to check for '\0', how far should it go? Doesn't this also
 add to the overhead (looping over all chars looking for '\0')?
 
 Note also, that toStringz has old code that used to check for "one byte
 beyond" the array, but this is commented out, because it's unreliable
 (could cause a segfault).
So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory).
s/clever/slow/ The only "intelligent" way to check for a 0 is a linear search. Without knowing where the data came from, there is no way to look past the slice without possibly calling a segfault. If you know it's a heap allocation, you can look at the block information to see if you can look past it. This might be possible to do for toStringz, but the linear check for 0 is just unacceptable for a simple function call. Appending a 0 is at least amortized. One thing though, it could make some smarter decisions as to whether to reallocate depending on the type of the array, since it is already doing a lookup of block info. But I still always come back to the fact that I should be able to circumvent some auto-intelligent decision that isn't aware of things that a developer can be aware of (such as knowing an array already contains a 0). The compiler shouldn't be too intrusive here.
Andrej Mitrovic found a rather annoying issue (which is fortunately highly unlikely and therefore almost certainly rare) with toStringz and toUTFz with checking for a terminating '\0' one past the end of the string (which both functions do under some circumstances). You might want to have a look at it: https://github.com/D-Programming-Language/phobos/pull/123 Given what you know about the GC and arrays, your thoughts on the matter would be welcome. - Jonathan M Davis
Jul 13 2011
prev sibling parent reply "Regan Heath" <regan netmail.co.nz> writes:
On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 I am suggesting the compiler will perform a special operation on all  
 char* parameters passed to extern "C" functions.

 The operation is a toStringz like operation which is (more or less) as  
 follows:

 1. If there is a \0 character inside foo[0..$], do nothing.
This is an O(n) operation -- too much overhead. Especially if you already know foo has a 0 in it. Note that toStringz does not have this overhead.
On 2nd thought, this step is unnecessary unless the array length matches the memory block length .. it was intended to detect an existing \0 and avoid the reallocation. But, this case is rare so this step could be skipped for the general case, or only carried out when the lengths match and reallocation is a possibility we want to avoid, or not if the cost is too high even for that.
 2. If the array allocated memory is > the array length, place a \0 at  
 foo[$]
The check to see if the array has allocated length requires a GC lock, and O(lgn) search for the block info in the GC. Not that it doesn't already happen in toStringz, but I just want to point out that it's not a small cost.
This is the cost Walter mentioned earlier. Does this mean that heap allocated arrays do not know how much memory they have allocated? I was assuming they held that information, and that a slice to them would also know. How else does an array append operation know whether to reallocate? Does it have to obtain the GC lock and perform an O(lgn) search on every append?
 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
 4. Call the C function passing foo.ptr

 So, it will handle all the following cases:

 char[] foo;
 .. code to populate foo ..

 ucase(foo);
 ucase(foo.ptr);
I read in your responses below, this is due to you making this equivalent to ucase(foo)? This still has the same problems I listed above.
Problems above? You mean the cost? Yes, there is a cost to pay, but it's a cost which has to be paid (and is already paid by calling toStringz) to avoid corrupting memory whether it's done explicitly or implicitly. And the cost is only paid for extern "C" functions with char* parameters. In the rare case where the string already contains \0 and the programmer can guarantee that, we can have some way to indicate it, or in some cases changing the function parameter to ubyte* or byte* may be the correct solution.
 What about

 char * foo;
 .. code to populate foo ..
 ucase(foo);

 Is there still anything special done by the compiler?
Assuming foo is allocated by the GC toStringz can still find the length of we can handle this case as well (for no extra cost than incurred by toStringz already).
 ucase(toStringz(foo));

 The problem cases are the buffer cases I mentioned earlier, and they  
 wouldn't be a problem if char was initialised to \0 as I first imagined.
The largest problem I've had with all this is there is a necessary overhead of conversion. Not only that, but due to the way reallocation works, there may be a move of data. I think it's better to require explicit calls incurring such overhead vs. hiding the overhead calls from the developer. Especially if the overhead calls are unnecessary.
But, the overhead is something we already pay calling toStringz explicitly, and the reallocation is no different to an append operation. Generally speaking I would normally agree that it's better to require explicit calls incurring overhead etc, but this specific case is something new D programmers stumble on all the time, and it makes D look less slick different for each, but if we can achieve something similar for no extra cost (other than we already pay calling toStringz explicitly), then it's well worth considering. As far as I can see the only problem cases are those where we incur more cost than toStringz when it's not required, and those cases seem rare to me, and could be handled by an opt-out decoration/keyword or similar.
 Other replies inline below..

 On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 Replace foo with foo.ptr, it makes no difference to the point I was  
 making.
You fix does not help in that case, foo.ptr will be passed as a non-null terminated string.
No, see above.
How does your proposal know that a char * is part of a heap-allocated array? If you are assuming the only case where char * is passed will be arr.ptr, then that doesn't cut it. What if the compiler doesn't know where the char * came from?
See your Q and my A above ("char * foo" example).
 The inherent problem of zero-terminated strings is that you don't know  
 how long it is until you search for a zero.  If it's not properly  
 terminated, then you are screwed.  That problem cannot be "solved", even  
 with compiler help -- you can get situations where there is no more  
 information other than the pointer.
Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?
 So, your proposal fixes the case:

 1. The user tries to pass a string/char[] to a C function.  Fails to  
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member  
 is the right type, and switches to that.

 It does not fix or help with cases where:

   * a programmer notices the type of the parameter is char * and uses  
 foo.ptr without trying foo first. (crash)
   * a programmer calls toStringz without going through the compile/fix  
 cycle above.
   * a programmer tries to pass string/char[], fails to compile, then  
 looks up how to interface with C and finds toStringz

 I think this fix really doesn't solve a very common problem.
See above, my intention was to solve all the cases listed here as I suspect the compiler can detect them all, and just 'do the right thing'. In these cases.. 1. If the programmer writes foo.ptr, the compiler detects that, calls toStringz on 'foo' (not foo.ptr) and updates foo as required (if reallocation occurs).
What if it's not foo.ptr? What if it's some random char * whose origin the compiler isn't aware of?
See above.

 toStringz returns foo.ptr (I assume).
Huh? Why should it do anything with toStringz? I'm not getting this one, toStringz already has done the work your proposal wants to do.
I was assuming the compiler could not detect the case where the programmer is explicitly calling toStringz i.e. what would be legacy code assuming this proposal came into effect.
 This is not a 'new' problem introduced the idea, it's a general  
 problem for D/arrays/slices and the same happens with an append,  
 right?  In which case it's not a reason against the idea.
It's new to the features of the C function being called. If you look up the man page for such a hypothetical function, it might claim that it alters the data passed in through the argument, but it seems to not be the case! So there's no way for someone (who arguably is not well versed in C functions if they didn't know to use toStringz) to figure out why the code seems not to do what it says it should. Such a programmer may blame either the implementation of the C function, or blame the D compiler for not calling the function properly.
None of this is relevant, let me explain.. My idea is for the compiler to detect a char* parameter to an extern "C" function and to call toStringz. When it does so it will correctly update the slice/array being passed if reallocation occurs. The C function will write to the slice/array being passed. So, it's not relevant if there was another slice referencing the array before it was reallocated, because that case is no different to calling a D function which does something similar, like appending to the passed slice/array.
What about this case? char buffer[12]; buffer[] = "hello, world"; ucase(buffer[]); // does nothing to buffer! I'm saying, the charter of the function is to update a string in place, and your proposal is making that not true in some cases.
Sure, but how is that different to this: char buffer[12]; buffer[] = "hello, world"; ucase(buffer ~ "a"); // does nothing to buffer! or in fact this: char buffer[12]; buffer[] = "hello, world"; ucase(cast(char*)toStringz(buffer)); // does nothing to buffer! in both cases buffer remains unchanged.
 The goal is to make a call to an extern "C" function "just work" in the  

 has it's own string type.
adding a '0' at the end affects all references to that string, reallocation or not.
\0 to the string. For all I know they're making a completely new copy, the goal is.
 toStringz does not currently check for '\0' anywhere in the existing  
 string.  It simply appends '\0' to the end of the passed string.  If  
 you want it to check for '\0', how far should it go?  Doesn't this  
 also add to the overhead (looping over all chars looking for '\0')?

 Note also, that toStringz has old code that used to check for "one  
 byte beyond" the array, but this is commented out, because it's  
 unreliable (could cause a segfault).
So, toStringz is not as clever as I imagined. I thought it would intelligently detect cases where a \0 was already present in the slice (from 0 to $) and if not, put one at $+1 (inside pre-allocated array memory). I was assuming toStringz had access to the underlying array allocation size and would know how far it can 'look' without causing a segfault. In the case where the slice length equaled the array reserved memory area, it would re-allocate and place the \0 at $+1 (inside the newly allocated memory).
s/clever/slow/ The only "intelligent" way to check for a 0 is a linear search.
Fair enough.
 Without knowing where the data came from, there is no way to look past  
 the slice without possibly calling a segfault.  If you know it's a heap  
 allocation, you can look at the block information to see if you can look  
 past it.  This might be possible to do for toStringz, but the linear  
 check for 0 is just unacceptable for a simple function call.  Appending  
 a 0 is at least amortized.  One thing though, it could make some smarter  
 decisions as to whether to reallocate depending on the type of the  
 array, since it is already doing a lookup of block info.
Ok, scrap the linear search, or only perform it when a reallocation may be required.
 But I still always come back to the fact that I should be able to  
 circumvent some auto-intelligent decision that isn't aware of things  
 that a developer can be aware of (such as knowing an array already  
 contains a 0).  The compiler shouldn't be too intrusive here.
Sure, we want to keep everyone happy, the Q is, to my mind, which is the more general case. It would be nice to have your cake and eat it too, or in other words for the general case (as I see it): char[] foo; .. code which populates foo .. ucase(foo); to "just work" as a new D programmer might expect, at the same time I agree that cases where speed is of the essence, or the data is guaranteed to contain \0 we need to be able to avoid the cost. As most things it comes down to cost/benefit and I think D would benefit from this default behaviour, provided there is a way to avoid it as well. Perhaps restricting the idea to cases like the one above where the compiler has the information for the slice/array, and doing nothing for raw char* cases is a good compromise, it would allow people to avoid the behaviour just by adding .ptr or similar. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 13 2011
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 How does your proposal know that a char * is part of a heap-allocated  
 array?  If you are assuming the only case where char * is passed will  
 be arr.ptr, then that doesn't cut it.  What if the compiler doesn't  
 know where the char * came from?
See your Q and my A above ("char * foo" example).
 The inherent problem of zero-terminated strings is that you don't know  
 how long it is until you search for a zero.  If it's not properly  
 terminated, then you are screwed.  That problem cannot be "solved",  
 even with compiler help -- you can get situations where there is no  
 more information other than the pointer.
Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?
Who said the char * points into GC memory? It could point at stack memory, or static data in ROM. -Steve
Jul 13 2011
parent reply "Regan Heath" <regan netmail.co.nz> writes:
On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 How does your proposal know that a char * is part of a heap-allocated  
 array?  If you are assuming the only case where char * is passed will  
 be arr.ptr, then that doesn't cut it.  What if the compiler doesn't  
 know where the char * came from?
See your Q and my A above ("char * foo" example).
 The inherent problem of zero-terminated strings is that you don't know  
 how long it is until you search for a zero.  If it's not properly  
 terminated, then you are screwed.  That problem cannot be "solved",  
 even with compiler help -- you can get situations where there is no  
 more information other than the pointer.
Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?
Who said the char * points into GC memory? It could point at stack memory, or static data in ROM.
Ok. What would toStringz do in this case? .. because that's what I'm proposing we do here. The goal here is to pick some low hanging fruit, the general case mentioned earlier, and make it work as a new D programmer would expect. In that case there is no technical difficulty implementing it (toStringz already exists), there is no extra cost (you already have to call toStringz), and the only disagreement seems to be whether it should be implicit or explicit. In this particular case I cannot see any harm in making it implicit. Yes, there are some edge cases, but they either already exist (as shown by the explicit toStringz example I gave where the passed char[] remained unchanged, and your example passing buffer[]), or they may be detectable by the compiler, or they are rare - in which case requiring some manual intervention is not too much to ask. So, on balance I reckon the implicit call would be "better" for more people more of the time, and at no extra cost. It seems like a win/win to me. Yes, there are edge cases, yes there are wrinkles to iron out, no it's not a "general/covers everything perfectly" kind of idea - which I agree we'd all prefer, but it makes D look slicker, and removes one more stumbling block for new D programmers. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 14 2011
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 14 Jul 2011 05:53:47 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 How does your proposal know that a char * is part of a heap-allocated  
 array?  If you are assuming the only case where char * is passed will  
 be arr.ptr, then that doesn't cut it.  What if the compiler doesn't  
 know where the char * came from?
See your Q and my A above ("char * foo" example).
 The inherent problem of zero-terminated strings is that you don't  
 know how long it is until you search for a zero.  If it's not  
 properly terminated, then you are screwed.  That problem cannot be  
 "solved", even with compiler help -- you can get situations where  
 there is no more information other than the pointer.
Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?
Who said the char * points into GC memory? It could point at stack memory, or static data in ROM.
Ok. What would toStringz do in this case? .. because that's what I'm proposing we do here.
Nothing, you don't call toStringz on a char *, you call it on a string. The point is, for those who have already guaranteed a char * has a 0 in it, they should not have to have the compiler injecting useless code for a simple function call. A really really good example is if you use a char * you got from a C function to call another C function.
 The goal here is to pick some low hanging fruit, the general case  
 mentioned earlier, and make it work as a new D programmer would expect.   
 In that case there is no technical difficulty implementing it (toStringz  
 already exists), there is no extra cost (you already have to call  
 toStringz), and the only disagreement seems to be whether it should be  
 implicit or explicit.
There is an extra cost where you wouldn't have to call toStringz currently.
 In this particular case I cannot see any harm in making it implicit.   
 Yes, there are some edge cases, but they either already exist (as shown  
 by the explicit toStringz example I gave where the passed char[]  
 remained unchanged, and your example passing buffer[]), or they may be  
 detectable by the compiler, or they are rare - in which case requiring  
 some manual intervention is not too much to ask.

 So, on balance I reckon the implicit call would be "better" for more  
 people more of the time, and at no extra cost.  It seems like a win/win  
 to me.  Yes, there are edge cases, yes there are wrinkles to iron out,  
 no it's not a "general/covers everything perfectly" kind of idea - which  
 I agree we'd all prefer, but it makes D look slicker, and removes one  
 more stumbling block for new D programmers.
We also have to weigh this against two things: 1. How will existing code (that already calls toStringz) be affected? 2. This is *not* a trivial compiler change. So all other options should be considered, there's a *lot* of C calls that exist from D today that could possibly be affected. If C strings were their own type (and not conflated with "buffer pointer"), and verifying a C string was valid without segfaulting and in O(1) time, I'd agree that a compiler change would be warranted. There's just too many cases (note, these aren't the majority, but they are enough) where the injected calls will be either performance drags or unnecessary.
Jul 14 2011
parent "Regan Heath" <regan netmail.co.nz> writes:
On Thu, 14 Jul 2011 12:30:24 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 On Thu, 14 Jul 2011 05:53:47 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 How does your proposal know that a char * is part of a  
 heap-allocated array?  If you are assuming the only case where char  
 * is passed will be arr.ptr, then that doesn't cut it.  What if the  
 compiler doesn't know where the char * came from?
See your Q and my A above ("char * foo" example).
 The inherent problem of zero-terminated strings is that you don't  
 know how long it is until you search for a zero.  If it's not  
 properly terminated, then you are screwed.  That problem cannot be  
 "solved", even with compiler help -- you can get situations where  
 there is no more information other than the pointer.
Really? But cant we obtain the GC lock and look them up, as mentioned above? And isn't this exactly what toStringz will do when the programmer first of all curses because it has crashed, and then adds an explicit toStringz call?
Who said the char * points into GC memory? It could point at stack memory, or static data in ROM.
Ok. What would toStringz do in this case? .. because that's what I'm proposing we do here.
Nothing, you don't call toStringz on a char *, you call it on a string. The point is, for those who have already guaranteed a char * has a 0 in it, they should not have to have the compiler injecting useless code for a simple function call. A really really good example is if you use a char * you got from a C function to call another C function.
Good points all. So, the idea should be limited to cases where D's char[] and string are passed to extern "C" functions expecting char*, and should not affect cases where D's char* is passed directly. Sounds good.
 The goal here is to pick some low hanging fruit, the general case  
 mentioned earlier, and make it work as a new D programmer would  
 expect.  In that case there is no technical difficulty implementing it  
 (toStringz already exists), there is no extra cost (you already have to  
 call toStringz), and the only disagreement seems to be whether it  
 should be implicit or explicit.
There is an extra cost where you wouldn't have to call toStringz currently.
The point I've tried to make all along is that this is a rare situation, and not the general case. In the general case you're going to need to call toStringz. Especially if you restrict this idea to D's char[] and string and not D's char* as mentioned above.
 In this particular case I cannot see any harm in making it implicit.   
 Yes, there are some edge cases, but they either already exist (as shown  
 by the explicit toStringz example I gave where the passed char[]  
 remained unchanged, and your example passing buffer[]), or they may be  
 detectable by the compiler, or they are rare - in which case requiring  
 some manual intervention is not too much to ask.

 So, on balance I reckon the implicit call would be "better" for more  
 people more of the time, and at no extra cost.  It seems like a win/win  
 to me.  Yes, there are edge cases, yes there are wrinkles to iron out,  
 no it's not a "general/covers everything perfectly" kind of idea -  
 which I agree we'd all prefer, but it makes D look slicker, and removes  
 one more stumbling block for new D programmers.
We also have to weigh this against two things:
Assuming the above mentioned restriction (char[] and string, not char*)...
 1. How will existing code (that already calls toStringz) be affected?
Not at all.
 2. This is *not* a trivial compiler change.  So all other options should  
 be considered, there's a *lot* of C calls that exist from D today that  
 could possibly be affected.
It will affect none of these.
 If C strings were their own type (and not conflated with "buffer  
 pointer"), and verifying a C string was valid without segfaulting and in  
 O(1) time, I'd agree that a compiler change would be warranted.  There's  
 just too many cases (note, these aren't the majority, but they are  
 enough) where the injected calls will be either performance drags or  
 unnecessary.
I disagree about the number of cases being too many, but this is a gut feeling and I have no evidence to support it. I think with the restriction I mentioned above the situation changes however, as all those edge cases are unaffected, old code is unaffected and only new code will allow char[] and string to be passed as extern "C" char* parameters. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jul 14 2011