digitalmars.D - toStringz or not toStringz

Regan Heath (16/16) Jul 08 2011 Sorry if this has been asked/answered before but I've been out of the lo...

Walter Bright (2/5) Jul 08 2011 Because char* in C does not necessarily mean "zero terminated string".

Regan Heath (20/26) Jul 08 2011 Sure, but in many (most?) cases it does. And in those cases where it

Steven Schveighoffer (21/46) Jul 08 2011 What about a template function that does this automatically? I'm thinkin...

SimonM (12/15) Jul 08 2011 This is kind of off-topic, and I don't know if it's just me, but I've

Mike Parker (6/13) Jul 08 2011 import std.utf;
Jonathan M Davis (3/21) Jul 08 2011 https://github.com/D-Programming-Language/phobos/pull/123

Walter Bright (5/21) Jul 08 2011 In the worst case, you're adding an extra memory allocation and function...

Regan Heath (17/47) Jul 12 2011 This worst case only happens when:

Steven Schveighoffer (5/43) Jul 12 2011 What if you expect the function is expecting to write to the buffer, and...

Regan Heath (16/62) Jul 12 2011 Assuming a C function in this form:

Steven Schveighoffer (12/75) Jul 12 2011 No, assuming C function in this form:

Steven Schveighoffer (6/15) Jul 12 2011 And, actually, the cost penalty of checking if you are going to segfault...

Regan Heath (9/25) Jul 12 2011 I wouldn't know anything about this. I was assuming when toStringz was ...

Regan Heath (36/70) Jul 12 2011 Ok, that's an even better example for my case.

Steven Schveighoffer (22/79) Jul 12 2011 No, it wouldn't compile. char[] does not cast implicitly to char *. (i...

Regan Heath (12/100) Jul 12 2011 Replace foo with foo.ptr, it makes no difference to the point I was maki...

Regan Heath (7/31) Jul 12 2011 Gah.. bad grammar.. 1/2 baked sentences..
Steven Schveighoffer (44/153) Jul 12 2011 You fix does not help in that case, foo.ptr will be passed as a non-null...

Regan Heath (58/189) Jul 13 2011 Ok, it's clear there has been some confusion over what exactly I am

Steven Schveighoffer (57/149) Jul 13 2011 This is an O(n) operation -- too much overhead. Especially if you alrea...

Jonathan M Davis (9/208) Jul 13 2011 Andrej Mitrovic found a rather annoying issue (which is fortunately high...
Regan Heath (82/247) Jul 13 2011 On 2nd thought, this step is unnecessary unless the array length matches...

Steven Schveighoffer (5/21) Jul 13 2011 Who said the char * points into GC memory? It could point at stack

Regan Heath (24/47) Jul 14 2011 Ok. What would toStringz do in this case? .. because that's what I'm

Steven Schveighoffer (19/67) Jul 14 2011 Nothing, you don't call toStringz on a char *, you call it on a string. ...

Regan Heath (20/92) Jul 14 2011 Good points all. So, the idea should be limited to cases where D's char...

"Regan Heath" <regan netmail.co.nz> writes:

Sorry if this has been asked/answered before but I've been out of the loop  
for a while..

I was just thinking about the recent discussion on renaming toStringz and  
I wondered why we need to explicitly call it at all.  Why can't we have  
the compiler call it automatically whenever we pass a string, or char[] to  
an extern "C" function, where the parameter is defined as char*?

I believe some extern "C" functions are defined as taking ubyte* or byte*  
instead of char*, but in those cases I believe they are 'buffers' and have  
a supplied length as well, meaning there is no need for the trailing \0 in  
any case.

I am probably missing something obvious, but it seems like it might work.

Side note.. It bothers me a little that 'char' means utf-8 codepoint in D,  
and means unsigned byte in extern "C" definitions, but I can live with  
that.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 08 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to an
extern
 "C" function, where the parameter is defined as char*?

Because char* in C does not necessarily mean "zero terminated string".

Jul 08 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to  
 an extern
 "C" function, where the parameter is defined as char*?

 Because char* in C does not necessarily mean "zero terminated string".

Sure, but in many (most?) cases it does.  And in those cases where it  
doesn't you could argue ubyte* or byte* should have been used in the D  
extern "C" declaration instead.  Plus, in those cases, worst case  
scenario, D passes an extra \0 byte to those functions which either ignore  
it because they were also passed a length, or expect a fixed sized  
structure, or .. I don't know what as I can't imagine another case where  
char* would be used without it being a "zero terminated string", or  
passing/knowing the length ahead of time.

D is already allocating an extra \0 byte for string constants right?  And,  
I assume, toStringz is already clever enough to detect cases where there  
is already a \0 in the correct position, or utilises the existing  
preallocated space remaining in a dynamic array, making it almost a  
no-op.  The only case it actually does any work is a dynamic or static  
array which is full.  In the former case the array is resized, and I'm not  
sure about the latter but I suspect it's more expensive.  So, it seems the  
cost of this is very low.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 08 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 08 Jul 2011 07:53:20 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to  
 an extern
 "C" function, where the parameter is defined as char*?

 Because char* in C does not necessarily mean "zero terminated string".

 Sure, but in many (most?) cases it does.  And in those cases where it  
 doesn't you could argue ubyte* or byte* should have been used in the D  
 extern "C" declaration instead.  Plus, in those cases, worst case  
 scenario, D passes an extra \0 byte to those functions which either  
 ignore it because they were also passed a length, or expect a fixed  
 sized structure, or .. I don't know what as I can't imagine another case  
 where char* would be used without it being a "zero terminated string",  
 or passing/knowing the length ahead of time.

 D is already allocating an extra \0 byte for string constants right?   
 And, I assume, toStringz is already clever enough to detect cases where  
 there is already a \0 in the correct position, or utilises the existing  
 preallocated space remaining in a dynamic array, making it almost a  
 no-op.  The only case it actually does any work is a dynamic or static  
 array which is full.  In the former case the array is resized, and I'm  
 not sure about the latter but I suspect it's more expensive.  So, it  
 seems the cost of this is very low.

What about a template function that does this automatically? I'm thinking  
something like opDispatch:

extern(C) foo(const(char)* c);

struct CCall
{
    auto opDispatch(string call, S...)(S args) if(call is a C function (can  
check this somehow?) )
    {
        /* determine which args of S are char[], and translate them to  
zero-terminated */
        ...
    }
}

usage:

string s;
CCall.foo(s);

I personally think, barring this idea, the best path is simply to wrap C  
functions you want to call with toStringz'd versions.

-Steve

Jul 08 2011

SimonM <user example.net> writes:

This is kind of off-topic, and I don't know if it's just me, but I've 
barely been able to use toStringz() where it's supposed to be useful:

I tried using it with a C function whose parameters are not 
const(char)*, but just char*, but because it returns immutable(char)'s I 
had to write my own one.

I tried using it with a C function that's unicode, but it won't take 
wstring's as arguments... so I had to write my own one.

Maybe it's because I'm not really experienced with interfacing to C code 
from D, or maybe it's because I couldn't write the extern(C) code myself 
as I'm using someone else's C interface, but out of the 3 times I tried 
using it in the last day, it only helped once.

On 2011/07/08 15:48 PM, Steven Schveighoffer wrote:
 I personally think, barring this idea, the best path is simply to wrap C
 functions you want to call with toStringz'd versions.

 -Steve

Jul 08 2011

Mike Parker <aldacron gmail.com> writes:

On 7/8/2011 11:03 PM, SimonM wrote:
 This is kind of off-topic, and I don't know if it's just me, but I've
 barely been able to use toStringz() where it's supposed to be useful:

 I tried using it with a C function whose parameters are not
 const(char)*, but just char*, but because it returns immutable(char)'s I
 had to write my own one.

someCFunc(cast(char*)myString.toStringz());

 I tried using it with a C function that's unicode, but it won't take
 wstring's as arguments... so I had to write my own one.

import std.utf;

some_wchar_func(myString.toUTF16z());

/* For non-const */
some_wchar_func_2(cast(wchar*)myString.toUTF16z());

Jul 08 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-07-08 07:03, SimonM wrote:
 This is kind of off-topic, and I don't know if it's just me, but I've
 barely been able to use toStringz() where it's supposed to be useful:
 
 I tried using it with a C function whose parameters are not
 const(char)*, but just char*, but because it returns immutable(char)'s I
 had to write my own one.
 
 I tried using it with a C function that's unicode, but it won't take
 wstring's as arguments... so I had to write my own one.
 
 Maybe it's because I'm not really experienced with interfacing to C code
 from D, or maybe it's because I couldn't write the extern(C) code myself
 as I'm using someone else's C interface, but out of the 3 times I tried
 using it in the last day, it only helped once.
 
 On 2011/07/08 15:48 PM, Steven Schveighoffer wrote:
 I personally think, barring this idea, the best path is simply to wrap C
 functions you want to call with toStringz'd versions.


https://github.com/D-Programming-Language/phobos/pull/123

- Jonathan M Davis

Jul 08 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[] to an
extern
 "C" function, where the parameter is defined as char*?

 Because char* in C does not necessarily mean "zero terminated string".

 Sure, but in many (most?) cases it does. And in those cases where it doesn't
you
 could argue ubyte* or byte* should have been used in the D extern "C"
 declaration instead. Plus, in those cases, worst case scenario, D passes an
 extra \0 byte to those functions which either ignore it because they were also
 passed a length, or expect a fixed sized structure, or .. I don't know what as
I
 can't imagine another case where char* would be used without it being a "zero
 terminated string", or passing/knowing the length ahead of time.

In the worst case, you're adding an extra memory allocation and function call 
overhead (that is hidden to the user, and not turn-off-able). This is not 
acceptable when interfacing to C.


 D is already allocating an extra \0 byte for string constants right?

Yes, but in a way that is essentially free.

Jul 08 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[]  
 to an extern
 "C" function, where the parameter is defined as char*?

 Because char* in C does not necessarily mean "zero terminated string".

 Sure, but in many (most?) cases it does. And in those cases where it  
 doesn't you
 could argue ubyte* or byte* should have been used in the D extern "C"
 declaration instead. Plus, in those cases, worst case scenario, D  
 passes an
 extra \0 byte to those functions which either ignore it because they  
 were also
 passed a length, or expect a fixed sized structure, or .. I don't know  
 what as I
 can't imagine another case where char* would be used without it being a  
 "zero
 terminated string", or passing/knowing the length ahead of time.

 In the worst case, you're adding an extra memory allocation and function  
 call overhead (that is hidden to the user, and not turn-off-able). This  
 is not acceptable when interfacing to C.

This worst case only happens when:
1. The extern "C" function takes a char* and is NOT expecting a "zero  
terminated string".
2. The char[], string, etc being passed is a fixed length array, or a  
slice which has no available space left for the \0.

So, it's rare.  I would guess a less than 1% of cases for general  
programming.

And, it *is* turn-off-able.  You simply change the extern "C" to use  
ubyte*, byte*, or void* (instead of char*).  This is arguably a better  
definition for this sort of function in the first place.

 D is already allocating an extra \0 byte for string constants right?

 Yes, but in a way that is essentially free.

Yep, this is essentially free, and calling toStringz automatically would  
be almost as free, for 99% of cases.  Plus it would "just work" which is a  
big deal when you're talking about first impressions etc.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 12 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[]  
 to an extern
 "C" function, where the parameter is defined as char*?

 Because char* in C does not necessarily mean "zero terminated string".

 Sure, but in many (most?) cases it does. And in those cases where it  
 doesn't you
 could argue ubyte* or byte* should have been used in the D extern "C"
 declaration instead. Plus, in those cases, worst case scenario, D  
 passes an
 extra \0 byte to those functions which either ignore it because they  
 were also
 passed a length, or expect a fixed sized structure, or .. I don't know  
 what as I
 can't imagine another case where char* would be used without it being  
 a "zero
 terminated string", or passing/knowing the length ahead of time.

 In the worst case, you're adding an extra memory allocation and  
 function call overhead (that is hidden to the user, and not  
 turn-off-able). This is not acceptable when interfacing to C.

 This worst case only happens when:
 1. The extern "C" function takes a char* and is NOT expecting a "zero  
 terminated string".
 2. The char[], string, etc being passed is a fixed length array, or a  
 slice which has no available space left for the \0.

 So, it's rare.  I would guess a less than 1% of cases for general  
 programming.

What if you expect the function is expecting to write to the buffer, and  
the compiler just made a copy of it?  Won't that be pretty surprising?

-Steve

Jul 12 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Tue, 12 Jul 2011 15:18:04 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or char[]  
 to an extern
 "C" function, where the parameter is defined as char*?

 Because char* in C does not necessarily mean "zero terminated  
 string".

 Sure, but in many (most?) cases it does. And in those cases where it  
 doesn't you
 could argue ubyte* or byte* should have been used in the D extern "C"
 declaration instead. Plus, in those cases, worst case scenario, D  
 passes an
 extra \0 byte to those functions which either ignore it because they  
 were also
 passed a length, or expect a fixed sized structure, or .. I don't  
 know what as I
 can't imagine another case where char* would be used without it being  
 a "zero
 terminated string", or passing/knowing the length ahead of time.

 In the worst case, you're adding an extra memory allocation and  
 function call overhead (that is hidden to the user, and not  
 turn-off-able). This is not acceptable when interfacing to C.

 This worst case only happens when:
 1. The extern "C" function takes a char* and is NOT expecting a "zero  
 terminated string".
 2. The char[], string, etc being passed is a fixed length array, or a  
 slice which has no available space left for the \0.

 So, it's rare.  I would guess a less than 1% of cases for general  
 programming.

 What if you expect the function is expecting to write to the buffer, and  
 the compiler just made a copy of it?  Won't that be pretty surprising?

Assuming a C function in this form:

   void write_to_buffer(char *buffer, int length);

You might initially extern it as:

   extern "C" void write_to_buffer(char *buffer, int length);

And, you could call it one of 2 ways (legitimately):

   char[] foo = new char[100];
   write_to_buffer(foo, foo.length);

or:

   char[100] foo;
   write_to_buffer(foo, foo.length);

and in both cases, toStringz would do nothing as foo is zero terminated  
already (in both cases), or am I wrong about that?

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 12 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Tue, 12 Jul 2011 15:18:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 09:54:15 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Fri, 08 Jul 2011 18:59:47 +0100, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 7/8/2011 4:53 AM, Regan Heath wrote:
 On Fri, 08 Jul 2011 10:49:08 +0100, Walter Bright  
 <newshound2 digitalmars.com>
 wrote:

 On 7/8/2011 2:26 AM, Regan Heath wrote:
 Why can't we have the
 compiler call it automatically whenever we pass a string, or  
 char[] to an extern
 "C" function, where the parameter is defined as char*?

 Because char* in C does not necessarily mean "zero terminated  
 string".

 Sure, but in many (most?) cases it does. And in those cases where it  
 doesn't you
 could argue ubyte* or byte* should have been used in the D extern "C"
 declaration instead. Plus, in those cases, worst case scenario, D  
 passes an
 extra \0 byte to those functions which either ignore it because they  
 were also
 passed a length, or expect a fixed sized structure, or .. I don't  
 know what as I
 can't imagine another case where char* would be used without it  
 being a "zero
 terminated string", or passing/knowing the length ahead of time.

 In the worst case, you're adding an extra memory allocation and  
 function call overhead (that is hidden to the user, and not  
 turn-off-able). This is not acceptable when interfacing to C.

 This worst case only happens when:
 1. The extern "C" function takes a char* and is NOT expecting a "zero  
 terminated string".
 2. The char[], string, etc being passed is a fixed length array, or a  
 slice which has no available space left for the \0.

 So, it's rare.  I would guess a less than 1% of cases for general  
 programming.

 What if you expect the function is expecting to write to the buffer,  
 and the compiler just made a copy of it?  Won't that be pretty  
 surprising?

 Assuming a C function in this form:

    void write_to_buffer(char *buffer, int length);

No, assuming C function in this form:

void ucase(char* str);

Essentially, a C function which takes a writable already-null-terminated  
string, and writes to it.

 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero terminated  
 already (in both cases), or am I wrong about that?

In neither case are they required to be null terminated.  The only thing  
that guarantees null termination is a string literal.  Even "abc".dup is  
not going to be guaranteed to be null terminated.  For an actual example,  
try "012345678901234".dup.  This should have a 0x0f right after the last  
character.

-Steve

Jul 12 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 12 Jul 2011 10:59:58 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 and in both cases, toStringz would do nothing as foo is zero terminated  
 already (in both cases), or am I wrong about that?

 In neither case are they required to be null terminated.  The only thing  
 that guarantees null termination is a string literal.  Even "abc".dup is  
 not going to be guaranteed to be null terminated.  For an actual  
 example, try "012345678901234".dup.  This should have a 0x0f right after  
 the last character.

And, actually, the cost penalty of checking if you are going to segfault  
(i.e. checking if the ptr is into heap data, and then getting the length)  
is quite costly.  You must take the GC lock.

-Steve

Jul 12 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Tue, 12 Jul 2011 16:04:15 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:59:58 -0400, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?

 In neither case are they required to be null terminated.  The only  
 thing that guarantees null termination is a string literal.  Even  
 "abc".dup is not going to be guaranteed to be null terminated.  For an  
 actual example, try "012345678901234".dup.  This should have a 0x0f  
 right after the last character.

 And, actually, the cost penalty of checking if you are going to segfault  
 (i.e. checking if the ptr is into heap data, and then getting the  
 length) is quite costly.  You must take the GC lock.

I wouldn't know anything about this.  I was assuming when toStringz was  
called on a slice it would use the array capacity and length to figure out  
where the \0 needed to be, and do as little work as possible to achieve  
it.  Meaning in most cases that \0 is written to 1 past the length, inside  
already allocated capacity.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 12 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 What if you expect the function is expecting to write to the buffer,  
 and the compiler just made a copy of it?  Won't that be pretty  
 surprising?

 Assuming a C function in this form:

    void write_to_buffer(char *buffer, int length);

 No, assuming C function in this form:

 void ucase(char* str);

 Essentially, a C function which takes a writable already-null-terminated  
 string, and writes to it.

Ok, that's an even better example for my case.

It would be used/called like...

   char[] foo;
   .. code which populates foo with something ..
   ucase(foo);

and in D today this would corrupt memory.  Unless the programmer  
remembered to write:

   ucase(toStringz(foo));

So, +1 for compiler called toStringz.

I am assuming also that if this idea were implemented it would handle  
things intelligently, like for example if when toStringz is called the  
underlying array is out of room and needs to be reallocated, the compiler  
would update the slice/reference 'foo' in the same way as it already does  
for an append which triggers a reallocation.

 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero terminated  
 already (in both cases), or am I wrong about that?

 In neither case are they required to be null terminated.

True, but I was outlining the worst case scenario for my suggestion, not  
describing the real C function requirements.

In this particular case the extern "C" declaration (IMO) for this style of  
function should be one of:

   extern "C" void write_to_buffer(ubyte *buffer, int length);
   extern "C" void write_to_buffer(byte *buffer, int length);
   extern "C" void write_to_buffer(void *buffer, int length);

which would all be ignored by my suggestion.

 The only thing that guarantees null termination is a string literal.

string literals /and/ calling toStringz.

 Even "abc".dup is not going to be guaranteed to be null terminated.  For  
 an actual example, try "012345678901234".dup.  This should have a 0x0f  
 right after the last character.

Why 0x0f?  Does the allocator initialise array memory to it's offset from  
the start of the block or something?

I have just realised that char is initialised to 0xFF.  That is a problem  
as my two examples above would be arrays full of 0xFF, not \0.. meaning  
toStringz would have to reallocate to append \0 to them, drat.  That is  
yet another reason to use ubyte or byte when interfacing with C.

Ok, how about going the other way.  Can we have something to decorate  
extern "C" function parameters to trigger an implicit call of toStringz on  
them?

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 12 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 What if you expect the function is expecting to write to the buffer,  
 and the compiler just made a copy of it?  Won't that be pretty  
 surprising?

 Assuming a C function in this form:

    void write_to_buffer(char *buffer, int length);

 No, assuming C function in this form:

 void ucase(char* str);

 Essentially, a C function which takes a writable  
 already-null-terminated string, and writes to it.

 Ok, that's an even better example for my case.

 It would be used/called like...

    char[] foo;
    .. code which populates foo with something ..
    ucase(foo);

 and in D today this would corrupt memory.  Unless the programmer  
 remembered to write:

No, it wouldn't compile.  char[] does not cast implicitly to char *.  (if  
it does, that needs to change).

 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called the  
 underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.

OK, but what if it's like this:

char[] foo = new char[100];
auto bar = foo;

ucase(foo);

In most cases, bar is also written to, but in some cases only foo is  
written to.

Granted, we're getting further out on the hypothetical limb here :)  But  
my point is, making it require explicit calling of toStringz instead of  
implicit makes the code less confusing, because you understand "oh,  
toStringz may reallocate, so I can't expect bar to also get updated" vs.  
simply calling a function with a buffer.

 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?

 In neither case are they required to be null terminated.

 True, but I was outlining the worst case scenario for my suggestion, not  
 describing the real C function requirements.

No, I mean you were wrong, D does not guarantee either of those (stack  
allocated or heap allocated) is null terminated.  So toStringz must add a  
'\0' at the end (which is mildly expensive for heap data, and very  
expensive for stack data).

 The only thing that guarantees null termination is a string literal.

 string literals /and/ calling toStringz.

 Even "abc".dup is not going to be guaranteed to be null terminated.   
 For an actual example, try "012345678901234".dup.  This should have a  
 0x0f right after the last character.

 Why 0x0f?  Does the allocator initialise array memory to it's offset  
 from the start of the block or something?

The final byte of the block is used as the hidden array length (in this  
case 15).

-Steve

Jul 12 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 What if you expect the function is expecting to write to the buffer,  
 and the compiler just made a copy of it?  Won't that be pretty  
 surprising?

 Assuming a C function in this form:

    void write_to_buffer(char *buffer, int length);

 No, assuming C function in this form:

 void ucase(char* str);

 Essentially, a C function which takes a writable  
 already-null-terminated string, and writes to it.

 Ok, that's an even better example for my case.

 It would be used/called like...

    char[] foo;
    .. code which populates foo with something ..
    ucase(foo);

 and in D today this would corrupt memory.  Unless the programmer  
 remembered to write:

 No, it wouldn't compile.  char[] does not cast implicitly to char *.   
 (if it does, that needs to change).

Replace foo with foo.ptr, it makes no difference to the point I was making.

 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called the  
 underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.

 OK, but what if it's like this:

 char[] foo = new char[100];
 auto bar = foo;

 ucase(foo);

 In most cases, bar is also written to, but in some cases only foo is  
 written to.

 Granted, we're getting further out on the hypothetical limb here :)  But  
 my point is, making it require explicit calling of toStringz instead of  
 implicit makes the code less confusing, because you understand "oh,  
 toStringz may reallocate, so I can't expect bar to also get updated" vs.  
 simply calling a function with a buffer.

This is not a 'new' problem introduced the idea, it's a general problem  
for D/arrays/slices and the same happens with an append, right?  In which  
case it's not a reason against the idea.

 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?

 In neither case are they required to be null terminated.

 True, but I was outlining the worst case scenario for my suggestion,  
 not describing the real C function requirements.

 No, I mean you were wrong, D does not guarantee either of those (stack  
 allocated or heap allocated) is null terminated.  So toStringz must add  
 a '\0' at the end (which is mildly expensive for heap data, and very  
 expensive for stack data).

Ah, ok, this was because I had forgotten char is initialised to 0xFF.  If  
it was initialised to \0 then both arrays would have been full of null  
terminators.  The default value of char is the killing blow to the idea.

 The only thing that guarantees null termination is a string literal.

 string literals /and/ calling toStringz.

 Even "abc".dup is not going to be guaranteed to be null terminated.   
 For an actual example, try "012345678901234".dup.  This should have a  
 0x0f right after the last character.

 Why 0x0f?  Does the allocator initialise array memory to it's offset  
 from the start of the block or something?

 The final byte of the block is used as the hidden array length (in this  
 case 15).

Good to know.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 12 2011

"Regan Heath" <regan netmail.co.nz> writes:

Gah.. bad grammar.. 1/2 baked sentences..

On Tue, 12 Jul 2011 18:00:41 +0100, Regan Heath <regan netmail.co.nz>  
wrote:
 On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 No, it wouldn't compile.  char[] does not cast implicitly to char *.   
 (if it does, that needs to change).

 Replace foo with foo.ptr, it makes no difference to the point I was  
 making.

Which was that a new D user would pass foo.ptr rather than go looking for,  
and find toStringz.  We've had a number of cases on the learn NG in the  
past.

 OK, but what if it's like this:

 char[] foo = new char[100];
 auto bar = foo;

 ucase(foo);

 In most cases, bar is also written to, but in some cases only foo is  
 written to.

 Granted, we're getting further out on the hypothetical limb here :)   
 But my point is, making it require explicit calling of toStringz  
 instead of implicit makes the code less confusing, because you  
 understand "oh, toStringz may reallocate, so I can't expect bar to also  
 get updated" vs. simply calling a function with a buffer.

 This is not a 'new' problem introduced the idea, it's a general problem

-->                                     ^by
 for D/arrays/slices and the same happens with an append, right?  In  
 which case it's not a reason against the idea.

Jul 12 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 What if you expect the function is expecting to write to the  
 buffer, and the compiler just made a copy of it?  Won't that be  
 pretty surprising?

 Assuming a C function in this form:

    void write_to_buffer(char *buffer, int length);

 No, assuming C function in this form:

 void ucase(char* str);

 Essentially, a C function which takes a writable  
 already-null-terminated string, and writes to it.

 Ok, that's an even better example for my case.

 It would be used/called like...

    char[] foo;
    .. code which populates foo with something ..
    ucase(foo);

 and in D today this would corrupt memory.  Unless the programmer  
 remembered to write:

 No, it wouldn't compile.  char[] does not cast implicitly to char *.   
 (if it does, that needs to change).

 Replace foo with foo.ptr, it makes no difference to the point I was  
 making.

You fix does not help in that case, foo.ptr will be passed as a non-null  
terminated string.

So, your proposal fixes the case:

1. The user tries to pass a string/char[] to a C function.  Fails to  
compile.
2. Instead of trying to understand the issue, realizes the .ptr member is  
the right type, and switches to that.

It does not fix or help with cases where:

  * a programmer notices the type of the parameter is char * and uses  
foo.ptr without trying foo first. (crash)
  * a programmer calls toStringz without going through the compile/fix  
cycle above.
  * a programmer tries to pass string/char[], fails to compile, then looks  
up how to interface with C and finds toStringz

I think this fix really doesn't solve a very common problem.

 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called the  
 underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.

 OK, but what if it's like this:

 char[] foo = new char[100];
 auto bar = foo;

 ucase(foo);

 In most cases, bar is also written to, but in some cases only foo is  
 written to.

 Granted, we're getting further out on the hypothetical limb here :)   
 But my point is, making it require explicit calling of toStringz  
 instead of implicit makes the code less confusing, because you  
 understand "oh, toStringz may reallocate, so I can't expect bar to also  
 get updated" vs. simply calling a function with a buffer.

 This is not a 'new' problem introduced the idea, it's a general problem  
 for D/arrays/slices and the same happens with an append, right?  In  
 which case it's not a reason against the idea.

It's new to the features of the C function being called.  If you look up  
the man page for such a hypothetical function, it might claim that it  
alters the data passed in through the argument, but it seems to not be the  
case!  So there's no way for someone (who arguably is not well versed in C  
functions if they didn't know to use toStringz) to figure out why the code  
seems not to do what it says it should.  Such a programmer may blame  
either the implementation of the C function, or blame the D compiler for  
not calling the function properly.

 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?

 In neither case are they required to be null terminated.

 True, but I was outlining the worst case scenario for my suggestion,  
 not describing the real C function requirements.

 No, I mean you were wrong, D does not guarantee either of those (stack  
 allocated or heap allocated) is null terminated.  So toStringz must add  
 a '\0' at the end (which is mildly expensive for heap data, and very  
 expensive for stack data).

 Ah, ok, this was because I had forgotten char is initialised to 0xFF.   
 If it was initialised to \0 then both arrays would have been full of  
 null terminators.  The default value of char is the killing blow to the  
 idea.

toStringz does not currently check for '\0' anywhere in the existing  
string.  It simply appends '\0' to the end of the passed string.  If you  
want it to check for '\0', how far should it go?  Doesn't this also add to  
the overhead (looping over all chars looking for '\0')?

Note also, that toStringz has old code that used to check for "one byte  
beyond" the array, but this is commented out, because it's unreliable  
(could cause a segfault).

 The only thing that guarantees null termination is a string literal.

 string literals /and/ calling toStringz.

 Even "abc".dup is not going to be guaranteed to be null terminated.   
 For an actual example, try "012345678901234".dup.  This should have a  
 0x0f right after the last character.

 Why 0x0f?  Does the allocator initialise array memory to it's offset  
 from the start of the block or something?

 The final byte of the block is used as the hidden array length (in this  
 case 15).

 Good to know.

Just for history trivia, it used to be there as an unallocated byte.   
Which means it likely had random data in it.  It was there to prevent  
cross-block pointers.  If the byte was part of the array, then it would be  
possible to do:

arr1 = arr[$..$];

and now, arr1 points at the *next* block!

arr1 ~= 5;

and now, arr1 may have stomped over possibly unallocated data, or possibly  
some already allocated data!

So it was a nice bonus that the byte I commandeered for storing the array  
length was already unused :)

-Steve

Jul 12 2011

"Regan Heath" <regan netmail.co.nz> writes:

Ok, it's clear there has been some confusion over what exactly I am  
suggesting.

I am not suggesting the compiler simply insert calls to the existing  
toStringz function as it appears the function does not, or cannot do what  
I am imagining.

I am suggesting the compiler will perform a special operation on all char*  
parameters passed to extern "C" functions.

The operation is a toStringz like operation which is (more or less) as  
follows:

1. If there is a \0 character inside foo[0..$], do nothing.
2. If the array allocated memory is > the array length, place a \0 at  
foo[$]
3. Reallocate the array memory, updating foo, place a \0 at foo[$]
4. Call the C function passing foo.ptr

So, it will handle all the following cases:

char[] foo;
.. code to populate foo ..

ucase(foo);
ucase(foo.ptr);
ucase(toStringz(foo));

The problem cases are the buffer cases I mentioned earlier, and they  
wouldn't be a problem if char was initialised to \0 as I first imagined.

Other replies inline below..

On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath  
 <regan netmail.co.nz> wrote:

 What if you expect the function is expecting to write to the  
 buffer, and the compiler just made a copy of it?  Won't that be  
 pretty surprising?

 Assuming a C function in this form:

    void write_to_buffer(char *buffer, int length);

 No, assuming C function in this form:

 void ucase(char* str);

 Essentially, a C function which takes a writable  
 already-null-terminated string, and writes to it.

 Ok, that's an even better example for my case.

 It would be used/called like...

    char[] foo;
    .. code which populates foo with something ..
    ucase(foo);

 and in D today this would corrupt memory.  Unless the programmer  
 remembered to write:

 No, it wouldn't compile.  char[] does not cast implicitly to char *.   
 (if it does, that needs to change).

 Replace foo with foo.ptr, it makes no difference to the point I was  
 making.

 You fix does not help in that case, foo.ptr will be passed as a non-null  
 terminated string.

No, see above.

 So, your proposal fixes the case:

 1. The user tries to pass a string/char[] to a C function.  Fails to  
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member  
 is the right type, and switches to that.

 It does not fix or help with cases where:

   * a programmer notices the type of the parameter is char * and uses  
 foo.ptr without trying foo first. (crash)
   * a programmer calls toStringz without going through the compile/fix  
 cycle above.
   * a programmer tries to pass string/char[], fails to compile, then  
 looks up how to interface with C and finds toStringz

 I think this fix really doesn't solve a very common problem.

See above, my intention was to solve all the cases listed here as I  
suspect the compiler can detect them all, and just 'do the right thing'.

In these cases..

1. If the programmer writes foo.ptr, the compiler detects that, calls  
toStringz on 'foo' (not foo.ptr) and updates foo as required (if  
reallocation occurs).

toStringz returns foo.ptr (I assume).
3. If the programmer passes 'foo', the compiler calls toStringz etc.

 I am assuming also that if this idea were implemented it would handle  
 things intelligently, like for example if when toStringz is called  
 the underlying array is out of room and needs to be reallocated, the  
 compiler would update the slice/reference 'foo' in the same way as it  
 already does for an append which triggers a reallocation.

 OK, but what if it's like this:

 char[] foo = new char[100];
 auto bar = foo;

 ucase(foo);

 In most cases, bar is also written to, but in some cases only foo is  
 written to.

 Granted, we're getting further out on the hypothetical limb here :)   
 But my point is, making it require explicit calling of toStringz  
 instead of implicit makes the code less confusing, because you  
 understand "oh, toStringz may reallocate, so I can't expect bar to  
 also get updated" vs. simply calling a function with a buffer.

 This is not a 'new' problem introduced the idea, it's a general problem  
 for D/arrays/slices and the same happens with an append, right?  In  
 which case it's not a reason against the idea.

 It's new to the features of the C function being called.  If you look up  
 the man page for such a hypothetical function, it might claim that it  
 alters the data passed in through the argument, but it seems to not be  
 the case!  So there's no way for someone (who arguably is not well  
 versed in C functions if they didn't know to use toStringz) to figure  
 out why the code seems not to do what it says it should.  Such a  
 programmer may blame either the implementation of the C function, or  
 blame the D compiler for not calling the function properly.

None of this is relevant, let me explain..

My idea is for the compiler to detect a char* parameter to an extern "C"  
function and to call toStringz.  When it does so it will correctly update  
the slice/array being passed if reallocation occurs.  The C function will  
write to the slice/array being passed.  So, it's not relevant if there was  
another slice referencing the array before it was reallocated, because  
that case is no different to calling a D function which does something  
similar, like appending to the passed slice/array.

In short, the end result will ALWAYS be that the passed slice/array will  
contain the output of the C function.

The goal is to make a call to an extern "C" function "just work" in the  

has it's own string type.

 You might initially extern it as:

    extern "C" void write_to_buffer(char *buffer, int length);

 And, you could call it one of 2 ways (legitimately):

    char[] foo = new char[100];
    write_to_buffer(foo, foo.length);

 or:

    char[100] foo;
    write_to_buffer(foo, foo.length);

 and in both cases, toStringz would do nothing as foo is zero  
 terminated already (in both cases), or am I wrong about that?

 In neither case are they required to be null terminated.

 True, but I was outlining the worst case scenario for my suggestion,  
 not describing the real C function requirements.

 No, I mean you were wrong, D does not guarantee either of those (stack  
 allocated or heap allocated) is null terminated.  So toStringz must  
 add a '\0' at the end (which is mildly expensive for heap data, and  
 very expensive for stack data).

 Ah, ok, this was because I had forgotten char is initialised to 0xFF.   
 If it was initialised to \0 then both arrays would have been full of  
 null terminators.  The default value of char is the killing blow to the  
 idea.

 toStringz does not currently check for '\0' anywhere in the existing  
 string.  It simply appends '\0' to the end of the passed string.  If you  
 want it to check for '\0', how far should it go?  Doesn't this also add  
 to the overhead (looping over all chars looking for '\0')?

 Note also, that toStringz has old code that used to check for "one byte  
 beyond" the array, but this is commented out, because it's unreliable  
 (could cause a segfault).

So, toStringz is not as clever as I imagined.  I thought it would  
intelligently detect cases where a \0 was already present in the slice  
(from 0 to $) and if not, put one at $+1 (inside pre-allocated array  
memory).  I was assuming toStringz had access to the underlying array  
allocation size and would know how far it can 'look' without causing a  
segfault.  In the case where the slice length equaled the array reserved  
memory area, it would re-allocate and place the \0 at $+1 (inside the  
newly allocated memory).

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 13 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 I am suggesting the compiler will perform a special operation on all  
 char* parameters passed to extern "C" functions.

 The operation is a toStringz like operation which is (more or less) as  
 follows:

 1. If there is a \0 character inside foo[0..$], do nothing.

This is an O(n) operation -- too much overhead.  Especially if you already  
know foo has a 0 in it.  Note that toStringz does not have this overhead.

 2. If the array allocated memory is > the array length, place a \0 at  
 foo[$]

The check to see if the array has allocated length requires a GC lock, and  
O(lgn) search for the block info in the GC.

Not that it doesn't already happen in toStringz, but I just want to point  
out that it's not a small cost.

 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
 4. Call the C function passing foo.ptr

 So, it will handle all the following cases:

 char[] foo;
 .. code to populate foo ..

 ucase(foo);
 ucase(foo.ptr);

I read in your responses below, this is due to you making this equivalent  
to ucase(foo)?  This still has the same problems I listed above.

What about

char * foo;
.. code to populate foo ..
ucase(foo);

Is there still anything special done by the compiler?

 ucase(toStringz(foo));

 The problem cases are the buffer cases I mentioned earlier, and they  
 wouldn't be a problem if char was initialised to \0 as I first imagined.

The largest problem I've had with all this is there is a necessary  
overhead of conversion.  Not only that, but due to the way reallocation  
works, there may be a move of data.  I think it's better to require  
explicit calls incurring such overhead vs. hiding the overhead calls from  
the developer.  Especially if the overhead calls are unnecessary.

 Other replies inline below..

 On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 Replace foo with foo.ptr, it makes no difference to the point I was  
 making.

 You fix does not help in that case, foo.ptr will be passed as a  
 non-null terminated string.

 No, see above.

How does your proposal know that a char * is part of a heap-allocated  
array?  If you are assuming the only case where char * is passed will be  
arr.ptr, then that doesn't cut it.  What if the compiler doesn't know  
where the char * came from?

The inherent problem of zero-terminated strings is that you don't know how  
long it is until you search for a zero.  If it's not properly terminated,  
then you are screwed.  That problem cannot be "solved", even with compiler  
help -- you can get situations where there is no more information other  
than the pointer.

 So, your proposal fixes the case:

 1. The user tries to pass a string/char[] to a C function.  Fails to  
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member  
 is the right type, and switches to that.

 It does not fix or help with cases where:

   * a programmer notices the type of the parameter is char * and uses  
 foo.ptr without trying foo first. (crash)
   * a programmer calls toStringz without going through the compile/fix  
 cycle above.
   * a programmer tries to pass string/char[], fails to compile, then  
 looks up how to interface with C and finds toStringz

 I think this fix really doesn't solve a very common problem.

 See above, my intention was to solve all the cases listed here as I  
 suspect the compiler can detect them all, and just 'do the right thing'.

 In these cases..

 1. If the programmer writes foo.ptr, the compiler detects that, calls  
 toStringz on 'foo' (not foo.ptr) and updates foo as required (if  
 reallocation occurs).

What if it's not foo.ptr?  What if it's some random char * whose origin  
the compiler isn't aware of?


 toStringz returns foo.ptr (I assume).

Huh?  Why should it do anything with toStringz?  I'm not getting this one,  
toStringz already has done the work your proposal wants to do.

 This is not a 'new' problem introduced the idea, it's a general  
 problem for D/arrays/slices and the same happens with an append,  
 right?  In which case it's not a reason against the idea.

 It's new to the features of the C function being called.  If you look  
 up the man page for such a hypothetical function, it might claim that  
 it alters the data passed in through the argument, but it seems to not  
 be the case!  So there's no way for someone (who arguably is not well  
 versed in C functions if they didn't know to use toStringz) to figure  
 out why the code seems not to do what it says it should.  Such a  
 programmer may blame either the implementation of the C function, or  
 blame the D compiler for not calling the function properly.

 None of this is relevant, let me explain..

 My idea is for the compiler to detect a char* parameter to an extern "C"  
 function and to call toStringz.  When it does so it will correctly  
 update the slice/array being passed if reallocation occurs.  The C  
 function will write to the slice/array being passed.  So, it's not  
 relevant if there was another slice referencing the array before it was  
 reallocated, because that case is no different to calling a D function  
 which does something similar, like appending to the passed slice/array.

What about this case?

char buffer[12];
buffer[] = "hello, world";

ucase(buffer[]); // does nothing to buffer!

I'm saying, the charter of the function is to update a string in place,  
and your proposal is making that not true in some cases.

 The goal is to make a call to an extern "C" function "just work" in the  

 has it's own string type.


a '0' at the end affects all references to that string, reallocation or  
not.

 toStringz does not currently check for '\0' anywhere in the existing  
 string.  It simply appends '\0' to the end of the passed string.  If  
 you want it to check for '\0', how far should it go?  Doesn't this also  
 add to the overhead (looping over all chars looking for '\0')?

 Note also, that toStringz has old code that used to check for "one byte  
 beyond" the array, but this is commented out, because it's unreliable  
 (could cause a segfault).

 So, toStringz is not as clever as I imagined.  I thought it would  
 intelligently detect cases where a \0 was already present in the slice  
 (from 0 to $) and if not, put one at $+1 (inside pre-allocated array  
 memory).  I was assuming toStringz had access to the underlying array  
 allocation size and would know how far it can 'look' without causing a  
 segfault.  In the case where the slice length equaled the array reserved  
 memory area, it would re-allocate and place the \0 at $+1 (inside the  
 newly allocated memory).

s/clever/slow/

The only "intelligent" way to check for a 0 is a linear search.

Without knowing where the data came from, there is no way to look past the  
slice without possibly calling a segfault.  If you know it's a heap  
allocation, you can look at the block information to see if you can look  
past it.  This might be possible to do for toStringz, but the linear check  
for 0 is just unacceptable for a simple function call.  Appending a 0 is  
at least amortized.  One thing though, it could make some smarter  
decisions as to whether to reallocate depending on the type of the array,  
since it is already doing a lookup of block info.

But I still always come back to the fact that I should be able to  
circumvent some auto-intelligent decision that isn't aware of things that  
a developer can be aware of (such as knowing an array already contains a  
0).  The compiler shouldn't be too intrusive here.

-Steve

Jul 13 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On 2011-07-13 09:00, Steven Schveighoffer wrote:
 On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz>
 
 wrote:
 I am suggesting the compiler will perform a special operation on all
 char* parameters passed to extern "C" functions.
 
 The operation is a toStringz like operation which is (more or less) as
 follows:
 
 1. If there is a \0 character inside foo[0..$], do nothing.

 
 This is an O(n) operation -- too much overhead. Especially if you already
 know foo has a 0 in it. Note that toStringz does not have this overhead.
 
 2. If the array allocated memory is > the array length, place a \0 at
 foo[$]

 
 The check to see if the array has allocated length requires a GC lock, and
 O(lgn) search for the block info in the GC.
 
 Not that it doesn't already happen in toStringz, but I just want to point
 out that it's not a small cost.
 
 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
 4. Call the C function passing foo.ptr
 
 So, it will handle all the following cases:
 
 char[] foo;
 .. code to populate foo ..
 
 ucase(foo);
 ucase(foo.ptr);

 
 I read in your responses below, this is due to you making this equivalent
 to ucase(foo)? This still has the same problems I listed above.
 
 What about
 
 char * foo;
 .. code to populate foo ..
 ucase(foo);
 
 Is there still anything special done by the compiler?
 
 ucase(toStringz(foo));
 
 The problem cases are the buffer cases I mentioned earlier, and they
 wouldn't be a problem if char was initialised to \0 as I first imagined.

 
 The largest problem I've had with all this is there is a necessary
 overhead of conversion. Not only that, but due to the way reallocation
 works, there may be a move of data. I think it's better to require
 explicit calls incurring such overhead vs. hiding the overhead calls from
 the developer. Especially if the overhead calls are unnecessary.
 
 Other replies inline below..
 
 On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer
 
 <schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>
 
 wrote:
 Replace foo with foo.ptr, it makes no difference to the point I was
 making.

 
 You fix does not help in that case, foo.ptr will be passed as a
 non-null terminated string.

 
 No, see above.

 
 How does your proposal know that a char * is part of a heap-allocated
 array? If you are assuming the only case where char * is passed will be
 arr.ptr, then that doesn't cut it. What if the compiler doesn't know
 where the char * came from?
 
 The inherent problem of zero-terminated strings is that you don't know how
 long it is until you search for a zero. If it's not properly terminated,
 then you are screwed. That problem cannot be "solved", even with compiler
 help -- you can get situations where there is no more information other
 than the pointer.
 
 So, your proposal fixes the case:
 
 1. The user tries to pass a string/char[] to a C function. Fails to
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member
 is the right type, and switches to that.
 
 It does not fix or help with cases where:
 * a programmer notices the type of the parameter is char * and uses
 
 foo.ptr without trying foo first. (crash)
 
 * a programmer calls toStringz without going through the compile/fix
 
 cycle above.
 
 * a programmer tries to pass string/char[], fails to compile, then
 
 looks up how to interface with C and finds toStringz
 
 I think this fix really doesn't solve a very common problem.

 
 See above, my intention was to solve all the cases listed here as I
 suspect the compiler can detect them all, and just 'do the right thing'.
 
 In these cases..
 
 1. If the programmer writes foo.ptr, the compiler detects that, calls
 toStringz on 'foo' (not foo.ptr) and updates foo as required (if
 reallocation occurs).

 
 What if it's not foo.ptr? What if it's some random char * whose origin
 the compiler isn't aware of?
 

 toStringz returns foo.ptr (I assume).

 
 Huh? Why should it do anything with toStringz? I'm not getting this one,
 toStringz already has done the work your proposal wants to do.
 
 This is not a 'new' problem introduced the idea, it's a general
 problem for D/arrays/slices and the same happens with an append,
 right? In which case it's not a reason against the idea.

 
 It's new to the features of the C function being called. If you look
 up the man page for such a hypothetical function, it might claim that
 it alters the data passed in through the argument, but it seems to not
 be the case! So there's no way for someone (who arguably is not well
 versed in C functions if they didn't know to use toStringz) to figure
 out why the code seems not to do what it says it should. Such a
 programmer may blame either the implementation of the C function, or
 blame the D compiler for not calling the function properly.

 
 None of this is relevant, let me explain..
 
 My idea is for the compiler to detect a char* parameter to an extern "C"
 function and to call toStringz. When it does so it will correctly
 update the slice/array being passed if reallocation occurs. The C
 function will write to the slice/array being passed. So, it's not
 relevant if there was another slice referencing the array before it was
 reallocated, because that case is no different to calling a D function
 which does something similar, like appending to the passed slice/array.

 
 What about this case?
 
 char buffer[12];
 buffer[] = "hello, world";
 
 ucase(buffer[]); // does nothing to buffer!
 
 I'm saying, the charter of the function is to update a string in place,
 and your proposal is making that not true in some cases.
 
 The goal is to make a call to an extern "C" function "just work" in the

 has it's own string type.

 

 a '0' at the end affects all references to that string, reallocation or
 not.
 
 toStringz does not currently check for '\0' anywhere in the existing
 string. It simply appends '\0' to the end of the passed string. If
 you want it to check for '\0', how far should it go? Doesn't this also
 add to the overhead (looping over all chars looking for '\0')?
 
 Note also, that toStringz has old code that used to check for "one byte
 beyond" the array, but this is commented out, because it's unreliable
 (could cause a segfault).

 
 So, toStringz is not as clever as I imagined. I thought it would
 intelligently detect cases where a \0 was already present in the slice
 (from 0 to $) and if not, put one at $+1 (inside pre-allocated array
 memory). I was assuming toStringz had access to the underlying array
 allocation size and would know how far it can 'look' without causing a
 segfault. In the case where the slice length equaled the array reserved
 memory area, it would re-allocate and place the \0 at $+1 (inside the
 newly allocated memory).

 
 s/clever/slow/
 
 The only "intelligent" way to check for a 0 is a linear search.
 
 Without knowing where the data came from, there is no way to look past the
 slice without possibly calling a segfault. If you know it's a heap
 allocation, you can look at the block information to see if you can look
 past it. This might be possible to do for toStringz, but the linear check
 for 0 is just unacceptable for a simple function call. Appending a 0 is
 at least amortized. One thing though, it could make some smarter
 decisions as to whether to reallocate depending on the type of the array,
 since it is already doing a lookup of block info.
 
 But I still always come back to the fact that I should be able to
 circumvent some auto-intelligent decision that isn't aware of things that
 a developer can be aware of (such as knowing an array already contains a
 0). The compiler shouldn't be too intrusive here.

Andrej Mitrovic found a rather annoying issue (which is fortunately highly 
unlikely and therefore almost certainly rare) with toStringz and toUTFz with 
checking for a terminating '\0' one past the end of the string (which both 
functions do under some circumstances). You might want to have a look at it:

https://github.com/D-Programming-Language/phobos/pull/123

Given what you know about the GC and arrays, your thoughts on the matter would 
be welcome.

- Jonathan M Davis

Jul 13 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 I am suggesting the compiler will perform a special operation on all  
 char* parameters passed to extern "C" functions.

 The operation is a toStringz like operation which is (more or less) as  
 follows:

 1. If there is a \0 character inside foo[0..$], do nothing.

 This is an O(n) operation -- too much overhead.  Especially if you  
 already know foo has a 0 in it.  Note that toStringz does not have this  
 overhead.

On 2nd thought, this step is unnecessary unless the array length matches  
the memory block length .. it was intended to detect an existing \0 and  
avoid the reallocation.  But, this case is rare so this step could be  
skipped for the general case, or only carried out when the lengths match  
and reallocation is a possibility we want to avoid, or not if the cost is  
too high even for that.

 2. If the array allocated memory is > the array length, place a \0 at  
 foo[$]

 The check to see if the array has allocated length requires a GC lock,  
 and O(lgn) search for the block info in the GC.

 Not that it doesn't already happen in toStringz, but I just want to  
 point out that it's not a small cost.

This is the cost Walter mentioned earlier.  Does this mean that heap  
allocated arrays do not know how much memory they have allocated?  I was  
assuming they held that information, and that a slice to them would also  
know.  How else does an array append operation know whether to  
reallocate?  Does it have to obtain the GC lock and perform an O(lgn)  
search on every append?

 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
 4. Call the C function passing foo.ptr

 So, it will handle all the following cases:

 char[] foo;
 .. code to populate foo ..

 ucase(foo);
 ucase(foo.ptr);

 I read in your responses below, this is due to you making this  
 equivalent to ucase(foo)?  This still has the same problems I listed  
 above.

Problems above?  You mean the cost?  Yes, there is a cost to pay, but it's  
a cost which has to be paid (and is already paid by calling toStringz) to  
avoid corrupting memory whether it's done explicitly or implicitly.  And  
the cost is only paid for extern "C" functions with char* parameters.  In  
the rare case where the string already contains \0 and the programmer can  
guarantee that, we can have some way to indicate it, or in some cases  
changing the function parameter to ubyte* or byte* may be the correct  
solution.

 What about

 char * foo;
 .. code to populate foo ..
 ucase(foo);

 Is there still anything special done by the compiler?

Assuming foo is allocated by the GC toStringz can still find the length of  

we can handle this case as well (for no extra cost than incurred by  
toStringz already).

 ucase(toStringz(foo));

 The problem cases are the buffer cases I mentioned earlier, and they  
 wouldn't be a problem if char was initialised to \0 as I first imagined.

 The largest problem I've had with all this is there is a necessary  
 overhead of conversion.  Not only that, but due to the way reallocation  
 works, there may be a move of data.  I think it's better to require  
 explicit calls incurring such overhead vs. hiding the overhead calls  
 from the developer. Especially if the overhead calls are unnecessary.

But, the overhead is something we already pay calling toStringz  
explicitly, and the reallocation is no different to an append operation.

Generally speaking I would normally agree that it's better to require  
explicit calls incurring overhead etc, but this specific case is something  
new D programmers stumble on all the time, and it makes D look less slick  

different for each, but if we can achieve something similar for no extra  
cost (other than we already pay calling toStringz explicitly), then it's  
well worth considering.

As far as I can see the only problem cases are those where we incur more  
cost than toStringz when it's not required, and those cases seem rare to  
me, and could be handled by an opt-out decoration/keyword or similar.

 Other replies inline below..

 On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:
 Replace foo with foo.ptr, it makes no difference to the point I was  
 making.

 You fix does not help in that case, foo.ptr will be passed as a  
 non-null terminated string.

 No, see above.

 How does your proposal know that a char * is part of a heap-allocated  
 array?  If you are assuming the only case where char * is passed will be  
 arr.ptr, then that doesn't cut it.  What if the compiler doesn't know  
 where the char * came from?

See your Q and my A above ("char * foo" example).

 The inherent problem of zero-terminated strings is that you don't know  
 how long it is until you search for a zero.  If it's not properly  
 terminated, then you are screwed.  That problem cannot be "solved", even  
 with compiler help -- you can get situations where there is no more  
 information other than the pointer.

Really?  But cant we obtain the GC lock and look them up, as mentioned  
above?  And isn't this exactly what toStringz will do when the programmer  
first of all curses because it has crashed, and then adds an explicit  
toStringz call?

 So, your proposal fixes the case:

 1. The user tries to pass a string/char[] to a C function.  Fails to  
 compile.
 2. Instead of trying to understand the issue, realizes the .ptr member  
 is the right type, and switches to that.

 It does not fix or help with cases where:

   * a programmer notices the type of the parameter is char * and uses  
 foo.ptr without trying foo first. (crash)
   * a programmer calls toStringz without going through the compile/fix  
 cycle above.
   * a programmer tries to pass string/char[], fails to compile, then  
 looks up how to interface with C and finds toStringz

 I think this fix really doesn't solve a very common problem.

 See above, my intention was to solve all the cases listed here as I  
 suspect the compiler can detect them all, and just 'do the right thing'.

 In these cases..

 1. If the programmer writes foo.ptr, the compiler detects that, calls  
 toStringz on 'foo' (not foo.ptr) and updates foo as required (if  
 reallocation occurs).

 What if it's not foo.ptr?  What if it's some random char * whose origin  
 the compiler isn't aware of?

See above.


 toStringz returns foo.ptr (I assume).

 Huh?  Why should it do anything with toStringz?  I'm not getting this  
 one, toStringz already has done the work your proposal wants to do.

I was assuming the compiler could not detect the case where the programmer  
is explicitly calling toStringz i.e. what would be legacy code assuming  
this proposal came into effect.

 This is not a 'new' problem introduced the idea, it's a general  
 problem for D/arrays/slices and the same happens with an append,  
 right?  In which case it's not a reason against the idea.

 It's new to the features of the C function being called.  If you look  
 up the man page for such a hypothetical function, it might claim that  
 it alters the data passed in through the argument, but it seems to not  
 be the case!  So there's no way for someone (who arguably is not well  
 versed in C functions if they didn't know to use toStringz) to figure  
 out why the code seems not to do what it says it should.  Such a  
 programmer may blame either the implementation of the C function, or  
 blame the D compiler for not calling the function properly.

 None of this is relevant, let me explain..

 My idea is for the compiler to detect a char* parameter to an extern  
 "C" function and to call toStringz.  When it does so it will correctly  
 update the slice/array being passed if reallocation occurs.  The C  
 function will write to the slice/array being passed.  So, it's not  
 relevant if there was another slice referencing the array before it was  
 reallocated, because that case is no different to calling a D function  
 which does something similar, like appending to the passed slice/array.

 What about this case?

 char buffer[12];
 buffer[] = "hello, world";

 ucase(buffer[]); // does nothing to buffer!

 I'm saying, the charter of the function is to update a string in place,  
 and your proposal is making that not true in some cases.

Sure, but how is that different to this:

   char buffer[12];
   buffer[] = "hello, world";	

   ucase(buffer ~ "a"); // does nothing to buffer!

or in fact this:

   char buffer[12];
   buffer[] = "hello, world";	

   ucase(cast(char*)toStringz(buffer)); // does nothing to buffer!

in both cases buffer remains unchanged.

 The goal is to make a call to an extern "C" function "just work" in the  

 has it's own string type.


 adding a '0' at the end affects all references to that string,  
 reallocation or not.


\0 to the string.  For all I know they're making a completely new copy,  


the goal is.

 toStringz does not currently check for '\0' anywhere in the existing  
 string.  It simply appends '\0' to the end of the passed string.  If  
 you want it to check for '\0', how far should it go?  Doesn't this  
 also add to the overhead (looping over all chars looking for '\0')?

 Note also, that toStringz has old code that used to check for "one  
 byte beyond" the array, but this is commented out, because it's  
 unreliable (could cause a segfault).

 So, toStringz is not as clever as I imagined.  I thought it would  
 intelligently detect cases where a \0 was already present in the slice  
 (from 0 to $) and if not, put one at $+1 (inside pre-allocated array  
 memory).  I was assuming toStringz had access to the underlying array  
 allocation size and would know how far it can 'look' without causing a  
 segfault.  In the case where the slice length equaled the array  
 reserved memory area, it would re-allocate and place the \0 at $+1  
 (inside the newly allocated memory).

 s/clever/slow/

 The only "intelligent" way to check for a 0 is a linear search.

Fair enough.

 Without knowing where the data came from, there is no way to look past  
 the slice without possibly calling a segfault.  If you know it's a heap  
 allocation, you can look at the block information to see if you can look  
 past it.  This might be possible to do for toStringz, but the linear  
 check for 0 is just unacceptable for a simple function call.  Appending  
 a 0 is at least amortized.  One thing though, it could make some smarter  
 decisions as to whether to reallocate depending on the type of the  
 array, since it is already doing a lookup of block info.

Ok, scrap the linear search, or only perform it when a reallocation may be  
required.

 But I still always come back to the fact that I should be able to  
 circumvent some auto-intelligent decision that isn't aware of things  
 that a developer can be aware of (such as knowing an array already  
 contains a 0).  The compiler shouldn't be too intrusive here.

Sure, we want to keep everyone happy, the Q is, to my mind, which is the  
more general case.

It would be nice to have your cake and eat it too, or in other words for  
the general case (as I see it):

char[] foo;
.. code which populates foo ..
ucase(foo);

to "just work" as a new D programmer might expect, at the same time I  
agree that cases where speed is of the essence, or the data is guaranteed  
to contain \0 we need to be able to avoid the cost.  As most things it  
comes down to cost/benefit and I think D would benefit from this default  
behaviour, provided there is a way to avoid it as well.

Perhaps restricting the idea to cases like the one above where the  
compiler has the information for the slice/array, and doing nothing for  
raw char* cases is a good compromise, it would allow people to avoid the  
behaviour just by adding .ptr or similar.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 13 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 How does your proposal know that a char * is part of a heap-allocated  
 array?  If you are assuming the only case where char * is passed will  
 be arr.ptr, then that doesn't cut it.  What if the compiler doesn't  
 know where the char * came from?

 See your Q and my A above ("char * foo" example).

 The inherent problem of zero-terminated strings is that you don't know  
 how long it is until you search for a zero.  If it's not properly  
 terminated, then you are screwed.  That problem cannot be "solved",  
 even with compiler help -- you can get situations where there is no  
 more information other than the pointer.

 Really?  But cant we obtain the GC lock and look them up, as mentioned  
 above?  And isn't this exactly what toStringz will do when the  
 programmer first of all curses because it has crashed, and then adds an  
 explicit toStringz call?

Who said the char * points into GC memory?  It could point at stack  
memory, or static data in ROM.

-Steve

Jul 13 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 How does your proposal know that a char * is part of a heap-allocated  
 array?  If you are assuming the only case where char * is passed will  
 be arr.ptr, then that doesn't cut it.  What if the compiler doesn't  
 know where the char * came from?

 See your Q and my A above ("char * foo" example).

 The inherent problem of zero-terminated strings is that you don't know  
 how long it is until you search for a zero.  If it's not properly  
 terminated, then you are screwed.  That problem cannot be "solved",  
 even with compiler help -- you can get situations where there is no  
 more information other than the pointer.

 Really?  But cant we obtain the GC lock and look them up, as mentioned  
 above?  And isn't this exactly what toStringz will do when the  
 programmer first of all curses because it has crashed, and then adds an  
 explicit toStringz call?

 Who said the char * points into GC memory?  It could point at stack  
 memory, or static data in ROM.

Ok.  What would toStringz do in this case? .. because that's what I'm  
proposing we do here.

The goal here is to pick some low hanging fruit, the general case  
mentioned earlier, and make it work as a new D programmer would expect.   
In that case there is no technical difficulty implementing it (toStringz  
already exists), there is no extra cost (you already have to call  
toStringz), and the only disagreement seems to be whether it should be  
implicit or explicit.

In this particular case I cannot see any harm in making it implicit.  Yes,  
there are some edge cases, but they either already exist (as shown by the  
explicit toStringz example I gave where the passed char[] remained  
unchanged, and your example passing buffer[]), or they may be detectable  
by the compiler, or they are rare - in which case requiring some manual  
intervention is not too much to ask.

So, on balance I reckon the implicit call would be "better" for more  
people more of the time, and at no extra cost.  It seems like a win/win to  
me.  Yes, there are edge cases, yes there are wrinkles to iron out, no  
it's not a "general/covers everything perfectly" kind of idea - which I  
agree we'd all prefer, but it makes D look slicker, and removes one more  
stumbling block for new D programmers.


-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Jul 2011 05:53:47 -0400, Regan Heath <regan netmail.co.nz>  
wrote:

 On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 How does your proposal know that a char * is part of a heap-allocated  
 array?  If you are assuming the only case where char * is passed will  
 be arr.ptr, then that doesn't cut it.  What if the compiler doesn't  
 know where the char * came from?

 See your Q and my A above ("char * foo" example).

 The inherent problem of zero-terminated strings is that you don't  
 know how long it is until you search for a zero.  If it's not  
 properly terminated, then you are screwed.  That problem cannot be  
 "solved", even with compiler help -- you can get situations where  
 there is no more information other than the pointer.

 Really?  But cant we obtain the GC lock and look them up, as mentioned  
 above?  And isn't this exactly what toStringz will do when the  
 programmer first of all curses because it has crashed, and then adds  
 an explicit toStringz call?

 Who said the char * points into GC memory?  It could point at stack  
 memory, or static data in ROM.

 Ok.  What would toStringz do in this case? .. because that's what I'm  
 proposing we do here.

Nothing, you don't call toStringz on a char *, you call it on a string.   
The point is, for those who have already guaranteed a char * has a 0 in  
it, they should not have to have the compiler injecting useless code for a  
simple function call.

A really really good example is if you use a char * you got from a C  
function to call another C function.

 The goal here is to pick some low hanging fruit, the general case  
 mentioned earlier, and make it work as a new D programmer would expect.   
 In that case there is no technical difficulty implementing it (toStringz  
 already exists), there is no extra cost (you already have to call  
 toStringz), and the only disagreement seems to be whether it should be  
 implicit or explicit.

There is an extra cost where you wouldn't have to call toStringz currently.

 In this particular case I cannot see any harm in making it implicit.   
 Yes, there are some edge cases, but they either already exist (as shown  
 by the explicit toStringz example I gave where the passed char[]  
 remained unchanged, and your example passing buffer[]), or they may be  
 detectable by the compiler, or they are rare - in which case requiring  
 some manual intervention is not too much to ask.

 So, on balance I reckon the implicit call would be "better" for more  
 people more of the time, and at no extra cost.  It seems like a win/win  
 to me.  Yes, there are edge cases, yes there are wrinkles to iron out,  
 no it's not a "general/covers everything perfectly" kind of idea - which  
 I agree we'd all prefer, but it makes D look slicker, and removes one  
 more stumbling block for new D programmers.

We also have to weigh this against two things:

1. How will existing code (that already calls toStringz) be affected?
2. This is *not* a trivial compiler change.  So all other options should  
be considered, there's a *lot* of C calls that exist from D today that  
could possibly be affected.

If C strings were their own type (and not conflated with "buffer  
pointer"), and verifying a C string was valid without segfaulting and in  
O(1) time, I'd agree that a compiler change would be warranted.  There's  
just too many cases (note, these aren't the majority, but they are enough)  
where the injected calls will be either performance drags or unnecessary.

Jul 14 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Thu, 14 Jul 2011 12:30:24 +0100, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 On Thu, 14 Jul 2011 05:53:47 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 19:31:42 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 On Wed, 13 Jul 2011 13:32:56 -0400, Regan Heath <regan netmail.co.nz>  
 wrote:

 On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:

 How does your proposal know that a char * is part of a  
 heap-allocated array?  If you are assuming the only case where char  
 * is passed will be arr.ptr, then that doesn't cut it.  What if the  
 compiler doesn't know where the char * came from?

 See your Q and my A above ("char * foo" example).

 The inherent problem of zero-terminated strings is that you don't  
 know how long it is until you search for a zero.  If it's not  
 properly terminated, then you are screwed.  That problem cannot be  
 "solved", even with compiler help -- you can get situations where  
 there is no more information other than the pointer.

 Really?  But cant we obtain the GC lock and look them up, as  
 mentioned above?  And isn't this exactly what toStringz will do when  
 the programmer first of all curses because it has crashed, and then  
 adds an explicit toStringz call?

 Who said the char * points into GC memory?  It could point at stack  
 memory, or static data in ROM.

 Ok.  What would toStringz do in this case? .. because that's what I'm  
 proposing we do here.

 Nothing, you don't call toStringz on a char *, you call it on a string.   
 The point is, for those who have already guaranteed a char * has a 0 in  
 it, they should not have to have the compiler injecting useless code for  
 a simple function call.

 A really really good example is if you use a char * you got from a C  
 function to call another C function.

Good points all.  So, the idea should be limited to cases where D's char[]  
and string are passed to extern "C" functions expecting char*, and should  
not affect cases where D's char* is passed directly.  Sounds good.

 The goal here is to pick some low hanging fruit, the general case  
 mentioned earlier, and make it work as a new D programmer would  
 expect.  In that case there is no technical difficulty implementing it  
 (toStringz already exists), there is no extra cost (you already have to  
 call toStringz), and the only disagreement seems to be whether it  
 should be implicit or explicit.

 There is an extra cost where you wouldn't have to call toStringz  
 currently.

The point I've tried to make all along is that this is a rare situation,  
and not the general case.  In the general case you're going to need to  
call toStringz.  Especially if you restrict this idea to D's char[] and  
string and not D's char* as mentioned above.

 In this particular case I cannot see any harm in making it implicit.   
 Yes, there are some edge cases, but they either already exist (as shown  
 by the explicit toStringz example I gave where the passed char[]  
 remained unchanged, and your example passing buffer[]), or they may be  
 detectable by the compiler, or they are rare - in which case requiring  
 some manual intervention is not too much to ask.

 So, on balance I reckon the implicit call would be "better" for more  
 people more of the time, and at no extra cost.  It seems like a win/win  
 to me.  Yes, there are edge cases, yes there are wrinkles to iron out,  
 no it's not a "general/covers everything perfectly" kind of idea -  
 which I agree we'd all prefer, but it makes D look slicker, and removes  
 one more stumbling block for new D programmers.

 We also have to weigh this against two things:

Assuming the above mentioned restriction (char[] and string, not char*)...

 1. How will existing code (that already calls toStringz) be affected?

Not at all.

 2. This is *not* a trivial compiler change.  So all other options should  
 be considered, there's a *lot* of C calls that exist from D today that  
 could possibly be affected.

It will affect none of these.

 If C strings were their own type (and not conflated with "buffer  
 pointer"), and verifying a C string was valid without segfaulting and in  
 O(1) time, I'd agree that a compiler change would be warranted.  There's  
 just too many cases (note, these aren't the majority, but they are  
 enough) where the injected calls will be either performance drags or  
 unnecessary.

I disagree about the number of cases being too many, but this is a gut  
feeling and I have no evidence to support it.

I think with the restriction I mentioned above the situation changes  
however, as all those edge cases are unaffected, old code is unaffected  
and only new code will allow char[] and string to be passed as extern "C"  
char* parameters.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jul 14 2011

D Programming

C/C++ Programming

Other

digitalmars.D - toStringz or not toStringz