www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Proposal: clean up semantics of array literals vs string literals

reply Don Clugston <dac nospam.com> writes:
The problem
-----------

String literals in D are a little bit magical; they have a trailing \0. 
This means that is possible to write,

printf("Hello, World!\n");

without including a trailing \0. This is important for compatibility 
with C. This trailing \0 is mentioned in the spec but only incidentally, 
and generally in connection with printf.

But the semantics are not well defined.

printf("Hello, W" ~ "orld!\n");

Does this have a trailing \0 ? I think it should, because it improves 
readability of string literals that are longer than one line. Currently 
DMD adds a \0, but it is not in the spec.

Now consider array literals.

printf(['H','e', 'l', 'l','o','\n']);

Does this have a trailing \0 ? Currently DMD does not put one in.
How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?

And "Hello " ~ ['W','o','r','l','d','\n']   ?

And "Hello World!" ~ '\n' ?
And  null ~ "Hello World!\n" ?

Currently DMD puts \0 in some cases but not others, and it's rather random.

The root cause is that this trailing zero is not part of the type, it's 
part of the literal. There are no rules for how literals are propagated 
inside expressions, they are just literals. This is a mess.

There is a second difference.
Array literals of char type, have completely different semantics from 
string literals. In module scope:

char[] x = ['a'];  // OK -- array literals can have an implicit .dup
char[] y = "b";    // illegal

This is a big problem for CTFE, because for CTFE, a string is just a 
compile-time value, it's neither string literal nor array literal!

See bug 8660 for further details of the problems this causes.


A proposal to clean up this mess
--------------------------------

Any compile-time value of type immutable(char)[] or const(char)[], 
behaves a string literals currently do, and will have a \0 appended when 
it is stored in the executable.

ie,

enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
printf(hello);

will work.

Any value of type char[], which is generated at compile time, will not 
have the trailing \0, and it will do an implicit dup (as current array 
literals do).

char [] foo()
{
     return "abc";
}

char [] x = foo();

// x does not have a trailing \0, and it is implicitly duped, even 
though it was not declared with an array literal.

-------------------
So that the difference between string literals and char array literals 
would simply be that the latter are polysemous. There would be no 
semantics associated with the form of the literal itself.


We still have this oddity:


void foo(char qqq = 'b') {

    string x = "abc";            // trailing \0
    string y = ['a', 'b', 'c'];  // trailing \0
    string z = ['a', qqq, 'c'];  // no trailing \0
}

This is because we made the (IMHO mistaken) decision to allow variables 
inside array literals.
This is the reason why I listed _compile time value_ in the requirement 
for having a \0, rather than entirely basing it on the type.

We could fix that with a language change: an array literal which 
contains a variable should not be of immutable type. It should be of 
mutable type (or const, in the case where it contains other, immutable 
values).

So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, 
even though w is allocated on the heap).

But that's a separate proposal from the one I'm making here. I just need 
a decision on the main proposal so that I can fix a pile of CTFE bugs.
Oct 02 2012
next sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a 
 trailing \0. This means that is possible to write,

 printf("Hello, World!\n");

 without including a trailing \0. This is important for 
 compatibility with C. This trailing \0 is mentioned in the spec 
 but only incidentally, and generally in connection with printf.

 But the semantics are not well defined.

 printf("Hello, W" ~ "orld!\n");
If every string literal is \0-terminated, then there should be two \0 in the final string. I guess that's not the case and that's actually my preferred behaviour, but the spec should make it crystal clear in which situations a string literal gets a terminator and in which not.
Oct 02 2012
parent Don Clugston <dac nospam.com> writes:
On 02/10/12 13:18, Tobias Pankrath wrote:
 On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing
 \0. This means that is possible to write,

 printf("Hello, World!\n");

 without including a trailing \0. This is important for compatibility
 with C. This trailing \0 is mentioned in the spec but only
 incidentally, and generally in connection with printf.

 But the semantics are not well defined.

 printf("Hello, W" ~ "orld!\n");
If every string literal is \0-terminated, then there should be two \0 in the final string. I guess that's not the case and that's actually my preferred behaviour, but the spec should make it crystal clear in which situations a string literal gets a terminator and in which not.
The \0 is *not* part of the string, it lies after the string. It's as if all memory is cleared, then the string literals are copied into it, with a gap of at least one byte between each. The 'trailing 0' is not part of the literal, it's the underlying cleared memory. At least, that's how I understand it. The spec is very vague.
Oct 02 2012
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Well the whole mess come from the fact that D conflate C string and D 
string.

The first problem come from the fact that D array are implicitly 
convertible to pointer. So calling D function that expect a char* is 
possible with D string even if it is unsafe and will not work in the 
general case.

The fact that D provide tricks that will make it work in special cases 
is armful as previous discussion have shown (many D programmer assume 
that this will always work because of toy tests they have made, where in 
case it won't and toStringz must be used).

The only sane solution I can think of is to :
  - disallow slice to convert implicitly to pointer. .ptr is made for that.
  - Do not put any trailing 0 in string literal, unless it is specified 
explicitly ( "foobar\0" ).
  - Except if a const(char)* is expected from the string literal. In 
case it becomes a Cstring literal, with a trailing 0. This is made to 
allow uses like printf("foobar");

In other terms, the receiver type is used to decide if the compiler 
generate a string literal or a Cstring literal.

Other addition of 0 are just confusing, and will make incorrect code 
work in special cases, which is something you usually don't want. Code 
that work by accident often backfire in spectacular ways at the least 
expected moment.
Oct 02 2012
parent reply Don Clugston <dac nospam.com> writes:
On 02/10/12 13:26, deadalnix wrote:
 Well the whole mess come from the fact that D conflate C string and D
 string.

 The first problem come from the fact that D array are implicitly
 convertible to pointer. So calling D function that expect a char* is
 possible with D string even if it is unsafe and will not work in the
 general case.

 The fact that D provide tricks that will make it work in special cases
 is armful as previous discussion have shown (many D programmer assume
 that this will always work because of toy tests they have made, where in
 case it won't and toStringz must be used).

 The only sane solution I can think of is to :
   - disallow slice to convert implicitly to pointer. .ptr is made for that.
   - Do not put any trailing 0 in string literal, unless it is specified
 explicitly ( "foobar\0" ).
   - Except if a const(char)* is expected from the string literal. In
 case it becomes a Cstring literal, with a trailing 0. This is made to
 allow uses like printf("foobar");

 In other terms, the receiver type is used to decide if the compiler
 generate a string literal or a Cstring literal.
This still doesn't solve the problem of the difference between array literals and string literals (the magical implicit .dup), which is the key problem I'm trying to solve.
Oct 02 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 02/10/2012 15:12, Don Clugston a écrit :
 On 02/10/12 13:26, deadalnix wrote:
 Well the whole mess come from the fact that D conflate C string and D
 string.

 The first problem come from the fact that D array are implicitly
 convertible to pointer. So calling D function that expect a char* is
 possible with D string even if it is unsafe and will not work in the
 general case.

 The fact that D provide tricks that will make it work in special cases
 is armful as previous discussion have shown (many D programmer assume
 that this will always work because of toy tests they have made, where in
 case it won't and toStringz must be used).

 The only sane solution I can think of is to :
 - disallow slice to convert implicitly to pointer. .ptr is made for that.
 - Do not put any trailing 0 in string literal, unless it is specified
 explicitly ( "foobar\0" ).
 - Except if a const(char)* is expected from the string literal. In
 case it becomes a Cstring literal, with a trailing 0. This is made to
 allow uses like printf("foobar");

 In other terms, the receiver type is used to decide if the compiler
 generate a string literal or a Cstring literal.
This still doesn't solve the problem of the difference between array literals and string literals (the magical implicit .dup), which is the key problem I'm trying to solve.
OK, infact we have 2 different and unrelated problems here. I have to say I have no idea for the second one.
Oct 02 2012
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 10/2/12, Don Clugston <dac nospam.com> wrote:
 A proposal to clean up this mess
 --------------------------------

 Any compile-time value of type immutable(char)[] or const(char)[],
 behaves a string literals currently do, and will have a \0 appended when
 it is stored in the executable.

 ie,

 enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
 printf(hello);

 will work.
What about these, will these pass?: enum string x = "foo"; assert(x.length == 3); void test(string x) { assert(x.length == 3); } test(x); If these don't pass the proposal will break code.
Oct 02 2012
parent Don Clugston <dac nospam.com> writes:
On 02/10/12 14:02, Andrej Mitrovic wrote:
 On 10/2/12, Don Clugston <dac nospam.com> wrote:
 A proposal to clean up this mess
 --------------------------------

 Any compile-time value of type immutable(char)[] or const(char)[],
 behaves a string literals currently do, and will have a \0 appended when
 it is stored in the executable.

 ie,

 enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
 printf(hello);

 will work.
What about these, will these pass?: enum string x = "foo"; assert(x.length == 3); void test(string x) { assert(x.length == 3); } test(x); If these don't pass the proposal will break code.
Yes, they pass. The \0 is not included in the string length. It's effectively in the data segment, not in the string.
Oct 02 2012
prev sibling next sibling parent kenji hara <k.hara.pg gmail.com> writes:
2012/10/2 Don Clugston <dac nospam.com>:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing \0. This
 means that is possible to write,

 printf("Hello, World!\n");

 without including a trailing \0. This is important for compatibility with C.
 This trailing \0 is mentioned in the spec but only incidentally, and
 generally in connection with printf.

 But the semantics are not well defined.

 printf("Hello, W" ~ "orld!\n");

 Does this have a trailing \0 ? I think it should, because it improves
 readability of string literals that are longer than one line. Currently DMD
 adds a \0, but it is not in the spec.

 Now consider array literals.

 printf(['H','e', 'l', 'l','o','\n']);

 Does this have a trailing \0 ? Currently DMD does not put one in.
 How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?

 And "Hello " ~ ['W','o','r','l','d','\n']   ?

 And "Hello World!" ~ '\n' ?
 And  null ~ "Hello World!\n" ?

 Currently DMD puts \0 in some cases but not others, and it's rather random.

 The root cause is that this trailing zero is not part of the type, it's part
 of the literal. There are no rules for how literals are propagated inside
 expressions, they are just literals. This is a mess.

 There is a second difference.
 Array literals of char type, have completely different semantics from string
 literals. In module scope:

 char[] x = ['a'];  // OK -- array literals can have an implicit .dup
 char[] y = "b";    // illegal

 This is a big problem for CTFE, because for CTFE, a string is just a
 compile-time value, it's neither string literal nor array literal!

 See bug 8660 for further details of the problems this causes.


 A proposal to clean up this mess
 --------------------------------

 Any compile-time value of type immutable(char)[] or const(char)[], behaves a
 string literals currently do, and will have a \0 appended when it is stored
 in the executable.

 ie,

 enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
 printf(hello);

 will work.

 Any value of type char[], which is generated at compile time, will not have
 the trailing \0, and it will do an implicit dup (as current array literals
 do).

 char [] foo()
 {
     return "abc";
 }

 char [] x = foo();

 // x does not have a trailing \0, and it is implicitly duped, even though it
 was not declared with an array literal.

 -------------------
 So that the difference between string literals and char array literals would
 simply be that the latter are polysemous. There would be no semantics
 associated with the form of the literal itself.


 We still have this oddity:


 void foo(char qqq = 'b') {

    string x = "abc";            // trailing \0
    string y = ['a', 'b', 'c'];  // trailing \0
    string z = ['a', qqq, 'c'];  // no trailing \0
 }

 This is because we made the (IMHO mistaken) decision to allow variables
 inside array literals.
 This is the reason why I listed _compile time value_ in the requirement for
 having a \0, rather than entirely basing it on the type.

 We could fix that with a language change: an array literal which contains a
 variable should not be of immutable type. It should be of mutable type (or
 const, in the case where it contains other, immutable values).

 So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even
 though w is allocated on the heap).

 But that's a separate proposal from the one I'm making here. I just need a
 decision on the main proposal so that I can fix a pile of CTFE bugs.
Maybe your proposal is correct. I think the key idea is *polysemous typed string literal*. When based on the Ideal D Interpreter in my brain, the organized rule will become like follows. 1-1) In semantic level, D should have just one polysemous string literal, which is "an array of char". 1-2) In token level, D has two represents for the polysemous string literal, they are "str" and ['s','t','r']. 2) The polysemous string literl is implicitly convertible to [wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is not need, because immutable([wd]?char)[] is implicitly convertible to them). 3) The concatenation result between polysemous literals is still polysemous, but its representation is different based on the both side of the operator. "str" ~ "str"; // "strstr" "str" ~ ['s','t','r']; // ['s','t','r','s','t','r'] "str" ~ 's'; // "strs" ['s','t','r'] ~ 's'; // ['s','t','r','s'] "str" ~ null; // "str" ['s','t','r'] ~ null; // ['s','t','r'] 4) After semantics _and_ optimization, polysemous string literal which represented as like 4-1) "str" is typed as immutable([wd]?char)[] (The char type is depends on the literal suffix). 4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is depends on the common type of its elements). 5) In object file generating phase, string literal which typed as 5-1) immutable([wd]?)char[] is stored in the executable and implicitly terminated with \0. 5-2) [wd]?char[] are stored in the executable as the original image and implicitly 'dup'ed in runtime. ---- Additionally, in following case, both concatenation should generate polysemous string literals in CT and RT. Because, after concatenation of chars and char arrays, newly allocated strings are *purely immutable* value and implicitly convertible to mutable. immutable char ic = 'a'; pragma(msg, typeof(['s', 't', ic, 'r'])); // prints const(char)[] immutable(char)[] s = ['s', 't', ic, 'r']; // BUT, should be allowed char mc = 'a'; pragma(msg, typeof("st"~mc~"r")); // prints const(char)[] char[] s = "st"~mc~"r"; // BUT, should be allowed Kenji Hara
Oct 02 2012
prev sibling next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
 [SNIP]
 A proposal to clean up this mess
 [SNIP]
While I think it is convenient to be able to write 'printf("world");', as you point out, I think that the fact that it works "inconsistently" (and by that, I mean there are rules and exceptions), is even more dangerous. If at all possible, I'd rather side with consistency, then the "we got your back... except when we don't" approach: IE: strings are NEVER null terminated. In theory, how often do you *really* need null terminated strings? And when you do, wouldn't it be safer to just write 'printf("world\0")'? or 'printf(str ~ "world" ~ '\0');' rather than "Am I in a case where it is null terminated? Yeah... 90% confident I am..." If you want 0 termination, then make it explicit, that's my opinion. Besides, as you said, the null termination is not documented, so anything relying on it is a bug really. Just an observation of an implementation detail.
Oct 02 2012
parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Tuesday, 2 October 2012 at 14:03:36 UTC, monarch_dodra wrote:
 If you want 0 termination, then make it explicit, that's my 
 opinion.
That ship has long since sailed. You'll break code in an incredibly dangerous way if you were to change it now.
Oct 04 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/2/12 7:11 AM, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing \0.
[snip] I don't mean to be Debbie Downer on this because I reckon it addresses an issue that some have, although I never do. With that warning, a few candid opinions follow. First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary. Second, a simple and workable solution to this would be to address the matter dynamically: make toStringz opportunistically look whether there's a \0 beyond the end of the string, EXCEPT when the string happens to end exactly at a page boundary (in which case accessing memory beyond the end of the string may produce a page fault). With this simple dynamic test we don't need precise and stringent rules for the implementation. Third, the complex set of rules proposed pushes the number of cases in which the \0 is guaranteed, but doesn't make for a clear and easy to remember boundary. Therefore people will need to remember some more rules to make sure they can, well, avoid a call to toStringz. On 10/2/12 10:55 AM, Regan Heath wrote:
 Recent discussions on the zero terminated string problems and
 inconsistency of string literals has me, again, wondering why D
 doesn't have a 'type' to represent C's zero terminated strings.  It
 seems to me that having a type, and typing C functions with it would
 solve a lot of problems.
[snip]
 I am probably missing something obvious, or I have forgotten one of
 the array/slice complexities which makes this a nightmare.
You're not missing anything and defining a zero-terminated type is something I considered doing and have been highly interested in. My interest is motivated by the fact that sentinel-terminated structures are a very interesting example of forward ranges that are also contiguous. That sets them apart from both singly-linked lists and simple arrays, and gives them interesting properties. I'd be interested in defining the more general: struct SentinelTerminatedSlice(T, T terminator) { private T* data; ... } That would be a forward range and the instantiation SentinelTerminatedSlice!(char, 0) would be CString. However, so far I held off of defining such a range because C-strings are seldom useful in D code and there are not many other compelling examples of sentinel-terminated ranges. Maybe it's time to dust off that idea, I'd love it if we gathered enough motivation for it. Andrei
Oct 02 2012
next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
wrote:
 However, so far I held off of defining such a range because 
 C-strings are seldom useful in D code [...]
I think your view of what is common in D code is not representative. You are primarily a library writer, which means you rarely have to interface with other code. Please correct me if I'm wrong, but I don't believe you've written much application-level D code. For people that write applications, we have the unfortunate chore of having to call lots of C APIs to get things done. There's a long list of things for which there is no D interface (graphics, audio, input, GUI, database, platform APIs, various 3rd party libs). Invariably these interfaces require C strings. In short, if you write applications in D, you need C strings. I don't know what the right decision is here, but please do not say that C-strings are seldom useful in D code.
Oct 02 2012
prev sibling next sibling parent Don Clugston <dac nospam.com> writes:
On 02/10/12 17:14, Andrei Alexandrescu wrote:
 On 10/2/12 7:11 AM, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing \0.
[snip] I don't mean to be Debbie Downer on this because I reckon it addresses an issue that some have, although I never do. With that warning, a few candid opinions follow. First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary.
[snip] You're missing the point, a bit. The zero-terminator is only one symptom of the underlying problem: string literals and array literals have the same type but different semantics. The other symptoms are: * the implicit .dup that happens with array literals, but not string literals. This is a silent performance killer. It's probably the most common performance bug we find in our code, and it's completely ungreppable. * string literals are polysemous with width (c, w, d) but array literals are not (they are polysemous with constness). For example, "abc" ~ 'ü' is legal, but ['a', 'b', 'c'] ~ 'ü' is not. This has nothing to do with the zero terminator.
Oct 04 2012
prev sibling parent reply "Bernard Helyer" <b.helyer gmail.com> writes:
On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
wrote:
 First, I think zero-terminated strings shouldn't be needed 
 frequently enough in D code to make this necessary.
My experience has been much different. Interfacing with C occurs in nearly every D program I write, and I usually end up passing a string literal. Anecdotes!
Oct 04 2012
parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 4 October 2012 at 07:57:16 UTC, Bernard Helyer wrote:
 On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
 wrote:
 First, I think zero-terminated strings shouldn't be needed 
 frequently enough in D code to make this necessary.
My experience has been much different. Interfacing with C occurs in nearly every D program I write, and I usually end up passing a string literal. Anecdotes!
Agreed. I'm always happy when I find that the particular C API I am working with supports passing strings as a pointer/length pair :) Anyway, toStringz (and the wchar and dchar equivalents in std.utf) needs to be fixed regardless - it currently does a dangerous optimization if the string is immutable, otherwise it unconditionally concatenates. We cannot rely on strings being GC allocated based on mutability. Memory is outside the scope of the D type system - we cannot make assumptions about memory based on types.
Oct 04 2012