digitalmars.D.learn - Stream and File understanding.

jicman (20/20) Nov 10 2005 So, I have this complicated piece of code:

Sean Kelly (7/29) Nov 10 2005 The problem is that literal string literals can be implicitly converted

Kris (22/51) Nov 10 2005 This produces a compile error:
Georg Wrede (19/52) Nov 11 2005 I just posted a "nice" fix on this thread. But it seems overkill (and

bert (4/10) Nov 11 2005 The *programmer* assumes so *anyway*.

jicman (2/12) Nov 11 2005 It is really cool. :-)

Kris (7/10) Nov 11 2005 That sounds like a good idea; it would set the /default/ type for litera...
Nick (6/12) Nov 14 2005 Well that's a nice attitude. Makes copy-and-paste impossible, and makes ...

Georg Wrede (16/32) Nov 14 2005 :-) there are actually 2 separate issues involved.

Kris (34/55) Nov 10 2005 This is the long standing mishmash between character literal arguments a...

Georg Wrede (32/54) Nov 11 2005 Compared to the bit thing I recently "bitched" about, this, IMHO, is an

Kris (12/36) Nov 11 2005 Not so. You'd see people complaining about this constantly if Stream.wri...

James Dunne (25/88) Nov 21 2005 char[] does NOT NECESSARILY MEAN an ASCII-only string in D.

Derek Parnell (8/93) Nov 21 2005 Very nice. Well said James. It makes so much sense when laid out like th...

Sean Kelly (14/37) Nov 21 2005 I agree, but there must be a way to improve internationalization without...

Derek Parnell (26/65) Nov 21 2005 Where did you get "6+ character types" from?

Kris (17/43) Nov 21 2005 Maybe. To maintain array indexing semantics, the compiler might implemen...
Sean Kelly (21/41) Nov 21 2005 I misunderstood and thought his cdpt8 would be added in addition to the

Derek Parnell (25/72) Nov 21 2005 That is what I'm doing now to Build. Internally, all strings will be

Bruno Medeiros (13/52) Nov 23 2005 Then, wouldn't having good dchar[] support in Phobos be a better

Kris (40/59) Nov 21 2005 Indeed. I was alluding to encoding multi-byte-utf8 literals by hand; but...

Regan Heath (10/15) Nov 22 2005 But that makes sense, right? Character literals i.e. '\X00000001' will

Kris (12/26) Nov 22 2005 Oh, that minor concern was in regard to consistency here also. I have no...

Regan Heath (29/59) Nov 23 2005 I realise that. I'm just trying to explore whether they _should_ behave ...

Kris (12/48) Nov 23 2005 To clarify: I'm already making the assumption that the compiler changes ...

Regan Heath (20/68) Nov 23 2005 Yes, that is what I thought we were doing, questioning whether it would ...

Kris (4/6) Nov 23 2005 That would be great. Now, will this truly come to pass?

Don Clugston (13/32) Nov 11 2005 I agree, except that I think the problem in this case is that it's not

Kris (4/11) Nov 11 2005 There would be if the auto-casting were disabled, and the type were

jicman (15/27) Nov 11 2005 Gosh, all I wanted was a simple explanation. :-) (kidding)

Nick (8/18) Nov 11 2005 Also note one thing though: Stream.write() will write the string in bina...

jicman (5/24) Nov 11 2005 Disregard this post? Oh, no! My friend, you wrote it, I am going to re...

jicman <jicman_member pathlink.com> writes:

So, I have this complicated piece of code:

|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.write("this is a test");
|  log.close();
|  return 1;
|}

and I try to compile it, I get,

|ftest.d(6): function std.stream.Stream.write called with argument types:
|        (char[14])
|matches both:
|        std.stream.Stream.write(char[])
|and:
|        std.stream.Stream.write(wchar[])

Shouldn't it just match "std.stream.Stream.write(char[])"?

thanks,

jos�

Nov 10 2005

Sean Kelly <sean f4.ca> writes:

jicman wrote:
 So, I have this complicated piece of code:
 
 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}
 
 and I try to compile it, I get,
 
 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])
 
 Shouldn't it just match "std.stream.Stream.write(char[])"?

The problem is that literal string literals can be implicitly converted 
to char, wchar, and dchar strings.  To fix the overload resolution 
problem, try this:

log.write( "this is a test"c );

the 'c' indicates that the above is a char string.


Sean

Nov 10 2005

"Kris" <fu bar.com> writes:

This produces a compile error:

void write (char[] x){}
void write (wchar[] x){}

void main()
{
write ("part 1"
          "part 2" c);
}

The compiler complains about the two literal types not matching. This also 
fails:

void main()
{
write ("part 1" c
          "part 2" c);
}

This strange looking suffixing is present due to unwarranted & unwanted 
automatic type conversion, is it not? Wouldn't it be better to explicitly 
request conversion when it's actually wanted instead? Isn't that what the 
cast() operator is for?

- Kris




"Sean Kelly" <sean f4.ca> wrote in message 
news:dl0in9$2bet$1 digitaldaemon.com...
 jicman wrote:
 So, I have this complicated piece of code:

 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}

 and I try to compile it, I get,

 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])

 Shouldn't it just match "std.stream.Stream.write(char[])"?

 The problem is that literal string literals can be implicitly converted to 
 char, wchar, and dchar strings.  To fix the overload resolution problem, 
 try this:

 log.write( "this is a test"c );

 the 'c' indicates that the above is a char string.


 Sean

Nov 10 2005

Georg Wrede <georg.wrede nospam.org> writes:

Sean Kelly wrote:
 jicman wrote:
 
 So, I have this complicated piece of code:

 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}

 and I try to compile it, I get,

 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])

 Shouldn't it just match "std.stream.Stream.write(char[])"?

 
 
 The problem is that literal string literals can be implicitly converted 
 to char, wchar, and dchar strings.  To fix the overload resolution 
 problem, try this:
 
 log.write( "this is a test"c );
 
 the 'c' indicates that the above is a char string.

I just posted a "nice" fix on this thread. But it seems overkill (and 
brittle), if one assumes this is just a problem with string literals!

_If_ it is true that this "problem" exists only with string literals, 
then it should be even easier to fix!

The compiler knows (or at least _should_ know) the character width of 
the source code file. Now, if there's an undecorated string literal in 
it, then _simply_assume_ that is the _intended_ type of the string!

(( At this time opponents will say "what if the source code file gets 
converted into another character width?" -- My answer: "Tough, ain't 
it!", since there's a law against gratuituous mucking with source code.  ))


So, implicitly just assume the source code literal character width.

The '"c' does _not_ exist so the compiler can force you to state the 
obvious. It's there so you _can_ be explicit _when_ it really matters to 
you.

---

Oh, and if we want to be real fancy, we could also have a pragma stating 
the default for character literals!

And when the pragma is not used, then assume based on the source.

Nov 11 2005

bert <bert_member pathlink.com> writes:

In article <4374598B.30604 nospam.org>, Georg Wrede says... 
 

 
The compiler knows (or at least _should_ know) the character width of  
the source code file. Now, if there's an undecorated string literal in  
it, then _simply_assume_ that is the _intended_ type of the string! 
 

The *programmer* assumes so *anyway*.  

Why on earth should the copiler assume anything else! 

BTW, D is really cool!

Nov 11 2005

jicman <jicman_member pathlink.com> writes:

bert says...
In article <4374598B.30604 nospam.org>, Georg Wrede says... 
 

 
The compiler knows (or at least _should_ know) the character width of  
the source code file. Now, if there's an undecorated string literal in  
it, then _simply_assume_ that is the _intended_ type of the string! 
 

The *programmer* assumes so *anyway*.  

Why on earth should the copiler assume anything else! 

BTW, D is really cool! 

It is really cool. :-)

Nov 11 2005

"Kris" <fu bar.com> writes:

"Georg Wrede" <georg.wrede nospam.org> wrote ...
 The compiler knows (or at least _should_ know) the character width of the 
 source code file. Now, if there's an undecorated string literal in it, 
 then _simply_assume_ that is the _intended_ type of the string!

That sounds like a good idea; it would set the /default/ type for literals. 
But the compiler should still inspect the literal content to determine if it 
has explicit wchar or dchar characters within. The compiler apparently does 
this, but doesn't use it to infer literal type?

This combination would very likely resolve all such problems, assuming the 
auto-casting were removed also?

Nov 11 2005

Nick <Nick_member pathlink.com> writes:

In article <4374598B.30604 nospam.org>, Georg Wrede says...
The compiler knows (or at least _should_ know) the character width of 
the source code file. Now, if there's an undecorated string literal in 
it, then _simply_assume_ that is the _intended_ type of the string!

(( At this time opponents will say "what if the source code file gets 
converted into another character width?" -- My answer: "Tough, ain't 
it!", since there's a law against gratuituous mucking with source code.  ))

Well that's a nice attitude. Makes copy-and-paste impossible, and makes writing
code off html, plain text, and books impossible too, since the code's behaviour
now dependens on your language environment. I'm sure that won't cause any bugs
at all ;-)

Nick

Nov 14 2005

Georg Wrede <georg.wrede nospam.org> writes:

Nick wrote:
 In article <4374598B.30604 nospam.org>, Georg Wrede says...
 
 The compiler knows (or at least _should_ know) the character width
 of the source code file. Now, if there's an undecorated string
 literal in it, then _simply_assume_ that is the _intended_ type of
 the string!
 
 (( At this time opponents will say "what if the source code file
 gets converted into another character width?" -- My answer: "Tough,
 ain't it!", since there's a law against gratuituous mucking with
 source code.  ))

 
 Well that's a nice attitude. Makes copy-and-paste impossible, and
 makes writing code off html, plain text, and books impossible too,
 since the code's behaviour now dependens on your language
 environment. I'm sure that won't cause any bugs at all ;-)

:-) there are actually 2 separate issues involved.

First of all, the copy-and-paste issue:

To be able to paste into the string, the text editor (or whatever) has 
to know the character width of the file to begin with, since pasting is 
done differently with the various UTF widths. Further, one cannot paste 
anything "in the wrong UTF width" as such, so the editor has to convert 
it into the width of the entire file first. (This _should_ be handled by 
the operating system (not the text editor), but I wouldn't bet on it, at 
least before 2010 or something. Not with at least _some_ "operating 
systems".)

Second, the width the undecorated literal is to be stored as:

What makes this issue interesting is, is it feasible to assume something 
or declare the literal as "of unspecified" width.

There's lately been some research into the issue (in the D newsgroup). 
The jury is still out.

Nov 14 2005

"Kris" <fu bar.com> writes:

This is the long standing mishmash between character literal arguments and 
parameters of type char[], wchar[], and/or dchar[]. Character literals don't 
really have a "solid" type ~ the compiler can, and will, convert between 
wide and narrow representations on the fly.

Suppose you have the following methods:

void write (char[] x);
void write (wchar[] x);
void write (dchar[] x);

Given a literal argument:

write ("what am I?");

D doesn't know whether to invoke the char[] or wchar[] signature, since the 
literal is treated as though it's possibly any of the three types. This is 
the kind of non-determinism you get when the compiler becomes too 'smart' 
(unwarranted automatic conversion, in this case).

To /patch/ around this problem, literals may be now be suffixed with a 
type-identifier, including 'c', 'w', and 'd'. Thus, the above example will 
compile when you do the following:

write ( "I am a char[], dammit!" c );

I, for one, think this is silly. To skirt the issue, APIs end up being 
written as follows:

void write (char[]);
void writeW (wchar[]);
void writeD (dchar[]);

Is that redundant, or what? Well, it's what Phobos is forced to do in the 
Stream code (take a look). The error you ran into appears to be a situation 
where Walter's own code (std.file) bumps into this ~ wish that were enough 
to justify a real fix for this long-running concern.

BTW; the correct thing happens when not using literals. For example, the 
following operates intuitively:

char[]  msg = "I am a char[], dammit!";
write (msg);


- Kris






"jicman" <jicman_member pathlink.com> wrote in message 
news:dl0hja$2aal$1 digitaldaemon.com...
 So, I have this complicated piece of code:

 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}

 and I try to compile it, I get,

 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])

 Shouldn't it just match "std.stream.Stream.write(char[])"?

 thanks,

 jos�

Nov 10 2005

Georg Wrede <georg.wrede nospam.org> writes:

Kris wrote:
 This is the long standing mishmash between character literal
 arguments and parameters of type char[], wchar[], and/or dchar[].
 Character literals don't really have a "solid" type ~ the compiler
 can, and will, convert between wide and narrow representations on the
 fly.

Compared to the bit thing I recently "bitched" about, this, IMHO, is an
issue one can accept better. :-)

It is a problem for small example programs. Larger programs tend to
(and IMHO should) have wrappers anyhow:

void logwrite(char[] logfile, char[] entry)
{
     std.stream.Stream.write(logfile, entry)
}

 BTW; the correct thing happens when not using literals.
 For example, the following operates intuitively:

 char[]  msg = "I am a char[], dammit!"; write (msg);


Hmm, Kris's comment above gives me an idea for a _very_ easy fix for 
this in Phobos:

Why not change Phobos

void write ( char[] s) {.....};
void write (wchar[] s) {.....};
void write (dchar[] s) {.....};

into

void _write ( char[] s) {.....};
void _write (wchar[] s) {.....};
void _write (dchar[] s) {.....};
void write (char[] s) {_write(s)};

I think this would solve the issue with string literals as discussed in 
this thread.

Also, overloading would not be hampered.

And, those who really _need_ types other than the 8 bit chars, could 
still have their types work as usual.


(( I also had 2 more lines

void writeD (wchar[] s) {_write(s)};
void writeW (dchar[] s) {_write(s)};

above, but they're actually not needed, based on the assumption that the 
compiler is smart enough to not make redundant char type conversions, 
which I believe it is. -- And if not, then the 2 lines should be 
included. ))

 To /patch/ around this problem, literals may be now be suffixed with
 a type-identifier, including 'c', 'w', and 'd'. Thus, the above
 example will compile when you do the following:
 
 write ( "I am a char[], dammit!" c );
 
 I, for one, think this is silly. To skirt the issue, APIs end up
 being written as follows:
 
 void write (char[]); void writeW (wchar[]); void writeD (dchar[]);
 
 Is that redundant, or what? Well, it's what Phobos is forced to do in
 the Stream code (take a look). The error you ran into appears to be a
 situation where Walter's own code (std.file) bumps into this

Nov 11 2005

"Kris" <fu bar.com> writes:

"Georg Wrede" <georg.wrede nospam.org> wrote ...
 Kris wrote:
 This is the long standing mishmash between character literal
 arguments and parameters of type char[], wchar[], and/or dchar[].
 Character literals don't really have a "solid" type ~ the compiler
 can, and will, convert between wide and narrow representations on the
 fly.

 Compared to the bit thing I recently "bitched" about, this, IMHO, is an
 issue one can accept better. :-)

That doesn't make it any less problematic :-)


 It is a problem for small example programs. Larger programs tend to
 (and IMHO should) have wrappers anyhow:

Not so. You'd see people complaining about this constantly if Stream.write() 
was not decorated to distinguish between the three relevant methods. 
Generally speaking, any code that deals with all three array types will bump 
into this. Mango.io has the same problem, since it exposes write() methods 
for every D type plus their array counterparts.


 Why not change Phobos

 void write ( char[] s) {.....};
 void write (wchar[] s) {.....};
 void write (dchar[] s) {.....};

 into

 void _write ( char[] s) {.....};
 void _write (wchar[] s) {.....};
 void _write (dchar[] s) {.....};
 void write (char[] s) {_write(s)};

 I think this would solve the issue with string literals as discussed in 
 this thread.

Then, how would one write a dchar[] literal? You just moved the problem to 
the _write() method instead. I think there needs to be a general resolution 
instead.

One might infer the literal type from the content therein?


 Also, overloading would not be hampered.

 And, those who really _need_ types other than the 8 bit chars, could still 
 have their types work as usual.

Ahh. I think non-ASCII folks would be troubled by this bias <g>

Nov 11 2005

James Dunne <james.jdunne gmail.com> writes:

Kris wrote:
 "Georg Wrede" <georg.wrede nospam.org> wrote ...
 
Kris wrote:

This is the long standing mishmash between character literal
arguments and parameters of type char[], wchar[], and/or dchar[].
Character literals don't really have a "solid" type ~ the compiler
can, and will, convert between wide and narrow representations on the
fly.

Compared to the bit thing I recently "bitched" about, this, IMHO, is an
issue one can accept better. :-)

 
 
 That doesn't make it any less problematic :-)
 
 
 
It is a problem for small example programs. Larger programs tend to
(and IMHO should) have wrappers anyhow:

 
 
 Not so. You'd see people complaining about this constantly if Stream.write() 
 was not decorated to distinguish between the three relevant methods. 
 Generally speaking, any code that deals with all three array types will bump 
 into this. Mango.io has the same problem, since it exposes write() methods 
 for every D type plus their array counterparts.
 
 
 
Why not change Phobos

void write ( char[] s) {.....};
void write (wchar[] s) {.....};
void write (dchar[] s) {.....};

into

void _write ( char[] s) {.....};
void _write (wchar[] s) {.....};
void _write (dchar[] s) {.....};
void write (char[] s) {_write(s)};

I think this would solve the issue with string literals as discussed in 
this thread.

 
 
 Then, how would one write a dchar[] literal? You just moved the problem to 
 the _write() method instead. I think there needs to be a general resolution 
 instead.
 
 One might infer the literal type from the content therein?
 
 
 
Also, overloading would not be hampered.

And, those who really _need_ types other than the 8 bit chars, could still 
have their types work as usual.

 
 
 Ahh. I think non-ASCII folks would be troubled by this bias <g> 
 

char[] does NOT NECESSARILY MEAN an ASCII-only string in D.

char[] can be a collection of UTF-8 code points, which further confuses 
the matter.

So long as you can process each variant of Unicode encodings (UTF-8, 
UTF-16, and UTF-32), it should NOT matter which you choose as your 
default encoding for your project's strings.  The only effect of the 
choice is the efficiency with which your project processes strings.  You 
should not lose any data, unless you make incorrect assumptions in your 
code.

I think it was a very wise decision to make char type separate from byte 
and ubyte, but I don't think it has separated far enough.  There should 
be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, 
cdpt32).  Then, there should be a single ASCII character type called 
'char'.  This would allow strings to be defined to hold ASCII 
characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.

String literals created from the D compiler should be stored as a 
specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and 
should be represented as the corresponding static array of the type of 
character.  The default encoding should be modifiable with either 
commandline options or with pragmas, preferrably pragmas.

For instance, if the default encoding were to be UTF-8, then a string 
literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

Also, it should be possible to explicitly specify the encoding for each 
string literal on a case-by-case basis.

Nov 21 2005

Derek Parnell <derek psych.ward> writes:

On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:

 Kris wrote:
 "Georg Wrede" <georg.wrede nospam.org> wrote ...
 
Kris wrote:

This is the long standing mishmash between character literal
arguments and parameters of type char[], wchar[], and/or dchar[].
Character literals don't really have a "solid" type ~ the compiler
can, and will, convert between wide and narrow representations on the
fly.

Compared to the bit thing I recently "bitched" about, this, IMHO, is an
issue one can accept better. :-)

 
 That doesn't make it any less problematic :-)
 
It is a problem for small example programs. Larger programs tend to
(and IMHO should) have wrappers anyhow:

 
 Not so. You'd see people complaining about this constantly if Stream.write() 
 was not decorated to distinguish between the three relevant methods. 
 Generally speaking, any code that deals with all three array types will bump 
 into this. Mango.io has the same problem, since it exposes write() methods 
 for every D type plus their array counterparts.
 
Why not change Phobos

void write ( char[] s) {.....};
void write (wchar[] s) {.....};
void write (dchar[] s) {.....};

into

void _write ( char[] s) {.....};
void _write (wchar[] s) {.....};
void _write (dchar[] s) {.....};
void write (char[] s) {_write(s)};

I think this would solve the issue with string literals as discussed in 
this thread.

 
 Then, how would one write a dchar[] literal? You just moved the problem to 
 the _write() method instead. I think there needs to be a general resolution 
 instead.
 
 One might infer the literal type from the content therein?
 
Also, overloading would not be hampered.

And, those who really _need_ types other than the 8 bit chars, could still 
have their types work as usual.

 
 Ahh. I think non-ASCII folks would be troubled by this bias <g> 
 

 
 char[] does NOT NECESSARILY MEAN an ASCII-only string in D.
 
 char[] can be a collection of UTF-8 code points, which further confuses 
 the matter.
 
 So long as you can process each variant of Unicode encodings (UTF-8, 
 UTF-16, and UTF-32), it should NOT matter which you choose as your 
 default encoding for your project's strings.  The only effect of the 
 choice is the efficiency with which your project processes strings.  You 
 should not lose any data, unless you make incorrect assumptions in your 
 code.
 
 I think it was a very wise decision to make char type separate from byte 
 and ubyte, but I don't think it has separated far enough.  There should 
 be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, 
 cdpt32).  Then, there should be a single ASCII character type called 
 'char'.  This would allow strings to be defined to hold ASCII 
 characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.
 
 String literals created from the D compiler should be stored as a 
 specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and 
 should be represented as the corresponding static array of the type of 
 character.  The default encoding should be modifiable with either 
 commandline options or with pragmas, preferrably pragmas.
 
 For instance, if the default encoding were to be UTF-8, then a string 
 literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').
 
 Also, it should be possible to explicitly specify the encoding for each 
 string literal on a case-by-case basis.

Very nice. Well said James. It makes so much sense when laid out like this.
D is only half way there to supporting international character sets.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
22/11/2005 10:51:23 AM

Nov 21 2005

Sean Kelly <sean f4.ca> writes:

Derek Parnell wrote:
 On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
 I think it was a very wise decision to make char type separate from byte 
 and ubyte, but I don't think it has separated far enough.  There should 
 be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, 
 cdpt32).  Then, there should be a single ASCII character type called 
 'char'.  This would allow strings to be defined to hold ASCII 
 characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.

 String literals created from the D compiler should be stored as a 
 specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and 
 should be represented as the corresponding static array of the type of 
 character.  The default encoding should be modifiable with either 
 commandline options or with pragmas, preferrably pragmas.

 For instance, if the default encoding were to be UTF-8, then a string 
 literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

 Also, it should be possible to explicitly specify the encoding for each 
 string literal on a case-by-case basis.

 
 Very nice. Well said James. It makes so much sense when laid out like this.
 D is only half way there to supporting international character sets.

I agree, but there must be a way to improve internationalization without 
this degree of complexity.  If D ends up with 6+ character types I think 
I might scream.  Is there any reason to support C-style code pages 
in-language in D?  I would like to think not.  As it stands, D supports 
three compatible encodings (char, wchar, dchar) that the programmer may 
choose between for reasons of data size and algorithm complexity.  The 
ASCII-compatible subset of UTF-8 works fine with the char-based C 
functions, and the full UTF-16 or UTF-32 character sets are compatible 
with the wchar-based C functions (depending on platform)... so far as I 
know at any rate.  I grant that the variable size of wchar in C is an 
irritating problem, but it's not insurmountable.  Why bother with all 
that old C code page nonsense?


Sean

Nov 21 2005

Derek Parnell <derek psych.ward> writes:

On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:

 Derek Parnell wrote:
 On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
 I think it was a very wise decision to make char type separate from byte 
 and ubyte, but I don't think it has separated far enough.  There should 
 be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, 
 cdpt32).  Then, there should be a single ASCII character type called 
 'char'.  This would allow strings to be defined to hold ASCII 
 characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.

 String literals created from the D compiler should be stored as a 
 specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and 
 should be represented as the corresponding static array of the type of 
 character.  The default encoding should be modifiable with either 
 commandline options or with pragmas, preferrably pragmas.

 For instance, if the default encoding were to be UTF-8, then a string 
 literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

 Also, it should be possible to explicitly specify the encoding for each 
 string literal on a case-by-case basis.

 
 Very nice. Well said James. It makes so much sense when laid out like this.
 D is only half way there to supporting international character sets.

 
 I agree, but there must be a way to improve internationalization without 
 this degree of complexity.  If D ends up with 6+ character types I think 
 I might scream.  

Where did you get "6+ character types" from?

James is (at worst) only adding one, ASCII. So we would end up with

  utf8  <==> schar[]  (Short? chars)
  utf16 <==> wchar[]  (Wide chars)
  utf32 <==> dchar[]  (Double-wide chars)
  ascii <==> char[]   (byte size chars)

But the key point is that each element in these arrays would be a
*character* (a.k.a. Code Point) rather than Code Units as they are now.

Thus a schar is an atomic value that represents a single character even if
that character takes up one, two, or four bytes in RAM. And 'schar[4]'
would represents a fixed size array of 4 code points. 

In this scheme, the old 'char' would be a directly compatible value with
C/C++ legacy code rather.


 Is there any reason to support C-style code pages 
 in-language in D? 

Huh? What code pages? This is nowhere near anything James was talking
about.

 I would like to think not.  As it stands, D supports 
 three compatible encodings (char, wchar, dchar) that the programmer may 
 choose between for reasons of data size and algorithm complexity.  The 
 ASCII-compatible subset of UTF-8 works fine with the char-based C 
 functions, and the full UTF-16 or UTF-32 character sets are compatible 
 with the wchar-based C functions (depending on platform)... so far as I 
 know at any rate.  I grant that the variable size of wchar in C is an 
 irritating problem, but it's not insurmountable.  Why bother with all 
 that old C code page nonsense?

Sure the current system can work, but only if the coder does a lot of
mundane, error-prone work, to make it happen. The compiler is a tool to
help coders do better, so it should help us take care of incidental
housekeeping so we can concentrate of algorithms rather than data
representations in RAM.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
22/11/2005 12:15:38 PM

Nov 21 2005

"Kris" <fu bar.com> writes:

"Derek Parnell" <derek psych.ward> wrote ...
 On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:

[snip]
 Derek Parnell wrote:
 On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
 Very nice. Well said James. It makes so much sense when laid out like 
 this.
 D is only half way there to supporting international character sets.

 I agree, but there must be a way to improve internationalization without
 this degree of complexity.  If D ends up with 6+ character types I think
 I might scream.

 Where did you get "6+ character types" from?

 James is (at worst) only adding one, ASCII. So we would end up with

  utf8  <==> schar[]  (Short? chars)
  utf16 <==> wchar[]  (Wide chars)
  utf32 <==> dchar[]  (Double-wide chars)
  ascii <==> char[]   (byte size chars)

 But the key point is that each element in these arrays would be a
 *character* (a.k.a. Code Point) rather than Code Units as they are now.

 Thus a schar is an atomic value that represents a single character even if
 that character takes up one, two, or four bytes in RAM. And 'schar[4]'
 would represents a fixed size array of 4 code points.

Maybe. To maintain array indexing semantics, the compiler might implement 
such things as an array of pointers to byte arrays?

Then, there's at least this problem :: dchar is always self-contained. It 
does not have surrogates, ever. Given that it's more efficient to store as a 
one-dimensional array, surely this would cause inconsistencies in usage? And 
what about BMP utf16? It doesn't need such treatment either (though extended 
utf16 would do).

But I agree in principal ~ the semantics of indexing (as in arrays) don't 
work well with multi code-unit encodings. Packages to deal with such things 
typically offer iterators as a supplement. Take a look at ICU for examples?


[snip]
 Sure the current system can work, but only if the coder does a lot of
 mundane, error-prone work, to make it happen. The compiler is a tool to
 help coders do better, so it should help us take care of incidental
 housekeeping so we can concentrate of algorithms rather than data
 representations in RAM.

I suspect it's a tall order to build such things into the compiler; 
especially when the issues are not clear-cut, and when there are heavy-duty 
libraries to take up the slack? Don't those libraries take care of data 
representation and incidental housekeeping on behalf of the developer?

Nov 21 2005

Sean Kelly <sean f4.ca> writes:

Derek Parnell wrote:
 
 Where did you get "6+ character types" from?

I misunderstood and thought his cdpt8 would be added in addition to the 
existing character types.

 James is (at worst) only adding one, ASCII. So we would end up with
 
   utf8  <==> schar[]  (Short? chars)
   utf16 <==> wchar[]  (Wide chars)
   utf32 <==> dchar[]  (Double-wide chars)
   ascii <==> char[]   (byte size chars)
 
 But the key point is that each element in these arrays would be a
 *character* (a.k.a. Code Point) rather than Code Units as they are now.
 
 Thus a schar is an atomic value that represents a single character even if
 that character takes up one, two, or four bytes in RAM. And 'schar[4]'
 would represents a fixed size array of 4 code points. 

This seems like it would invite a great degree of compiler complexity. 
What problem are we trying to solve again?  And why not just use dchar 
if it's important to have a 1-1 correspondence between element and 
character representation?

 Sure the current system can work, but only if the coder does a lot of
 mundane, error-prone work, to make it happen. The compiler is a tool to
 help coders do better, so it should help us take care of incidental
 housekeeping so we can concentrate of algorithms rather than data
 representations in RAM.

The only somewhat confusing issue to me is that the symbol names "char" 
and "wchar" imply that the data stored therein is a complete character, 
when this is only sometimes true.  I agree that this is a problem, but 
I'm not sure that variable width characters is the solution.  It makes 
array manipulations oddly inconsistent, for one thing.  Should the 
length property return the number of characters in the array?  Would a 
size property be needed to determine the memory footprint of this array? 
  What if I try something like this:

utf8[] myString = "multiwidth";
utf8[] slice = myString[0..1];
slice[0] = '\U00000001';

Would the sliced array resize to fit the potentially different-sized 
character being inserted, or would myString end up corrputed?


Sean

Nov 21 2005

Derek Parnell <derek psych.ward> writes:

On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:

 Derek Parnell wrote:
 
 Where did you get "6+ character types" from?

 
 I misunderstood and thought his cdpt8 would be added in addition to the 
 existing character types.
 
 James is (at worst) only adding one, ASCII. So we would end up with
 
   utf8  <==> schar[]  (Short? chars)
   utf16 <==> wchar[]  (Wide chars)
   utf32 <==> dchar[]  (Double-wide chars)
   ascii <==> char[]   (byte size chars)
 
 But the key point is that each element in these arrays would be a
 *character* (a.k.a. Code Point) rather than Code Units as they are now.
 
 Thus a schar is an atomic value that represents a single character even if
 that character takes up one, two, or four bytes in RAM. And 'schar[4]'
 would represents a fixed size array of 4 code points. 

 
 This seems like it would invite a great degree of compiler complexity. 
 What problem are we trying to solve again?  And why not just use dchar 
 if it's important to have a 1-1 correspondence between element and 
 character representation?

That is what I'm doing now to Build. Internally, all strings will be
dchar[], but what I'm finding out is the huge lack of support for dchar[]
in phobos. I've now coded my own routine to read text files in UTF formats,
but store them as dchar[] in the application. Then I've had to code
appropriate routines for all the other support functions: split(), strip(),
find(), etc ... 

 Sure the current system can work, but only if the coder does a lot of
 mundane, error-prone work, to make it happen. The compiler is a tool to
 help coders do better, so it should help us take care of incidental
 housekeeping so we can concentrate of algorithms rather than data
 representations in RAM.

 
 The only somewhat confusing issue to me is that the symbol names "char" 
 and "wchar" imply that the data stored therein is a complete character, 
 when this is only sometimes true.  I agree that this is a problem, but 
 I'm not sure that variable width characters is the solution.  It makes 
 array manipulations oddly inconsistent, for one thing.  Should the 
 length property return the number of characters in the array?

Yes. 

  Would a 
 size property be needed to determine the memory footprint of this array? 

Yes.

   What if I try something like this:
 
 utf8[] myString = "multiwidth";
 utf8[] slice = myString[0..1];
 slice[0] = '\U00000001';
 
 Would the sliced array resize to fit the potentially different-sized 
 character being inserted, or would myString end up corrputed?

Yes, it would be complex. No, the myString would not be corrupted. It would
just be the same as doing it 'manually', only the compiler will do the hack
work for you.

  char[] myString = "multiwidth";
  char[] slice = myString[0..1];
  // modify base string.
  myString = "\U00000001" ~ myString[1..$];
  // reslice it because its address might have changed.
  slice = myString[0..1];

Messy doing it manually, so that's why a code-point array would be better
than a byte/short/int array for strings.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
22/11/2005 2:17:40 PM

Nov 21 2005

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Derek Parnell wrote:
 On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:
 
 
Derek Parnell wrote:

Where did you get "6+ character types" from?

I misunderstood and thought his cdpt8 would be added in addition to the 
existing character types.


James is (at worst) only adding one, ASCII. So we would end up with

  utf8  <==> schar[]  (Short? chars)
  utf16 <==> wchar[]  (Wide chars)
  utf32 <==> dchar[]  (Double-wide chars)
  ascii <==> char[]   (byte size chars)

But the key point is that each element in these arrays would be a
*character* (a.k.a. Code Point) rather than Code Units as they are now.

Thus a schar is an atomic value that represents a single character even if
that character takes up one, two, or four bytes in RAM. And 'schar[4]'
would represents a fixed size array of 4 code points. 

This seems like it would invite a great degree of compiler complexity. 
What problem are we trying to solve again?  And why not just use dchar 
if it's important to have a 1-1 correspondence between element and 
character representation?

 
 
 That is what I'm doing now to Build. Internally, all strings will be
 dchar[], but what I'm finding out is the huge lack of support for dchar[]
 in phobos. I've now coded my own routine to read text files in UTF formats,
 but store them as dchar[] in the application. Then I've had to code
 appropriate routines for all the other support functions: split(), strip(),
 find(), etc ... 
 
 

Then, wouldn't having good dchar[] support in Phobos be a better 
solution that having to introduce another type in the language to do the 
same thing that dchar[] does?
The only difference I see in such a type (a codepoint string) and a 
dchar string is in better storage size for the codepoint string, but is 
that difference worth it? (not to mention a codepoint string would have 
(in certain cases) much worse modification performance that a dchar string).

Also, what is Phobos lacking in dchar[] support?

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to 
be... unnatural."

Nov 23 2005

"Kris" <fu bar.com> writes:

"James Dunne" <james.jdunne gmail.com> wrote ...
 Kris wrote:

[snip]
 Ahh. I think non-ASCII folks would be troubled by this bias <g>

 char[] does NOT NECESSARILY MEAN an ASCII-only string in D.

 char[] can be a collection of UTF-8 code points, which further confuses 
 the matter.

Indeed. I was alluding to encoding multi-byte-utf8 literals by hand; but it 
was a piss-poor attempt at humour.


 So long as you can process each variant of Unicode encodings (UTF-8, 
 UTF-16, and UTF-32), it should NOT matter which you choose as your default 
 encoding for your project's strings.  The only effect of the choice is the 
 efficiency with which your project processes strings.  You should not lose 
 any data, unless you make incorrect assumptions in your code.

Right.


 String literals created from the D compiler should be stored as a specific 
 encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be 
 represented as the corresponding static array of the type of character.

They are. The 'c', 'w', and 'd' suffix provides the fine control. Auto 
instances map implicitly to 'c'. Explicitly typed instances (e.g. wchar[] s 
= "a wide string";) also provide fine control. The minor concern I have with 
this aspect is that the literal content does not play a role, whereas it 
does with char literals (such as '?', '\x0001', and '\X00000001'). No big 
deal there, although perhaps it's food for another topic?


 The default encoding should be modifiable with either commandline options 
 or with pragmas, preferrably pragmas.

I wondered about that also. Walter pointed out it would be similar to the 
signed/unsigned char-type switch prevalent in C compilers, which can cause 
grief. Perhaps D does need defaults like that, but some consistency in the 
interpretation of string literals would have to happen first. This required 
a subtle change:

That change is to assign a resolvable type to 'undecorated' string-literal 
arguments in the same way as the "auto" keyword does. This would also make 
it consistent with undecorated integer-literals (as noted elsewhere). In 
short, an undecorated argument "literal" would be treated as a decorated 
"literal"c (that 'c' suffix makes it utf8), just like auto does. This would 
mean all uses of string literals are treated consistently, and all 
undecorated literals (string, char, numeric) have consistent rules when it 
comes to overload resolution (currently they do not).

To elaborate, here's the undecorated string literal asymmetry:

auto s = "literal";   // effectively adds an implicit 'c' suffix

myFunc ("literal");  // Should be changed to behave as above

What I hear you asking for is a way to alter that implicit suffix?  I'd be 
really happy to just get the consistency first :-)


These instances are all (clearly) explicitly typed:

char[] s = "literal";   // utf8
wchar[] s = "literal"; // utf16
dchar[] s = "literal"; // utf32

auto s = "literal"c;  // utf8
auto s = "literal"w;  // utf16
auto s = "literal"d;  // utf32

myFunc ("literal"c); // utf8
myFunc ("literal"w); // utf16
myFunc ("literal"d); // ut32


 For instance, if the default encoding were to be UTF-8, then a string 
 literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

 Also, it should be possible to explicitly specify the encoding for each 
 string literal on a case-by-case basis.

If I understand correctly, you can. See above.

Nov 21 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas it
 does with char literals (such as '?', '\x0001', and '\X00000001').

But that makes sense, right? Character literals i.e. '\X00000001' will  
only _fit_ in certain types, the same is not true for string literals  
which will always _fit_ in all 3 even if the way they end up being  
represented is not exactly what you've typed (or is that the problem?)

If this were to change would it make this an error:

foo(wchar[] foo) {}
foo("\U00000040");

 No big
 deal there, although perhaps it's food for another topic?

Here seems like as good a place as any.

Regan

Nov 22 2005

"Kris" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas it
 does with char literals (such as '?', '\x0001', and '\X00000001').

 But that makes sense, right? Character literals i.e. '\X00000001' will 
 only _fit_ in certain types, the same is not true for string literals 
 which will always _fit_ in all 3 even if the way they end up being 
 represented is not exactly what you've typed (or is that the problem?)

 If this were to change would it make this an error:

 foo(wchar[] foo) {}
 foo("\U00000040");

 No big
 deal there, although perhaps it's food for another topic?

 Here seems like as good a place as any.


Oh, that minor concern was in regard to consistency here also. I have no 
quibble with the character type being implied by content (consistent with 
numeric literals):

1) The type for literal chars is implied by their content ('?', '\u0001', 
'\U00000001')

2) The type of a numeric literal is implied by the content (0xFF, 
0xFFFFFFFF, 1.234)

3) The type for literal strings is not influenced at all by the content.


far as I'm aware). These two inconsistencies are small, but they may 
influence concerns elsewhere ...

Nov 22 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas  
 it
 does with char literals (such as '?', '\x0001', and '\X00000001').

 But that makes sense, right? Character literals i.e. '\X00000001' will
 only _fit_ in certain types, the same is not true for string literals
 which will always _fit_ in all 3 even if the way they end up being
 represented is not exactly what you've typed (or is that the problem?)

 If this were to change would it make this an error:

 foo(wchar[] foo) {}
 foo("\U00000040");

 No big
 deal there, although perhaps it's food for another topic?

 Here seems like as good a place as any.


 Oh, that minor concern was in regard to consistency here also.

I realise that. I'm just trying to explore whether they _should_ behave  
the same, or not, are they both apples or are they apples and oranges. I  
agree things should behave consistently, provided it makes sense for them  
to do so.

 I have no quibble with the character type being implied by content

I didn't think you did. My example above is a string literal, not a  
character literal. If the string literal type was implied by content would  
my example above be an error?

"\U00000040" is a dchar (sized) character in a string literal. "abc  
\U00000040 def" could be used also.

foo requires a wchar. if the type of the literal is taken to be dchar,  
based on contents then it does not match
wchar and you need the 'w' suffix or similar to resolve it.

It seems the real question is, what did the programmer intend? Did they  
intend for the character to be represented exactly as they typed it? In  
this case, if it was passed exactly as written it would become 2 wchar  
code units, did they want that? Or, did they simply want the equivalent  
character in the resulting encoding.

I think the latter is more likely. The former can create illegal UTF  
sequences.

What do you think?

The facts:
 1) The type for literal chars is implied by their content ('?', '\u0001',
 '\U00000001')

 2) The type of a numeric literal is implied by the content (0xFF,
 0xFFFFFFFF, 1.234)

 3) The type for literal strings is not influenced at all by the content.


smaller types.



similar enough? or is it in fact different?


 (as far as I'm aware).

I'm not aware of any either.

Regan

Nov 23 2005

"Kris" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote
 On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas 
 it
 does with char literals (such as '?', '\x0001', and '\X00000001').

 But that makes sense, right? Character literals i.e. '\X00000001' will
 only _fit_ in certain types, the same is not true for string literals
 which will always _fit_ in all 3 even if the way they end up being
 represented is not exactly what you've typed (or is that the problem?)

 If this were to change would it make this an error:

 foo(wchar[] foo) {}
 foo("\U00000040");

 No big
 deal there, although perhaps it's food for another topic?

 Here seems like as good a place as any.


 Oh, that minor concern was in regard to consistency here also.

 I realise that. I'm just trying to explore whether they _should_ behave 
 the same, or not, are they both apples or are they apples and oranges. I 
 agree things should behave consistently, provided it makes sense for them 
 to do so.

 I have no quibble with the character type being implied by content

 I didn't think you did. My example above is a string literal, not a 
 character literal. If the string literal type was implied by content would 
 my example above be an error?

To clarify: I'm already making the assumption that the compiler changes to 
eliminate the uncommited aspect of argument literals. That presupposes the 
"default" type will be char[] (like auto literals).

This is a further, and probably minor, question as to whether it might be 
useful (and consistent) that "default" type be implied by the literal 
content. Suffix 'typing' and compile-time transcoding are still present and 
able. I'm not at all sure it would be terribly useful, given that the 
literal will potentially be transcoded at compile-time anyway.

[snip]

 I think the latter is more likely. The former can create illegal UTF 
 sequences.

 What do you think?

I think I'd be perfectly content once argument-literals lose their 
uncommited status, and thus behave like auto literals <g>

Nov 23 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 23 Nov 2005 13:58:20 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote
 On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas
 it
 does with char literals (such as '?', '\x0001', and '\X00000001').

 But that makes sense, right? Character literals i.e. '\X00000001' will
 only _fit_ in certain types, the same is not true for string literals
 which will always _fit_ in all 3 even if the way they end up being
 represented is not exactly what you've typed (or is that the problem?)

 If this were to change would it make this an error:

 foo(wchar[] foo) {}
 foo("\U00000040");

 No big
 deal there, although perhaps it's food for another topic?

 Here seems like as good a place as any.


 Oh, that minor concern was in regard to consistency here also.

 I realise that. I'm just trying to explore whether they _should_ behave
 the same, or not, are they both apples or are they apples and oranges. I
 agree things should behave consistently, provided it makes sense for  
 them
 to do so.

 I have no quibble with the character type being implied by content

 I didn't think you did. My example above is a string literal, not a
 character literal. If the string literal type was implied by content  
 would
 my example above be an error?

 To clarify: I'm already making the assumption that the compiler changes  
 to eliminate the uncommited aspect of argument literals. That  
 presupposes the "default" type will be char[] (like auto literals).

Same.

 This is a further, and probably minor, question as to whether it might be
 useful (and consistent) that "default" type be implied by the literal
 content.

Yes, that is what I thought we were doing, questioning whether it would be  
useful. My current feeling is that it's not, but we'll see...

 Suffix 'typing' and compile-time transcoding are still present and able.

Yep.

 I'm not at all sure it would be terribly useful, given that the
 literal will potentially be transcoded at compile-time anyway.

Like in my first example:

foo(wchar[] foo) {}
foo("\U00000040");

the string containing the dchar content would in fact be transcoded to  
wchar at compile time to match the one available overload.

So, when wouldn't it be transcoded at compile time? All I can think of is  
"auto", eg.

auto test = "abc \U00000040 def";

So, if this is the only case where the string contents make a difference I  
would call that inconsistent, and would instead opt for using the string  
literal suffix to specify an encoding where required, eg.

auto test = "abc \U00000040 def"d;

Then the statement "all string literals default to char[] unless a the  
required encoding can be determined at compile time" would be true.

Regan

Nov 23 2005

"Kris" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message news
[snip]
 Then the statement "all string literals default to char[] unless a the 
 required encoding can be determined at compile time" would be true.

That would be great. Now, will this truly come to pass?

<g>

Nov 23 2005

Don Clugston <dac nospam.com.au> writes:

Kris wrote:
 This is the long standing mishmash between character literal arguments and 
 parameters of type char[], wchar[], and/or dchar[]. Character literals don't 
 really have a "solid" type ~ the compiler can, and will, convert between 
 wide and narrow representations on the fly.
 
 Suppose you have the following methods:
 
 void write (char[] x);
 void write (wchar[] x);
 void write (dchar[] x);
 
 Given a literal argument:
 
 write ("what am I?");
 
 D doesn't know whether to invoke the char[] or wchar[] signature, since the 
 literal is treated as though it's possibly any of the three types. This is 
 the kind of non-determinism you get when the compiler becomes too 'smart' 
 (unwarranted automatic conversion, in this case).

I agree, except that I think the problem in this case is that it's not 
converting "from" anything! There's no "exact match" which it tries first.

A parallel case is that a floating point literal can be implicitly 
converted to float, double, real, cfloat, cdouble, creal. For fp 
literals, the default is double.

It's a bit odd that with
dchar [] q = "abc";  wchar [] w = "abc"
"abc" is a dchar literal the first time, but a wchar literal the 
second,whereas with
real q = 2.5;  double w = 2.5;
2.5 is a double literal in both cases.

No wonder array literals are such a problem...

Nov 11 2005

"Kris" <fu bar.com> writes:

"Don Clugston" <dac nospam.com.au> wrote ..
 Kris wrote:
 D doesn't know whether to invoke the char[] or wchar[] signature, since 
 the literal is treated as though it's possibly any of the three types. 
 This is the kind of non-determinism you get when the compiler becomes too 
 'smart' (unwarranted automatic conversion, in this case).

 I agree, except that I think the problem in this case is that it's not 
 converting "from" anything! There's no "exact match" which it tries first.

There would be if the auto-casting were disabled, and the type were 
determined via the literal content, in conjunction with the /default/ 
literal type suggested by GW. Yes?

Nov 11 2005

jicman <jicman_member pathlink.com> writes:

Kris says...
"Don Clugston" <dac nospam.com.au> wrote ..
 Kris wrote:
 D doesn't know whether to invoke the char[] or wchar[] signature, since 
 the literal is treated as though it's possibly any of the three types. 
 This is the kind of non-determinism you get when the compiler becomes too 
 'smart' (unwarranted automatic conversion, in this case).

 I agree, except that I think the problem in this case is that it's not 
 converting "from" anything! There's no "exact match" which it tries first.

There would be if the auto-casting were disabled, and the type were 
determined via the literal content, in conjunction with the /default/ 
literal type suggested by GW. Yes?

Gosh, all I wanted was a simple explanation. :-)  (kidding)

I used writeString and it works,

|17:24:22.68>type ftest.d
|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.writeString("this is a test");
|  log.close();
|  return 1;
|}

thanks.  Please, continue with your discussion. :-)

jos�

Nov 11 2005

Nick <Nick_member pathlink.com> writes:

In article <dl0hja$2aal$1 digitaldaemon.com>, jicman says...
So, I have this complicated piece of code:

|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.write("this is a test");
|  log.close();
|  return 1;
|}

Also note one thing though: Stream.write() will write the string in binary
format, ie. it will write a binary int with the length, and then the string. If
you want a plain ASCII file, which is probably what you want in a log file, you
should use Stream.writeString(), or Stream.writeLine() which inserts a line
break. Or you can use writef/writefln for more advanced formatting. If you
already knew this then disregard this post ;-)

Nick

Nov 11 2005

jicman <jicman_member pathlink.com> writes:

Nick says...
In article <dl0hja$2aal$1 digitaldaemon.com>, jicman says...
So, I have this complicated piece of code:

|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.write("this is a test");
|  log.close();
|  return 1;
|}

Also note one thing though: Stream.write() will write the string in binary
format, ie. it will write a binary int with the length, and then the string. If
you want a plain ASCII file, which is probably what you want in a log file, you
should use Stream.writeString(), or Stream.writeLine() which inserts a line
break. Or you can use writef/writefln for more advanced formatting. If you
already knew this then disregard this post ;-)

Disregard this post?  Oh, no!  My friend, you wrote it, I am going to read it.
(Yes, I knew that.  I was trying to quickly write some debugging code for
something at work, and I found that compiler error and asked.)  Thanks.

jos�

Nov 11 2005

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Stream and File understanding.