www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Stream and File understanding.

reply jicman <jicman_member pathlink.com> writes:
So, I have this complicated piece of code:

|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.write("this is a test");
|  log.close();
|  return 1;
|}

and I try to compile it, I get,

|ftest.d(6): function std.stream.Stream.write called with argument types:
|        (char[14])
|matches both:
|        std.stream.Stream.write(char[])
|and:
|        std.stream.Stream.write(wchar[])

Shouldn't it just match "std.stream.Stream.write(char[])"?

thanks,

josé
Nov 10 2005
next sibling parent reply Sean Kelly <sean f4.ca> writes:
jicman wrote:
 So, I have this complicated piece of code:
 
 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}
 
 and I try to compile it, I get,
 
 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])
 
 Shouldn't it just match "std.stream.Stream.write(char[])"?
The problem is that literal string literals can be implicitly converted to char, wchar, and dchar strings. To fix the overload resolution problem, try this: log.write( "this is a test"c ); the 'c' indicates that the above is a char string. Sean
Nov 10 2005
next sibling parent "Kris" <fu bar.com> writes:
This produces a compile error:

void write (char[] x){}
void write (wchar[] x){}

void main()
{
write ("part 1"
          "part 2" c);
}

The compiler complains about the two literal types not matching. This also 
fails:

void main()
{
write ("part 1" c
          "part 2" c);
}

This strange looking suffixing is present due to unwarranted & unwanted 
automatic type conversion, is it not? Wouldn't it be better to explicitly 
request conversion when it's actually wanted instead? Isn't that what the 
cast() operator is for?

- Kris




"Sean Kelly" <sean f4.ca> wrote in message 
news:dl0in9$2bet$1 digitaldaemon.com...
 jicman wrote:
 So, I have this complicated piece of code:

 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}

 and I try to compile it, I get,

 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])

 Shouldn't it just match "std.stream.Stream.write(char[])"?
The problem is that literal string literals can be implicitly converted to char, wchar, and dchar strings. To fix the overload resolution problem, try this: log.write( "this is a test"c ); the 'c' indicates that the above is a char string. Sean
Nov 10 2005
prev sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Sean Kelly wrote:
 jicman wrote:
 
 So, I have this complicated piece of code:

 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}

 and I try to compile it, I get,

 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])

 Shouldn't it just match "std.stream.Stream.write(char[])"?
The problem is that literal string literals can be implicitly converted to char, wchar, and dchar strings. To fix the overload resolution problem, try this: log.write( "this is a test"c ); the 'c' indicates that the above is a char string.
I just posted a "nice" fix on this thread. But it seems overkill (and brittle), if one assumes this is just a problem with string literals! _If_ it is true that this "problem" exists only with string literals, then it should be even easier to fix! The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string! (( At this time opponents will say "what if the source code file gets converted into another character width?" -- My answer: "Tough, ain't it!", since there's a law against gratuituous mucking with source code. )) So, implicitly just assume the source code literal character width. The '"c' does _not_ exist so the compiler can force you to state the obvious. It's there so you _can_ be explicit _when_ it really matters to you. --- Oh, and if we want to be real fancy, we could also have a pragma stating the default for character literals! And when the pragma is not used, then assume based on the source.
Nov 11 2005
next sibling parent reply bert <bert_member pathlink.com> writes:
In article <4374598B.30604 nospam.org>, Georg Wrede says... 
 
 
The compiler knows (or at least _should_ know) the character width of  
the source code file. Now, if there's an undecorated string literal in  
it, then _simply_assume_ that is the _intended_ type of the string! 
 
The *programmer* assumes so *anyway*. Why on earth should the copiler assume anything else! BTW, D is really cool!
Nov 11 2005
parent jicman <jicman_member pathlink.com> writes:
bert says...
In article <4374598B.30604 nospam.org>, Georg Wrede says... 
 
 
The compiler knows (or at least _should_ know) the character width of  
the source code file. Now, if there's an undecorated string literal in  
it, then _simply_assume_ that is the _intended_ type of the string! 
 
The *programmer* assumes so *anyway*. Why on earth should the copiler assume anything else! BTW, D is really cool!
It is really cool. :-)
Nov 11 2005
prev sibling next sibling parent "Kris" <fu bar.com> writes:
"Georg Wrede" <georg.wrede nospam.org> wrote ...
 The compiler knows (or at least _should_ know) the character width of the 
 source code file. Now, if there's an undecorated string literal in it, 
 then _simply_assume_ that is the _intended_ type of the string!
That sounds like a good idea; it would set the /default/ type for literals. But the compiler should still inspect the literal content to determine if it has explicit wchar or dchar characters within. The compiler apparently does this, but doesn't use it to infer literal type? This combination would very likely resolve all such problems, assuming the auto-casting were removed also?
Nov 11 2005
prev sibling parent reply Nick <Nick_member pathlink.com> writes:
In article <4374598B.30604 nospam.org>, Georg Wrede says...
The compiler knows (or at least _should_ know) the character width of 
the source code file. Now, if there's an undecorated string literal in 
it, then _simply_assume_ that is the _intended_ type of the string!

(( At this time opponents will say "what if the source code file gets 
converted into another character width?" -- My answer: "Tough, ain't 
it!", since there's a law against gratuituous mucking with source code.  ))
Well that's a nice attitude. Makes copy-and-paste impossible, and makes writing code off html, plain text, and books impossible too, since the code's behaviour now dependens on your language environment. I'm sure that won't cause any bugs at all ;-) Nick
Nov 14 2005
parent Georg Wrede <georg.wrede nospam.org> writes:
Nick wrote:
 In article <4374598B.30604 nospam.org>, Georg Wrede says...
 
 The compiler knows (or at least _should_ know) the character width
 of the source code file. Now, if there's an undecorated string
 literal in it, then _simply_assume_ that is the _intended_ type of
 the string!
 
 (( At this time opponents will say "what if the source code file
 gets converted into another character width?" -- My answer: "Tough,
 ain't it!", since there's a law against gratuituous mucking with
 source code.  ))
Well that's a nice attitude. Makes copy-and-paste impossible, and makes writing code off html, plain text, and books impossible too, since the code's behaviour now dependens on your language environment. I'm sure that won't cause any bugs at all ;-)
:-) there are actually 2 separate issues involved. First of all, the copy-and-paste issue: To be able to paste into the string, the text editor (or whatever) has to know the character width of the file to begin with, since pasting is done differently with the various UTF widths. Further, one cannot paste anything "in the wrong UTF width" as such, so the editor has to convert it into the width of the entire file first. (This _should_ be handled by the operating system (not the text editor), but I wouldn't bet on it, at least before 2010 or something. Not with at least _some_ "operating systems".) Second, the width the undecorated literal is to be stored as: What makes this issue interesting is, is it feasible to assume something or declare the literal as "of unspecified" width. There's lately been some research into the issue (in the D newsgroup). The jury is still out.
Nov 14 2005
prev sibling next sibling parent reply "Kris" <fu bar.com> writes:
This is the long standing mishmash between character literal arguments and 
parameters of type char[], wchar[], and/or dchar[]. Character literals don't 
really have a "solid" type ~ the compiler can, and will, convert between 
wide and narrow representations on the fly.

Suppose you have the following methods:

void write (char[] x);
void write (wchar[] x);
void write (dchar[] x);

Given a literal argument:

write ("what am I?");

D doesn't know whether to invoke the char[] or wchar[] signature, since the 
literal is treated as though it's possibly any of the three types. This is 
the kind of non-determinism you get when the compiler becomes too 'smart' 
(unwarranted automatic conversion, in this case).

To /patch/ around this problem, literals may be now be suffixed with a 
type-identifier, including 'c', 'w', and 'd'. Thus, the above example will 
compile when you do the following:

write ( "I am a char[], dammit!" c );

I, for one, think this is silly. To skirt the issue, APIs end up being 
written as follows:

void write (char[]);
void writeW (wchar[]);
void writeD (dchar[]);

Is that redundant, or what? Well, it's what Phobos is forced to do in the 
Stream code (take a look). The error you ran into appears to be a situation 
where Walter's own code (std.file) bumps into this ~ wish that were enough 
to justify a real fix for this long-running concern.

BTW; the correct thing happens when not using literals. For example, the 
following operates intuitively:

char[]  msg = "I am a char[], dammit!";
write (msg);


- Kris






"jicman" <jicman_member pathlink.com> wrote in message 
news:dl0hja$2aal$1 digitaldaemon.com...
 So, I have this complicated piece of code:

 |import std.file;
 |import std.stream;
 |int main()
 |{
 |  File log = new File("myfile.txt",FileMode.Out);
 |  log.write("this is a test");
 |  log.close();
 |  return 1;
 |}

 and I try to compile it, I get,

 |ftest.d(6): function std.stream.Stream.write called with argument types:
 |        (char[14])
 |matches both:
 |        std.stream.Stream.write(char[])
 |and:
 |        std.stream.Stream.write(wchar[])

 Shouldn't it just match "std.stream.Stream.write(char[])"?

 thanks,

 josé

 
Nov 10 2005
next sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Kris wrote:
 This is the long standing mishmash between character literal
 arguments and parameters of type char[], wchar[], and/or dchar[].
 Character literals don't really have a "solid" type ~ the compiler
 can, and will, convert between wide and narrow representations on the
 fly.
Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-) It is a problem for small example programs. Larger programs tend to (and IMHO should) have wrappers anyhow: void logwrite(char[] logfile, char[] entry) { std.stream.Stream.write(logfile, entry) }
 BTW; the correct thing happens when not using literals.
 For example, the following operates intuitively:

 char[]  msg = "I am a char[], dammit!"; write (msg);
Hmm, Kris's comment above gives me an idea for a _very_ easy fix for this in Phobos: Why not change Phobos void write ( char[] s) {.....}; void write (wchar[] s) {.....}; void write (dchar[] s) {.....}; into void _write ( char[] s) {.....}; void _write (wchar[] s) {.....}; void _write (dchar[] s) {.....}; void write (char[] s) {_write(s)}; I think this would solve the issue with string literals as discussed in this thread. Also, overloading would not be hampered. And, those who really _need_ types other than the 8 bit chars, could still have their types work as usual. (( I also had 2 more lines void writeD (wchar[] s) {_write(s)}; void writeW (dchar[] s) {_write(s)}; above, but they're actually not needed, based on the assumption that the compiler is smart enough to not make redundant char type conversions, which I believe it is. -- And if not, then the 2 lines should be included. ))
 To /patch/ around this problem, literals may be now be suffixed with
 a type-identifier, including 'c', 'w', and 'd'. Thus, the above
 example will compile when you do the following:
 
 write ( "I am a char[], dammit!" c );
 
 I, for one, think this is silly. To skirt the issue, APIs end up
 being written as follows:
 
 void write (char[]); void writeW (wchar[]); void writeD (dchar[]);
 
 Is that redundant, or what? Well, it's what Phobos is forced to do in
 the Stream code (take a look). The error you ran into appears to be a
 situation where Walter's own code (std.file) bumps into this
Nov 11 2005
parent reply "Kris" <fu bar.com> writes:
"Georg Wrede" <georg.wrede nospam.org> wrote ...
 Kris wrote:
 This is the long standing mishmash between character literal
 arguments and parameters of type char[], wchar[], and/or dchar[].
 Character literals don't really have a "solid" type ~ the compiler
 can, and will, convert between wide and narrow representations on the
 fly.
Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-)
That doesn't make it any less problematic :-)
 It is a problem for small example programs. Larger programs tend to
 (and IMHO should) have wrappers anyhow:
Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.
 Why not change Phobos

 void write ( char[] s) {.....};
 void write (wchar[] s) {.....};
 void write (dchar[] s) {.....};

 into

 void _write ( char[] s) {.....};
 void _write (wchar[] s) {.....};
 void _write (dchar[] s) {.....};
 void write (char[] s) {_write(s)};

 I think this would solve the issue with string literals as discussed in 
 this thread.
Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead. One might infer the literal type from the content therein?
 Also, overloading would not be hampered.

 And, those who really _need_ types other than the 8 bit chars, could still 
 have their types work as usual.
Ahh. I think non-ASCII folks would be troubled by this bias <g>
Nov 11 2005
parent reply James Dunne <james.jdunne gmail.com> writes:
Kris wrote:
 "Georg Wrede" <georg.wrede nospam.org> wrote ...
 
Kris wrote:

This is the long standing mishmash between character literal
arguments and parameters of type char[], wchar[], and/or dchar[].
Character literals don't really have a "solid" type ~ the compiler
can, and will, convert between wide and narrow representations on the
fly.
Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-)
That doesn't make it any less problematic :-)
It is a problem for small example programs. Larger programs tend to
(and IMHO should) have wrappers anyhow:
Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.
Why not change Phobos

void write ( char[] s) {.....};
void write (wchar[] s) {.....};
void write (dchar[] s) {.....};

into

void _write ( char[] s) {.....};
void _write (wchar[] s) {.....};
void _write (dchar[] s) {.....};
void write (char[] s) {_write(s)};

I think this would solve the issue with string literals as discussed in 
this thread.
Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead. One might infer the literal type from the content therein?
Also, overloading would not be hampered.

And, those who really _need_ types other than the 8 bit chars, could still 
have their types work as usual.
Ahh. I think non-ASCII folks would be troubled by this bias <g>
char[] does NOT NECESSARILY MEAN an ASCII-only string in D. char[] can be a collection of UTF-8 code points, which further confuses the matter. So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings. The only effect of the choice is the efficiency with which your project processes strings. You should not lose any data, unless you make incorrect assumptions in your code. I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough. There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32). Then, there should be a single ASCII character type called 'char'. This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points. String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character. The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas. For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]'). Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.
Nov 21 2005
next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:

 Kris wrote:
 "Georg Wrede" <georg.wrede nospam.org> wrote ...
 
Kris wrote:

This is the long standing mishmash between character literal
arguments and parameters of type char[], wchar[], and/or dchar[].
Character literals don't really have a "solid" type ~ the compiler
can, and will, convert between wide and narrow representations on the
fly.
Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-)
That doesn't make it any less problematic :-)
It is a problem for small example programs. Larger programs tend to
(and IMHO should) have wrappers anyhow:
Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.
Why not change Phobos

void write ( char[] s) {.....};
void write (wchar[] s) {.....};
void write (dchar[] s) {.....};

into

void _write ( char[] s) {.....};
void _write (wchar[] s) {.....};
void _write (dchar[] s) {.....};
void write (char[] s) {_write(s)};

I think this would solve the issue with string literals as discussed in 
this thread.
Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead. One might infer the literal type from the content therein?
Also, overloading would not be hampered.

And, those who really _need_ types other than the 8 bit chars, could still 
have their types work as usual.
Ahh. I think non-ASCII folks would be troubled by this bias <g>
char[] does NOT NECESSARILY MEAN an ASCII-only string in D. char[] can be a collection of UTF-8 code points, which further confuses the matter. So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings. The only effect of the choice is the efficiency with which your project processes strings. You should not lose any data, unless you make incorrect assumptions in your code. I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough. There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32). Then, there should be a single ASCII character type called 'char'. This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points. String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character. The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas. For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]'). Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.
Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets. -- Derek (skype: derek.j.parnell) Melbourne, Australia 22/11/2005 10:51:23 AM
Nov 21 2005
parent reply Sean Kelly <sean f4.ca> writes:
Derek Parnell wrote:
 On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
 I think it was a very wise decision to make char type separate from byte 
 and ubyte, but I don't think it has separated far enough.  There should 
 be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, 
 cdpt32).  Then, there should be a single ASCII character type called 
 'char'.  This would allow strings to be defined to hold ASCII 
 characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.

 String literals created from the D compiler should be stored as a 
 specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and 
 should be represented as the corresponding static array of the type of 
 character.  The default encoding should be modifiable with either 
 commandline options or with pragmas, preferrably pragmas.

 For instance, if the default encoding were to be UTF-8, then a string 
 literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

 Also, it should be possible to explicitly specify the encoding for each 
 string literal on a case-by-case basis.
Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets.
I agree, but there must be a way to improve internationalization without this degree of complexity. If D ends up with 6+ character types I think I might scream. Is there any reason to support C-style code pages in-language in D? I would like to think not. As it stands, D supports three compatible encodings (char, wchar, dchar) that the programmer may choose between for reasons of data size and algorithm complexity. The ASCII-compatible subset of UTF-8 works fine with the char-based C functions, and the full UTF-16 or UTF-32 character sets are compatible with the wchar-based C functions (depending on platform)... so far as I know at any rate. I grant that the variable size of wchar in C is an irritating problem, but it's not insurmountable. Why bother with all that old C code page nonsense? Sean
Nov 21 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:

 Derek Parnell wrote:
 On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
 I think it was a very wise decision to make char type separate from byte 
 and ubyte, but I don't think it has separated far enough.  There should 
 be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, 
 cdpt32).  Then, there should be a single ASCII character type called 
 'char'.  This would allow strings to be defined to hold ASCII 
 characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.

 String literals created from the D compiler should be stored as a 
 specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and 
 should be represented as the corresponding static array of the type of 
 character.  The default encoding should be modifiable with either 
 commandline options or with pragmas, preferrably pragmas.

 For instance, if the default encoding were to be UTF-8, then a string 
 literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

 Also, it should be possible to explicitly specify the encoding for each 
 string literal on a case-by-case basis.
Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets.
I agree, but there must be a way to improve internationalization without this degree of complexity. If D ends up with 6+ character types I think I might scream.
Where did you get "6+ character types" from? James is (at worst) only adding one, ASCII. So we would end up with utf8 <==> schar[] (Short? chars) utf16 <==> wchar[] (Wide chars) utf32 <==> dchar[] (Double-wide chars) ascii <==> char[] (byte size chars) But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now. Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points. In this scheme, the old 'char' would be a directly compatible value with C/C++ legacy code rather.
 Is there any reason to support C-style code pages 
 in-language in D? 
Huh? What code pages? This is nowhere near anything James was talking about.
 I would like to think not.  As it stands, D supports 
 three compatible encodings (char, wchar, dchar) that the programmer may 
 choose between for reasons of data size and algorithm complexity.  The 
 ASCII-compatible subset of UTF-8 works fine with the char-based C 
 functions, and the full UTF-16 or UTF-32 character sets are compatible 
 with the wchar-based C functions (depending on platform)... so far as I 
 know at any rate.  I grant that the variable size of wchar in C is an 
 irritating problem, but it's not insurmountable.  Why bother with all 
 that old C code page nonsense?
Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM. -- Derek (skype: derek.j.parnell) Melbourne, Australia 22/11/2005 12:15:38 PM
Nov 21 2005
next sibling parent "Kris" <fu bar.com> writes:
"Derek Parnell" <derek psych.ward> wrote ...
 On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:
[snip]
 Derek Parnell wrote:
 On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
 Very nice. Well said James. It makes so much sense when laid out like 
 this.
 D is only half way there to supporting international character sets.
I agree, but there must be a way to improve internationalization without this degree of complexity. If D ends up with 6+ character types I think I might scream.
Where did you get "6+ character types" from? James is (at worst) only adding one, ASCII. So we would end up with utf8 <==> schar[] (Short? chars) utf16 <==> wchar[] (Wide chars) utf32 <==> dchar[] (Double-wide chars) ascii <==> char[] (byte size chars) But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now. Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.
Maybe. To maintain array indexing semantics, the compiler might implement such things as an array of pointers to byte arrays? Then, there's at least this problem :: dchar is always self-contained. It does not have surrogates, ever. Given that it's more efficient to store as a one-dimensional array, surely this would cause inconsistencies in usage? And what about BMP utf16? It doesn't need such treatment either (though extended utf16 would do). But I agree in principal ~ the semantics of indexing (as in arrays) don't work well with multi code-unit encodings. Packages to deal with such things typically offer iterators as a supplement. Take a look at ICU for examples? [snip]
 Sure the current system can work, but only if the coder does a lot of
 mundane, error-prone work, to make it happen. The compiler is a tool to
 help coders do better, so it should help us take care of incidental
 housekeeping so we can concentrate of algorithms rather than data
 representations in RAM.
I suspect it's a tall order to build such things into the compiler; especially when the issues are not clear-cut, and when there are heavy-duty libraries to take up the slack? Don't those libraries take care of data representation and incidental housekeeping on behalf of the developer?
Nov 21 2005
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Derek Parnell wrote:
 
 Where did you get "6+ character types" from?
I misunderstood and thought his cdpt8 would be added in addition to the existing character types.
 James is (at worst) only adding one, ASCII. So we would end up with
 
   utf8  <==> schar[]  (Short? chars)
   utf16 <==> wchar[]  (Wide chars)
   utf32 <==> dchar[]  (Double-wide chars)
   ascii <==> char[]   (byte size chars)
 
 But the key point is that each element in these arrays would be a
 *character* (a.k.a. Code Point) rather than Code Units as they are now.
 
 Thus a schar is an atomic value that represents a single character even if
 that character takes up one, two, or four bytes in RAM. And 'schar[4]'
 would represents a fixed size array of 4 code points. 
This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again? And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?
 Sure the current system can work, but only if the coder does a lot of
 mundane, error-prone work, to make it happen. The compiler is a tool to
 help coders do better, so it should help us take care of incidental
 housekeeping so we can concentrate of algorithms rather than data
 representations in RAM.
The only somewhat confusing issue to me is that the symbol names "char" and "wchar" imply that the data stored therein is a complete character, when this is only sometimes true. I agree that this is a problem, but I'm not sure that variable width characters is the solution. It makes array manipulations oddly inconsistent, for one thing. Should the length property return the number of characters in the array? Would a size property be needed to determine the memory footprint of this array? What if I try something like this: utf8[] myString = "multiwidth"; utf8[] slice = myString[0..1]; slice[0] = '\U00000001'; Would the sliced array resize to fit the potentially different-sized character being inserted, or would myString end up corrputed? Sean
Nov 21 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:

 Derek Parnell wrote:
 
 Where did you get "6+ character types" from?
I misunderstood and thought his cdpt8 would be added in addition to the existing character types.
 James is (at worst) only adding one, ASCII. So we would end up with
 
   utf8  <==> schar[]  (Short? chars)
   utf16 <==> wchar[]  (Wide chars)
   utf32 <==> dchar[]  (Double-wide chars)
   ascii <==> char[]   (byte size chars)
 
 But the key point is that each element in these arrays would be a
 *character* (a.k.a. Code Point) rather than Code Units as they are now.
 
 Thus a schar is an atomic value that represents a single character even if
 that character takes up one, two, or four bytes in RAM. And 'schar[4]'
 would represents a fixed size array of 4 code points. 
This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again? And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?
That is what I'm doing now to Build. Internally, all strings will be dchar[], but what I'm finding out is the huge lack of support for dchar[] in phobos. I've now coded my own routine to read text files in UTF formats, but store them as dchar[] in the application. Then I've had to code appropriate routines for all the other support functions: split(), strip(), find(), etc ...
 Sure the current system can work, but only if the coder does a lot of
 mundane, error-prone work, to make it happen. The compiler is a tool to
 help coders do better, so it should help us take care of incidental
 housekeeping so we can concentrate of algorithms rather than data
 representations in RAM.
The only somewhat confusing issue to me is that the symbol names "char" and "wchar" imply that the data stored therein is a complete character, when this is only sometimes true. I agree that this is a problem, but I'm not sure that variable width characters is the solution. It makes array manipulations oddly inconsistent, for one thing. Should the length property return the number of characters in the array?
Yes.
  Would a 
 size property be needed to determine the memory footprint of this array? 
Yes.
   What if I try something like this:
 
 utf8[] myString = "multiwidth";
 utf8[] slice = myString[0..1];
 slice[0] = '\U00000001';
 
 Would the sliced array resize to fit the potentially different-sized 
 character being inserted, or would myString end up corrputed?
Yes, it would be complex. No, the myString would not be corrupted. It would just be the same as doing it 'manually', only the compiler will do the hack work for you. char[] myString = "multiwidth"; char[] slice = myString[0..1]; // modify base string. myString = "\U00000001" ~ myString[1..$]; // reslice it because its address might have changed. slice = myString[0..1]; Messy doing it manually, so that's why a code-point array would be better than a byte/short/int array for strings. -- Derek (skype: derek.j.parnell) Melbourne, Australia 22/11/2005 2:17:40 PM
Nov 21 2005
parent Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Derek Parnell wrote:
 On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:
 
 
Derek Parnell wrote:

Where did you get "6+ character types" from?
I misunderstood and thought his cdpt8 would be added in addition to the existing character types.
James is (at worst) only adding one, ASCII. So we would end up with

  utf8  <==> schar[]  (Short? chars)
  utf16 <==> wchar[]  (Wide chars)
  utf32 <==> dchar[]  (Double-wide chars)
  ascii <==> char[]   (byte size chars)

But the key point is that each element in these arrays would be a
*character* (a.k.a. Code Point) rather than Code Units as they are now.

Thus a schar is an atomic value that represents a single character even if
that character takes up one, two, or four bytes in RAM. And 'schar[4]'
would represents a fixed size array of 4 code points. 
This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again? And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?
That is what I'm doing now to Build. Internally, all strings will be dchar[], but what I'm finding out is the huge lack of support for dchar[] in phobos. I've now coded my own routine to read text files in UTF formats, but store them as dchar[] in the application. Then I've had to code appropriate routines for all the other support functions: split(), strip(), find(), etc ...
Then, wouldn't having good dchar[] support in Phobos be a better solution that having to introduce another type in the language to do the same thing that dchar[] does? The only difference I see in such a type (a codepoint string) and a dchar string is in better storage size for the codepoint string, but is that difference worth it? (not to mention a codepoint string would have (in certain cases) much worse modification performance that a dchar string). Also, what is Phobos lacking in dchar[] support? -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 23 2005
prev sibling parent reply "Kris" <fu bar.com> writes:
"James Dunne" <james.jdunne gmail.com> wrote ...
 Kris wrote:
[snip]
 Ahh. I think non-ASCII folks would be troubled by this bias <g>
char[] does NOT NECESSARILY MEAN an ASCII-only string in D. char[] can be a collection of UTF-8 code points, which further confuses the matter.
Indeed. I was alluding to encoding multi-byte-utf8 literals by hand; but it was a piss-poor attempt at humour.
 So long as you can process each variant of Unicode encodings (UTF-8, 
 UTF-16, and UTF-32), it should NOT matter which you choose as your default 
 encoding for your project's strings.  The only effect of the choice is the 
 efficiency with which your project processes strings.  You should not lose 
 any data, unless you make incorrect assumptions in your code.
Right.
 String literals created from the D compiler should be stored as a specific 
 encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be 
 represented as the corresponding static array of the type of character.
They are. The 'c', 'w', and 'd' suffix provides the fine control. Auto instances map implicitly to 'c'. Explicitly typed instances (e.g. wchar[] s = "a wide string";) also provide fine control. The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001'). No big deal there, although perhaps it's food for another topic?
 The default encoding should be modifiable with either commandline options 
 or with pragmas, preferrably pragmas.
I wondered about that also. Walter pointed out it would be similar to the signed/unsigned char-type switch prevalent in C compilers, which can cause grief. Perhaps D does need defaults like that, but some consistency in the interpretation of string literals would have to happen first. This required a subtle change: That change is to assign a resolvable type to 'undecorated' string-literal arguments in the same way as the "auto" keyword does. This would also make it consistent with undecorated integer-literals (as noted elsewhere). In short, an undecorated argument "literal" would be treated as a decorated "literal"c (that 'c' suffix makes it utf8), just like auto does. This would mean all uses of string literals are treated consistently, and all undecorated literals (string, char, numeric) have consistent rules when it comes to overload resolution (currently they do not). To elaborate, here's the undecorated string literal asymmetry: auto s = "literal"; // effectively adds an implicit 'c' suffix myFunc ("literal"); // Should be changed to behave as above What I hear you asking for is a way to alter that implicit suffix? I'd be really happy to just get the consistency first :-) These instances are all (clearly) explicitly typed: char[] s = "literal"; // utf8 wchar[] s = "literal"; // utf16 dchar[] s = "literal"; // utf32 auto s = "literal"c; // utf8 auto s = "literal"w; // utf16 auto s = "literal"d; // utf32 myFunc ("literal"c); // utf8 myFunc ("literal"w); // utf16 myFunc ("literal"d); // ut32
 For instance, if the default encoding were to be UTF-8, then a string 
 literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

 Also, it should be possible to explicitly specify the encoding for each 
 string literal on a case-by-case basis.
If I understand correctly, you can. See above.
Nov 21 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas it
 does with char literals (such as '?', '\x0001', and '\X00000001').
But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");
 No big
 deal there, although perhaps it's food for another topic?
Here seems like as good a place as any. Regan
Nov 22 2005
parent reply "Kris" <fu bar.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas it
 does with char literals (such as '?', '\x0001', and '\X00000001').
But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");
 No big
 deal there, although perhaps it's food for another topic?
Here seems like as good a place as any.
Oh, that minor concern was in regard to consistency here also. I have no quibble with the character type being implied by content (consistent with numeric literals): 1) The type for literal chars is implied by their content ('?', '\u0001', '\U00000001') 2) The type of a numeric literal is implied by the content (0xFF, 0xFFFFFFFF, 1.234) 3) The type for literal strings is not influenced at all by the content. far as I'm aware). These two inconsistencies are small, but they may influence concerns elsewhere ...
Nov 22 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas  
 it
 does with char literals (such as '?', '\x0001', and '\X00000001').
But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");
 No big
 deal there, although perhaps it's food for another topic?
Here seems like as good a place as any.
Oh, that minor concern was in regard to consistency here also.
I realise that. I'm just trying to explore whether they _should_ behave the same, or not, are they both apples or are they apples and oranges. I agree things should behave consistently, provided it makes sense for them to do so.
 I have no quibble with the character type being implied by content
I didn't think you did. My example above is a string literal, not a character literal. If the string literal type was implied by content would my example above be an error? "\U00000040" is a dchar (sized) character in a string literal. "abc \U00000040 def" could be used also. foo requires a wchar. if the type of the literal is taken to be dchar, based on contents then it does not match wchar and you need the 'w' suffix or similar to resolve it. It seems the real question is, what did the programmer intend? Did they intend for the character to be represented exactly as they typed it? In this case, if it was passed exactly as written it would become 2 wchar code units, did they want that? Or, did they simply want the equivalent character in the resulting encoding. I think the latter is more likely. The former can create illegal UTF sequences. What do you think? The facts:
 1) The type for literal chars is implied by their content ('?', '\u0001',
 '\U00000001')

 2) The type of a numeric literal is implied by the content (0xFF,
 0xFFFFFFFF, 1.234)

 3) The type for literal strings is not influenced at all by the content.
smaller types. similar enough? or is it in fact different?

 (as far as I'm aware).
I'm not aware of any either. Regan
Nov 23 2005
parent reply "Kris" <fu bar.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote
 On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas 
 it
 does with char literals (such as '?', '\x0001', and '\X00000001').
But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");
 No big
 deal there, although perhaps it's food for another topic?
Here seems like as good a place as any.
Oh, that minor concern was in regard to consistency here also.
I realise that. I'm just trying to explore whether they _should_ behave the same, or not, are they both apples or are they apples and oranges. I agree things should behave consistently, provided it makes sense for them to do so.
 I have no quibble with the character type being implied by content
I didn't think you did. My example above is a string literal, not a character literal. If the string literal type was implied by content would my example above be an error?
To clarify: I'm already making the assumption that the compiler changes to eliminate the uncommited aspect of argument literals. That presupposes the "default" type will be char[] (like auto literals). This is a further, and probably minor, question as to whether it might be useful (and consistent) that "default" type be implied by the literal content. Suffix 'typing' and compile-time transcoding are still present and able. I'm not at all sure it would be terribly useful, given that the literal will potentially be transcoded at compile-time anyway. [snip]
 I think the latter is more likely. The former can create illegal UTF 
 sequences.

 What do you think?
I think I'd be perfectly content once argument-literals lose their uncommited status, and thus behave like auto literals <g>
Nov 23 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Wed, 23 Nov 2005 13:58:20 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote
 On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:
 The minor concern I have with
 this aspect is that the literal content does not play a role, whereas
 it
 does with char literals (such as '?', '\x0001', and '\X00000001').
But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");
 No big
 deal there, although perhaps it's food for another topic?
Here seems like as good a place as any.
Oh, that minor concern was in regard to consistency here also.
I realise that. I'm just trying to explore whether they _should_ behave the same, or not, are they both apples or are they apples and oranges. I agree things should behave consistently, provided it makes sense for them to do so.
 I have no quibble with the character type being implied by content
I didn't think you did. My example above is a string literal, not a character literal. If the string literal type was implied by content would my example above be an error?
To clarify: I'm already making the assumption that the compiler changes to eliminate the uncommited aspect of argument literals. That presupposes the "default" type will be char[] (like auto literals).
Same.
 This is a further, and probably minor, question as to whether it might be
 useful (and consistent) that "default" type be implied by the literal
 content.
Yes, that is what I thought we were doing, questioning whether it would be useful. My current feeling is that it's not, but we'll see...
 Suffix 'typing' and compile-time transcoding are still present and able.
Yep.
 I'm not at all sure it would be terribly useful, given that the
 literal will potentially be transcoded at compile-time anyway.
Like in my first example: foo(wchar[] foo) {} foo("\U00000040"); the string containing the dchar content would in fact be transcoded to wchar at compile time to match the one available overload. So, when wouldn't it be transcoded at compile time? All I can think of is "auto", eg. auto test = "abc \U00000040 def"; So, if this is the only case where the string contents make a difference I would call that inconsistent, and would instead opt for using the string literal suffix to specify an encoding where required, eg. auto test = "abc \U00000040 def"d; Then the statement "all string literals default to char[] unless a the required encoding can be determined at compile time" would be true. Regan
Nov 23 2005
parent "Kris" <fu bar.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message news
[snip]
 Then the statement "all string literals default to char[] unless a the 
 required encoding can be determined at compile time" would be true.
That would be great. Now, will this truly come to pass? <g>
Nov 23 2005
prev sibling parent reply Don Clugston <dac nospam.com.au> writes:
Kris wrote:
 This is the long standing mishmash between character literal arguments and 
 parameters of type char[], wchar[], and/or dchar[]. Character literals don't 
 really have a "solid" type ~ the compiler can, and will, convert between 
 wide and narrow representations on the fly.
 
 Suppose you have the following methods:
 
 void write (char[] x);
 void write (wchar[] x);
 void write (dchar[] x);
 
 Given a literal argument:
 
 write ("what am I?");
 
 D doesn't know whether to invoke the char[] or wchar[] signature, since the 
 literal is treated as though it's possibly any of the three types. This is 
 the kind of non-determinism you get when the compiler becomes too 'smart' 
 (unwarranted automatic conversion, in this case).
I agree, except that I think the problem in this case is that it's not converting "from" anything! There's no "exact match" which it tries first. A parallel case is that a floating point literal can be implicitly converted to float, double, real, cfloat, cdouble, creal. For fp literals, the default is double. It's a bit odd that with dchar [] q = "abc"; wchar [] w = "abc" "abc" is a dchar literal the first time, but a wchar literal the second,whereas with real q = 2.5; double w = 2.5; 2.5 is a double literal in both cases. No wonder array literals are such a problem...
Nov 11 2005
parent reply "Kris" <fu bar.com> writes:
"Don Clugston" <dac nospam.com.au> wrote ..
 Kris wrote:
 D doesn't know whether to invoke the char[] or wchar[] signature, since 
 the literal is treated as though it's possibly any of the three types. 
 This is the kind of non-determinism you get when the compiler becomes too 
 'smart' (unwarranted automatic conversion, in this case).
I agree, except that I think the problem in this case is that it's not converting "from" anything! There's no "exact match" which it tries first.
There would be if the auto-casting were disabled, and the type were determined via the literal content, in conjunction with the /default/ literal type suggested by GW. Yes?
Nov 11 2005
parent jicman <jicman_member pathlink.com> writes:
Kris says...
"Don Clugston" <dac nospam.com.au> wrote ..
 Kris wrote:
 D doesn't know whether to invoke the char[] or wchar[] signature, since 
 the literal is treated as though it's possibly any of the three types. 
 This is the kind of non-determinism you get when the compiler becomes too 
 'smart' (unwarranted automatic conversion, in this case).
I agree, except that I think the problem in this case is that it's not converting "from" anything! There's no "exact match" which it tries first.
There would be if the auto-casting were disabled, and the type were determined via the literal content, in conjunction with the /default/ literal type suggested by GW. Yes?
Gosh, all I wanted was a simple explanation. :-) (kidding) I used writeString and it works, |17:24:22.68>type ftest.d |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.writeString("this is a test"); | log.close(); | return 1; |} thanks. Please, continue with your discussion. :-) josé
Nov 11 2005
prev sibling parent reply Nick <Nick_member pathlink.com> writes:
In article <dl0hja$2aal$1 digitaldaemon.com>, jicman says...
So, I have this complicated piece of code:

|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.write("this is a test");
|  log.close();
|  return 1;
|}
Also note one thing though: Stream.write() will write the string in binary format, ie. it will write a binary int with the length, and then the string. If you want a plain ASCII file, which is probably what you want in a log file, you should use Stream.writeString(), or Stream.writeLine() which inserts a line break. Or you can use writef/writefln for more advanced formatting. If you already knew this then disregard this post ;-) Nick
Nov 11 2005
parent jicman <jicman_member pathlink.com> writes:
Nick says...
In article <dl0hja$2aal$1 digitaldaemon.com>, jicman says...
So, I have this complicated piece of code:

|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.write("this is a test");
|  log.close();
|  return 1;
|}
Also note one thing though: Stream.write() will write the string in binary format, ie. it will write a binary int with the length, and then the string. If you want a plain ASCII file, which is probably what you want in a log file, you should use Stream.writeString(), or Stream.writeLine() which inserts a line break. Or you can use writef/writefln for more advanced formatting. If you already knew this then disregard this post ;-)
Disregard this post? Oh, no! My friend, you wrote it, I am going to read it. (Yes, I knew that. I was trying to quickly write some debugging code for something at work, and I found that compiler error and asked.) Thanks. josé
Nov 11 2005