digitalmars.D.learn - Stream and File understanding.
- jicman (20/20) Nov 10 2005 So, I have this complicated piece of code:
- Sean Kelly (7/29) Nov 10 2005 The problem is that literal string literals can be implicitly converted
- Kris (22/51) Nov 10 2005 This produces a compile error:
- Georg Wrede (19/52) Nov 11 2005 I just posted a "nice" fix on this thread. But it seems overkill (and
- bert (4/10) Nov 11 2005 The *programmer* assumes so *anyway*.
- jicman (2/12) Nov 11 2005 It is really cool. :-)
- Kris (7/10) Nov 11 2005 That sounds like a good idea; it would set the /default/ type for litera...
- Nick (6/12) Nov 14 2005 Well that's a nice attitude. Makes copy-and-paste impossible, and makes ...
- Georg Wrede (16/32) Nov 14 2005 :-) there are actually 2 separate issues involved.
- Kris (34/55) Nov 10 2005 This is the long standing mishmash between character literal arguments a...
- Georg Wrede (32/54) Nov 11 2005 Compared to the bit thing I recently "bitched" about, this, IMHO, is an
- Kris (12/36) Nov 11 2005 Not so. You'd see people complaining about this constantly if Stream.wri...
- James Dunne (25/88) Nov 21 2005 char[] does NOT NECESSARILY MEAN an ASCII-only string in D.
- Derek Parnell (8/93) Nov 21 2005 Very nice. Well said James. It makes so much sense when laid out like th...
- Sean Kelly (14/37) Nov 21 2005 I agree, but there must be a way to improve internationalization without...
- Derek Parnell (26/65) Nov 21 2005 Where did you get "6+ character types" from?
- Kris (17/43) Nov 21 2005 Maybe. To maintain array indexing semantics, the compiler might implemen...
- Sean Kelly (21/41) Nov 21 2005 I misunderstood and thought his cdpt8 would be added in addition to the
- Derek Parnell (25/72) Nov 21 2005 That is what I'm doing now to Build. Internally, all strings will be
- Bruno Medeiros (13/52) Nov 23 2005 Then, wouldn't having good dchar[] support in Phobos be a better
- Kris (40/59) Nov 21 2005 Indeed. I was alluding to encoding multi-byte-utf8 literals by hand; but...
- Regan Heath (10/15) Nov 22 2005 But that makes sense, right? Character literals i.e. '\X00000001' will
- Kris (12/26) Nov 22 2005 Oh, that minor concern was in regard to consistency here also. I have no...
- Regan Heath (29/59) Nov 23 2005 I realise that. I'm just trying to explore whether they _should_ behave ...
- Kris (12/48) Nov 23 2005 To clarify: I'm already making the assumption that the compiler changes ...
- Regan Heath (20/68) Nov 23 2005 Yes, that is what I thought we were doing, questioning whether it would ...
- Kris (4/6) Nov 23 2005 That would be great. Now, will this truly come to pass?
- Don Clugston (13/32) Nov 11 2005 I agree, except that I think the problem in this case is that it's not
- Nick (8/18) Nov 11 2005 Also note one thing though: Stream.write() will write the string in bina...
- jicman (5/24) Nov 11 2005 Disregard this post? Oh, no! My friend, you wrote it, I am going to re...
So, I have this complicated piece of code: |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.write("this is a test"); | log.close(); | return 1; |} and I try to compile it, I get, |ftest.d(6): function std.stream.Stream.write called with argument types: | (char[14]) |matches both: | std.stream.Stream.write(char[]) |and: | std.stream.Stream.write(wchar[]) Shouldn't it just match "std.stream.Stream.write(char[])"? thanks, josé
Nov 10 2005
jicman wrote:So, I have this complicated piece of code: |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.write("this is a test"); | log.close(); | return 1; |} and I try to compile it, I get, |ftest.d(6): function std.stream.Stream.write called with argument types: | (char[14]) |matches both: | std.stream.Stream.write(char[]) |and: | std.stream.Stream.write(wchar[]) Shouldn't it just match "std.stream.Stream.write(char[])"?The problem is that literal string literals can be implicitly converted to char, wchar, and dchar strings. To fix the overload resolution problem, try this: log.write( "this is a test"c ); the 'c' indicates that the above is a char string. Sean
Nov 10 2005
This produces a compile error: void write (char[] x){} void write (wchar[] x){} void main() { write ("part 1" "part 2" c); } The compiler complains about the two literal types not matching. This also fails: void main() { write ("part 1" c "part 2" c); } This strange looking suffixing is present due to unwarranted & unwanted automatic type conversion, is it not? Wouldn't it be better to explicitly request conversion when it's actually wanted instead? Isn't that what the cast() operator is for? - Kris "Sean Kelly" <sean f4.ca> wrote in message news:dl0in9$2bet$1 digitaldaemon.com...jicman wrote:So, I have this complicated piece of code: |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.write("this is a test"); | log.close(); | return 1; |} and I try to compile it, I get, |ftest.d(6): function std.stream.Stream.write called with argument types: | (char[14]) |matches both: | std.stream.Stream.write(char[]) |and: | std.stream.Stream.write(wchar[]) Shouldn't it just match "std.stream.Stream.write(char[])"?The problem is that literal string literals can be implicitly converted to char, wchar, and dchar strings. To fix the overload resolution problem, try this: log.write( "this is a test"c ); the 'c' indicates that the above is a char string. Sean
Nov 10 2005
Sean Kelly wrote:jicman wrote:I just posted a "nice" fix on this thread. But it seems overkill (and brittle), if one assumes this is just a problem with string literals! _If_ it is true that this "problem" exists only with string literals, then it should be even easier to fix! The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string! (( At this time opponents will say "what if the source code file gets converted into another character width?" -- My answer: "Tough, ain't it!", since there's a law against gratuituous mucking with source code. )) So, implicitly just assume the source code literal character width. The '"c' does _not_ exist so the compiler can force you to state the obvious. It's there so you _can_ be explicit _when_ it really matters to you. --- Oh, and if we want to be real fancy, we could also have a pragma stating the default for character literals! And when the pragma is not used, then assume based on the source.So, I have this complicated piece of code: |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.write("this is a test"); | log.close(); | return 1; |} and I try to compile it, I get, |ftest.d(6): function std.stream.Stream.write called with argument types: | (char[14]) |matches both: | std.stream.Stream.write(char[]) |and: | std.stream.Stream.write(wchar[]) Shouldn't it just match "std.stream.Stream.write(char[])"?The problem is that literal string literals can be implicitly converted to char, wchar, and dchar strings. To fix the overload resolution problem, try this: log.write( "this is a test"c ); the 'c' indicates that the above is a char string.
Nov 11 2005
In article <4374598B.30604 nospam.org>, Georg Wrede says...The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string!The *programmer* assumes so *anyway*. Why on earth should the copiler assume anything else! BTW, D is really cool!
Nov 11 2005
bert says...In article <4374598B.30604 nospam.org>, Georg Wrede says...It is really cool. :-)The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string!The *programmer* assumes so *anyway*. Why on earth should the copiler assume anything else! BTW, D is really cool!
Nov 11 2005
"Georg Wrede" <georg.wrede nospam.org> wrote ...The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string!That sounds like a good idea; it would set the /default/ type for literals. But the compiler should still inspect the literal content to determine if it has explicit wchar or dchar characters within. The compiler apparently does this, but doesn't use it to infer literal type? This combination would very likely resolve all such problems, assuming the auto-casting were removed also?
Nov 11 2005
In article <4374598B.30604 nospam.org>, Georg Wrede says...The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string! (( At this time opponents will say "what if the source code file gets converted into another character width?" -- My answer: "Tough, ain't it!", since there's a law against gratuituous mucking with source code. ))Well that's a nice attitude. Makes copy-and-paste impossible, and makes writing code off html, plain text, and books impossible too, since the code's behaviour now dependens on your language environment. I'm sure that won't cause any bugs at all ;-) Nick
Nov 14 2005
Nick wrote:In article <4374598B.30604 nospam.org>, Georg Wrede says...:-) there are actually 2 separate issues involved. First of all, the copy-and-paste issue: To be able to paste into the string, the text editor (or whatever) has to know the character width of the file to begin with, since pasting is done differently with the various UTF widths. Further, one cannot paste anything "in the wrong UTF width" as such, so the editor has to convert it into the width of the entire file first. (This _should_ be handled by the operating system (not the text editor), but I wouldn't bet on it, at least before 2010 or something. Not with at least _some_ "operating systems".) Second, the width the undecorated literal is to be stored as: What makes this issue interesting is, is it feasible to assume something or declare the literal as "of unspecified" width. There's lately been some research into the issue (in the D newsgroup). The jury is still out.The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string! (( At this time opponents will say "what if the source code file gets converted into another character width?" -- My answer: "Tough, ain't it!", since there's a law against gratuituous mucking with source code. ))Well that's a nice attitude. Makes copy-and-paste impossible, and makes writing code off html, plain text, and books impossible too, since the code's behaviour now dependens on your language environment. I'm sure that won't cause any bugs at all ;-)
Nov 14 2005
This is the long standing mishmash between character literal arguments and parameters of type char[], wchar[], and/or dchar[]. Character literals don't really have a "solid" type ~ the compiler can, and will, convert between wide and narrow representations on the fly. Suppose you have the following methods: void write (char[] x); void write (wchar[] x); void write (dchar[] x); Given a literal argument: write ("what am I?"); D doesn't know whether to invoke the char[] or wchar[] signature, since the literal is treated as though it's possibly any of the three types. This is the kind of non-determinism you get when the compiler becomes too 'smart' (unwarranted automatic conversion, in this case). To /patch/ around this problem, literals may be now be suffixed with a type-identifier, including 'c', 'w', and 'd'. Thus, the above example will compile when you do the following: write ( "I am a char[], dammit!" c ); I, for one, think this is silly. To skirt the issue, APIs end up being written as follows: void write (char[]); void writeW (wchar[]); void writeD (dchar[]); Is that redundant, or what? Well, it's what Phobos is forced to do in the Stream code (take a look). The error you ran into appears to be a situation where Walter's own code (std.file) bumps into this ~ wish that were enough to justify a real fix for this long-running concern. BTW; the correct thing happens when not using literals. For example, the following operates intuitively: char[] msg = "I am a char[], dammit!"; write (msg); - Kris "jicman" <jicman_member pathlink.com> wrote in message news:dl0hja$2aal$1 digitaldaemon.com...So, I have this complicated piece of code: |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.write("this is a test"); | log.close(); | return 1; |} and I try to compile it, I get, |ftest.d(6): function std.stream.Stream.write called with argument types: | (char[14]) |matches both: | std.stream.Stream.write(char[]) |and: | std.stream.Stream.write(wchar[]) Shouldn't it just match "std.stream.Stream.write(char[])"? thanks, josé
Nov 10 2005
Kris wrote:This is the long standing mishmash between character literal arguments and parameters of type char[], wchar[], and/or dchar[]. Character literals don't really have a "solid" type ~ the compiler can, and will, convert between wide and narrow representations on the fly.Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-) It is a problem for small example programs. Larger programs tend to (and IMHO should) have wrappers anyhow: void logwrite(char[] logfile, char[] entry) { std.stream.Stream.write(logfile, entry) }BTW; the correct thing happens when not using literals. For example, the following operates intuitively: char[] msg = "I am a char[], dammit!"; write (msg);Hmm, Kris's comment above gives me an idea for a _very_ easy fix for this in Phobos: Why not change Phobos void write ( char[] s) {.....}; void write (wchar[] s) {.....}; void write (dchar[] s) {.....}; into void _write ( char[] s) {.....}; void _write (wchar[] s) {.....}; void _write (dchar[] s) {.....}; void write (char[] s) {_write(s)}; I think this would solve the issue with string literals as discussed in this thread. Also, overloading would not be hampered. And, those who really _need_ types other than the 8 bit chars, could still have their types work as usual. (( I also had 2 more lines void writeD (wchar[] s) {_write(s)}; void writeW (dchar[] s) {_write(s)}; above, but they're actually not needed, based on the assumption that the compiler is smart enough to not make redundant char type conversions, which I believe it is. -- And if not, then the 2 lines should be included. ))To /patch/ around this problem, literals may be now be suffixed with a type-identifier, including 'c', 'w', and 'd'. Thus, the above example will compile when you do the following: write ( "I am a char[], dammit!" c ); I, for one, think this is silly. To skirt the issue, APIs end up being written as follows: void write (char[]); void writeW (wchar[]); void writeD (dchar[]); Is that redundant, or what? Well, it's what Phobos is forced to do in the Stream code (take a look). The error you ran into appears to be a situation where Walter's own code (std.file) bumps into this
Nov 11 2005
"Georg Wrede" <georg.wrede nospam.org> wrote ...Kris wrote:That doesn't make it any less problematic :-)This is the long standing mishmash between character literal arguments and parameters of type char[], wchar[], and/or dchar[]. Character literals don't really have a "solid" type ~ the compiler can, and will, convert between wide and narrow representations on the fly.Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-)It is a problem for small example programs. Larger programs tend to (and IMHO should) have wrappers anyhow:Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.Why not change Phobos void write ( char[] s) {.....}; void write (wchar[] s) {.....}; void write (dchar[] s) {.....}; into void _write ( char[] s) {.....}; void _write (wchar[] s) {.....}; void _write (dchar[] s) {.....}; void write (char[] s) {_write(s)}; I think this would solve the issue with string literals as discussed in this thread.Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead. One might infer the literal type from the content therein?Also, overloading would not be hampered. And, those who really _need_ types other than the 8 bit chars, could still have their types work as usual.Ahh. I think non-ASCII folks would be troubled by this bias <g>
Nov 11 2005
Kris wrote:"Georg Wrede" <georg.wrede nospam.org> wrote ...char[] does NOT NECESSARILY MEAN an ASCII-only string in D. char[] can be a collection of UTF-8 code points, which further confuses the matter. So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings. The only effect of the choice is the efficiency with which your project processes strings. You should not lose any data, unless you make incorrect assumptions in your code. I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough. There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32). Then, there should be a single ASCII character type called 'char'. This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points. String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character. The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas. For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]'). Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.Kris wrote:That doesn't make it any less problematic :-)This is the long standing mishmash between character literal arguments and parameters of type char[], wchar[], and/or dchar[]. Character literals don't really have a "solid" type ~ the compiler can, and will, convert between wide and narrow representations on the fly.Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-)It is a problem for small example programs. Larger programs tend to (and IMHO should) have wrappers anyhow:Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.Why not change Phobos void write ( char[] s) {.....}; void write (wchar[] s) {.....}; void write (dchar[] s) {.....}; into void _write ( char[] s) {.....}; void _write (wchar[] s) {.....}; void _write (dchar[] s) {.....}; void write (char[] s) {_write(s)}; I think this would solve the issue with string literals as discussed in this thread.Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead. One might infer the literal type from the content therein?Also, overloading would not be hampered. And, those who really _need_ types other than the 8 bit chars, could still have their types work as usual.Ahh. I think non-ASCII folks would be troubled by this bias <g>
Nov 21 2005
On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:Kris wrote:Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets. -- Derek (skype: derek.j.parnell) Melbourne, Australia 22/11/2005 10:51:23 AM"Georg Wrede" <georg.wrede nospam.org> wrote ...char[] does NOT NECESSARILY MEAN an ASCII-only string in D. char[] can be a collection of UTF-8 code points, which further confuses the matter. So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings. The only effect of the choice is the efficiency with which your project processes strings. You should not lose any data, unless you make incorrect assumptions in your code. I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough. There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32). Then, there should be a single ASCII character type called 'char'. This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points. String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character. The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas. For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]'). Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.Kris wrote:That doesn't make it any less problematic :-)This is the long standing mishmash between character literal arguments and parameters of type char[], wchar[], and/or dchar[]. Character literals don't really have a "solid" type ~ the compiler can, and will, convert between wide and narrow representations on the fly.Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-)It is a problem for small example programs. Larger programs tend to (and IMHO should) have wrappers anyhow:Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.Why not change Phobos void write ( char[] s) {.....}; void write (wchar[] s) {.....}; void write (dchar[] s) {.....}; into void _write ( char[] s) {.....}; void _write (wchar[] s) {.....}; void _write (dchar[] s) {.....}; void write (char[] s) {_write(s)}; I think this would solve the issue with string literals as discussed in this thread.Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead. One might infer the literal type from the content therein?Also, overloading would not be hampered. And, those who really _need_ types other than the 8 bit chars, could still have their types work as usual.Ahh. I think non-ASCII folks would be troubled by this bias <g>
Nov 21 2005
Derek Parnell wrote:On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:I agree, but there must be a way to improve internationalization without this degree of complexity. If D ends up with 6+ character types I think I might scream. Is there any reason to support C-style code pages in-language in D? I would like to think not. As it stands, D supports three compatible encodings (char, wchar, dchar) that the programmer may choose between for reasons of data size and algorithm complexity. The ASCII-compatible subset of UTF-8 works fine with the char-based C functions, and the full UTF-16 or UTF-32 character sets are compatible with the wchar-based C functions (depending on platform)... so far as I know at any rate. I grant that the variable size of wchar in C is an irritating problem, but it's not insurmountable. Why bother with all that old C code page nonsense? SeanI think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough. There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32). Then, there should be a single ASCII character type called 'char'. This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points. String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character. The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas. For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]'). Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets.
Nov 21 2005
On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:Derek Parnell wrote:Where did you get "6+ character types" from? James is (at worst) only adding one, ASCII. So we would end up with utf8 <==> schar[] (Short? chars) utf16 <==> wchar[] (Wide chars) utf32 <==> dchar[] (Double-wide chars) ascii <==> char[] (byte size chars) But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now. Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points. In this scheme, the old 'char' would be a directly compatible value with C/C++ legacy code rather.On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:I agree, but there must be a way to improve internationalization without this degree of complexity. If D ends up with 6+ character types I think I might scream.I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough. There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32). Then, there should be a single ASCII character type called 'char'. This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points. String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character. The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas. For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]'). Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets.Is there any reason to support C-style code pages in-language in D?Huh? What code pages? This is nowhere near anything James was talking about.I would like to think not. As it stands, D supports three compatible encodings (char, wchar, dchar) that the programmer may choose between for reasons of data size and algorithm complexity. The ASCII-compatible subset of UTF-8 works fine with the char-based C functions, and the full UTF-16 or UTF-32 character sets are compatible with the wchar-based C functions (depending on platform)... so far as I know at any rate. I grant that the variable size of wchar in C is an irritating problem, but it's not insurmountable. Why bother with all that old C code page nonsense?Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM. -- Derek (skype: derek.j.parnell) Melbourne, Australia 22/11/2005 12:15:38 PM
Nov 21 2005
"Derek Parnell" <derek psych.ward> wrote ...On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:[snip]Maybe. To maintain array indexing semantics, the compiler might implement such things as an array of pointers to byte arrays? Then, there's at least this problem :: dchar is always self-contained. It does not have surrogates, ever. Given that it's more efficient to store as a one-dimensional array, surely this would cause inconsistencies in usage? And what about BMP utf16? It doesn't need such treatment either (though extended utf16 would do). But I agree in principal ~ the semantics of indexing (as in arrays) don't work well with multi code-unit encodings. Packages to deal with such things typically offer iterators as a supplement. Take a look at ICU for examples? [snip]Derek Parnell wrote:Where did you get "6+ character types" from? James is (at worst) only adding one, ASCII. So we would end up with utf8 <==> schar[] (Short? chars) utf16 <==> wchar[] (Wide chars) utf32 <==> dchar[] (Double-wide chars) ascii <==> char[] (byte size chars) But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now. Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote: Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets.I agree, but there must be a way to improve internationalization without this degree of complexity. If D ends up with 6+ character types I think I might scream.Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM.I suspect it's a tall order to build such things into the compiler; especially when the issues are not clear-cut, and when there are heavy-duty libraries to take up the slack? Don't those libraries take care of data representation and incidental housekeeping on behalf of the developer?
Nov 21 2005
Derek Parnell wrote:Where did you get "6+ character types" from?I misunderstood and thought his cdpt8 would be added in addition to the existing character types.James is (at worst) only adding one, ASCII. So we would end up with utf8 <==> schar[] (Short? chars) utf16 <==> wchar[] (Wide chars) utf32 <==> dchar[] (Double-wide chars) ascii <==> char[] (byte size chars) But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now. Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again? And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM.The only somewhat confusing issue to me is that the symbol names "char" and "wchar" imply that the data stored therein is a complete character, when this is only sometimes true. I agree that this is a problem, but I'm not sure that variable width characters is the solution. It makes array manipulations oddly inconsistent, for one thing. Should the length property return the number of characters in the array? Would a size property be needed to determine the memory footprint of this array? What if I try something like this: utf8[] myString = "multiwidth"; utf8[] slice = myString[0..1]; slice[0] = '\U00000001'; Would the sliced array resize to fit the potentially different-sized character being inserted, or would myString end up corrputed? Sean
Nov 21 2005
On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:Derek Parnell wrote:That is what I'm doing now to Build. Internally, all strings will be dchar[], but what I'm finding out is the huge lack of support for dchar[] in phobos. I've now coded my own routine to read text files in UTF formats, but store them as dchar[] in the application. Then I've had to code appropriate routines for all the other support functions: split(), strip(), find(), etc ...Where did you get "6+ character types" from?I misunderstood and thought his cdpt8 would be added in addition to the existing character types.James is (at worst) only adding one, ASCII. So we would end up with utf8 <==> schar[] (Short? chars) utf16 <==> wchar[] (Wide chars) utf32 <==> dchar[] (Double-wide chars) ascii <==> char[] (byte size chars) But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now. Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again? And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?Yes.Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM.The only somewhat confusing issue to me is that the symbol names "char" and "wchar" imply that the data stored therein is a complete character, when this is only sometimes true. I agree that this is a problem, but I'm not sure that variable width characters is the solution. It makes array manipulations oddly inconsistent, for one thing. Should the length property return the number of characters in the array?Would a size property be needed to determine the memory footprint of this array?Yes.What if I try something like this: utf8[] myString = "multiwidth"; utf8[] slice = myString[0..1]; slice[0] = '\U00000001'; Would the sliced array resize to fit the potentially different-sized character being inserted, or would myString end up corrputed?Yes, it would be complex. No, the myString would not be corrupted. It would just be the same as doing it 'manually', only the compiler will do the hack work for you. char[] myString = "multiwidth"; char[] slice = myString[0..1]; // modify base string. myString = "\U00000001" ~ myString[1..$]; // reslice it because its address might have changed. slice = myString[0..1]; Messy doing it manually, so that's why a code-point array would be better than a byte/short/int array for strings. -- Derek (skype: derek.j.parnell) Melbourne, Australia 22/11/2005 2:17:40 PM
Nov 21 2005
Derek Parnell wrote:On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:Then, wouldn't having good dchar[] support in Phobos be a better solution that having to introduce another type in the language to do the same thing that dchar[] does? The only difference I see in such a type (a codepoint string) and a dchar string is in better storage size for the codepoint string, but is that difference worth it? (not to mention a codepoint string would have (in certain cases) much worse modification performance that a dchar string). Also, what is Phobos lacking in dchar[] support? -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."Derek Parnell wrote:That is what I'm doing now to Build. Internally, all strings will be dchar[], but what I'm finding out is the huge lack of support for dchar[] in phobos. I've now coded my own routine to read text files in UTF formats, but store them as dchar[] in the application. Then I've had to code appropriate routines for all the other support functions: split(), strip(), find(), etc ...Where did you get "6+ character types" from?I misunderstood and thought his cdpt8 would be added in addition to the existing character types.James is (at worst) only adding one, ASCII. So we would end up with utf8 <==> schar[] (Short? chars) utf16 <==> wchar[] (Wide chars) utf32 <==> dchar[] (Double-wide chars) ascii <==> char[] (byte size chars) But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now. Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again? And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?
Nov 23 2005
"James Dunne" <james.jdunne gmail.com> wrote ...Kris wrote:[snip]Indeed. I was alluding to encoding multi-byte-utf8 literals by hand; but it was a piss-poor attempt at humour.Ahh. I think non-ASCII folks would be troubled by this bias <g>char[] does NOT NECESSARILY MEAN an ASCII-only string in D. char[] can be a collection of UTF-8 code points, which further confuses the matter.So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings. The only effect of the choice is the efficiency with which your project processes strings. You should not lose any data, unless you make incorrect assumptions in your code.Right.String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character.They are. The 'c', 'w', and 'd' suffix provides the fine control. Auto instances map implicitly to 'c'. Explicitly typed instances (e.g. wchar[] s = "a wide string";) also provide fine control. The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001'). No big deal there, although perhaps it's food for another topic?The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas.I wondered about that also. Walter pointed out it would be similar to the signed/unsigned char-type switch prevalent in C compilers, which can cause grief. Perhaps D does need defaults like that, but some consistency in the interpretation of string literals would have to happen first. This required a subtle change: That change is to assign a resolvable type to 'undecorated' string-literal arguments in the same way as the "auto" keyword does. This would also make it consistent with undecorated integer-literals (as noted elsewhere). In short, an undecorated argument "literal" would be treated as a decorated "literal"c (that 'c' suffix makes it utf8), just like auto does. This would mean all uses of string literals are treated consistently, and all undecorated literals (string, char, numeric) have consistent rules when it comes to overload resolution (currently they do not). To elaborate, here's the undecorated string literal asymmetry: auto s = "literal"; // effectively adds an implicit 'c' suffix myFunc ("literal"); // Should be changed to behave as above What I hear you asking for is a way to alter that implicit suffix? I'd be really happy to just get the consistency first :-) These instances are all (clearly) explicitly typed: char[] s = "literal"; // utf8 wchar[] s = "literal"; // utf16 dchar[] s = "literal"; // utf32 auto s = "literal"c; // utf8 auto s = "literal"w; // utf16 auto s = "literal"d; // utf32 myFunc ("literal"c); // utf8 myFunc ("literal"w); // utf16 myFunc ("literal"d); // ut32For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]'). Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.If I understand correctly, you can. See above.
Nov 21 2005
On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001').But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");No big deal there, although perhaps it's food for another topic?Here seems like as good a place as any. Regan
Nov 22 2005
"Regan Heath" <regan netwin.co.nz> wrote...On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:Oh, that minor concern was in regard to consistency here also. I have no quibble with the character type being implied by content (consistent with numeric literals): 1) The type for literal chars is implied by their content ('?', '\u0001', '\U00000001') 2) The type of a numeric literal is implied by the content (0xFF, 0xFFFFFFFF, 1.234) 3) The type for literal strings is not influenced at all by the content. far as I'm aware). These two inconsistencies are small, but they may influence concerns elsewhere ...The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001').But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");No big deal there, although perhaps it's food for another topic?Here seems like as good a place as any.
Nov 22 2005
On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:"Regan Heath" <regan netwin.co.nz> wrote...I realise that. I'm just trying to explore whether they _should_ behave the same, or not, are they both apples or are they apples and oranges. I agree things should behave consistently, provided it makes sense for them to do so.On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:Oh, that minor concern was in regard to consistency here also.The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001').But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");No big deal there, although perhaps it's food for another topic?Here seems like as good a place as any.I have no quibble with the character type being implied by contentI didn't think you did. My example above is a string literal, not a character literal. If the string literal type was implied by content would my example above be an error? "\U00000040" is a dchar (sized) character in a string literal. "abc \U00000040 def" could be used also. foo requires a wchar. if the type of the literal is taken to be dchar, based on contents then it does not match wchar and you need the 'w' suffix or similar to resolve it. It seems the real question is, what did the programmer intend? Did they intend for the character to be represented exactly as they typed it? In this case, if it was passed exactly as written it would become 2 wchar code units, did they want that? Or, did they simply want the equivalent character in the resulting encoding. I think the latter is more likely. The former can create illegal UTF sequences. What do you think? The facts:1) The type for literal chars is implied by their content ('?', '\u0001', '\U00000001') 2) The type of a numeric literal is implied by the content (0xFF, 0xFFFFFFFF, 1.234) 3) The type for literal strings is not influenced at all by the content.smaller types. similar enough? or is it in fact different?(as far as I'm aware).I'm not aware of any either. Regan
Nov 23 2005
"Regan Heath" <regan netwin.co.nz> wroteOn Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:To clarify: I'm already making the assumption that the compiler changes to eliminate the uncommited aspect of argument literals. That presupposes the "default" type will be char[] (like auto literals). This is a further, and probably minor, question as to whether it might be useful (and consistent) that "default" type be implied by the literal content. Suffix 'typing' and compile-time transcoding are still present and able. I'm not at all sure it would be terribly useful, given that the literal will potentially be transcoded at compile-time anyway. [snip]"Regan Heath" <regan netwin.co.nz> wrote...I realise that. I'm just trying to explore whether they _should_ behave the same, or not, are they both apples or are they apples and oranges. I agree things should behave consistently, provided it makes sense for them to do so.On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:Oh, that minor concern was in regard to consistency here also.The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001').But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");No big deal there, although perhaps it's food for another topic?Here seems like as good a place as any.I have no quibble with the character type being implied by contentI didn't think you did. My example above is a string literal, not a character literal. If the string literal type was implied by content would my example above be an error?I think the latter is more likely. The former can create illegal UTF sequences. What do you think?I think I'd be perfectly content once argument-literals lose their uncommited status, and thus behave like auto literals <g>
Nov 23 2005
On Wed, 23 Nov 2005 13:58:20 -0800, Kris <fu bar.com> wrote:"Regan Heath" <regan netwin.co.nz> wroteSame.On Tue, 22 Nov 2005 15:01:11 -0800, Kris <fu bar.com> wrote:To clarify: I'm already making the assumption that the compiler changes to eliminate the uncommited aspect of argument literals. That presupposes the "default" type will be char[] (like auto literals)."Regan Heath" <regan netwin.co.nz> wrote...I realise that. I'm just trying to explore whether they _should_ behave the same, or not, are they both apples or are they apples and oranges. I agree things should behave consistently, provided it makes sense for them to do so.On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu bar.com> wrote:Oh, that minor concern was in regard to consistency here also.The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001').But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?) If this were to change would it make this an error: foo(wchar[] foo) {} foo("\U00000040");No big deal there, although perhaps it's food for another topic?Here seems like as good a place as any.I have no quibble with the character type being implied by contentI didn't think you did. My example above is a string literal, not a character literal. If the string literal type was implied by content would my example above be an error?This is a further, and probably minor, question as to whether it might be useful (and consistent) that "default" type be implied by the literal content.Yes, that is what I thought we were doing, questioning whether it would be useful. My current feeling is that it's not, but we'll see...Suffix 'typing' and compile-time transcoding are still present and able.Yep.I'm not at all sure it would be terribly useful, given that the literal will potentially be transcoded at compile-time anyway.Like in my first example: foo(wchar[] foo) {} foo("\U00000040"); the string containing the dchar content would in fact be transcoded to wchar at compile time to match the one available overload. So, when wouldn't it be transcoded at compile time? All I can think of is "auto", eg. auto test = "abc \U00000040 def"; So, if this is the only case where the string contents make a difference I would call that inconsistent, and would instead opt for using the string literal suffix to specify an encoding where required, eg. auto test = "abc \U00000040 def"d; Then the statement "all string literals default to char[] unless a the required encoding can be determined at compile time" would be true. Regan
Nov 23 2005
"Regan Heath" <regan netwin.co.nz> wrote in message news [snip]Then the statement "all string literals default to char[] unless a the required encoding can be determined at compile time" would be true.That would be great. Now, will this truly come to pass? <g>
Nov 23 2005
Kris wrote:This is the long standing mishmash between character literal arguments and parameters of type char[], wchar[], and/or dchar[]. Character literals don't really have a "solid" type ~ the compiler can, and will, convert between wide and narrow representations on the fly. Suppose you have the following methods: void write (char[] x); void write (wchar[] x); void write (dchar[] x); Given a literal argument: write ("what am I?"); D doesn't know whether to invoke the char[] or wchar[] signature, since the literal is treated as though it's possibly any of the three types. This is the kind of non-determinism you get when the compiler becomes too 'smart' (unwarranted automatic conversion, in this case).I agree, except that I think the problem in this case is that it's not converting "from" anything! There's no "exact match" which it tries first. A parallel case is that a floating point literal can be implicitly converted to float, double, real, cfloat, cdouble, creal. For fp literals, the default is double. It's a bit odd that with dchar [] q = "abc"; wchar [] w = "abc" "abc" is a dchar literal the first time, but a wchar literal the second,whereas with real q = 2.5; double w = 2.5; 2.5 is a double literal in both cases. No wonder array literals are such a problem...
Nov 11 2005
"Don Clugston" <dac nospam.com.au> wrote ..Kris wrote:There would be if the auto-casting were disabled, and the type were determined via the literal content, in conjunction with the /default/ literal type suggested by GW. Yes?D doesn't know whether to invoke the char[] or wchar[] signature, since the literal is treated as though it's possibly any of the three types. This is the kind of non-determinism you get when the compiler becomes too 'smart' (unwarranted automatic conversion, in this case).I agree, except that I think the problem in this case is that it's not converting "from" anything! There's no "exact match" which it tries first.
Nov 11 2005
Kris says..."Don Clugston" <dac nospam.com.au> wrote ..Gosh, all I wanted was a simple explanation. :-) (kidding) I used writeString and it works, |17:24:22.68>type ftest.d |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.writeString("this is a test"); | log.close(); | return 1; |} thanks. Please, continue with your discussion. :-) joséKris wrote:There would be if the auto-casting were disabled, and the type were determined via the literal content, in conjunction with the /default/ literal type suggested by GW. Yes?D doesn't know whether to invoke the char[] or wchar[] signature, since the literal is treated as though it's possibly any of the three types. This is the kind of non-determinism you get when the compiler becomes too 'smart' (unwarranted automatic conversion, in this case).I agree, except that I think the problem in this case is that it's not converting "from" anything! There's no "exact match" which it tries first.
Nov 11 2005
In article <dl0hja$2aal$1 digitaldaemon.com>, jicman says...So, I have this complicated piece of code: |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.write("this is a test"); | log.close(); | return 1; |}Also note one thing though: Stream.write() will write the string in binary format, ie. it will write a binary int with the length, and then the string. If you want a plain ASCII file, which is probably what you want in a log file, you should use Stream.writeString(), or Stream.writeLine() which inserts a line break. Or you can use writef/writefln for more advanced formatting. If you already knew this then disregard this post ;-) Nick
Nov 11 2005
Nick says...In article <dl0hja$2aal$1 digitaldaemon.com>, jicman says...Disregard this post? Oh, no! My friend, you wrote it, I am going to read it. (Yes, I knew that. I was trying to quickly write some debugging code for something at work, and I found that compiler error and asked.) Thanks. joséSo, I have this complicated piece of code: |import std.file; |import std.stream; |int main() |{ | File log = new File("myfile.txt",FileMode.Out); | log.write("this is a test"); | log.close(); | return 1; |}Also note one thing though: Stream.write() will write the string in binary format, ie. it will write a binary int with the length, and then the string. If you want a plain ASCII file, which is probably what you want in a log file, you should use Stream.writeString(), or Stream.writeLine() which inserts a line break. Or you can use writef/writefln for more advanced formatting. If you already knew this then disregard this post ;-)
Nov 11 2005