D - Unicode discussion
- Elias Martenson (72/72) Dec 15 2003 DISCLAIMER: I am not a "D programmer". I certainly haven't written any
- Walter (20/84) Dec 15 2003 The data type you're looking for is implemented in D and is the 'dchar'....
- Elias Martenson (50/98) Dec 15 2003 Actually, byte or ubyte doesn't really matter. One is not supposed to
- Walter (50/103) Dec 15 2003 In a higher level language, yes. But in doing systems work, one always s...
- Lewis (6/167) Dec 15 2003 heres a page i found with some c++ code that may help in creating decode...
- Elias Martenson (6/13) Dec 16 2003 IBM has a set of Unicode tools. Last time I googled for them I found
- uwem (5/18) Dec 16 2003 You mean icu?!
- Elias Martenson (5/8) Dec 16 2003 Yes that's it! No wonder I didn't find it, I was searching for "classes
- Sean L. Palmer (8/17) Dec 16 2003 seems
- Walter (3/8) Dec 16 2003 You're right.
- Elias Martenson (7/19) Dec 17 2003 Agreed. Some kind of itarator for strings are desperately needed.
- Elias Martenson (85/154) Dec 16 2003 All right. I can accept this, of course. The problem I still have with
- Ben Hinkle (33/187) Dec 16 2003 I think Walter once said char had been called 'ascii'. That doesn't soun...
- Elias Martenson (9/14) Dec 16 2003 No. This would be extremely bad. The (unfortunately) very large amount
- Ben Hinkle (22/34) Dec 16 2003 sound
- Elias Martenson (31/51) Dec 16 2003 For legacy code, you should have to take an extra step to make it work.
- Ben Hinkle (7/11) Dec 16 2003 I didn't say the default type should be ASCII. I just said it should be
- Elias Martenson (32/46) Dec 16 2003 But for all intents and puproses, ASCII does not exist anymore. It's a
- Elias Martenson (4/5) Dec 16 2003 ^^^^ 8999 of course
- Ben Hinkle (20/24) Dec 16 2003 which was why I suggested doing away with the generic "char" type entire...
- Elias Martenson (21/47) Dec 17 2003 No, it would certainly NOT be safe. You must remember that ASCII doesn't...
- Hauke Duden (11/17) Dec 17 2003 None of these alternatives is correct. printf will only work correctly
- Charles (16/240) Dec 16 2003 It does sound insane, I like it. I vote for this.
-
Carlos Santander B.
(16/16)
Dec 16 2003
"Elias Martenson"
wrote in message - Elias Martenson (18/25) Dec 16 2003 mbstowcs() = multi byte string to wide character string
- Carlos Santander B. (40/40) Dec 16 2003 Thank you both.
- Julio César Carrascal Urquijo (4/10) Dec 16 2003 mbstowcs - Multi Byte to Wide Character String
- Andy Friesen (4/11) Dec 16 2003 Ironically enough, you question answers Elias's question quite
- Elias Martenson (4/16) Dec 17 2003 Dang! How do you americans say? Three strikes, I'm out. :-)
- Walter (34/121) Dec 16 2003 Yes.
- Sean L. Palmer (26/42) Dec 17 2003 It's stupid to not agree on a standard size for char, since it's easy to
- Elias Martenson (15/38) Dec 17 2003 C doesn't define any standard sizes at all (well, you do have stdint.h
- Sean L. Palmer (4/6) Dec 17 2003 Sorry, "sign" of char.
- Elias Martenson (63/118) Dec 17 2003 Memory-wise perhaps. But for everything else UTF-8 is always slower.
- Walter (1/1) Dec 17 2003 I think we're mostly in agreement!
- Hauke Duden (34/46) Dec 16 2003 This is simply not true, Walter. The world has not gotten used to
- Elias Martenson (17/59) Dec 16 2003 Indeed. In many cases existing code would actually continue working,
- Hauke Duden (19/23) Dec 16 2003 They are not quite as few as one may think. For example, if you pass an
- Elias Martenson (11/24) Dec 16 2003 Exactly. But the number of functions that do these things are still
-
Walter
(55/94)
Dec 16 2003
. - Hauke Duden (35/59) Dec 17 2003 Right, it has been around for decades. And people still don't use it
- Walter (34/68) Dec 17 2003 UTF-8 has some nice advantages over other multibyte encodings in that it...
- Roald Ribe (58/126) Dec 18 2003 is
- Sean L. Palmer (12/19) Dec 18 2003 You raise some good points.
- Lewis (3/5) Dec 18 2003 Sorry if im stating something i lack knowledge in, but if there were no ...
- Elias Martenson (4/10) Dec 18 2003 Most likely ushort[].
-
Walter
(29/63)
Dec 18 2003
I can't really stop clueless/lazy programmers from writing bad code
. - Elias Martenson (26/51) Dec 19 2003 But it is possible to make it harder to do so. I believe that is what
- Sean L. Palmer (20/42) Dec 19 2003 Probably true.
- Rupert Millard (80/80) Dec 19 2003 There has been a lot of talk about doing things, but very little has
- Sean L. Palmer (15/30) Dec 19 2003 Cool beans! Thanks, Rupert!
- Rupert Millard (12/46) Dec 19 2003 I agree with you, but we just have to grin and bear it, unless / until
- Walter (11/62) Dec 19 2003 The problem with the operater* or operator~ syntax is it is ambiguous. I...
- Rupert Millard (13/40) Dec 20 2003 If you say it's ambiguous, I'll take your word for it and if you think b...
- Walter (3/5) Dec 20 2003 I haven't got that far yet!
- Sean L. Palmer (8/10) Dec 20 2003 It would be greppable if it were required that there be no space between...
- Sean L. Palmer (23/39) Dec 18 2003 and
- Elias Martenson (9/23) Dec 18 2003 This is why I have advocated a rename of dchar to char, and the current
- Walter (8/18) Dec 18 2003 Yes. Exactly.
- Karl Bochert (21/33) Dec 20 2003 A char is defined as a UTF-8 character but does not have enough storage ...
- Elias Martenson (8/14) Dec 20 2003 It's a fixed memory type. Look at it as an ubyte, but with some special
- Walter (18/35) Dec 20 2003 hold one!?
- Roald Ribe (8/17) Dec 21 2003 to
- Walter (3/10) Dec 22 2003 Sure, perhaps I misunderstood him.
- Serge K (3/6) Dec 30 2003 UTF-8 can represent all Unicode characters with no more then 4 bytes.
- Rupert Millard (9/16) Dec 21 2003 On Friday 19th, I posted a class that provides this functionality to thi...
- Ant (7/18) Dec 21 2003 I sorry to interrup
- Rupert Millard (13/38) Dec 21 2003 You had me worried here because I missed that post! However, they do
- Ilya Minkov (18/18) Dec 21 2003 I think this discussion of "language being wrong" is wrong. It is
- Hauke Duden (37/70) Dec 19 2003 The only situation I can think of where this might be useful is if you
- Walter (7/26) Dec 19 2003 I had the same thoughts!
- Hauke Duden (31/47) Dec 19 2003 Not really ;).
- Hauke Duden (27/31) Dec 19 2003 Just to clarify: I meant this in the context of creating a string
- Serge K (20/26) Dec 30 2003 UTF-32 never takes less memory than UTF-8. Period.
- Roald Ribe (15/41) Dec 30 2003 This is a good point. But I stand my ground: it may result in up to
- Serge K (36/48) Jan 03 2004 RTFM.
- Matthias Becker (1/6) Dec 17 2003 Shouldn't this wrapper be part of Phobos?
- Walter (7/13) Dec 17 2003 seems
- Roald Ribe (40/133) Dec 31 2003 seems
- Elohe (16/16) Jan 07 2004 First: I'm new in D and my english are bad.
DISCLAIMER: I am not a "D programmer". I certainly haven't written any real-world applications in the language yet but I am very knowlegeable in localisation issues. After the recent discussion regarding Unicode in D, which seems to have faded away now, I have decided to write some initial comments on what needs to be done to the language and API's to make it support all languages, not only English and Latin (which to my knowledge are the only lnguages that can be written using 7-bit ASCII). char types ---------- Today, according to the specification, there are three char types. char, wchar and dchar. These are then used in an array to create three different kinds of internal string representaions: UTF-8, UTF-16 and UTF-32. There are several problems with this. First and foremost, when an expression such as this: "char[] foo" you get the impression that this is an array of characters. This is wrong. The UTF-8 specification dictates that a UTF-8 string is an array of bytes, not characters. This is an important distiction to make since you cannot take the n'th character from a UTF-8 stream like this: string[n], since you may get a part of a multibyte character sequence. The wchar data type has the exact same problem, since it uses UTF-16 which also uses variable lengths for its characters. What is needed is a "char" datatype that is infact able to hold a character. You need 21 bits to describe a unicode character (Unicode allocates 17*2^16 code points, all of which are not yet defined) and therefore it seems reasonable to use a 32-bit data type for this. In my opinion this data type should be named "char". For UTF-8 and UTF-16 strings, one can use the "byte" and "short" data types, which would be in keeping with the Unicode standards which (to my knowledge, I'd have to look up the exact wording) declare UTF-8 strings as being sequences of bytes and 16-bit words respectively, and not "characters". String classes and functions ---------------------------- There are a set of const char[] arrays containing various character sequences including: hexdigits, digits, uppercase, letters, whitespace, etc... There are also character classification functions that accept 8-bit characters. These should really be replaced by a new but similar set of functions that work with 32-bit char types. isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace() These cannot be inlined functions since newer versions of the Unicode standard can declare new code points and we need to be forward compatible. Another funtion is also needed: getCharacterCategory() which returns the Unicode category. Some other functions are needed to determine other properites of the characters such as the directionality. Take a look at the Java classes java.text.BreakIterator and java.text.Bidi to get some ideas. Streams ------- The current std.stream is not adequate for Unicode. It doesn't seem to take encodings into consideration at all but is simply a binary interface. Strings in the Phobos stream library seems to deal primarily with char[] and wchar[]. The most important stream type, dchar[] is not even considered. Another problem with the library is that the point as which native encodiding<->unicode conversion is performed is not defined. Personally, I have not given this much considering yet, although I kind of like the way Java did it by introducing two different kinds of streams, byte streams and character stream. More discussion is clearly needed. Interoperability ---------------- In particular, C often uses 8-bit char arrays to represent strings. This causes a problem when all strings are 32-bit internally. The most straightforward olution is to convert UTF-32 char[] to UTF-8 byte[] before a call to a legacy function. This would also very elegantly deal with the problem is zero-terminated C strings, vs. non-zero terminated D strings (one of the char[]->UTF-8 conversions functions should create a zero-terminated byte array).
Dec 15 2003
"Elias Martenson" <elias-m algonet.se> wrote in message news:brjvsf$28lb$1 digitaldaemon.com...char types ---------- Today, according to the specification, there are three char types. char, wchar and dchar. These are then used in an array to create three different kinds of internal string representaions: UTF-8, UTF-16 and UTF-32. There are several problems with this. First and foremost, when an expression such as this: "char[] foo" you get the impression that this is an array of characters. This is wrong. The UTF-8 specification dictates that a UTF-8 string is an array of bytes, not characters. This is an important distiction to make since you cannot take the n'th character from a UTF-8 stream like this: string[n], since you may get a part of a multibyte character sequence. The wchar data type has the exact same problem, since it uses UTF-16 which also uses variable lengths for its characters. What is needed is a "char" datatype that is infact able to hold a character. You need 21 bits to describe a unicode character (Unicode allocates 17*2^16 code points, all of which are not yet defined) and therefore it seems reasonable to use a 32-bit data type for this. In my opinion this data type should be named "char". For UTF-8 and UTF-16 strings, one can use the "byte" and "short" data types, which would be in keeping with the Unicode standards which (to my knowledge, I'd have to look up the exact wording) declare UTF-8 strings as being sequences of bytes and 16-bit words respectively, and not "characters".The data type you're looking for is implemented in D and is the 'dchar'. A 'dchar' is 32 bits wide, wide enough for all the current and future unicode characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16 short. Having 'char' be a separate type from 'byte' is pretty handy for overloading purposes. (A minor clarification, 'byte' in D is signed, I think you meant 'ubyte', since UTF-8 bytes are unsigned.)String classes and functions ---------------------------- There are a set of const char[] arrays containing various character sequences including: hexdigits, digits, uppercase, letters, whitespace, etc... There are also character classification functions that accept 8-bit characters. These should really be replaced by a new but similar set of functions that work with 32-bit char types. isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace() These cannot be inlined functions since newer versions of the Unicode standard can declare new code points and we need to be forward compatible. Another funtion is also needed: getCharacterCategory() which returns the Unicode category. Some other functions are needed to determine other properites of the characters such as the directionality. Take a look at the Java classes java.text.BreakIterator and java.text.Bidi to get some ideas.I agree that more needs to be done in the D runtime library along these lines. I am not an expert on unicode - would you care to write those functions and contribute them to the D project?Streams ------- The current std.stream is not adequate for Unicode. It doesn't seem to take encodings into consideration at all but is simply a binary interface.That's correct.Strings in the Phobos stream library seems to deal primarily with char[] and wchar[]. The most important stream type, dchar[] is not even considered. Another problem with the library is that the point as which native encodiding<->unicode conversion is performed is not defined.That's correct as well. The library's support for unicode is inadequate. But there also is a nice package (std.utf) which will convert between char[], wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system API supports. (For win32 this would be UTF-16, I am unsure what linux supports.)Personally, I have not given this much considering yet, although I kind of like the way Java did it by introducing two different kinds of streams, byte streams and character stream. More discussion is clearly needed. Interoperability ---------------- In particular, C often uses 8-bit char arrays to represent strings. This causes a problem when all strings are 32-bit internally. The most straightforward olution is to convert UTF-32 char[] to UTF-8 byte[] before a call to a legacy function. This would also very elegantly deal with the problem is zero-terminated C strings, vs. non-zero terminated D strings (one of the char[]->UTF-8 conversions functions should create a zero-terminated byte array).D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.
Dec 15 2003
Den Mon, 15 Dec 2003 02:28:01 -0800 skrev Walter:Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway. The overloading issue is interesting, but may I suggest that char and whcar are at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters. And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either. I was almost going to provide a summary of the issues we're having in C with regards to this, but I don't know if it's necessary, and it's also getting late here (work early tomorrow).In my opinion this data type should be named "char". For UTF-8 and UTF-16 strings, one can use the "byte" and "short" data types, which would be in keeping with the Unicode standards which (to my knowledge, I'd have to look up the exact wording) declare UTF-8 strings as being sequences of bytes and 16-bit words respectively, and not "characters".The data type you're looking for is implemented in D and is the 'dchar'. A 'dchar' is 32 bits wide, wide enough for all the current and future unicode characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16 short. Having 'char' be a separate type from 'byte' is pretty handy for overloading purposes. (A minor clarification, 'byte' in D is signed, I think you meant 'ubyte', since UTF-8 bytes are unsigned.)I'd love to help out and do these things. But two things are needed first: - At least one other person needs to volunteer. I've had bad experiences when one person does this by himself, - The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what a "string" really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.[ my own comments regarding strings snipped ]I agree that more needs to be done in the D runtime library along these lines. I am not an expert on unicode - would you care to write those functions and contribute them to the D project?Agree. And as such it's very good.Streams ------- The current std.stream is not adequate for Unicode. It doesn't seem to take encodings into consideration at all but is simply a binary interface.That's correct.Yes. But this would then assume that char[] is always in native encoding and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs native decoding to UTF-8 when reading into a char[] array. Unless fundamental encoding/decoding is embedded in the streams library, it would be best to simply read text data into a byte array and then perform native decoding manually afterwards using functions similar to the C mbstowcs() and wcstombs(). The drawback to this is that you cannot read text data in platform encoding without copying through a separate buffer, even in cases when this is not needed.Strings in the Phobos stream library seems to deal primarily with char[] and wchar[]. The most important stream type, dchar[] is not even considered. Another problem with the library is that the point as which native encodiding<->unicode conversion is performed is not defined.That's correct as well. The library's support for unicode is inadequate. But there also is a nice package (std.utf) which will convert between char[], wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system API supports. (For win32 this would be UTF-16, I am unsure what linux supports.)This can be done in a much better, platform independent way, by using the native<->unicode conversion routines. In C, as already mentioned, these are called mbstowcs() and wcstombs(). For Windows, these would convert to and from UTF-16. For Unix, these would convert to and from whatever encoding the application is running under (dictated by the LC_CTYPE environment variable). There really is no need to make the API's platform dependent in any way here. In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest. This goes for all the other methods and functions that accept string parameters. This of course still depends on what a "string" really is, this really needs to be decided, and I think you are the only one who can make that call. Although more discussion on the subject might be needed first? Regards Elias MÃ¥rtensonIn particular, C often uses 8-bit char arrays to represent strings. This causes a problem when all strings are 32-bit internally. The most straightforward olution is to convert UTF-32 char[] to UTF-8 byte[] before a call to a legacy function. This would also very elegantly deal with the problem is zero-terminated C strings, vs. non-zero terminated D strings (one of the char[]->UTF-8 conversions functions should create a zero-terminated byte array).D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.
Dec 15 2003
"Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047 spam.spam...Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.The overloading issue is interesting, but may I suggest that char andwhcarare at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters.I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is for the moment.I'd love to help out and do these things. But two things are needed first: - At least one other person needs to volunteer. I've had bad experiences when one person does this by himself,You're not by yourself. There's a whole D community here!- The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what a "string" really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.ButThat's correct as well. The library's support for unicode is inadequate.char[],there also is a nice package (std.utf) which will convert betweensupports.wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system APIchar[] strings are UTF-8, and as such I don't know what you mean by 'native decoding'. There is only one possible conversion of UTF-8 to UTF-16.(For win32 this would be UTF-16, I am unsure what linux supports.)Yes. But this would then assume that char[] is always in native encoding and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs native decoding to UTF-8 when reading into a char[] array.Unless fundamental encoding/decoding is embedded in the streams library, it would be best to simply read text data into a byte array and then perform native decoding manually afterwards using functions similar to the C mbstowcs() and wcstombs(). The drawback to this is that you cannot read text data in platform encoding without copying through a separate buffer, even in cases when this is not needed.If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page dependent. They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.This can be done in a much better, platform independent way, by using the native<->unicode conversion routines.In C, as already mentioned, these are called mbstowcs() and wcstombs(). For Windows, these would convert to and from UTF-16. For Unix, these would convert to and from whatever encoding the application is running under (dictated by the LC_CTYPE environment variable). There really is no need to make the API's platform dependent in any way here.After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from UTF. This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having #ifdef _UNICODE all over the place? I've done that too much already. No thanks!) UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.It does that now, except they take a char[].This goes for all the other methods and functions that accept string parameters. This of course still depends on what a "string" really is, this really needs to be decided, and I think you are the only one who can make that call. Although more discussion on the subject might be needed first?It's been debated here before <g>.
Dec 15 2003
Walter wrote:"Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047 spam.spam...heres a page i found with some c++ code that may help in creating decoders etc... http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html for windows coding its easy enough to use com api's to manipulate and create unicode strings? (for utf16)Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.The overloading issue is interesting, but may I suggest that char andwhcarare at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters.I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is for the moment.I'd love to help out and do these things. But two things are needed first: - At least one other person needs to volunteer. I've had bad experiences when one person does this by himself,You're not by yourself. There's a whole D community here!- The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what a "string" really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.ButThat's correct as well. The library's support for unicode is inadequate.char[],there also is a nice package (std.utf) which will convert betweensupports.wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system APIchar[] strings are UTF-8, and as such I don't know what you mean by 'native decoding'. There is only one possible conversion of UTF-8 to UTF-16.(For win32 this would be UTF-16, I am unsure what linux supports.)Yes. But this would then assume that char[] is always in native encoding and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs native decoding to UTF-8 when reading into a char[] array.Unless fundamental encoding/decoding is embedded in the streams library, it would be best to simply read text data into a byte array and then perform native decoding manually afterwards using functions similar to the C mbstowcs() and wcstombs(). The drawback to this is that you cannot read text data in platform encoding without copying through a separate buffer, even in cases when this is not needed.If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page dependent. They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.This can be done in a much better, platform independent way, by using the native<->unicode conversion routines.In C, as already mentioned, these are called mbstowcs() and wcstombs(). For Windows, these would convert to and from UTF-16. For Unix, these would convert to and from whatever encoding the application is running under (dictated by the LC_CTYPE environment variable). There really is no need to make the API's platform dependent in any way here.After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from UTF. This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having #ifdef _UNICODE all over the place? I've done that too much already. No thanks!) UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.It does that now, except they take a char[].This goes for all the other methods and functions that accept string parameters. This of course still depends on what a "string" really is, this really needs to be decided, and I think you are the only one who can make that call. Although more discussion on the subject might be needed first?It's been debated here before <g>.
Dec 15 2003
Lewis wrote:heres a page i found with some c++ code that may help in creating decoders etc... http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html for windows coding its easy enough to use com api's to manipulate and create unicode strings? (for utf16)IBM has a set of Unicode tools. Last time I googled for them I found them right away but not I can't. I'll keel looking and post again when I find the link. Regards Elias Mårtenson
Dec 16 2003
You mean icu?! http://oss.software.ibm.com/icu/ Bye uwe In article <brmlf3$83b$1 digitaldaemon.com>, Elias Martenson says...Lewis wrote:heres a page i found with some c++ code that may help in creating decoders etc... http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html for windows coding its easy enough to use com api's to manipulate and create unicode strings? (for utf16)IBM has a set of Unicode tools. Last time I googled for them I found them right away but not I can't. I'll keel looking and post again when I find the link. Regards Elias Mårtenson
Dec 16 2003
uwem wrote:You mean icu?! http://oss.software.ibm.com/icu/Yes that's it! No wonder I didn't find it, I was searching for "classes for unicode". Regards Elias Mårtenson
Dec 16 2003
"Walter" <walter digitalmars.com> wrote in message news:brll85$1oko$1 digitaldaemon.com..."Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047 spam.spam...seemsActually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.In a higher level language, yes. But in doing systems work, one alwaysto be looking at the lower level elements anyway. I wrestled with this forawhile, and eventually decided that char[], wchar[], and dchar[] would belowlevel representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.The problem is that [] would be a horribly inefficient way to index UTF-8 characters. foreach would be ok. Sean
Dec 16 2003
"Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brmeos$2v9c$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in messageYou're right.One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.The problem is that [] would be a horribly inefficient way to index UTF-8 characters. foreach would be ok.
Dec 16 2003
Walter wrote:"Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brmeos$2v9c$1 digitaldaemon.com...Agreed. Some kind of itarator for strings are desperately needed. May I ask that they be designed in such a way that they are compatible/consistent with other iterators, such as the collections and things like the break iterator (also for strings). Regards Elias Mårtenson"Walter" <walter digitalmars.com> wrote in messageYou're right.One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.The problem is that [] would be a horribly inefficient way to index UTF-8 characters. foreach would be ok.
Dec 17 2003
Walter wrote:"Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047 spam.spam...All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual characters.Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is for the moment.Indeed, but no one else volunteered yet. :-)I'd love to help out and do these things. But two things are needed first: - At least one other person needs to volunteer. I've had bad experiences when one person does this by himself,You're not by yourself. There's a whole D community here!OK, if that is your descision then you will not see me argue against it. :-) However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options: void log_to_file(char[] str); void log_to_file(wchar[] str); void log_to_file(dchar[] str); Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise. Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]). Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings. Would it be possible to use something like this? dchar get_first_char(string str) { return str[0]; } string str1 = (dchar[])"A UTF-32 string"; string str2 = (char[])"A UTF-8 string"; // call the function to demonstrate that the "string" // type can be used in declarations dchar x = get_first_char(str1); dchar y = get_first_char(str2); I.e. the "string" data type would be a wrapper or supertype for the three different string types.- The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what a "string" really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.char[] strings are UTF-8, and as such I don't know what you mean by 'native decoding'. There is only one possible conversion of UTF-8 to UTF-16.The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R. In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page dependent. They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from UTF. This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having #ifdef _UNICODE all over the place? I've done that too much already. No thanks!)Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally. So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal encoding.UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed. Regards Elias MårtensonIn general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.It does that now, except they take a char[].
Dec 16 2003
I think Walter once said char had been called 'ascii'. That doesn't sound all that bad to me. Perhaps we should have the primitive types 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane, I know, but at least then you never will mistake an ascii[] for a utf32[] (or a utf8[], for that matter). -Ben "Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com...Walter wrote:seems"Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047 spam.spam...Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.In a higher level language, yes. But in doing systems work, one alwaysfor ato be looking at the lower level elements anyway. I wrestled with thislowwhile, and eventually decided that char[], wchar[], and dchar[] would becharacters.level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual<g>.I see your point, but I just can't see making utf8byte into a keywordmuchThe world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't seechars.of an issue here.Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.I think the library functions should be improved to handle unicodeisBut I'm not much of an expert on how to do it right, so it is the way itfirst:for the moment.As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.I'd love to help out and do these things. But two things are neededUTF-8,Indeed, but no one else volunteered yet. :-)- At least one other person needs to volunteer. I've had bad experiences when one person does this by himself,You're not by yourself. There's a whole D community here!- The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what a "string" really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.A string in D can be char[], wchar[], or dchar[], corresponding to:-)UTF-16, or UTF-32 representations.OK, if that is your descision then you will not see me argue against it.However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options: void log_to_file(char[] str); void log_to_file(wchar[] str); void log_to_file(dchar[] str); Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise. Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]). Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings. Would it be possible to use something like this? dchar get_first_char(string str) { return str[0]; } string str1 = (dchar[])"A UTF-32 string"; string str2 = (char[])"A UTF-8 string"; // call the function to demonstrate that the "string" // type can be used in declarations dchar x = get_first_char(str1); dchar y = get_first_char(str2); I.e. the "string" data type would be a wrapper or supertype for the three different string types.'nativechar[] strings are UTF-8, and as such I don't know what you mean bythedecoding'. There is only one possible conversion of UTF-8 to UTF-16.The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R. In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.If you're talking about win32 code pages, I'm going to draw a line independent.sand and assert that D char[] strings are NOT locale or code pageDThey are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)The UTF-8 to UTF-16 conversion is defined and platform independent. Thethem.runtime library includes routines to convert back and forth betweenfeelThey could probably be optimized better, but that's another issue. Ilocalethat by designing D around UTF-8, UTF-16 and UTF-32 the problems withordependent character sets are pushed off to the side as merely an inputandoutput translation nuisance. The core routines all expect UTF strings,isso are platform and language independent. I personally think the futureandUTF, and locale dependent encodings will fall by the wayside.Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the languageUTF.runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/fromaboutThis should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? Howhavingwchar_t's being 32 bits wide on linux vs 16 bits on win32? How about#ifdef _UNICODE all over the place? I've done that too much already. No thanks!)Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally. So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal encoding.UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed. Regards Elias MårtensonIn general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.It does that now, except they take a char[].
Dec 16 2003
Ben Hinkle wrote:I think Walter once said char had been called 'ascii'. That doesn't sound all that bad to me. Perhaps we should have the primitive types 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane, I know, but at least then you never will mistake an ascii[] for a utf32[] (or a utf8[], for that matter).No. This would be extremely bad. The (unfortunately) very large amount of english-only programmers will use "ascii" exclusively, and we'll end up with yet another english/latin-only language. ASCII really has no place in modern computing enironments anymore. All oeprating systems and languages has migrated, or ar in the process of migrating, to Unicode. Regards Elias Mårtenson
Dec 16 2003
"Elias Martenson" <elias-m algonet.se> wrote in message news:brn3tp$t93$1 digitaldaemon.com...Ben Hinkle wrote:soundI think Walter once said char had been called 'ascii'. That doesn'tInsane,all that bad to me. Perhaps we should have the primitive types 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.utf32[]I know, but at least then you never will mistake an ascii[] for aBut ASCII has a place in a practical programming language designed to work with legacy system and code. If you pass a utf-8 or utf-32 format string that isn't ASCII to printf it probably won't print out what you want. That's life. In terms of encouraging a healthy, happy future... the only thing the D language definition can do is choose what type to use for string literals. ie, given the declarations void foo(ascii[]); void foo(utf8[]); void foo(utf32[]); what function does foo("bar") calll? Right now it would call foo(utf8[]). You are arguing it should call utf32[]. I am on the fence about what it should call. Phobos should have routines to handle any encoding - ascii (or just rely on the std.c for these), utf8, utf16 and utf32. -Ben(or a utf8[], for that matter).No. This would be extremely bad. The (unfortunately) very large amount of english-only programmers will use "ascii" exclusively, and we'll end up with yet another english/latin-only language. ASCII really has no place in modern computing enironments anymore. All oeprating systems and languages has migrated, or ar in the process of migrating, to Unicode.
Dec 16 2003
Ben Hinkle wrote:But ASCII has a place in a practical programming language designed to work with legacy system and code. If you pass a utf-8 or utf-32 format string that isn't ASCII to printf it probably won't print out what you want. That's life.For legacy code, you should have to take an extra step to make it work. However, it should certainly be possible. Allow me to compare to how Java does it: String str = "this is a unicode string"; byte[] asciiString = str.getBytes("ASCII"); You can also convert it to UTF-8 if you like: byte[] utf8String = str.getBytes("UTF-8"); If the "default" string type in D is a simple ASCII string, do you honestly think that programmers who only speak english will even bother to do the right thing? Do you think they will even know that they are writing effectively broken code? I am suffering from these kinds of bugs every day (I speak swedish natively, but have also need to work with cyrillic) and let me tell you: 99% of all problems I have are caused by bugs similar to this. Also, I don't think it's a good idea to design a language around a legacy character set (ASCII) which will hopefully be gone in a few years (for newly written programs that is).In terms of encouraging a healthy, happy future... the only thing the D language definition can do is choose what type to use for string literals. ie, given the declarations void foo(ascii[]); void foo(utf8[]); void foo(utf32[]); what function does foo("bar") calll? Right now it would call foo(utf8[]). You are arguing it should call utf32[]. I am on the fence about what it should call.Yes, with Walters previous posting in mind, I argue that in foo() is overloaded with all three string types, it would call the dchar[] version. If one is not available, it would fall back to the wchar[], and lastly the char[] version. Then again, I also argue that there should be a way to using the supertype, "string", to avoid having to mess with the overloading and transparent string conversions.Phobos should have routines to handle any encoding - ascii (or just rely on the std.c for these), utf8, utf16 and utf32.The C standard library has largely migrated away from pure ASCII. It's there for backwards compatibility reasons, but people people still tend to use them, that's not the languages fault though but rather the developers. Regards Elias Mårtenson
Dec 16 2003
If the "default" string type in D is a simple ASCII string, do you honestly think that programmers who only speak english will even bother to do the right thing? Do you think they will even know that they are writing effectively broken code?I didn't say the default type should be ASCII. I just said it should be explicit when it is ASCII. For example, I think printf should be declared as accepting an ascii* format string, not a char* as it is currently declared (same for fopen etc etc). I said I didn't know what the default type should be, though I'm leaning towards UTF-8 so that casting to ascii[] doesn't have to reallocate anything. -Ben
Dec 16 2003
Ben Hinkle wrote:But for all intents and puproses, ASCII does not exist anymore. It's a legacy character set, and it should certainly not be the "natural" way of dealing with strings. Believe it or not, there are a lot of programmers out there who still believe that ASCII ought to be enough for anybody.If the "default" string type in D is a simple ASCII string, do you honestly think that programmers who only speak english will even bother to do the right thing? Do you think they will even know that they are writing effectively broken code?I didn't say the default type should be ASCII. I just said it should be explicit when it is ASCII.For example, I think printf should be declared as accepting an ascii* format string, not a char* as it is currently declared (same for fopen etc etc).But printf() works very well with UTF-8 in most cases.I said I didn't know what the default type should be, though I'm leaning towards UTF-8 so that casting to ascii[] doesn't have to reallocate anything.True. But then again, isn't the intent to try to avoid legacy calls ats much as possible? Is it a good idea to set the default in order to accomodate a legacy character set? You have to remember that UTF-8 is very inefficient. Suppose you have a 10000 character long string, and you want to retrieve the 9000'th character from that string. If the string is UTF-32, that means a single memory lookup. With UTF-8 it could mean anywhere between 9000 and 54000 memory lookups. Now imagine if the string is ten or one hundred times as long... Now, with the current design, what many people are going to do is: char c = str[9000]; // now play happily(?) with the char "c" that probably isn't the // 9000'th character and maybe was a part of a UTF-8 multi byte // character Again, this is a huge problem. The bug will not be evident until some other person (me, for example) tries to use non-ASCII characters. The above broken code may have run through every single test that the developer wrote, simply because he didn't think of putting a non-ASCII character in the string. This is a real problem, and it desperately needs to be solved. Several solutions has already been presented, the question is just which one of them that Walter will support? He already explained in the previous post, but it seems that there is still some things to be said. Regards Elias Mårtenson
Dec 16 2003
Elias Martenson wrote:char c = str[9000];^^^^ 8999 of course Regards Elias Mårtenson
Dec 16 2003
char c = str[8999]; // now play happily(?) with the char "c" that probably isn't the // 9000'th character and maybe was a part of a UTF-8 multi byte // characterwhich was why I suggested doing away with the generic "char" type entirely. If str was declared as an ascii array then it would be ascii c = str[8999]; Which is completely safe and reasonable. If it was declared as utf8[] then when the user writes ubyte c = str[8999] and they don't have any a-priori knowledge about str they should feel very nervous since I completely agree indexing into an arbitrary utf-8 encoded array is pretty meaningless. Plus in my experience using individual characters isn't that common - I'd say easily 90% of the time a variable is declared as char* or char[] rather than just char. By the way, I also think any utf8, utf16 and utf32 types should be aliased to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I dunno. About Java and D: when I program in Java I never worry about the size of a char because Java is very different than C and you have to jump through hoops to call C. But when I program in D I feel like it is an extension of C like C++. Imagine if C++ decided that char should be 32 bits. That would have been very painful. -Ben
Dec 16 2003
Ben Hinkle wrote:No, it would certainly NOT be safe. You must remember that ASCII doesn't exist anymore. It's a legacy character set. It's dead. Gone. Bye bye. And yes, sometimes it's needed for backwards compatibility, but in those cases it should be made explicit that you are throwing away information when converting.char c = str[8999]; // now play happily(?) with the char "c" that probably isn't the // 9000'th character and maybe was a part of a UTF-8 multi byte // characterwhich was why I suggested doing away with the generic "char" type entirely. If str was declared as an ascii array then it would be ascii c = str[8999]; Which is completely safe and reasonable.If it was declared as utf8[] then when the user writes ubyte c = str[8999] and they don't have any a-priori knowledge about str they should feel very nervous since I completely agree indexing into an arbitrary utf-8 encoded array is pretty meaningless. Plus in my experience using individual characters isn't that common - I'd say easily 90% of the time a variable is declared as char* or char[] rather than just char.You are right. Actually it's probably more than 90%. Especially when dealing with unicode. Very often it's not allowed to split a unicode string because of composite characters. However, you still need to be able to do indicidual character classification, such as isspace().By the way, I also think any utf8, utf16 and utf32 types should be aliased to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I dunno.ASCII has no business in a modern programming language.About Java and D: when I program in Java I never worry about the size of a char because Java is very different than C and you have to jump through hoops to call C. But when I program in D I feel like it is an extension of C like C++. Imagine if C++ decided that char should be 32 bits. That would have been very painful.All I was suggesting was a renaming of the types so that it's made explicit what type you have to use in order to be able to hold a single character. In D, this type is called "dchar", char doesn't cut it. In C on unix, it's called wchar_t. In C on windows the type to use is called "int" or "long". And finally in Java, you ahve to use "int". In all of these languages, "char" is insufficient to hold a character. Don't you think it's logical that the data type that can hold a character is called "char"? Regards Elias Mårtenson
Dec 17 2003
Elias Martenson wrote:None of these alternatives is correct. printf will only work correctly with UTF-8 if the string data is either ASCII or UTF-8 happens to be the current system code page. And ASCII will only work for english systems, which is even worse. As I said before, the C functions should be passed strings encoded in current system code page. That way all strings that are written in the system language will be printed perfectly. Also, characters that are not in the code page can be replaced with ? during the conversion, which is better than having printf output garbage. Haukeaccepting an ascii* format string, not a char* as it is currently declared (same for fopen etc etc).But printf() works very well with UTF-8 in most cases.
Dec 17 2003
It does sound insane, I like it. I vote for this. C "Ben Hinkle" <bhinkle4 juno.com> wrote in message news:brn1eq$ppk$1 digitaldaemon.com...I think Walter once said char had been called 'ascii'. That doesn't sound all that bad to me. Perhaps we should have the primitive types 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.Insane,I know, but at least then you never will mistake an ascii[] for a utf32[] (or a utf8[], for that matter). -Ben "Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com...beWalter wrote:seems"Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047 spam.spam...Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.In a higher level language, yes. But in doing systems work, one alwaysfor ato be looking at the lower level elements anyway. I wrestled with thiswhile, and eventually decided that char[], wchar[], and dchar[] wouldlowfunkycharacters.level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual<g>.I see your point, but I just can't see making utf8byte into a keywordThe world has already gotten used to multibyte 'char' in C and thesee'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don'tmuchitchars.of an issue here.Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.I think the library functions should be improved to handle unicodeBut I'm not much of an expert on how to do it right, so it is the wayis"string"first:for the moment.As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.I'd love to help out and do these things. But two things are neededIndeed, but no one else volunteered yet. :-)- At least one other person needs to volunteer. I've had bad experiences when one person does this by himself,You're not by yourself. There's a whole D community here!- The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what adependentUTF-8,really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.A string in D can be char[], wchar[], or dchar[], corresponding to:-)UTF-16, or UTF-32 representations.OK, if that is your descision then you will not see me argue against it.However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options: void log_to_file(char[] str); void log_to_file(wchar[] str); void log_to_file(dchar[] str); Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise. Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]). Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings. Would it be possible to use something like this? dchar get_first_char(string str) { return str[0]; } string str1 = (dchar[])"A UTF-32 string"; string str2 = (char[])"A UTF-8 string"; // call the function to demonstrate that the "string" // type can be used in declarations dchar x = get_first_char(str1); dchar y = get_first_char(str2); I.e. the "string" data type would be a wrapper or supertype for the three different string types.'nativechar[] strings are UTF-8, and as such I don't know what you mean bythedecoding'. There is only one possible conversion of UTF-8 to UTF-16.The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R. In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.If you're talking about win32 code pages, I'm going to draw a line independent.sand and assert that D char[] strings are NOT locale or code pageThey are UTF-8 strings. If you are reading code page or localeThestrings, to put them into a char[] will require running it through a conversion.Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)The UTF-8 to UTF-16 conversion is defined and platform independent.Dfuturethem.runtime library includes routines to convert back and forth betweenfeelThey could probably be optimized better, but that's another issue. Ilocalethat by designing D around UTF-8, UTF-16 and UTF-32 the problems withordependent character sets are pushed off to the side as merely an inputandoutput translation nuisance. The core routines all expect UTF strings,so are platform and language independent. I personally think theisto/fromandUTF, and locale dependent encodings will fall by the wayside.Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the languageruntime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translatingUTF.NoaboutThis should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? Howhavingwchar_t's being 32 bits wide on linux vs 16 bits on win32? How about#ifdef _UNICODE all over the place? I've done that too much already.encoding.thanks!)Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally. So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internalUTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed. Regards Elias MårtensonIn general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.It does that now, except they take a char[].
Dec 16 2003
"Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com... | for example). Why not use the same names as are used in C? mbstowcs() | and wcstombs()? | Sorry to ask, but what do those do? What do they stand for? ------------------------- Carlos Santander "Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com... | for example). Why not use the same names as are used in C? mbstowcs() | and wcstombs()? | Sorry to ask, but what do those do? What do they stand for? ------------------------- Carlos Santander
Dec 16 2003
Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.:"Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com... | for example). Why not use the same names as are used in C? mbstowcs() | and wcstombs()? | Sorry to ask, but what do those do? What do they stand for?mbstowcs() = multi byte string to wide character string wcstombs() = wide character string to multi byte string A multi byte string is a (char *), i.e. the platform encoding. This means that if you are running Unix in a UTF-8 locale (standard these days) then it contains a UTF-8 string. If you are running Unix or Windows with an ISO-8859-1 locale, then it contains ISO-8859-1 data. A wide character string is a (wchar_t *) which is a UTF-32 string on Unix, and a UTF-16 string on Windows. As you can see, the windows way of using UTF-16 causes the exact same problems as you would suffer when using UTF-8, so working with wchar_t on Windows would be of doubtful use if not for the fact that all Unicode functions in Windows deal with wchar_t. On Unix it's easier, since you know that the full Unicode range fits in a wchar_t. This is the reason why I have been advocating against the UTF-16 representation in D. It makes little sense compared to UTF-8 and UTF-32. Regards Elias MÃ¥rtenson
Dec 16 2003
Thank you both. "Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.16.22.27.42.233945 spam.spam... | Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.: | | > "Elias Martenson" <elias-m algonet.se> wrote in message | > news:brml3p$7hp$1 digitaldaemon.com... | > | for example). Why not use the same names as are used in C? mbstowcs() | > | and wcstombs()? | > | | > | > Sorry to ask, but what do those do? What do they stand for? | | mbstowcs() = multi byte string to wide character string | wcstombs() = wide character string to multi byte string | | A multi byte string is a (char *), i.e. the platform encoding. This means | that if you are running Unix in a UTF-8 locale (standard these days) then | it contains a UTF-8 string. If you are running Unix or Windows with an | ISO-8859-1 locale, then it contains ISO-8859-1 data. | | A wide character string is a (wchar_t *) which is a UTF-32 string on | Unix, and a UTF-16 string on Windows. | | As you can see, the windows way of using UTF-16 causes the exact same | problems as you would suffer when using UTF-8, so working with wchar_t on | Windows would be of doubtful use if not for the fact that all Unicode | functions in Windows deal with wchar_t. On Unix it's easier, since you | know that the full Unicode range fits in a wchar_t. | | This is the reason why I have been advocating against the UTF-16 | representation in D. It makes little sense compared to UTF-8 and UTF-32. | | Regards | | Elias MÃ¥rtenson | --- Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.552 / Virus Database: 344 - Release Date: 2003-12-15
Dec 16 2003
mbstowcs - Multi Byte to Wide Character String wcstombs - Wide Character String to Multi Byte Carlos Santander B. <carlos8294 msn.com> escribió en el mensaje de noticias brnpe0$206i$3 digitaldaemon.com..."Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com... | for example). Why not use the same names as are used in C? mbstowcs() | and wcstombs()? | Sorry to ask, but what do those do? What do they stand for?
Dec 16 2003
Carlos Santander B. wrote:"Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com... | for example). Why not use the same names as are used in C? mbstowcs() | and wcstombs()? | Sorry to ask, but what do those do? What do they stand for?Ironically enough, you question answers Elias's question quite succinctly. ;) -- andy
Dec 16 2003
Andy Friesen wrote:Carlos Santander B. wrote:Dang! How do you americans say? Three strikes, I'm out. :-) Regards Elias Mårtenson"Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com... | for example). Why not use the same names as are used in C? mbstowcs() | and wcstombs()? | Sorry to ask, but what do those do? What do they stand for?Ironically enough, you question answers Elias's question quite succinctly. ;)
Dec 17 2003
"Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com...As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.Yes.However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options: void log_to_file(char[] str); void log_to_file(wchar[] str); void log_to_file(dchar[] str); Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise.Do it as char[]. Have the internal implementation convert it to whatever format the underling operating system API uses. I don't agree that UTF-8 is horribly inefficient (this is from experience, UTF-32 is much, much worse).Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]).It's fairly easy to write a wrapper class for it that decodes it automatically with foreach and [] overloads.Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings. Would it be possible to use something like this? dchar get_first_char(string str) { return str[0]; } string str1 = (dchar[])"A UTF-32 string"; string str2 = (char[])"A UTF-8 string"; // call the function to demonstrate that the "string" // type can be used in declarations dchar x = get_first_char(str1); dchar y = get_first_char(str2); I.e. the "string" data type would be a wrapper or supertype for the three different string types.The best thing is to stick with one scheme for a program.'nativechar[] strings are UTF-8, and as such I don't know what you mean byFor char types, yes. But not for UTF-16, and win32 internally is all UTF-16. There are no locale-specific encodings in UTF-16.decoding'. There is only one possible conversion of UTF-8 to UTF-16.The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R.In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>. Let's look forward instead of those backward locale dependent encodings.theIf you're talking about win32 code pages, I'm going to draw a line independent.sand and assert that D char[] strings are NOT locale or code pageNo, I think D will provide an optional filter for I/O which will translate to/from locale dependent encodings. Wherever possible, the UTF-16 API's will be used to avoid any need for locale dependent encodings.They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?'cuz I can never remember how they're spelled <g>.andAfter wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the languageUTF.runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/fromaboutThis should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? Howhavingwchar_t's being 32 bits wide on linux vs 16 bits on win32? How aboutFrankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.#ifdef _UNICODE all over the place? I've done that too much already. No thanks!)Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character".The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally.Windows made the right decision given what was known at the time, it was the unicode folks who goofed by not defining unicode right in the first place.Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.It already does that for string literals. I've thought about implicit conversions for runtime strings, but sometimes trouble results from too many implicit conversions, so I'm hanging back a bit on this to see how things evolve.Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed.In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.It does that now, except they take a char[].
Dec 16 2003
"Walter" <walter digitalmars.com> wrote in message news:brnurb$2bc5$1 digitaldaemon.com...It's stupid to not agree on a standard size for char, since it's easy to "fix" the sign of a char register by biasing it by 128 (xor 0x80 works too), doing the operation, then biasing it again (un-biasing it). If all else fails, you can promote it. How often is this important anyway? If it's crucial, it's worth the time to emulate the sign if you have to. It is no good to run fast if the wrong results are generated. It's just a portability landmine, waiting for the unwary programmer, and shame on whoever let it get into a so-called "standard".Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character".Frankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.theThe Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good. However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally.Windows made the right decision given what was known at the time, it wasunicode folks who goofed by not defining unicode right in the first place.I still don't understand why they couldn't have packed all the languages that actually get used into the lowest 16 bits, and put all the crud like box-drawing characters and visible control codes and byzantine musical notes and runes and Aleutian indian that won't fit into the next 16 pages. There's lots of gaps in the first 65536 anyway. And probably plenty of overlap, duplicated symbols (lots of languages have the same characters, especially latin-based ones). Hell they should probably have done away with accented characters being distinct characters and enforced a combining rule from the start. But the Unicode standards body wanted to please the typesetters, as opposed to giving the world a computer encoding that would actually be usable as a common text-storage and processing medium. This thread shows just how convoluted Unicode really is. I think someone can (and probably will) do better. Unfortunately I also believe that such an effort is doomed to failure. Sean
Dec 17 2003
Sean L. Palmer wrote:It's stupid to not agree on a standard size for char, since it's easy to "fix" the sign of a char register by biasing it by 128 (xor 0x80 works too), doing the operation, then biasing it again (un-biasing it). If all else fails, you can promote it. How often is this important anyway? If it's crucial, it's worth the time to emulate the sign if you have to. It is no good to run fast if the wrong results are generated. It's just a portability landmine, waiting for the unwary programmer, and shame on whoever let it get into a so-called "standard".C doesn't define any standard sizes at all (well, you do have stdint.h these days). This is both a curse and a blessing. More often than not, it's a curse though.I still don't understand why they couldn't have packed all the languages that actually get used into the lowest 16 bits, and put all the crud like box-drawing characters and visible control codes and byzantine musical notes and runes and Aleutian indian that won't fit into the next 16 pages. There's lots of gaps in the first 65536 anyway. And probably plenty of overlap, duplicated symbols (lots of languages have the same characters, especially latin-based ones). Hell they should probably have done away with accented characters being distinct characters and enforced a combining rule from the start. But the Unicode standards body wanted to please the typesetters, as opposed to giving the world a computer encoding that would actually be usable as a common text-storage and processing medium. This thread shows just how convoluted Unicode really is. I think someone can (and probably will) do better. Unfortunately I also believe that such an effort is doomed to failure.Agreed. Unicode has a lot of cruft. One of my favourite pet peeves are the two characters: 00C5 Å: LATIN CAPITAL LETTER A WITH RING ABOVE and 212B Å: ANGSTROM SIGN The comment even says that the preferred representation is the latin Å. But, like you say, trying to do it once again will not succeed. It has taken us 10 or so years to get where we are. I'd say we accept Unicode for what it is. It's a hell of a lot better than the previous mess. Regards Elias Mårtenson
Dec 17 2003
Sorry, "sign" of char. "Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brp52o$1gc6$1 digitaldaemon.com...It's stupid to not agree on a standard size for char, since it's easy to "fix" the sign of a char register by biasing it by 128 (xor 0x80 workstoo),
Dec 17 2003
Walter wrote:"Elias Martenson" <elias-m algonet.se> wrote in message news:brml3p$7hp$1 digitaldaemon.com...Good, I like it. :-)As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.Yes.Memory-wise perhaps. But for everything else UTF-8 is always slower. Consider what happens when the program is used with russian? Every single character will need special decoding, except punctuation of course. Now think about chinese and japenese. These are even worse.Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise.Do it as char[]. Have the internal implementation convert it to whatever format the underling operating system API uses. I don't agree that UTF-8 is horribly inefficient (this is from experience, UTF-32 is much, much worse).Indeed. But they will be slow. Now, personally I can accept the slowness. Again, it's your call. What we do need to make sure is that the string/character handling package that we build is comprehensive in terms on Unicode support, and also that every single string handling function handles UTF-32 as well as UTF-8. This way a developer who is having performance problems with the default UTF-8 strings can easily change his hotspots to work with UTF-32 instead.Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]).It's fairly easy to write a wrapper class for it that decodes it automatically with foreach and [] overloads.Unless the developer is bitten by the poor performance of UTF-8 that is. A package with perl-like functionality would be horribly slow if using UTF-8 rather than UTF-32. If we are to stick with UTF-8 as default internal string format, UTF-32 must be available as an option, and it must be easy to use.I.e. the "string" data type would be a wrapper or supertype for the three different string types.The best thing is to stick with one scheme for a program.For char types, yes. But not for UTF-16, and win32 internally is all UTF-16. There are no locale-specific encodings in UTF-16.True. But I can't see any use for UTF-16 outside communicating with external windows libraries. UTF-16 really is the worst of both worlds compared to UTF-8 and UTF-32. UTF-16 should really be considered the "native encoding" and left at that. Just like [the content of LC_CTYPE] is the native encoding when run in Unix. The developer should be shielded from the native encoding in that he should be able to say: "convert my string to the encoding my operating system wants (i.e. the native encoding)". As it happens, this is what wcstombs() does.Agreed. I am heavily lobbying for proper Unicode support everywhere. I've been bitten by too many broken applications. However, Windows has decided on UTF-16. Unix has decided on UTF-8. We need a way of transprently inputting and outputting strings so that they are converted to whatever encoding the host operating system uses. If we don't do this we are going to end up with a lot of conditional code that checks which OS (and encoding) is being used.In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>. Let's look forward instead of those backward locale dependent encodings.No, I think D will provide an optional filter for I/O which will translate to/from locale dependent encodings. Wherever possible, the UTF-16 API's will be used to avoid any need for locale dependent encodings.Why UTF-16? There is no need to involve platform specifics at this level. Remember that UTF-16 can be considered platform specific for Windows.'cuz I can never remember how they're spelled <g>.Allright... So how about adding to the utf8 package some functions called... Hmm... nativeToUTF8(), nativeToUTF32() and then an overloaded function utfToNative() (which accepts char[], wchar[] and dchar[]}. "native" in this case would be a byte[] or ubyte[] to point out that this form is not supposed to be used in the program.Indeed. That's why the Unix standard went a bit forther and specified a wchar_t to be a Unicode character. The problem is with Windows where wchar_t is 16-bit and thus cannot hold a Unicode character. And thus we end up with the current situation where using wchar_t in Windows really doesn't buy you anything because you have the same problems as you would with UTF-8. You still cannot assume that a wchar_t can hold a single character. You still need all the funky iterators and decoding stuff to be able to extract individual characters. This is why I'm saying that the UTF-16 in Windows is horrible, and that UTF-16 is the worst of both worlds.Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character".Frankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.Windows made the right decision given what was known at the time, it was the unicode folks who goofed by not defining unicode right in the first place.I agree 100%. Java is in the same boat. How many people know that from JDK1.5 and onwards it's a bad idea to use String.charAt()? (in JDK1.5 the internal representation for String will change from UCS-2 to UTF-16). In other words, the exact same problem Windows faced. The Unicode people argues that they never guaranteed that it was a 16-bit character set, and while this is technically true, they are really trying to cover up their mess.It already does that for string literals. I've thought about implicit conversions for runtime strings, but sometimes trouble results from too many implicit conversions, so I'm hanging back a bit on this to see how things evolve.True. We suffer from this in C++ (costly implicit conversions) and it would be nice to be able to avoid this. Regards Elias Mårtenson
Dec 17 2003
Walter wrote:This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all. A lot of english-speaking programmers simply treat chars as ASCII characters, even if there's some comment somewhere stating that the data should be UTF-8. I agree with Elias that the "char" type should be 32 bit, so that people who simply use a char array as a string, as they have done for years in other languages, will actually get the behaviour they expect, without losing the Unicode support. Btw: this could also be used to solve the "oops, I forgot to make the string null-terminated" problem when interacting with C functions. If the D char is a different type than the old C char (which could be called char_c or charz instead) then people will automatically be reminded that they need to convert them. So how about the following proposal: - char is a 32 bit Unicode character - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 or 32 bits (depending on the system), provided for interoperability with C functions - charz (or char_c? c_char?) is a normal 8 bit C character, also provided for interoperability with C functions UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This would at the same time remind users that the elements are NOT characters but simply a bunch of binary data. I don't see the need to define a new type for these - there are a lot of encodings out there, so why treat UTF-8 and UTF-16 specially? With this system it would be instantly obvious that D strings are Unicode. Interacting with legacy C code is still possible, and accidentally passing a wrong (e.g. UTF-8) string to a C function that expects ASCII or Latin-1 is impossible. Also, pure D code will automatically be UTF-32, which is exactly what you need if you want to make the lives of newbies easier. Otherwise people WILL end up using ASCII strings when they start out. HaukeThe overloading issue is interesting, but may I suggest that char andwhcarare at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters.I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.
Dec 16 2003
Hauke Duden wrote:I agree. You are better at explaining these things than I am. :-)I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all. A lot of english-speaking programmers simply treat chars as ASCII characters, even if there's some comment somewhere stating that the data should be UTF-8.I agree with Elias that the "char" type should be 32 bit, so that people who simply use a char array as a string, as they have done for years in other languages, will actually get the behaviour they expect, without losing the Unicode support.Indeed. In many cases existing code would actually continue working, since char[] would still declare a string. It wouldn't work when using legacy libraries though, but they won't anyway because of the zero-termination issue.Btw: this could also be used to solve the "oops, I forgot to make the string null-terminated" problem when interacting with C functions. If the D char is a different type than the old C char (which could be called char_c or charz instead) then people will automatically be reminded that they need to convert them.Exactly.So how about the following proposal: - char is a 32 bit Unicode character - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 or 32 bits (depending on the system), provided for interoperability with C functions - charz (or char_c? c_char?) is a normal 8 bit C character, also provided for interoperability with C functions UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This would at the same time remind users that the elements are NOT characters but simply a bunch of binary data. I don't see the need to define a new type for these - there are a lot of encodings out there, so why treat UTF-8 and UTF-16 specially? With this system it would be instantly obvious that D strings are Unicode. Interacting with legacy C code is still possible, and accidentally passing a wrong (e.g. UTF-8) string to a C function that expects ASCII or Latin-1 is impossible. Also, pure D code will automatically be UTF-32, which is exactly what you need if you want to make the lives of newbies easier. Otherwise people WILL end up using ASCII strings when they start out.We have to keep in mind, that in most cases, when you call a legacy C functions accepting (char *) the correct thing is to pas in a UTF-8 encoded string. The number of functions which actually fail when doing so are quite few. What I'm saying here is that there are actually few "C function[s] that expects ASCII or Latin-1". Most of them expect a (char *) and work on them as if they were a byte array. Compare this to my (and your) suggestion of using byte[] (or ubyte[]) for UTF-8 strings. Regards Elias Mårtenson
Dec 16 2003
Elias Martenson wrote:We have to keep in mind, that in most cases, when you call a legacy C functions accepting (char *) the correct thing is to pas in a UTF-8 encoded string. The number of functions which actually fail when doing so are quite few.They are not quite as few as one may think. For example, if you pass an UTF-8 string to fopen then it will only work correctly if the filename is made up of only ASCII characters only. printf will print garbage if you pass it a UTF-8 character. If you use scanf to read a string from stdin then the returned string will not be UTF-8, so you have to deal with that. The is-functions (isalpha, etc.) will not work correctly for all characters. toupper, tolower, etc. are not able to work with non-ASCII characters. The list goes on... Pretty much the only thing I can think of that will work correctly under all circumstances are simple C functions that pass strings through unmodified (if they modify them they might slice them in the middle of a UTF-8 sequence). IMHO, the safest way to call C functions is to pass them strings encoded using the current system code page, because that's what the CRT expects a char array to be. Since the code page is different from system to system this makes a runtime conversion pretty much inevitable, but there's no way around that if you want Unicode support. Hauke
Dec 16 2003
Hauke Duden wrote:They are not quite as few as one may think. For example, if you pass an UTF-8 string to fopen then it will only work correctly if the filename is made up of only ASCII characters only.Depends on the OS. Unix handles it perfectly.printf will print garbage if you pass it a UTF-8 character. If you use scanf to read a string from stdin then the returned string will not be UTF-8, so you have to deal with that. The is-functions (isalpha, etc.) will not work correctly for all characters. toupper, tolower, etc. are not able to work with non-ASCII characters. The list goes on...Exactly. But the number of functions that do these things are still pretty smapp, compared to the total number of functions accetping strings. Take a look at your own code and try to classify them as UTF-8 safe or not. I think you'll be surprised.Pretty much the only thing I can think of that will work correctly under all circumstances are simple C functions that pass strings through unmodified (if they modify them they might slice them in the middle of a UTF-8 sequence).And, believe it or not, this is the major part of all such functions. But, the discussion is really irrelevant since we both agree that it is inherently unsafe. Regards Elias Mårtenson
Dec 16 2003
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:brnas5$1940$1 digitaldaemon.com...Walter wrote:<g>.I see your point, but I just can't see making utf8byte into a keywordmuchThe world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't seeMultibyte char programming in C has been common on the IBM PC for 20 years now (my C compiler has supported it for that long, since it was distributed to an international community), and it was standardized into C in 1989. I agree that many ignore it, but that's because it's badly designed. Dealing with locale-dependent encodings is a real chore in C.of an issue here.This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all.A lot of english-speaking programmers simply treat chars as ASCII characters, even if there's some comment somewhere stating that the data should be UTF-8.True, but code doesn't have to be changed much to allow for UTF-8. For example, D source text is UTF-8, and supporting that required little change in the D front end, and none in the back end. Trying to use UTF-32 internally to support this would have been a disaster.I agree with Elias that the "char" type should be 32 bit, so that people who simply use a char array as a string, as they have done for years in other languages, will actually get the behaviour they expect, without losing the Unicode support.Other problems are introduced with that for the naive programmer who expects it to work just like ascii. For example, many people don't bother multiplying by sizeof(char) when allocating storage for char arrays. chars and 'bytes' in C are used willy-nilly interchangeably. Direct manipulation of chars (without going through ctype.h) is common for converting lower case to upper case. Etc. The nice thing about UTF-8 is it does work just like ascii when you're dealing with ascii data.Btw: this could also be used to solve the "oops, I forgot to make the string null-terminated" problem when interacting with C functions. If the D char is a different type than the old C char (which could be called char_c or charz instead) then people will automatically be reminded that they need to convert them. So how about the following proposal: - char is a 32 bit Unicode characterAlready have that, it's 'dchar' <g>. There is nothing in D that prevents a programmer from using dchar's for his character handling chores.- wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 or 32 bits (depending on the system), provided for interoperability with C functionsI've dealt with porting large projects between win32 and linux and the change in wchar_t size from 16 to 32. I've come to believe that method is a mistake, hence wchar and dchar in D. (One of the wretched problems is one cannot intermingle printf and wprintf to stdout in C.)- charz (or char_c? c_char?) is a normal 8 bit C character, also provided for interoperability with C functionsI agree that the 0 termination is an issue when calling C functions. I think this issue will fade, however, as the D libraries get more comprehensive. Another problem with 'normal' C chars is the confusion about whether they are signed or unsigned. The D char type is unsigned, period <g>.UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This would at the same time remind users that the elements are NOT characters but simply a bunch of binary data. I don't see the need to define a new type for these - there are a lot of encodings out there, so why treat UTF-8 and UTF-16 specially?Treating UTF-8 and UTF-16 specially in D has great advantages in making the internal workings of the compiler and runtime library consistent. (No more problems mixing printf and wprintf!) I'm convinced that UTF is becoming the linqua franca of computing, and the other encodings will be relegated to sideshow status.With this system it would be instantly obvious that D strings are Unicode. Interacting with legacy C code is still possible, and accidentally passing a wrong (e.g. UTF-8) string to a C function that expects ASCII or Latin-1 is impossible.Windows NT, 2000, XP, and onwards are internally all UTF-16. Any win32 API functions that accept 8 bit chars will immediately convert them to UTF-16. wchar_t's under win32 are UTF-16 encodings (including the 2 word encodings of UTF-16). Linux is internally UTF-8, if I'm not mistaken. This means D code will feel right at home with linux. Under win32, I plan on fixing all the runtime library functions to convert UTF-8 to UTF-16 internally and use the win32 API UTF-16 functions. Hence, UTF is where the operating systems are going, and D is looking forward to mapping cleanly onto that. I believe that following the C approach of code pages, signed/unsigned char confusion, varying wchar_t sizes, etc., is rapidly becoming obsolete.Also, pure D code will automatically be UTF-32, which is exactly what you need if you want to make the lives of newbies easier. Otherwise people WILL end up using ASCII strings when they start out.Over the last 10 years, I wrote two major internationalized apps. One used UTF-8 internally, and converted other encodings to/from it on input/output. The other used wchar_t throughout, and was ported to win32 and linux which mapped wchar_t to UTF-16 and UTF-32, respectively. The former project ran much faster, consumed far less memory, and (aside from the lack of support from C for UTF-8) simply had far fewer problems. The latter was big and slow. Especially and linux with the wchar_t's being UTF-32, it really hogged the memory.
Dec 16 2003
Walter wrote:Right, it has been around for decades. And people still don't use it properly. Don't make that same mistake again! I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all.Multibyte char programming in C has been common on the IBM PC for 20 years now (my C compiler has supported it for that long, since it was distributed to an international community), and it was standardized into C in 1989. I agree that many ignore it, but that's because it's badly designed. Dealing with locale-dependent encodings is a real chore in C.Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8. And about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character. Also, how much text did your "bad experience" application use? It seems to me that even if you assume best-case for UTF-8 (e.g. one byte per character) then the memory overhead should not be much of an issue. It's only factor 4, after all. So assuming that your application uses 100.000 lines of text (which is a lot more than anything I've ever seen in a program), each 100 characters long and everything held in memory at once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32. These are hardly numbers that will bring a modern OS to its knees anymore. In a few years this might even fit completely into the CPU's cache! I think it's more important to have proper localization ability and programming ease than trying to conserve a few bytes for a limited group of people (i.e. english speakers). Being greedy with memory consumption when making long-term design decisions has always caused problems. For instance, it caused that major Y2K panic in the industry a few years ago! Please also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time. Most people already have several hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit shortsighted to make the lives of D programmers harder forever, just to save a few megabytes of memory that people will laugh about in 5 years (or already laugh about right now)? HaukeAlso, pure D code will automatically be UTF-32, which is exactly what you need if you want to make the lives of newbies easier. Otherwise people WILL end up using ASCII strings when they start out.Over the last 10 years, I wrote two major internationalized apps. One used UTF-8 internally, and converted other encodings to/from it on input/output. The other used wchar_t throughout, and was ported to win32 and linux which mapped wchar_t to UTF-16 and UTF-32, respectively. The former project ran much faster, consumed far less memory, and (aside from the lack of support from C for UTF-8) simply had far fewer problems. The latter was big and slow. Especially and linux with the wchar_t's being UTF-32, it really hogged the memory.
Dec 17 2003
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:brpvmn$2o0t$1 digitaldaemon.com...I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.UTF-8 has some nice advantages over other multibyte encodings in that it is possible to find the start of a sequence without backing up to the beginning, none of the multibyte encodings have bit 7 clear (so they never conflict with ascii), and no additional information like code pages are necessary to decode them.Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.That's correct. And D supports UTF-32 programming if that works better for the particular application.And about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character.Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and away most operations on strings were copying them, storing them, hashing them, etc.Also, how much text did your "bad experience" application use?Maybe 100 megs. Extensive profiling and analysis showed that it would have run much faster if it was UTF-8 rather than UTF-32, not the least of which was it would have hit the 'wall' of thrashing the virtual memory much later.It seems to me that even if you assume best-case for UTF-8 (e.g. one byte per character) then the memory overhead should not be much of an issue. It's only factor 4, after all.It's a huge (!) issue. When you're pushing a web server to the max, using 4x memory means it runs 4x slower. (Actually about 2x slower because of other factors.)So assuming that your application uses 100.000 lines of text (which is a lot more than anything I've ever seen in a program), each 100 characters long and everything held in memory at once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32. These are hardly numbers that will bring a modern OS to its knees anymore. In a few years this might even fit completely into the CPU'scache! Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive with C++ pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen do not measure char processing speed or memory consumption.)I think it's more important to have proper localization ability and programming ease than trying to conserve a few bytes for a limited group of people (i.e. english speakers). Being greedy with memory consumption when making long-term design decisions has always caused problems. For instance, it caused that major Y2K panic in the industry a few years ago!You have a valid point, but things are always a tradeoff. D offers the flexibility of allowing the programmer to choose whether he wants to build his app around char, wchar, or dchar's. (None of my programs dating back to the 70's had any Y2K bugs in them <g>)Please also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time.I don't agree that memory is improving that fast. Even if it is, people just load them up with more data to fill the memory up. I will agree that program code size is no longer that relevant, but data size is still pretty relevant. Stuff we were forced to do back in the bad old DOS 640k days seem pretty quaint now <g>.Most people already have several hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit shortsighted to make the lives of D programmers harder forever, just to save a few megabytes of memory that people will laugh about in 5 years (or already laugh about right now)?D programmers can use dchars if they want to.
Dec 17 2003
"Walter" <walter digitalmars.com> wrote in message news:brqr8e$vmh$1 digitaldaemon.com..."Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:brpvmn$2o0t$1 digitaldaemon.com...isI don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.UTF-8 has some nice advantages over other multibyte encodings in that itpossible to find the start of a sequence without backing up to the beginning, none of the multibyte encodings have bit 7 clear (so they never conflict with ascii), and no additional information like code pages are necessary to decode them.But with UTF-32, this is not an issue at all.Yes, but that statement does not stop clueless/lazy programmers from using chars in libraries/programs where UTF-32 should have been used.Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.That's correct. And D supports UTF-32 programming if that works better for the particular application.andAnd about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character.Interestingly, it was rarely necessary to decode the UTF-8 strings. Faraway most operations on strings were copying them, storing them, hashing them, etc.If that is correct, it might be just as correct, and even faster, to treat it as binary data in most cases. No need to have that data reoresented as String at all times.later. I think the profiling might have shown very different numbers if the native language of the profiling crew/test files were traditional chinese texts, mixed with a lot of different languages.Also, how much text did your "bad experience" application use?Maybe 100 megs. Extensive profiling and analysis showed that it would have run much faster if it was UTF-8 rather than UTF-32, not the least of which was it would have hit the 'wall' of thrashing the virtual memory much4xIt seems to me that even if you assume best-case for UTF-8 (e.g. one byte per character) then the memory overhead should not be much of an issue. It's only factor 4, after all.It's a huge (!) issue. When you're pushing a web server to the max, usingmemory means it runs 4x slower. (Actually about 2x slower because of other factors.)I agree with you, speed is important. But if what you are serving is 8-bit .html files (latin language), why not treat the data as usigned bytes? You are describing the "special case" as the explanation of why UTF-32 should not be the general case. The definition of the language is what people are interested in at this point. What "dirty" tricks you use in the implementation to make it faster (right now, in some special cases, with a limited set of language data) is less interesting.C++So assuming that your application uses 100.000 lines of text (which is a lot more than anything I've ever seen in a program), each 100 characters long and everything held in memory at once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32. These are hardly numbers that will bring a modern OS to its knees anymore. In a few years this might even fit completely into the CPU'scache! Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive withif it does chars as 32 bits each. I doubt many realize this, but Java andpay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen do not measure char processing speed or memory consumption.)I think this is a brilliant observation. I had not thought much about this. But I think my thought from above is still correct: why should the data for this special case be String at all? A good server software writer could obtain the ultimate speed by using unsigned bytes. That would give ultimate speed when necessary, and generally applicable String handling for all spoken languages would be enforced for String at the same time.ago!I think it's more important to have proper localization ability and programming ease than trying to conserve a few bytes for a limited group of people (i.e. english speakers). Being greedy with memory consumption when making long-term design decisions has always caused problems. For instance, it caused that major Y2K panic in the industry a few yearsYou have a valid point, but things are always a tradeoff. D offers the flexibility of allowing the programmer to choose whether he wants to build his app around char, wchar, or dchar's.With all due respect, I believe you are trading off in the wrong direction. Because you have a personal interest in good performance (which is good) you seem to not want to consider the more general cases as being the general ones. I propose (as an experiment) that you try to think "what would I do if I were a chinese?" each time you want to make a tradeoff on string handling. This is what good design is all about. In the performance trail of thought: do we all agree that the general String _manipulation_ handling in all programs will perform much better if choosing UTF-32 over UTF-8, when considering that the natural language data of the program would be traditional chinese? Another one: If UTF-32 were the base type of String, would it be applicable to have a "Compressed" attribute on each String? That way it could have as small as possible i/o, storage and memcpy size most of the time, and could be uncompressed for manipulation? This should take care of most of the "data size"/trashing related arguments...(None of my programs dating back to the 70's had any Y2K bugs in them <g>)justPlease also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time.I don't agree that memory is improving that fast. Even if it is, peopleload them up with more data to fill the memory up. I will agree thatprogramcode size is no longer that relevant, but data size is still pretty relevant. Stuff we were forced to do back in the bad old DOS 640k daysseempretty quaint now <g>.The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D. Thanks to all who took the time to read my take on these issues. Regards, RoaldMost people already have several hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit shortsighted to make the lives of D programmers harder forever, just to save a few megabytes of memory that people will laugh about in 5 years (or already laugh about right now)?D programmers can use dchars if they want to.
Dec 18 2003
"Roald Ribe" <rr.no spam.teikom.no> wrote in message news:brsfkq$dpl$1 digitaldaemon.com...You raise some good points. This issue should not be treated too lightly. It should be possible to work with text as bytes, (for performance on interfacing with legacy non-Unicode strings) but that should definitely not be the preferred way. I think that there should be no char or wchar, and that dchar should be renamed char. That way if you see byte[] in the code you won't be tempted to think of it as a string but more like raw data. UTF-8 can be well represented by byte[], and if you want to work directly with UTF-8, you can use a wrapper class from the D standard library. SeanD programmers can use dchars if they want to.The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D. Thanks to all who took the time to read my take on these issues.
Dec 18 2003
I think that there should be no char or wchar, and that dchar should be renamed char.Sorry if im stating something i lack knowledge in, but if there were no wchar what would you use to call windows wide api? regards
Dec 18 2003
Den Thu, 18 Dec 2003 15:27:02 -0500 skrev Lewis:Most likely ushort[]. Regards Elias MÃ¥rtensonI think that there should be no char or wchar, and that dchar should be renamed char.Sorry if im stating something i lack knowledge in, but if there were no wchar what would you use to call windows wide api?
Dec 18 2003
"Roald Ribe" <rr.no spam.teikom.no> wrote in message news:brsfkq$dpl$1 digitaldaemon.com...Yes, but that statement does not stop clueless/lazy programmers from using chars in libraries/programs where UTF-32 should have been used.I can't really stop clueless/lazy programmers from writing bad code <g>.I think the profiling might have shown very different numbers if the native language of the profiling crew/test files were traditional chinese texts, mixed with a lot of different languages.If an app is going to process primarilly chinese, it will probably be more efficient using dchar[]. If an app is going to process primarilly english, then char[] is the right choice. The server app I wrote was for use primarilly by american and european companies. It had to handle chinese, but far and away the bulk of the data it needed to process was plain old ascii. D doesn't force such a choice on the app programmer - he can pick char[], wchar[] or dchar[] to match the probability of the bulk of the text it will be dealing with.I agree with you, speed is important. But if what you are serving is 8-bit .html files (latin language), why not treat the data as usigned bytes? You are describing the "special case" as the explanation of why UTF-32 should not be the general case.For overloading reasons. I never liked the C way of conflating chars with bytes. Having a utf type separate from a byte type enables more reasonable ways of handling things like string literals.buildYou have a valid point, but things are always a tradeoff. D offers the flexibility of allowing the programmer to choose whether he wants todirection.his app around char, wchar, or dchar's.With all due respect, I believe you are trading off in the wrongBecause you have a personal interest in good performance (which is good) you seem to not want to consider the more general cases as being the general ones. I propose (as an experiment) that you try to think "what would I do if I were a chinese?" each time you want to make a tradeoff on string handling. This is what good design is all about.I assume that a chinese programmer writing chinese apps would prefer to use dchar[]. And that is fully supported by D, so I am misunderstanding what our disagreement is about.In the performance trail of thought: do we all agree that the general String _manipulation_ handling in all programs will perform much better if choosing UTF-32 over UTF-8, when considering that the natural language data of the program would be traditional chinese?Sure. But if the data the program will see is not chinese, then performance will suffer. As a language designer, I cannot determine what data the programmer will see, so D provides char[], wchar[] and dchar[] and the programmer can make the choice based on the data for his app.Another one: If UTF-32 were the base type of String, would it be applicable to have a "Compressed" attribute on each String? That way it could have as small as possible i/o, storage and memcpy size most of the time, and could be uncompressed for manipulation? This should take care of most of the "data size"/trashing related arguments...An intriguing idea, but I am not convinced it would be superior to UTF-8. Data compression is relatively slow.D is not going to force one to write internationalized apps, just make it easy to write them if the programmer cares about it. As opposed to C where it is rather difficult to write internationalized apps, so few bother.D programmers can use dchars if they want to.The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D.Thanks to all who took the time to read my take on these issues.It's a fun discussion!
Dec 18 2003
Walter wrote:"Roald Ribe" <rr.no spam.teikom.no> wrote in message news:brsfkq$dpl$1 digitaldaemon.com...But it is possible to make it harder to do so. I believe that is what this discussion is all about.Yes, but that statement does not stop clueless/lazy programmers from using chars in libraries/programs where UTF-32 should have been used.I can't really stop clueless/lazy programmers from writing bad code <g>.I don't think most programmers (at the time of writing the code) is aware of the fact that his application is going to be used outside the local region. An example is the current project I'm working in, the old application that out new one is designed to replace, is already exported throughout the world. Even though that is the case, when I came into the project there was absolutely zero understanding that we needed to support anything else than ISO-8859-1. As a result, we have lost a lot of time rewriting parts of the system. Now, I agree that the current D way would have made it a lot easier, but it could be even easier.I think the profiling might have shown very different numbers if the native language of the profiling crew/test files were traditional chinese texts, mixed with a lot of different languages.If an app is going to process primarilly chinese, it will probably be more efficient using dchar[]. If an app is going to process primarilly english, then char[] is the right choice. The server app I wrote was for use primarilly by american and european companies. It had to handle chinese, but far and away the bulk of the data it needed to process was plain old ascii.D doesn't force such a choice on the app programmer - he can pick char[], wchar[] or dchar[] to match the probability of the bulk of the text it will be dealing with.In the end, I think most people (including me) would be a lot happier if all that was done was renaming dchar into char. No functionality change at all, just a rename of the types. I think most people can see the advantage of D supporting UTF-8 natively, it just feels wrong with an array of "char" which isn't really an array of characters.For overloading reasons. I never liked the C way of conflating chars with bytes. Having a utf type separate from a byte type enables more reasonable ways of handling things like string literals.Right, I can see your reasoning, but does the type _really_ have to be named "char"?I assume that a chinese programmer writing chinese apps would prefer to use dchar[]. And that is fully supported by D, so I am misunderstanding what our disagreement is about.Possibly, but in todays world it's not unusual that an application is developed in europe but used in china, or developed in india but used in new zealand. Regards Elias Mårtenson
Dec 19 2003
"Elias Martenson" <elias-m algonet.se> wrote in message news:bruen1$f05$1 digitaldaemon.com...I don't think most programmers (at the time of writing the code) is aware of the fact that his application is going to be used outside the local region.Probably true.char[],D doesn't force such a choice on the app programmer - he can pickwillwchar[] or dchar[] to match the probability of the bulk of the text itEven despite the fact that in C and C++, char is byte-sized, it would probably be preferrable to just rename "char" to "bchar" and "dchar" to "char". This corresponds to byte and int, but then wchar seems out of place since in D there is short and ushort but not word. "schar" sounds like "signed char" and I believe we should stay away from that. What to do, what to do?be dealing with.In the end, I think most people (including me) would be a lot happier if all that was done was renaming dchar into char. No functionality change at all, just a rename of the types. I think most people can see the advantage of D supporting UTF-8 natively, it just feels wrong with an array of "char" which isn't really an array of characters.withFor overloading reasons. I never liked the C way of conflating charsreasonablebytes. Having a utf type separate from a byte type enables moreGood point. But there is the backward compatibility thing, which kind of sucks. It would subtly break any C app ported to D that allocated memory using malloc(N) and then stored a N-character string into it.ways of handling things like string literals.Right, I can see your reasoning, but does the type _really_ have to be named "char"?useI assume that a chinese programmer writing chinese apps would prefer toourdchar[]. And that is fully supported by D, so I am misunderstanding whatIt will still work, but won't be as efficient as it could be. Seandisagreement is about.Possibly, but in todays world it's not unusual that an application is developed in europe but used in china, or developed in india but used in new zealand.
Dec 19 2003
There has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface and two rough and ready string classes for UTF-8 and UTF-32, which are attached to this message. Currently they only do a few things, one of which is to provide a consistent interface for character manipulation. The UTF-8 class also provides direct access to the bytes for when the user can do things more efficiently with these. They can also be appended to each other. In addition, each provides a constructor taking the other one as a parameter. Please bear in mind that I am only an amateur programmer, who knows very little about Unicode and has no experience of programming in the real world. Nevertheless, I can appreciate some of the issues here and I hope that these classes can be the foundation of something more useful. From, Rupert begin 666 stringclasses.d M+R]0<F4M06QP:&$ 4W1R:6YG($-L87-S97,-"B\O8GD 4G5P97)T($UI;&QA M" DO+VEN9&EV:61U86P 8VAA<F%C=&5R(&UA;FEP=6QA=&EO;B!F=6YC=&EO M;G,-" DO+W1H97-E(&%R92!F87-T(&]N('5T9C,R('-T<FEN9W,L('-L;W=E M26YD97 H:6YT(&DI.PT*(" ('9O:60 ;W!);F1E>"AI;G0 :2P =71F,S)C M:&%R('9A;'5E*3L-"B ("!3=')I;F< ;W!3;&EC92AI;G0 >"P :6YT('DI M.PT*(" ( T*(" ("\O=&AI<R!I<R!J=7-T(&9O<B!Q=6EC:R!A;F0 9&ER M80T*(" ('9O:60 <')I;G0H*3L-"GT-" T*8VQA<W, 4W1R:6YG551&,S( M.B!3=')I;F<-"GL-" EP<FEV871E('5T9C,R8VAA<EM=(&-H87)S.PT*"0T* M"71H:7,H4W1R:6YG551&."!S*0T*"7L-" D)8VAA<G, /2!T;U541C,R*',N M(" ('L-" D (" <F5T=7)N(&YE=R!3=')I;F=55$8S,BAC:&%R<UM=*3L- M" D)<F5T=7)N(&-H87)S6VE=.PT*"7T-" D-"B ("!V;VED(&]P26YD97 H M:6YT(&DL('5T9C,R8VAA<B!C*0T*(" ('L-" D (" 8VAA<G-;:5T /2!C M.PT*(" ('T-"B (" -"B ("!3=')I;F< ;W!3;&EC92AI;G0 >"P :6YT M6W +BX >5TI.PT*(" ('T-" T*(" ('9O:60 <')I;G0H*0T*(" ('L- M" D (" <')I;G1F*"(E+BIS7&XB+"!T;U541C H8VAA<G,I*3L-"B ("!] M" EP=6)L:6, =71F.&)Y=&5;72!B>71E<SL-" T*"71H:7,H=71F.&)Y=&5; M72!S*0T*"7L-" D)8GET97, /2!S.PT*"7T-" D-" ET:&ES*%-T<FEN9U54 M97, ?CT =&]55$8X*',N8VAA<G,I.PT*"7T-" D-" EV;VED(&]P0V%T07-S M('5I;G0 <#TP.PT*"2 (" -" D (" =VAI;&4 *&X\:2D-" D (" )>PT* M"0D (" )9&5C;V1E*&)Y=&5S+"!P*3L-" D)(" "6XK*SL-" D (" )?0T* M"2 (" O+R!.3U0 1$].12!9150-" D (" 87-S97)T*&9A;'-E*3L)(" M( T*(" ('T-"B (" -"B ("!3=')I;F< ;W!3;&EC92AI;G0 >"P :6YT M.PT*"2 (" -" D (" =VAI;&4 *&X\>"D-" D (" )>PT*"0D (" )9&5C M(" "7!Y/7!X.PT*"2 ( D-" D (" )=VAI;&4 *&X\>2D-" D (" )>PT* M+B!P>5TI.PT*(" ('T-" T*(" ('9O:60 <')I;G0H*0T*(" ('L-" D M="!M86EN("AC:&%R6UU;72!A<F=S*0T*>PT*"5-T<FEN9U541C 82 ](&YE M=R!3=')I;F=55$8X*")3=')I;F< :2(I.PT*"5-T<FEN9U541C,R(&( /2!N M<')I;G0H*3L-" D-" EA/6YE=R!3=')I;F=55$8X*&(I.PT*"6$N<')I;G0H 4*3L-" D-" ER971U<FX ,#L-"GT` ` end
Dec 19 2003
Cool beans! Thanks, Rupert! This brings up a point. The main reason that I do not like opAssign/opAdd syntax for operator overloading is that it is not self-documenting that opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that opCatAssign corresponds to a ~= b. This information either has to be present in a comment or you have to go look it up. Yeah, D gurus will have it memorized, but I'd rather there be just one "name" for the function, and it should be the same both in the definition and at the point of call. Sean "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message news:brvghd$21n8$2 digitaldaemon.com...There has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface and two rough and ready string classes for UTF-8 and UTF-32, which are attached to this message. Currently they only do a few things, one of which is to provide aconsistentinterface for character manipulation. The UTF-8 class also provides direct access to the bytes for when the user can do things more efficiently with these. They can also be appended to each other. In addition, each providesaconstructor taking the other one as a parameter. Please bear in mind that I am only an amateur programmer, who knows very little about Unicode and has no experience of programming in the realworld.Nevertheless, I can appreciate some of the issues here and I hope thattheseclasses can be the foundation of something more useful. From, Rupert
Dec 19 2003
I agree with you, but we just have to grin and bear it, unless / until Walter changes his mind. I suppose I could have commented my code better though. Hopefully as I become more experienced, I will be a better judge of these things. "Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brvlj9$29qh$1 digitaldaemon.com...Cool beans! Thanks, Rupert! This brings up a point. The main reason that I do not like opAssign/opAdd syntax for operator overloading is that it is not self-documenting that opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that opCatAssign corresponds to a ~= b. This information either has to be present in a comment or you have to go look it up. Yeah, D gurus willhaveit memorized, but I'd rather there be just one "name" for the function,andit should be the same both in the definition and at the point of call. Sean "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message news:brvghd$21n8$2 digitaldaemon.com...toThere has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface and two rough and ready string classes for UTF-8 and UTF-32, which are attacheddirectthis message. Currently they only do a few things, one of which is to provide aconsistentinterface for character manipulation. The UTF-8 class also provideswithaccess to the bytes for when the user can do things more efficientlyprovidesthese. They can also be appended to each other. In addition, eachaconstructor taking the other one as a parameter. Please bear in mind that I am only an amateur programmer, who knows very little about Unicode and has no experience of programming in the realworld.Nevertheless, I can appreciate some of the issues here and I hope thattheseclasses can be the foundation of something more useful. From, Rupert
Dec 19 2003
The problem with the operater* or operator~ syntax is it is ambiguous. It's also not greppable. "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message news:brvr60$2il5$1 digitaldaemon.com...I agree with you, but we just have to grin and bear it, unless / until Walter changes his mind. I suppose I could have commented my code better though. Hopefully as I become more experienced, I will be a better judgeofthese things. "Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brvlj9$29qh$1 digitaldaemon.com...opAssign/opAddCool beans! Thanks, Rupert! This brings up a point. The main reason that I do not likethatsyntax for operator overloading is that it is not self-documenting that opSlice corresponds to a[x..y] or that opAdd corresponds to a + b ormessageopCatAssign corresponds to a ~= b. This information either has to be present in a comment or you have to go look it up. Yeah, D gurus willhaveit memorized, but I'd rather there be just one "name" for the function,andit should be the same both in the definition and at the point of call. Sean "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote intwonews:brvghd$21n8$2 digitaldaemon.com...There has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface andattachedrough and ready string classes for UTF-8 and UTF-32, which aretoverydirectthis message. Currently they only do a few things, one of which is to provide aconsistentinterface for character manipulation. The UTF-8 class also provideswithaccess to the bytes for when the user can do things more efficientlyprovidesthese. They can also be appended to each other. In addition, eachaconstructor taking the other one as a parameter. Please bear in mind that I am only an amateur programmer, who knowslittle about Unicode and has no experience of programming in the realworld.Nevertheless, I can appreciate some of the issues here and I hope thattheseclasses can be the foundation of something more useful. From, Rupert
Dec 19 2003
If you say it's ambiguous, I'll take your word for it and if you think being greppable is important, I'm also happy to accept that. My personal opinions are not all that strong - it's only a minor inconvenience to have to check the overload function names. More importantly, what do you think of my request for more opSlice overloads? From, Rupert "Walter" <walter digitalmars.com> wrote in message news:bs08b8$527$2 digitaldaemon.com...The problem with the operater* or operator~ syntax is it is ambiguous.It'salso not greppable. "Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message news:brvr60$2il5$1 digitaldaemon.com...thatI agree with you, but we just have to grin and bear it, unless / until Walter changes his mind. I suppose I could have commented my code better though. Hopefully as I become more experienced, I will be a better judgeofthese things. "Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brvlj9$29qh$1 digitaldaemon.com...opAssign/opAddCool beans! Thanks, Rupert! This brings up a point. The main reason that I do not likesyntax for operator overloading is that it is not self-documentingfunction,thatopSlice corresponds to a[x..y] or that opAdd corresponds to a + b oropCatAssign corresponds to a ~= b. This information either has to be present in a comment or you have to go look it up. Yeah, D gurus willhaveit memorized, but I'd rather there be just one "name" for theandit should be the same both in the definition and at the point of call. Sean
Dec 20 2003
"Rupert Millard" <rupertamillard hotmail.DELETE.THIS.com> wrote in message news:bs1d9b$2033$1 digitaldaemon.com...More importantly, what do you think of my request for more opSlice overloads?I haven't got that far yet!
Dec 20 2003
It would be greppable if it were required that there be no space between the operator and the symbol. (if you use regexp you can get around this) There should be some other way to embed the symbol into the identifier, if it's causing too many lexer problems. Sean "Walter" <walter digitalmars.com> wrote in message news:bs08b8$527$2 digitaldaemon.com...The problem with the operater* or operator~ syntax is it is ambiguous.It'salso not greppable.
Dec 20 2003
"Walter" <walter digitalmars.com> wrote in message news:brqr8e$vmh$1 digitaldaemon.com...Interestingly, it was rarely necessary to decode the UTF-8 strings. Farandaway most operations on strings were copying them, storing them, hashing them, etc.That is my experience as well. Either that or it's parsing them more or less linearly.justPlease also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time.I don't agree that memory is improving that fast. Even if it is, peopleload them up with more data to fill the memory up. I will agree thatprogramcode size is no longer that relevant, but data size is still pretty relevant. Stuff we were forced to do back in the bad old DOS 640k daysseempretty quaint now <g>.Code size is actually still important on embedded apps (console video games) where the machine has small code cache size (8K or less) On PS2, optimizing for size produces faster code in most cases than optimizing for speed.So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32? Unfortunately then a char won't hold a single Unicode character, you have to mix char and dchar. It would be nice to have a library function to pull the first character out of a UTF-8 string and increment the iterator pointer past it. dchar extractFirstChar(inout char* utf8string); That seems like an insanely useful text processing function. Maybe the reverse as well: void appendChar(char[] utf8string, dchar c); SeanMost people already have several hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit shortsighted to make the lives of D programmers harder forever, just to save a few megabytes of memory that people will laugh about in 5 years (or already laugh about right now)?D programmers can use dchars if they want to.
Dec 18 2003
Den Thu, 18 Dec 2003 10:49:31 -0800 skrev Sean L. Palmer:So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32? Unfortunately then a char won't hold a single Unicode character, you have to mix char and dchar.This is why I have advocated a rename of dchar to char, and the current char to something else (my first suggestion was utf8byte, but I can see why it was rejected off hand. :-) ).It would be nice to have a library function to pull the first character out of a UTF-8 string and increment the iterator pointer past it. dchar extractFirstChar(inout char* utf8string); That seems like an insanely useful text processing function. Maybe the reverse as well: void appendChar(char[] utf8string, dchar c);At least my intention when starting this second round of discussion was to iron out what the "D way" of handling strings is, so we can get to work on these library functions that you request. Regards Elias MÃ¥rtenson
Dec 18 2003
"Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brssrg$135p$1 digitaldaemon.com...So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32?Yes. Exactly.Unfortunately then a char won't hold a single Unicode character,Correct. But a dchar will.you have to mix char and dchar. It would be nice to have a library function to pull the first characteroutof a UTF-8 string and increment the iterator pointer past it. dchar extractFirstChar(inout char* utf8string);Check out the functions in std.utf.That seems like an insanely useful text processing function. Maybe the reverse as well: void appendChar(char[] utf8string, dchar c);Actually, a wrapper class around the string, overloading opApply, [], etc., will do the job nicely.
Dec 18 2003
On Thu, 18 Dec 2003 16:05:47 -0800, "Walter" <walter digitalmars.com> wrote:"Sean L. Palmer" <palmer.sean verizon.net> wrote in message news:brssrg$135p$1 digitaldaemon.com...A char is defined as a UTF-8 character but does not have enough storage to hold one!? ubute[4] declares storage for 4 ubytes btytes, but char[4] The D manual derscribes a char as being a UTF-8 char AND being 8-bits ? Can't a single UTF-8 character require multiple bytes for representation? A datatype is some storage and a set of operations that can be done on that storage. In what way are char and ubyte different datatypes? An array of a datatype is an indexable set of elements of that type. (Isn't it?) Given char foo[4]; does foo[2] not represent the third char in foo !!?? I would think that the datatype char would be a UTF-8 character, with no indication of the amount of storage it used. The compiler would be free to represent it internally however it chose. Indexing should work (perhaps inefficiently) D's datatypes seem to be of two different varieties; names for units of memory and names for abstract types. Some (ubyte) describe a fixed amount af physical storage, while others ( ifloat?) describe an abstract datatype whose physical structure is hidden (or at least irrelevant) Which is char? Karl BochertSo you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32?Yes. Exactly.Unfortunately then a char won't hold a single Unicode character,Correct. But a dchar will.
Dec 20 2003
Den Sat, 20 Dec 2003 19:33:59 +0000 skrev Karl Bochert:D's datatypes seem to be of two different varieties; names for units of memory and names for abstract types. Some (ubyte) describe a fixed amount af physical storage, while others ( ifloat?) describe an abstract datatype whose physical structure is hidden (or at least irrelevant) Which is char?It's a fixed memory type. Look at it as an ubyte, but with some special guarantees (upheld by convention). By your own question you have pointed out that the name "char" is not very good. But I really should stop pointing this out, or I'll be banned before I even get started with providing any actual value to the project. :-) Regards Elias MÃ¥rtenson
Dec 20 2003
"Karl Bochert" <kbochert copper.net> wrote in message news:1103_1071948839 bose...A char is defined as a UTF-8 character but does not have enough storage tohold one!? Right.The D manual derscribes a char as being a UTF-8 char AND being 8-bits ?Yes.Can't a single UTF-8 character require multiple bytes for representation?No.A datatype is some storage and a set of operations that can be done onthat storage.In what way are char and ubyte different datatypes?Only how they are overloaded, and how string literals are handled.An array of a datatype is an indexable set of elements of that type.(Isn't it?)Given char foo[4]; does foo[2] not represent the third char in foo !!??If it makes more sense, it is the third byte in foo.I would think that the datatype char would be a UTF-8 character, with noindication ofthe amount of storage it used. The compiler would be free to represent itinternallyhowever it chose. Indexing should work (perhaps inefficiently)That would be a higher level view of it, and I suggest a wrapper class around it can provide this.D's datatypes seem to be of two different varieties; names for units ofmemoryand names for abstract types. Some (ubyte) describe a fixed amount afphysicalstorage, while others ( ifloat?) describe an abstract datatype whosephysical structureis hidden (or at least irrelevant) Which is char?char is a fixed 8 bits of storage.
Dec 20 2003
"Walter" <walter digitalmars.com> wrote in message news:bs3pmm$2m0v$2 digitaldaemon.com..."Karl Bochert" <kbochert copper.net> wrote in message news:1103_1071948839 bose...toA char is defined as a UTF-8 character but does not have enough storagehold one!? Right.representation?The D manual derscribes a char as being a UTF-8 char AND being 8-bits ?Yes.Can't a single UTF-8 character require multiple bytes forNo.??? A unicode character can result in up to 6 bytes used, when encoded with UTF-8. Which is what the poster meant to ask, I think. Roald
Dec 21 2003
"Roald Ribe" <rr.no spam.teikom.no> wrote in message news:bs4ddt$ig4$1 digitaldaemon.com...Sure, perhaps I misunderstood him.representation?Can't a single UTF-8 character require multiple bytes forNo.??? A unicode character can result in up to 6 bytes used, when encoded with UTF-8. Which is what the poster meant to ask, I think.
Dec 22 2003
??? A unicode character can result in up to 6 bytes used, when encoded with UTF-8.UTF-8 can represent all Unicode characters with no more then 4 bytes. ISO/IEC 10646 (UCS-4) may require up to 6 bytes in UTF-8, but it's the superset for Unicode.
Dec 30 2003
itI would think that the datatype char would be a UTF-8 character, with noindication ofthe amount of storage it used. The compiler would be free to representinternallyOn Friday 19th, I posted a class that provides this functionality to this thread. You can see the message here: http://www.digitalmars.com/drn-bin/wwwnews?D/20619 As for the attached file - it does not appear to be accessible to users of the webservice, so I have placed it on the wiki at: http://www.wikiservice.at/wiki4d/wiki.cgi?StringClasses Ruperthowever it chose. Indexing should work (perhaps inefficiently)That would be a higher level view of it, and I suggest a wrapper class around it can provide this.
Dec 21 2003
In article <bs4ea9$jo2$1 digitaldaemon.com>, Rupert Millard says...I sorry to interrup (I'm one of the cluless here, in fact I call this the unicorn discussion) but isn't Vathix's String class suppose to cover that? http://www.digitalmars.com/drn-bin/wwwnews?D/19525 It's bigger so it must be better ;) AntitI would think that the datatype char would be a UTF-8 character, with noindication ofthe amount of storage it used. The compiler would be free to representinternallyOn Friday 19th, I posted a class that provides this functionality to this thread.however it chose. Indexing should work (perhaps inefficiently)That would be a higher level view of it, and I suggest a wrapper class around it can provide this.
Dec 21 2003
Ant <Ant_member pathlink.com> wrote in news:bs4gc8$n2c$1 digitaldaemon.com:In article <bs4ea9$jo2$1 digitaldaemon.com>, Rupert Millard says...You had me worried here because I missed that post! However, they do slightly different things, I think. Mine indexes characters rather than bytes in UTF-8 strings. Vathix's does many other string handling things. (e.g. changing case) My code needs to be integrated into his, if it can be - I'm not sure what implications his use of templates has. You're quite correct - as they currently are, his is vastly more useful - I can't think of many situations where you need to index whole characters rather than bytes. My main reason for writing it was that I enjoy writing code. RupertI sorry to interrup (I'm one of the cluless here, in fact I call this the unicorn discussion) but isn't Vathix's String class suppose to cover that? http://www.digitalmars.com/drn-bin/wwwnews?D/19525 It's bigger so it must be better ;) AntitI would think that the datatype char would be a UTF-8 character, with noindication ofthe amount of storage it used. The compiler would be free to representinternallyOn Friday 19th, I posted a class that provides this functionality to this thread.however it chose. Indexing should work (perhaps inefficiently)That would be a higher level view of it, and I suggest a wrapper class around it can provide this.
Dec 21 2003
I think this discussion of "language being wrong" is wrong. It is obviuosly clear that the char[], char, and other associated types don't have a sensible higher-level symantics. The examples are many. Obviously, i find it quite right from the language not to constrain the programmers to high-level types. It is a job for the library. Now, everyone. Walter has quite enough to do of what he does better than all of us. Improving on a standard library is a job which he delegates to us. A library class or struct String should be indexed by a real character scanning, and not by the adress, even if it means more overhead. And the result of this indexing, as well as any single character acess would be a dchar. The internal representation should be still acessible, for the case someone finds high-level semantics a bottleneck within his application. Besides, myself and Mark have proposed a number of solutions a while ago, which would give strings non-standard storage, but would allow the high level representation to be significantly faster, at the cost of ease of operating on a lower-level representation. -eye
Dec 21 2003
Walter wrote:The only situation I can think of where this might be useful is if you want to jump directly into the middle of a string. And that isn't really useful for UTF-8 because you do not know how many characters were before that - so you have no idea where you've "landed".I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.UTF-8 has some nice advantages over other multibyte encodings in that it is possible to find the start of a sequence without backing up to the beginning, none of the multibyte encodings have bit 7 clear (so they never conflict with ascii), and no additional information like code pages are necessary to decode them.Hmmm. That IS interesting. Now that you mention it, I think this would also apply to most of my own code. Though it might depend on the kind of application.And about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character.Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and away most operations on strings were copying them, storing them, hashing them, etc.I hadn't thought of applications that do nothing but serve data/text to others. That's a good counter-example against some of my arguments. Having the server run at 1/2 capacity because of string encoding seems to be too much. So I think you're right in having multiple "native" encodings. That still leaves the problems of providing easy ways to work with strings, though, to ensure that newbies will "automatically" write Unicode capable applications. That's the only way I see to avoid the situation we see in C/C++ code right now. What's bad about multiple encodings is that all libraries would have to support 3 kinds of strings for everything. That's not really feasible in the real world - I certainly don't want to write every function 3 times. I can think of only two ways around that: 1) some sort of automatic conversion when the function is called. This might cause quite a bit of overhead. 2) using some sort of template and let the compiler generate the 3 special cases. I don't think normal templates will work here, because we also need to support string functions in interfaces. Maybe we need some kind of universal string argument type? So that the compiler can automatically generate 3 functions if that type is used in the parameter list? Seems a bit of a hack.... 3) making the string type abstract so that string objects are compatible, no matter what their encoding is. This has the added benefit (as I have mentioned a few times before ;)) that users could have strings in their own encoding, which comes in handy when you're dealing with legacy code that does not use US-ASCII. I think 3 would be the most feasible. You decide about the encoding when you create the string object and everything else is completely transparent. HaukeSo assuming that your application uses 100.000 lines of text (which is a lot more than anything I've ever seen in a program), each 100 characters long and everything held in memory at once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32. These are hardly numbers that will bring a modern OS to its knees anymore. In a few years this might even fit completely into the CPU'scache! Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive with C++ pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen do not measure char processing speed or memory consumption.)
Dec 19 2003
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:bruief$kav$1 digitaldaemon.com...What's bad about multiple encodings is that all libraries would have to support 3 kinds of strings for everything. That's not really feasible in the real world - I certainly don't want to write every function 3 times.I had the same thoughts!I can think of only two ways around that: 1) some sort of automatic conversion when the function is called. This might cause quite a bit of overhead. 2) using some sort of template and let the compiler generate the 3 special cases. I don't think normal templates will work here, because we also need to support string functions in interfaces. Maybe we need some kind of universal string argument type? So that the compiler can automatically generate 3 functions if that type is used in the parameter list? Seems a bit of a hack....My first thought was to template all functions taking a string. It just got too complicated.3) making the string type abstract so that string objects are compatible, no matter what their encoding is. This has the added benefit (as I have mentioned a few times before ;)) that users could have strings in their own encoding, which comes in handy when you're dealing with legacy code that does not use US-ASCII. I think 3 would be the most feasible. You decide about the encoding when you create the string object and everything else is completelytransparent. I think 3 is the same as 1!
Dec 19 2003
Walter wrote:<snip>I can think of only two ways around that: 1) some sort of automatic conversion when the function is called. This might cause quite a bit of overhead.Not really ;). With 1 I meant having unrelated string classes (maybe source code compatible, but not derived from a common base class). That would mean that a temporary object would have to be created if a function takes, say, a UTF-8 string as an argument but you pass it a UTF-32 string. Pros: the compiler can do more inlining, since it knows the object type. Cons: the performance gain of the inlining is probably lost with all the conversions that will be going on if you use different libs. It is also not possible to easily add new string types without having to add the corresponding copy constructor and =operators to the existing ones. With 3 there would not be such a problem. All functions would have to use the common string interface for their arguments, so any kind of string object that implements this interface could be passed without a conversion. Pros: adding new string encodings is no problem, passing string objects never causes new objects to be created or data to be converted. Cons: most calls can probably not be inlined, since the functions will never know the actual class of the strings they work with. Also, if you want to pass a string constant to a function you'll have to explicitly wrap it in an object, since the compiler doesn't know what kind of object to create to convert a char[] to a string interface reference. The last point would go away if string constants were also string objects. I think that would be a good idea anyway, since that'd make the string interface the default way to deal with strings. Another solution would be if there was some way to write global conversion functions that are called to do implicit conversions between different types. Such functions could also be useful in many other circumstances, so that might be an idea to think about. Hauke3) making the string type abstract so that string objects are compatible, no matter what their encoding is. This has the added benefit (as I have mentioned a few times before ;)) that users could have strings in their own encoding, which comes in handy when you're dealing with legacy code that does not use US-ASCII. I think 3 would be the most feasible. You decide about the encoding when you create the string object and everything else is completelytransparent. I think 3 is the same as 1!
Dec 19 2003
Hauke Duden wrote:Another solution would be if there was some way to write global conversion functions that are called to do implicit conversions between different types. Such functions could also be useful in many other circumstances, so that might be an idea to think about.Just to clarify: I meant this in the context of creating a string interface instance from a string constant, not to convert between different string objects (which wouldn't make much sense). E.g. interface string { ... } class MyString implements string { ... } void print(string msg) { ... } Without an implicit conversion we'd have to write: print(new MyString("Hello World")); With an implicit conversion that'd look like this: string opConvert(char[] s) { return new MyString(s); } print("Hello World"); [The last line would translate to print(opConvert("Hello World")) ] Hauke
Dec 19 2003
I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.Well, at least one can convert any Unicode string to UTF-8 without risk of losing information.Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.UTF-32 never takes less memory than UTF-8. Period. Any Unicode character takes no more than 4 byte in UTF-8: 1 byte - ASCII 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... 3 byte - most of the scripts in use. 4 byte - rare/dead/special scripts UTF-8 means multibyte encoding for most of the languages (except English and maybe some others) Most of the European and Asian languages need just one UTF-16 unit per character. For CJK languages occurrence of the UTF-16 surrogates in the real texts is estimated as <1%. Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols. In most of the cases UTF-16 string can be treated as simple array of UCS-2 characters. You just need to know if it has surrogates // if (number_of_characters < nomber_of_16bit_units)
Dec 30 2003
"Serge K" <skarebo programmer.net> wrote in message news:bst8q3$218i$1 digitaldaemon.com...This is a good point. But I stand my ground: it may result in up to 6 bytes used for ecah character (worst case).I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.Well, at least one can convert any Unicode string to UTF-8 without risk of losing information.This is wrong. Read up on UTF-8 encoding.Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.UTF-32 never takes less memory than UTF-8. Period. Any Unicode character takes no more than 4 byte in UTF-8: 1 byte - ASCII 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... 3 byte - most of the scripts in use. 4 byte - rare/dead/special scriptsUTF-8 means multibyte encoding for most of the languages (except Englishandmaybe some others)Right.Most of the European and Asian languages need just one UTF-16 unit per character.Yes most, but not all.For CJK languages occurrence of the UTF-16 surrogates in the real texts is estimated as <1%.The code to handle it still has to be present...Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols. In most of the cases UTF-16 string can be treated as simple array of UCS-2 characters.Yes, but "most cases" is not a good argument when the original discussion was initiated to handle ALL laguages, in a way that the developer would find to be "natural", easy and integrated in the D language.You just need to know if it has surrogates // if (number_of_characters < nomber_of_16bit_units)There is no such thing as "just" with these issues (IMHO) ;-) Roald
Dec 30 2003
thenActually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode charactersRTFM. [The Unicode Standard, Version 4.0] The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. UTF-8 D36. UTF-8 encoding form: The Unicode encoding form which assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-5. [Table 3-5. UTF-8 Bit Distribution] Scalar Value 1st Byte 2nd Byte 3rd Byte 4th Byte 00000000 0xxxxxxx 0xxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx [Appendix C : Relationship to ISO/IEC 10646] C.3 UCS Transformation Formats UTF-8 The term UTF-8 stands for UCS Transformation Format, 8-bit form. UTF-8 is an alternative coded representation form for all of the characters of ISO/IEC 10646. The ISO/IEC definition is identical in format to UTF-8 as described under definition D36 in Section 3.9, Unicode Encoding Forms. ... The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters.This is wrong. Read up on UTF-8 encoding.UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.UTF-32 never takes less memory than UTF-8. Period. Any Unicode character takes no more than 4 byte in UTF-8: 1 byte - ASCII 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... 3 byte - most of the scripts in use. 4 byte - rare/dead/special scripts
Jan 03 2004
In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.Shouldn't this wrapper be part of Phobos?
Dec 17 2003
"Matthias Becker" <Matthias_member pathlink.com> wrote in message news:brpr00$2grc$1 digitaldaemon.com...seemsIn a higher level language, yes. But in doing systems work, one alwaysfor ato be looking at the lower level elements anyway. I wrestled with thislowwhile, and eventually decided that char[], wchar[], and dchar[] would beEventually, yes. First things first, though, and the first step was making the innards of the D language and compiler fully unicode enabled.level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.Shouldn't this wrapper be part of Phobos?
Dec 17 2003
"Walter" <walter digitalmars.com> wrote in message news:brll85$1oko$1 digitaldaemon.com..."Elias Martenson" <no spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047 spam.spam...seemsActually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.In a higher level language, yes. But in doing systems work, one alwaysto be looking at the lower level elements anyway. I wrestled with this forawhile, and eventually decided that char[], wchar[], and dchar[] would belowlevel representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.characters.The overloading issue is interesting, but may I suggest that char andwhcarare at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren'tI see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't seemuchof an issue here.isAnd here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way itfor the moment.first:I'd love to help out and do these things. But two things are neededinadequate.- At least one other person needs to volunteer. I've had bad experiences when one person does this by himself,You're not by yourself. There's a whole D community here!- The core concepts needs to be decided upon. Things seems to be somewhat in flux right now, with three different string types and all. At the very least it needs to be deicded what a "string" really is, is it a UTF-8 byte sequence or a UTF-32 character sequence? I haven't hid the fact that I would prefer the latter.A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.That's correct as well. The library's support for unicode isButintochar[],there also is a nice package (std.utf) which will convert betweenwchar[], and dchar[]. This can be used to convert the text stringsnativesupports.whatever unicode stream type the underlying operating system API(For win32 this would be UTF-16, I am unsure what linux supports.)Yes. But this would then assume that char[] is always in native encoding and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence. Or, the specification could be read as the stream actually performs'nativedecoding to UTF-8 when reading into a char[] array.char[] strings are UTF-8, and as such I don't know what you mean bydecoding'. There is only one possible conversion of UTF-8 to UTF-16.dependent.Unless fundamental encoding/decoding is embedded in the streams library, it would be best to simply read text data into a byte array and then perform native decoding manually afterwards using functions similar to the C mbstowcs() and wcstombs(). The drawback to this is that you cannot read text data in platform encoding without copying through a separate buffer, even in cases when this is not needed.If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code pageThey are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.onD is headed that way. The current version of the library I'm workinglocaleThe UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems withconverts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.This can be done in a much better, platform independent way, by using the native<->unicode conversion routines.dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.UTF.In C, as already mentioned, these are called mbstowcs() and wcstombs(). For Windows, these would convert to and from UTF-16. For Unix, these would convert to and from whatever encoding the application is running under (dictated by the LC_CTYPE environment variable). There really is no need to make the API's platform dependent in any way here.After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/fromThis should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How abouthaving#ifdef _UNICODE all over the place? I've done that too much already. No thanks!) UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.Following this discussion, I have read some more on the subject. In additon to the speed issues that was mentioned, I have had some insights on the issues of endianess, serialization, BOM (Byte Order Mark) ++ Most of it can be found in a reasonably short pdf document: http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf There is even more to this than I first believed... Based on the new knowledge I become more and more convinced that the choice of UTF-8 encoding as the basic "correct thing to do" for general use in a programming language, is well founded. But when text _processing_ comes into play, other rules aplies. But: I still find it objectionable to call one byte in a UTF-8/Unicode based language a char! ;-) The naming will of course make it easier to do a straight port from C to D, but such a port will in most cases be of no use on the "International scene". Oh well, this can be argued for/against well both ways I guess... IMHO there should be no char type at all. Only byte. Or maybe to take more sizes into consideration: bin8, bin16, bin32, bin64... I think porting from C to D should involve renaming char's to bin8's Hmmm... It is sad when learning more makes you want to change less ;-) Anyway, there is more to be learned... Roald
Dec 31 2003
First: I'm new in D and my english are bad. I realy like the utf8, but the true it no is efficient all the time ( local character acces...) and in a litle number of C/C++ programs I ned to use interrnal utf32 intestead of utf8 but later, I introduced a hack and I indexed the utf8 char nunber/pos and used a standard utf8 vector, the memory need are lower than using utf32 in my most frequent cases and the memory efficiency are better than utf32 for my experience this work very well in latin and CJK languages ( I normale use this two encodings) but for cirilyc, arabian... the memory can be bigger than utf32 but if is used a eficient indexation system we can equal the memory needed to utf32, in perfomance the penalitation is than 8 times slower than utf32 implementation, compared it to the penalitation in standar utf8 are very fast. I recomend to add: stringi -> indexed string for utf8 and the posibility to mark an internal representaion off the utf like: string utf8-32 -> this mark an utf8 string, but it works internal as utf32
Jan 07 2004