digitalmars.D - Selectable encodings
- John C (22/22) Apr 06 2006 I know of three ways to support a user-selected char encoding in a libra...
- Mike Capp (10/11) Apr 06 2006 Apologies for going off at a tangent to your question, but I've never qu...
- Oskar Linde (4/14) Apr 06 2006 It is the latter. But I don't think much of the string handling code is
- James Dunne (20/46) Apr 06 2006 The char type is really a misnomer for dealing with UTF-8 encoded
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (8/14) Apr 06 2006 Yeah, but it does hold an *ASCII* character ?
- Mike Capp (24/30) Apr 06 2006 (Changing subject line since we seem to have rudely hijacked the OP's to...
- Georg Wrede (32/66) Apr 06 2006 Yes. And it's a _gross_ misnomer.
- =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (13/26) Apr 06 2006 It's O(n) vs O(n). :) You have to go through all the bytes in both
- Thomas Kuehne (11/17) Apr 06 2006 -----BEGIN PGP SIGNED MESSAGE-----
- =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (9/19) Apr 07 2006 Yes, I know. This was just an optimistic tongue-in-the-cheek analysis :)
- Thomas Kuehne (60/70) Apr 06 2006 -----BEGIN PGP SIGNED MESSAGE-----
-
Walter Bright
(4/7)
Apr 06 2006
I don't know about that, but the code below isn't optimal
. Replace - Sean Kelly (5/13) Apr 06 2006 I've been wondering about this. Will 'stride' be accurate for any
- Walter Bright (3/16) Apr 06 2006 UTF8stride[] will give 0xFF for values that are not at the beginning of
- Sean Kelly (10/28) Apr 06 2006 Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make
- Walter Bright (3/31) Apr 06 2006 Take a look at std.utf.toUTFindex(), which takes care of the problem (by...
- Georg Wrede (8/38) Apr 06 2006 No fear. Any UTF-8 byte that belongs to a stride is clearly marked as
- kris (6/16) Apr 06 2006 It's not as simple as that any more. Lookup tables can sometimes cause
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (20/29) Apr 07 2006 I don't think so. UTF-8 is good for us in "non-British" Europe, and
- Sean Kelly (16/45) Apr 06 2006 Since UTF-8 is compatible with ASCII, might it not be reasonable to
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/18) Apr 06 2006 I'm not sure that C guys would miss a string class (after all, char[]
I know of three ways to support a user-selected char encoding in a library, but each has its drawbacks. 1) Method overloading Introduces conflicts with string literals (forcing a c/w/d suffix to be used) and you can't overload by return type. 2) Parameterising all types that use strings Making every class a template just to get this functionality seems over the top. class SomeClassT(TChar) { TChar[] getSomeString() {} } alias SomeClassT!(char) SomeClass; // in library module alias SomeClassT!(wchar) SomeClass; // in user module 3) A compiler version condition with aliases. The version condition approach is the most attractive to me, but some people aren't fond of it. version (utf8) alias mlchar char; else version (utf16) alias mlchar wchar; else version (utf32) alias mlchar dchar; There's a fourth way - encoding conversion, but there's a runtime cost. So does anyone use an alternative way to enable users to select which char encoding they want to use at compile time?
Apr 06 2006
In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...version (utf8) alias mlchar char;Apologies for going off at a tangent to your question, but I've never quite understood what D thinks it's doing here. If char[] is an array of characters, then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So is char[] an array of characters from some other charset (e.g. the subset of UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8 string (in which case I suspect quite a lot of string-handling code is badly broken)? cheers Mike
Apr 06 2006
Mike Capp skrev:In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...It is the latter. But I don't think much of the string handling code is broken because of that. /Oskarversion (utf8) alias mlchar char;Apologies for going off at a tangent to your question, but I've never quite understood what D thinks it's doing here. If char[] is an array of characters, then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So is char[] an array of characters from some other charset (e.g. the subset of UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8 string (in which case I suspect quite a lot of string-handling code is badly broken)?
Apr 06 2006
Oskar Linde wrote:Mike Capp skrev:The char type is really a misnomer for dealing with UTF-8 encoded strings. It should be named closer to "code-unit for UTF-8 encoding". For my own research language I've chosen what I believe to be a nice type naming system: char - 32-bit Unicode code point u8cu - UTF-8 code unit u16cu - UTF-16 code unit u32cu - UTF-32 code unit I could be wrong (and I bet I am) on the terminology used to describe char, but I really mean it to just store a full Unicode character such that strings of chars can safely assume character index == array index. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James DunneIn article <e12j34$2gi2$1 digitaldaemon.com>, John C says...It is the latter. But I don't think much of the string handling code is broken because of that. /Oskarversion (utf8) alias mlchar char;Apologies for going off at a tangent to your question, but I've never quite understood what D thinks it's doing here. If char[] is an array of characters, then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So is char[] an array of characters from some other charset (e.g. the subset of UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8 string (in which case I suspect quite a lot of string-handling code is badly broken)?
Apr 06 2006
James Dunne wrote:The char type is really a misnomer for dealing with UTF-8 encoded strings. It should be named closer to "code-unit for UTF-8 encoding".Yeah, but it does hold an *ASCII* character ? Usually the D code handles char[] with dchar, but with a "short path" for ASCII characters...I could be wrong (and I bet I am) on the terminology used to describe char, but I really mean it to just store a full Unicode character such that strings of chars can safely assume character index == array index.For the general case, UTF-32 is a pretty wasteful Unicode encoding just to have that priviledge ? --anders
Apr 06 2006
(Changing subject line since we seem to have rudely hijacked the OP's topic) In article <e13b56$is0$1 digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...James Dunne wrote:(I fully agree with this statement, by the way.)The char type is really a misnomer for dealing with UTF-8 encoded strings. It should be named closer to "code-unit for UTF-8 encoding".Yeah, but it does hold an *ASCII* character ?I don't find that very helpful - seeing a char[] in code doesn't tell me anything about whether it's byte-per-character ASCII or possibly-multibyte UTF-8.For the general case, UTF-32 is a pretty wasteful Unicode encoding just to have that priviledge ?I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII. "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth. Finding the millionth character in a UTF-8 string means looping through at least a million bytes, and executing some conditional logic for each one. Finding the millionth character in a UTF-32 string is a simple pointer offset and one-word fetch. At the risk of repeating James, I do think that spelling "string" as "char[]"/"wchar[]" is grossly misleading, particularly to people coming from any other C-family language. If I was doing any serious string-handling work in D I'd almost certainly write a opaque String class that overloaded opIndex (returning dchar) to do the right thing, and optimised the underlying storage to suit the app's requirements. cheers Mike
Apr 06 2006
Mike Capp wrote:(Changing subject line since we seem to have rudely hijacked the OP's topic) In article <e13b56$is0$1 digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...Yes. And it's a _gross_ misnomer. And we who are used to D can't even _begin_ to appreciate the [unnecessary!] extra work and effort needed to gradually come to understand it "our way", for those new to D.James Dunne wrote:(I fully agree with this statement, by the way.)The char type is really a misnomer for dealing with UTF-8 encoded strings. It should be named closer to "code-unit for UTF-8 encoding".(( A dumb idea: the input stream has a flag that gets set as soon as the first non-ASCII character is found. ))Yeah, but it does hold an *ASCII* character ?I don't find that very helpful - seeing a char[] in code doesn't tell me anything about whether it's byte-per-character ASCII or possibly-multibyte UTF-8.True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the Far East.For the general case, UTF-32 is a pretty wasteful Unicode encoding just to have that priviledge ?I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII."Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.It sure looks like it. Then again, studying the UTF-8 spec, and "why we did it this way" (sorry, no URL here. Anybody?), shows that it actually is _amazingly_ light on CPU cycles! Really. (( I sure wish there was somebody in this NG who could write a Scientifically Valid test to compare the time needed to find the millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))Finding the millionth character in a UTF-8 string means looping through at least a million bytes, and executing some conditional logic for each one. Finding the millionth character in a UTF-32 string is a simple pointer offset and one-word fetch.True. And even if we'd exclude any "character width logic" in the search, we still end up with sequential lookup O(n) vs. O(1). Then again, when's the last time anyone here had to find the millionth character of anything? :-) So, of course for library writers, this appears as most relevant, but for real world programming tasks, I think after profiling, the time wasted may be minor, in practice. (Ah, and of course, turning a UTF-8 input into UTF-32 and then straight shooting the millionth character, is way more expensive (both in time and size) than just a loop through the UTF-8 as such. Not to mention the losses if one were, instead, to have a million-character file on hard disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the time reading in the file gets so much longer that this in itself defeats the "gain".)At the risk of repeating James, I do think that spelling "string" as "char[]"/"wchar[]" is grossly misleading, particularly to people coming from any other C-family language.No argument here. :-) In the midst of The Great Character Width Brouhaha (about November last year), I tried to convince Walter on this particular issue.
Apr 06 2006
Georg Wrede wrote:(( I sure wish there was somebody in this NG who could write a Scientifically Valid test to compare the time needed to find the millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))It's O(n) vs O(n). :) You have to go through all the bytes in both cases. I guess the conversion has a higher coefficient.So, of course for library writers, this appears as most relevant, but for real world programming tasks, I think after profiling, the time wasted may be minor, in practice.Why not use the same encoding throughout the whole program and it's libraries? No need to convert anywhere.(Ah, and of course, turning a UTF-8 input into UTF-32 and then straight shooting the millionth character, is way more expensive (both in time and size) than just a loop through the UTF-8 as such. Not to mention the losses if one were, instead, to have a million-character file on hard disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the time reading in the file gets so much longer that this in itself defeats the "gain".)That's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :). A modern processor executes 3 000 000 000 operations in a second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and thus costs 3 ms. So it's actually faster to read UTF-8. -- Jari-Matti
Apr 06 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Jari-Matti wrote:That's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :). A modern processor executes 3 000 000 000 operations in a second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and thus costs 3 ms. So it's actually faster to read UTF-8.1) your sample: English (consider Chinese) 2) magic word: seek Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFENbhY3w+/yD4P9tIRArYCAJ4vxbiR2fim5rFh+AQ4O3e/Gc3xjQCbBnCV BLrTa9vqU3l8ny+/8Sqw8Mc= =59uu -----END PGP SIGNATURE-----
Apr 06 2006
Thomas Kuehne wrote:Jari-Matti wrote:Yes, I know. This was just an optimistic tongue-in-the-cheek analysis :) A real world example would naturally have a lot of non-ASCII characters too, but the point is that reading huge loads of uncompressed UTF-32 data will be usually slower than reading UTF-8 if we are also checking against text corruptions. I wonder if it's any faster to read UTF-32-files from a transparently compressed reiser4 drive? -- Jari-Matti1) your sample: English (consider Chinese) 2) magic word: seekThat's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :). A modern processor executes 3 000 000 000 operations in a second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and thus costs 3 ms. So it's actually faster to read UTF-8.
Apr 07 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Georg Wrede schrieb am 2006-04-06:Mike Capp wrote:Have a look at the endcoding of Hangul(Korean) and polytonic Greek <g>"Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.It sure looks like it. Then again, studying the UTF-8 spec, and "why we did it this way" (sorry, no URL here. Anybody?), shows that it actually is _amazingly_ light on CPU cycles! Really.(( I sure wish there was somebody in this NG who could write a Scientifically Valid test to compare the time needed to find the millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below: Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFENbZw3w+/yD4P9tIRAjTkAJsEcE6xM0fSLrT3x+iArgdVacZIXgCgsnNa 19AB53HGi6fbH9AuHTMvjq4= =gZWL -----END PGP SIGNATURE-----
Apr 06 2006
Thomas Kuehne wrote:Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below:I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
Apr 06 2006
Walter Bright wrote:Thomas Kuehne wrote:I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure. SeanChallenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below:I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
Apr 06 2006
Sean Kelly wrote:Walter Bright wrote:UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.Thomas Kuehne wrote:I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below:I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
Apr 06 2006
Walter Bright wrote:Sean Kelly wrote:Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character, as stride seems the best fit for an "is valid UTF-8 char" type function. I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings. If I had to track down a program bug I'd almost prefer it be a tight endless loop. SeanWalter Bright wrote:UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.Thomas Kuehne wrote:I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below:I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
Apr 06 2006
Sean Kelly wrote:Walter Bright wrote:Take a look at std.utf.toUTFindex(), which takes care of the problem (by throwing an exception).Sean Kelly wrote:Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character, as stride seems the best fit for an "is valid UTF-8 char" type function. I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings. If I had to track down a program bug I'd almost prefer it be a tight endless loop.Walter Bright wrote:UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.Thomas Kuehne wrote:I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below:I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
Apr 06 2006
Sean Kelly wrote:Walter Bright wrote:No fear. Any UTF-8 byte that belongs to a stride is clearly marked as such in the most significant bits. Thus, you can enter a byte[] at any place, and immediately know if it's (1) a single-byte character, (2) the first in a stride, or (3) within a stride. Without looking at any of the other bytes.Sean Kelly wrote:Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character,Walter Bright wrote:UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.Thomas Kuehne wrote:I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below:I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().as stride seems the best fit for an "is valid UTF-8 char" type function. I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings. If I had to track down a program bug I'd almost prefer it be a tight endless loop.UTF-8 is precisely designed to be used in very tight ASM loops, that don't need a lookup table.
Apr 06 2006
Walter Bright wrote:Thomas Kuehne wrote:It's not as simple as that any more. Lookup tables can sometimes cause more stalls that straight-line code, especially with designs such as the P4. Not to mention the possibility of a bit of cache-thrashing with other programs. Thus, the lookup may be sub-optimal. Quite possibly less optimal.Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below:I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
Apr 06 2006
Georg Wrede wrote:I don't think so. UTF-8 is good for us in "non-British" Europe, and UTF-16 is good in the East. UTF-32 is good for... finding codepoints ? As long as the "exceptions" (high code units) are taken care of, there is really no difference between the three (or five) - it's all Unicode. I prefer UTF-8 - because it is ASCII-compatible and endian-independent, but UTF-16 is not a bad choice if you handle a lot of non-ASCII chars. Just as long as other layers play along with the embedded NULs, and you have the proper BOM marks when storing it. It seemed to work for Java ? The argument was just against *UTF-32* as a storage type, nothing more. (As was rationalized in http://www.unicode.org/faq/utf_bom.html#UTF32) --anders PS. Thought that having std UTF type aliases would have helped, but I dunno: module std.stdutf; /* UTF code units */ alias char utf8_t; // UTF-8 alias wchar utf16_t; // UTF-16 alias dchar utf32_t; // UTF-32 It's a little confusing anyway, many "char*" routines don't accept UTF ?True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the Far East.For the general case, UTF-32 is a pretty wasteful Unicode encoding just to have that priviledge ?I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII.
Apr 07 2006
Mike Capp wrote:(Changing subject line since we seem to have rudely hijacked the OP's topic) In article <e13b56$is0$1 digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...Since UTF-8 is compatible with ASCII, might it not be reasonable to assume char strings are always UTF-8? I'll admit this suggests many of the D string functions are broken, but they can certainly be fixed. I've been considering rewriting find and rfind to support multibyte strings. Fixing find is pretty straightforward, though rfind might be a tad messy. As a related question, can anyone verify whether std.utf.stride will return a correct result for evaluating an arbitrary offset in all potential input strings?James Dunne wrote:(I fully agree with this statement, by the way.)The char type is really a misnomer for dealing with UTF-8 encoded strings. It should be named closer to "code-unit for UTF-8 encoding".Yeah, but it does hold an *ASCII* character ?I don't find that very helpful - seeing a char[] in code doesn't tell me anything about whether it's byte-per-character ASCII or possibly-multibyte UTF-8.For what it's worth, I believe the correct behavior for string/array operations is to provide overloads for char[] and wchar[] that require input to be valid UTF-8 and UTF-16, respectively. If the user knows their data is pure ASCII or they otherwise want to process it as a fixed-width string they can cast to ubyte[] or ushort[]. This is what I'm planning for std.array in Ares. SeanFor the general case, UTF-32 is a pretty wasteful Unicode encoding just to have that priviledge ?I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII. "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth. Finding the millionth character in a UTF-8 string means looping through at least a million bytes, and executing some conditional logic for each one. Finding the millionth character in a UTF-32 string is a simple pointer offset and one-word fetch.
Apr 06 2006
Mike Capp wrote:At the risk of repeating James, I do think that spelling "string" as "char[]"/"wchar[]" is grossly misleading, particularly to people coming from any other C-family language. If I was doing any serious string-handling work in D I'd almost certainly write a opaque String class that overloaded opIndex (returning dchar) to do the right thing, and optimised the underlying storage to suit the app's requirements.I'm not sure that C guys would miss a string class (after all, char[] is a lot better than the raw "undefined" char* they used to be using...) but I do see how having an easy String class around is useful sometimes. I even wrote a simple one myself, based on something Java-like: http://www.algonet.se/~afb/d/dcaf/html/class_string.html http://www.algonet.se/~afb/d/dcaf/html/class_string_buffer.html But for wxD we use a simple char[] alias for strings, works just fine... If the backend uses UTF-16, it will convert them at runtime when needed. (wxWidgets can be built in a "ASCII"/UTF-8, or in "Unicode"/UTF-16 mode) Then again it only does the occasional window title or dialog string etc --anders
Apr 06 2006