digitalmars.D - =?utf-8?B?SW50ZXJuYXRpb25hbGl6YXRpb24gbGlicmFyeSDilIAgYWR2aWNlL2hlbHA=?=
- Uwe Salomon (24/24) May 15 2005 During the writing of a string class for my Indigo library i "discovered...
- Andrew Fedoniouk (136/160) May 15 2005 Good idea, I like it.
- Uwe Salomon (5/7) May 15 2005 Hmm, thanks for that. As libiconv is no standard for windows :) this wil...
- Thomas Kuehne (15/22) May 17 2005 -----BEGIN PGP SIGNED MESSAGE-----
- Uwe Salomon (3/4) May 17 2005 [snip]
- Lars Ivar Igesund (9/15) May 18 2005 Also, look at
- Uwe Salomon (13/16) May 18 2005 Hmm, they are using GNU gettext() instead of the Qt tr(). Perhaps it wou...
- Uwe Salomon (36/36) May 18 2005 This is a first implementation for conversion between UTF encodings. I
- Ben Hinkle (24/61) May 18 2005 Speeding up std.utf would be good - how can one argue with that? :-)
- Uwe Salomon (24/32) May 18 2005 Yes, one of them sounds much better. I did not think long about
- Uwe Salomon (4/10) May 18 2005 Maybe i should add that if you convert a text which contains a lot of UT...
- Ben Hinkle (10/42) May 18 2005 I could see using the unsafe versions when you check the input once and ...
- Uwe Salomon (30/49) May 18 2005 Imagine a program that reads a lot of files from disk, does some fuzzy
- Ben Hinkle (6/56) May 18 2005 sounds reasonable
- Uwe Salomon (10/20) May 18 2005 Still you are right. I moved it out of the loop in toUtf16(). I will thi...
- Uwe Salomon (18/18) May 21 2005 I have now moved the UTF conversion code into the std.utf module. I have...
- Uwe Salomon (1/1) May 21 2005 And here goes the attachment %)
During the writing of a string class for my Indigo library i "discovered" the need for a thorough internationalization library for D. I think a good implementation of i18n functionality would be very important for the development of applications in D, thus for the future of D. There is the ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as natural and fast as it could be. I would like to write a native D i18n library which is independent of third party libraries. As this is too big a project to develop for myself, and (i hope) of public interest for the D community, i would like to ask for: - Advice: What is needed? How should it be implemented? - Help: Who has the time and wants to help me? A total of 2 or 3 developers should be sufficient? My ideas are to write a compact core library that contains the most important features (character properties, UTF encodings, basic message translation), and then write some localization modules (number formatting, date formatting, comparing and searching). The goals should be simplicity and speed (but perhaps the community wants other things more?), avoiding complicated implementations and "template magic". And it should be well documented from the beginning, not a construction site on every corner. But that are just some ideas that come to my mind right now. I hope that everybody makes some helpful statements what he/she thinks should be covered by the library on all accounts, and what would be very nice. Thanks & ciao uwe
May 15 2005
Good idea, I like it. FYI: On Windows MultiByteToWideChar and WideCharToMultiByte support many encodings other than mentioned directly in MSDN. I am using this list: lang_t langs[] = { {"asmo-708",708}, {"dos-720",720}, {"iso-8859-6",28596}, {"x-mac-arabic",10004}, {"windows-1256",1256}, {"ibm775",775}, {"iso-8859-4",28594}, {"windows-1257",1257}, {"ibm852",852}, {"iso-8859-2",28592}, {"x-mac-ce",10029}, {"windows-1250",1250}, {"euc-cn",51936}, {"gb2312",936}, {"hz-gb-2312",52936}, {"x-mac-chinesesimp",10008}, {"big5",950}, {"x-chinese-cns",20000}, {"x-chinese-eten",20002}, {"x-mac-chinesetrad",10002}, {"cp866",866}, {"iso-8859-5",28595}, {"koi8-r",20866}, {"koi8-u",21866}, {"x-mac-cyrillic",10007}, {"windows-1251",1251}, {"x-europa",29001}, {"x-ia5-german",20106}, {"ibm737",737}, {"iso-8859-7",28597}, {"x-mac-greek",10006}, {"windows-1253",1253}, {"ibm869",869}, {"dos-862",862}, {"iso-8859-8-i",38598}, {"iso-8859-8",28598}, {"x-mac-hebrew",10005}, {"windows-1255",1255}, {"x-ebcdic-arabic",20420}, {"x-ebcdic-cyrillicrussian",20880}, {"x-ebcdic-cyrillicserbianbulgarian",21025}, {"x-ebcdic-denmarknorway",20277}, {"x-ebcdic-denmarknorway-euro",1142}, {"x-ebcdic-finlandsweden",20278}, {"x-ebcdic-finlandsweden-euro",1143}, {"x-ebcdic-finlandsweden-euro",1143}, {"x-ebcdic-france-euro",1147}, {"x-ebcdic-germany",20273}, {"x-ebcdic-germany-euro",1141}, {"x-ebcdic-greekmodern",875}, {"x-ebcdic-greek",20423}, {"x-ebcdic-hebrew",20424}, {"x-ebcdic-icelandic",20871}, {"x-ebcdic-icelandic-euro",1149}, {"x-ebcdic-international-euro",1148}, {"x-ebcdic-italy",20280}, {"x-ebcdic-italy-euro",1144}, {"x-ebcdic-japaneseandkana",50930}, {"x-ebcdic-japaneseandjapaneselatin",50939}, {"x-ebcdic-japaneseanduscanada",50931}, {"x-ebcdic-japanesekatakana",20290}, {"x-ebcdic-koreanandkoreanextended",50933}, {"x-ebcdic-koreanextended",20833}, {"cp870",870}, {"x-ebcdic-simplifiedchinese",50935}, {"x-ebcdic-spain",20284}, {"x-ebcdic-spain-euro",1145}, {"x-ebcdic-thai",20838}, {"x-ebcdic-traditionalchinese",50937}, {"cp1026",1026}, {"x-ebcdic-turkish",20905}, {"x-ebcdic-uk",20285}, {"x-ebcdic-uk-euro",1146}, {"ebcdic-cp-us",37}, {"x-ebcdic-cp-us-euro",1140}, {"ibm861",861}, {"x-mac-icelandic",10079}, {"x-iscii-as",57006}, {"x-iscii-be",57003}, {"x-iscii-de",57002}, {"x-iscii-gu",57010}, {"x-iscii-ka",57008}, {"x-iscii-ma",57009}, {"x-iscii-or",57007}, {"x-iscii-pa",57011}, {"x-iscii-ta",57004}, {"x-iscii-te",57005}, {"euc-jp",51932}, {"iso-2022-jp",50220}, {"iso-2022-jp",50222}, {"csiso2022jp",50221}, {"x-mac-japanese",10001}, {"shift_jis",932}, {"ks_c_5601-1987",949}, {"euc-kr",51949}, {"iso-2022-kr",50225}, {"johab",1361}, {"x-mac-korean",10003}, {"iso-8859-3",28593}, {"iso-8859-15",28605}, {"x-ia5-norwegian",20108}, {"ibm437",437}, {"x-ia5-swedish",20107}, {"windows-874",874}, {"ibm857",857}, {"iso-8859-9",28599}, {"x-mac-turkish",10081}, {"windows-1254",1254}, //{(const char *)L"unicode",1200}, //{"unicodefffe",1201}, {"utf-7",65000}, {"utf-8",65001}, //{"us-ascii",20127}, {"us-ascii",1252}, {"windows-1258",1258}, {"ibm850",850}, {"x-ia5",20105}, {"iso-8859-1",1252}, //was 28591 {"macintosh",10000}, {"windows-1252",1252}, {"system",CP_ACP} }; Second member in these structs is codepage id directly used as first parameter of MultiByteToWideChar and WideCharToMultiByte Hope this will help. At least it might help to build translation tables automaticly :) Andrew. "Uwe Salomon" <post uwesalomon.de> wrote in message news:op.sqtvopik6yjbe6 sandmann.maerchenwald.net...During the writing of a string class for my Indigo library i "discovered" the need for a thorough internationalization library for D. I think a good implementation of i18n functionality would be very important for the development of applications in D, thus for the future of D. There is the ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as natural and fast as it could be. I would like to write a native D i18n library which is independent of third party libraries. As this is too big a project to develop for myself, and (i hope) of public interest for the D community, i would like to ask for: - Advice: What is needed? How should it be implemented? - Help: Who has the time and wants to help me? A total of 2 or 3 developers should be sufficient? My ideas are to write a compact core library that contains the most important features (character properties, UTF encodings, basic message translation), and then write some localization modules (number formatting, date formatting, comparing and searching). The goals should be simplicity and speed (but perhaps the community wants other things more?), avoiding complicated implementations and "template magic". And it should be well documented from the beginning, not a construction site on every corner. But that are just some ideas that come to my mind right now. I hope that everybody makes some helpful statements what he/she thinks should be covered by the library on all accounts, and what would be very nice. Thanks & ciao uwe
May 15 2005
FYI: On Windows MultiByteToWideChar and WideCharToMultiByte support many encodings other than mentioned directly in MSDN.Hmm, thanks for that. As libiconv is no standard for windows :) this will come in handy. Is there anyone who knows about encoding/decoding (and programming specialties in general) on the Mac? Regrettably, i know not a thing about the Mac programming environment at all. :( uwe
May 15 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Uwe Salomon schrieb am Sun, 15 May 2005 19:47:03 +0200:During the writing of a string class for my Indigo library i "discovered" the need for a thorough internationalization library for D. I think a good implementation of i18n functionality would be very important for the development of applications in D, thus for the future of D. There is the ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as natural and fast as it could be. I would like to write a native D i18n library which is independent of third party libraries.[snip] some links: http://www.i18ngurus.com/ http://www.openi18n.org/ http://java.sun.com/j2se/corejava/intl/ http://doc.trolltech.com/3.3/i18n.html Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFCifJL3w+/yD4P9tIRAovzAKDAgMP6Ti7ENQPYwMo1uuLdrIBKfQCfbg9q 8bWa1c8AAVR++B5ytpxaugo= =XnC4 -----END PGP SIGNATURE-----
May 17 2005
some links:[snip] These are very good and informative, thanks a lot! uwe
May 17 2005
Uwe Salomon wrote:Also, look at http://i18n.kde.org and http://developer.kde.org/documentation/library/kdeqt/kde3arch/kde-i18n-howto.html/ While KDE is based on Qt, it seems like they've expanded on the functionality, especially the part that has to do with translations of messages and gui. Lars Ivar Igesundsome links:[snip] These are very good and informative, thanks a lot! uwe
May 18 2005
While KDE is based on Qt, it seems like they've expanded on the functionality, especially the part that has to do with translations of messages and gui.Hmm, they are using GNU gettext() instead of the Qt tr(). Perhaps it would be a good idea to go at least one of the ways, instead of inventing something totally new. I like the KDE markup i18n("String to translate"). If i used that, all the existing tools (KBabel, Emacs PO mode) as well as string extractors and friends were available already. But it will make the lib dependant on GNU gettext(), or i would have to write my own .mo reader. gettext() is nonstandard for Windows, right? Please, would anybody be so kind and explain to me how translation of user messages works under Windows (roughly)? I remember them using resource files. Does the application load the right resource file at runtime? And how does it work for the Mac? Thanks for the help! uwe
May 18 2005
This is a first implementation for conversion between UTF encodings. I used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of: char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer); char[] toUtf8(wchar[] str, inout size_t eaten); char[] toUtf8(wchar[] str, char[] buffer); char[] toUtf8(wchar[] str); * The first function converts str into UTF-8, beginning at str[eaten], adjusting eaten up to where it converted (stopping before an incomplete sequence at the end of str), and using buffer if large enough, reallocating the buffer if space is not sufficient. It throws an exception if faced with invalid input encoding. * The second function allocates a sufficient buffer itself. * The third function converts str as a whole, asserting on an incomplete sequence at the end of str. It uses buffer if possible. * The fourth function does like the third, and allocates the buffer itself. * For every function there is a variant called fast_toUtf8() with the same parameters which relies on valid input, producing invalid output otherwise. It can be used if the input is guaranteed to be valid, and is much faster then. For more explanations and a coding example visit: http://www.uwesalomon.de/code/unicode/files/conversion-d.html The source is at http://www.uwesalomon.de/code/unicode/conversion.d This is a draft, and i will be very happy if everyone who is interested comments on it, especially the API "design" (i know, fast_toUtf8() is a clumsy name :). And another question (i hope this is not arrogant): should these functions (or especially the simple form, without eaten) be included into Phobos std.utf? They are *much* faster than the current implementation. If someone would say, "Nice stuff, kiddo. Debug that properly, adjust it to the std.utf module (use their exception etc.) and submit a patch. Perhaps we will look at it then." i would sure do that. :) But i am afraid that these kind of guerilla actions are rather unwanted, and i should better keep my mouth shut and code some useful stuff... Thanks uwe
May 18 2005
"Uwe Salomon" <post uwesalomon.de> wrote in message news:op.sqyw3zok6yjbe6 sandmann.maerchenwald.net...This is a first implementation for conversion between UTF encodings. I used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of: char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer); char[] toUtf8(wchar[] str, inout size_t eaten); char[] toUtf8(wchar[] str, char[] buffer); char[] toUtf8(wchar[] str); * The first function converts str into UTF-8, beginning at str[eaten], adjusting eaten up to where it converted (stopping before an incomplete sequence at the end of str), and using buffer if large enough, reallocating the buffer if space is not sufficient. It throws an exception if faced with invalid input encoding. * The second function allocates a sufficient buffer itself. * The third function converts str as a whole, asserting on an incomplete sequence at the end of str. It uses buffer if possible. * The fourth function does like the third, and allocates the buffer itself. * For every function there is a variant called fast_toUtf8() with the same parameters which relies on valid input, producing invalid output otherwise. It can be used if the input is guaranteed to be valid, and is much faster then. For more explanations and a coding example visit: http://www.uwesalomon.de/code/unicode/files/conversion-d.html The source is at http://www.uwesalomon.de/code/unicode/conversion.d This is a draft, and i will be very happy if everyone who is interested comments on it, especially the API "design" (i know, fast_toUtf8() is a clumsy name :). And another question (i hope this is not arrogant): should these functions (or especially the simple form, without eaten) be included into Phobos std.utf? They are *much* faster than the current implementation. If someone would say, "Nice stuff, kiddo. Debug that properly, adjust it to the std.utf module (use their exception etc.) and submit a patch. Perhaps we will look at it then." i would sure do that. :) But i am afraid that these kind of guerilla actions are rather unwanted, and i should better keep my mouth shut and code some useful stuff... Thanks uweSpeeding up std.utf would be good - how can one argue with that? :-) Three thoughts come to mind: 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked to indicate to the user that it's not just a faster version of another routine (since I'd call fast_foo over foo every time!) but one that makes significant assumptions about the input. I'm not actually sure how often it would be ok to call such a function anyway so maybe it isn't even needed. Getting the wrong answer quickly is not a good trade-off. 2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside? 3) the formatting of the source code is somewhat unusual. I missed the loop at first: // Now do the conversion. if (pIn < endIn) do { // Check for enough space left in the buffer. if (pOut >= endOut) [snip 50 lines of code or so] } while (++pIn < endIn); That first line with the "do" my eye skipped right over the "do" and I had to backtrack once I saw a "while" down at the bottom.
May 18 2005
1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8UncheckedYes, one of them sounds much better. I did not think long about fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very long.I'm not actually sure how often it would be ok to call such a function anyway so maybe it isn't even needed. Getting the wrong answer quickly is not a good trade-off.You are right, that is an important fact, especially for a standard library. Easy test: i converted a german email (mostly ASCII, some special characters) with 5000 characters from UTF8 to UTF16. I provided the buffer, because both functions are equally well at allocating memory. Normal compilation: * safe function: 0.100 ms * unsafe function: 0.088 ms (12% faster) Compilation -release -O: * safe function: 0.050 ms * unsafe function: 0.046 ms (8 % faster) I am not sure how all this could benefit from an assembler implementation. Anyways, the speed gain is minimal (actually, i thought it would be a lot more!). Well, no need to search for a good "unsafe" name then. ;)2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside?Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?3) the formatting of the source code is somewhat unusual. I missed the loop at first.Changed. Thanks for the reply, uwe
May 18 2005
Normal compilation: * safe function: 0.100 ms * unsafe function: 0.088 ms (12% faster) Compilation -release -O: * safe function: 0.050 ms * unsafe function: 0.046 ms (8 % faster)Maybe i should add that if you convert a text which contains a lot of UTF8 2/3-byte-encodings (asian languages), the unsafe function saves more: about 20% in comparison to the safe function. uwe
May 18 2005
"Uwe Salomon" <post uwesalomon.de> wrote in message news:op.sqy2lzec6yjbe6 sandmann.maerchenwald.net...I could see using the unsafe versions when you check the input once and then convert many slices that one then knows to be safe. So it isn't unreasonable to have it in there. I don't know the use cases well enough to offer up an opinion.1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8UncheckedYes, one of them sounds much better. I did not think long about fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very long.I'm not actually sure how often it would be ok to call such a function anyway so maybe it isn't even needed. Getting the wrong answer quickly is not a good trade-off.You are right, that is an important fact, especially for a standard library. Easy test: i converted a german email (mostly ASCII, some special characters) with 5000 characters from UTF8 to UTF16. I provided the buffer, because both functions are equally well at allocating memory. Normal compilation: * safe function: 0.100 ms * unsafe function: 0.088 ms (12% faster) Compilation -release -O: * safe function: 0.050 ms * unsafe function: 0.046 ms (8 % faster) I am not sure how all this could benefit from an assembler implementation. Anyways, the speed gain is minimal (actually, i thought it would be a lot more!). Well, no need to search for a good "unsafe" name then. ;)How about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that.2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside?Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?3) the formatting of the source code is somewhat unusual. I missed the loop at first.Changed. Thanks for the reply, uwe
May 18 2005
I could see using the unsafe versions when you check the input once and then convert many slices that one then knows to be safe. So it isn't unreasonable to have it in there. I don't know the use cases well enough to offer up an opinion.Imagine a program that reads a lot of files from disk, does some fuzzy work on them, and writes some others back, for example a doc tool. It reads the source files in UTF8 format and converts them to the internally used UTF16 (using the safe functions). It then processes here and there, extracts the comments and formats them round. After that it puts out HTML files in UTF8. The comments need to be converted back to UTF8, and that's where the program could use the unsafe functions. At least that were my thoughts. But if the speed gain is under 30%, i think the fast versions are unnecessary. Imagine the doc tool needs a minute for output. With the current functions this would drop to 50 seconds at most, providing that the output only consists of UTF conversion (which is very unlikely).Hmm, the current source is: if (pOut >= endOut) { // ... buffer.length = buffer.length + (endIn - pIn) + 2; // Will be enough. // ... } This will grow the buffer only once? (endIn - pIn) is the number of UTF8 characters to be processed, and they cannot expand to more than the same amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes for toUtf8(). But you are right, this could still be moved before the loop, especially this one in toUtf16(). That's because (endIn - pIn) is a very accurate guess for languages with a lot of ASCII in them. Ciao uweHow about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that.2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside?Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?
May 18 2005
"Uwe Salomon" <post uwesalomon.de> wrote in message news:op.sqy4o0ud6yjbe6 sandmann.maerchenwald.net...sounds reasonableI could see using the unsafe versions when you check the input once and then convert many slices that one then knows to be safe. So it isn't unreasonable to have it in there. I don't know the use cases well enough to offer up an opinion.Imagine a program that reads a lot of files from disk, does some fuzzy work on them, and writes some others back, for example a doc tool. It reads the source files in UTF8 format and converts them to the internally used UTF16 (using the safe functions). It then processes here and there, extracts the comments and formats them round. After that it puts out HTML files in UTF8. The comments need to be converted back to UTF8, and that's where the program could use the unsafe functions. At least that were my thoughts. But if the speed gain is under 30%, i think the fast versions are unnecessary. Imagine the doc tool needs a minute for output. With the current functions this would drop to 50 seconds at most, providing that the output only consists of UTF conversion (which is very unlikely).ok - I didn't look at the details. I just saw the resizing happening in the loop and guessed it was resizing a little bit each time. What you have seems reasonable.Hmm, the current source is: if (pOut >= endOut) { // ... buffer.length = buffer.length + (endIn - pIn) + 2; // Will be enough. // ... } This will grow the buffer only once? (endIn - pIn) is the number of UTF8 characters to be processed, and they cannot expand to more than the same amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes for toUtf8().How about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that.2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside?Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?But you are right, this could still be moved before the loop, especially this one in toUtf16(). That's because (endIn - pIn) is a very accurate guess for languages with a lot of ASCII in them. Ciao uwe
May 18 2005
Still you are right. I moved it out of the loop in toUtf16(). I will think about it in the other functions, not sure what is best in each case (well, it always depends on the characters in the string). I am now writing the other 4 functions (that is much easier now, as the two were the most complex). After finishing and testing them, i'll beep again. :) By the way... how are the Phobos docs generated? Hand-crafted? I will also update the corresponding sections if you let me... Ciao uweThis will grow the buffer only once? (endIn - pIn) is the number of UTF8 characters to be processed, and they cannot expand to more than the same amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes for toUtf8().ok - I didn't look at the details. I just saw the resizing happening in the loop and guessed it was resizing a little bit each time. What you have seems reasonable.
May 18 2005
I have now moved the UTF conversion code into the std.utf module. I have made the following changes: * The tabs are now spaces. Sorry... :) * Slight change in UTF8stride array. Unicode 4.01 declares some encodings illegal, including 5- and 6-byte encodings and some at the beginning of the 2-bytes. * Slight change in stride(wchar) and toUTFindex(wchar) and toUCSindex(wchar). I just changed the detection of UTF16 surrogate values to a faster variant that does not need a local variable as well. * Replacement of all toUTF() functions, except the ones that only validate because the return type has the same encoding as the parameter. toUTF16z() is still there as well, but changed to use my own toUTF16 (it zero-terminates the strings anyways). I have not changed the encode/decode functions, even if they really needed some change (especially the UTF8 decode() function). I will happily do that, but i want to know first if my previous work is ok. Ciao uwe
May 21 2005