digitalmars.D - Chars and Strs
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (15/15) Feb 11 2005 Here's another long documentation essay,
- Andrew Fedoniouk (40/55) Feb 11 2005 Hi, Anders,
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (7/24) Feb 11 2005 What I meant to say was that *surrogates* need to be in pairs...
- Roald Ribe (15/17) Feb 12 2005 Most of the WIN32 API has two entries for each function. The 8 bit chara...
Here's another long documentation essay, on the other "missing" D type: strings... http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs I'll add some D sample code on how to convert to and from legacy encodings (manually) later. http://www.algonet.se/~afb/d/mapping.zip (using ftp://ftp.unicode.org/Public/MAPPINGS/) And some character tables for US-ASCII and Latin-1, http://www.algonet.se/~afb/d/latin1/iso-8859-1.html Also needed is how to talk to the Windows console, http://www.digitalmars.com/techtips/windows_utf.html But that can wait until after I get back from vacation :-) Any comments can only make it better, here or on Wiki4D... Share and Enjoy, --anders
Feb 11 2005
Hi, Anders, I am looking on: "All UTF-16 code units from 0xD800-0xDFFF are similarly just "surrogates" for a real code point, and *must be occur in pairs that can then be combined to form the real Unicode code unit*. The lower byte of the code units 0x0000-0x00FF are exactly the same as the ISO-8859-1 encoding, and 0x00-0x7F is the same as ASCII. They are also called "wide characters", by some operating systems." Stuff in *...* (my mark) technically speaking is not the case. UTF-16 corresponds to UCS-2 (Basic Multilanguage Plane - BMP) and does not need to "must occur in pairs". It depends on use case: Programm A supports only UCS-2 and programm B supports UCS-4. (BMP) The first plane defined in Unicode/ISO 10646, designed to include all scripts in active modern use. The BMP currently includes the Latin, Greek, Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, among others, and a large body of mathematical, APL-related, and other miscellaneous characters. Most of the Han ideographs in current use are present in the BMP, but due to the large number of ideographs, many were placed in the Supplementary Ideographic Plane. Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of functions there (AFAIK) can treat widechars as sequences of UTF-16 codes. All modern browsers has UCS-2 as their internal representation. JavaScript, Java are also UCS-2 only (by their specs) languages. -------------------------------------------------------------------- I've found D way treating strings as char[], dchar[] and qchar[] pretty reasonable as this allows to work with text in most optimal way. The only thing I am not sure yet - string as an entity has its own methods. It is pretty traditional these days to use them as objects : s.substr(1,4). But in fact strings are atomic types so they should be handled as any other native types e.g. int. Personally I think that substr(s,1.4) is more "honest" than s.substr(1,4). Some aesthetical concerns though. But as deeper I am looking in the "string sense. E.g. inability to work with strings as sequences(arrays) of characters is a source of many bottlenecks in these languages. Andrew Fedoniouk. http://terrainformatica.com "Anders F Björklund" <afb algonet.se> wrote in message news:cuifu3$23kd$1 digitaldaemon.com...Here's another long documentation essay, on the other "missing" D type: strings... http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs I'll add some D sample code on how to convert to and from legacy encodings (manually) later. http://www.algonet.se/~afb/d/mapping.zip (using ftp://ftp.unicode.org/Public/MAPPINGS/) And some character tables for US-ASCII and Latin-1, http://www.algonet.se/~afb/d/latin1/iso-8859-1.html Also needed is how to talk to the Windows console, http://www.digitalmars.com/techtips/windows_utf.html But that can wait until after I get back from vacation :-) Any comments can only make it better, here or on Wiki4D... Share and Enjoy, --anders
Feb 11 2005
Andrew Fedoniouk wrote:"All UTF-16 code units from 0xD800-0xDFFF are similarly just "surrogates" for a real code point, and *must be occur in pairs that can then be combined to form the real Unicode code unit*. The lower byte of the code units 0x0000-0x00FF are exactly the same as the ISO-8859-1 encoding, and 0x00-0x7F is the same as ASCII. They are also called "wide characters", by some operating systems." Stuff in *...* (my mark) technically speaking is not the case.Hmm, doesn't even seem to be a real sentence :-) "must be occur"UTF-16 corresponds to UCS-2 (Basic Multilanguage Plane - BMP) and does not need to "must occur in pairs". It depends on use case: Programm A supports only UCS-2 and programm B supports UCS-4.What I meant to say was that *surrogates* need to be in pairs... (0xD800-0xDFFF) Not all the other individual UTF-16 code units. Got it from http://www.unicode.org/faq/utf_bom.html#UTF16Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of functions there (AFAIK) can treat widechars as sequences of UTF-16 codes. All modern browsers has UCS-2 as their internal representation. JavaScript, Java are also UCS-2 only (by their specs) languages.Right, the "wide characters" should be mentioned down by the Z stuff... --anders
Feb 11 2005
Andrew Fedoniouk wrote:Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.Most of the WIN32 API has two entries for each function. The 8 bit character API functions has A appended to their names, and the 16 bit character funcs has W appended. This how each application can choose which API to use. * On NT based kernels the 16 bit char is what is used natively, and the A API's just convert from currently selected codepage into unicode before calling the W API. * On 9x/Me kernels, the W API's comes as redistributable DLL's (apps can include them in their installer). In these systems the W API just converts the unicode strings/chars into current codepage (where possible) and then calls the native A API's. So to conclude: Most (all?) of the WIN32 API is available in both 8 and 16 bits versions. The only exception may be WIN32s, but I do not think anyone uses that for new software releases (if they ever did). Roald
Feb 12 2005