digitalmars.D.learn - Char & the Extended ascii set
- Era Scarecrow (34/34) Jan 28 2012 It there any support for the extended ascii characters? (128-255). I un...
- Era Scarecrow (2/25) Jan 28 2012 Yeah, and while I'm finding more often then not what is breaking the Un...
It there any support for the extended ascii characters? (128-255). I under= stand unicode is important, however working with some data and programs tha= t don't support those, I am getting a problem that the program causes an ex= ception because it isn't valid utf-8. Do I have to handle it all as bytes/u= bytes? If I do then I lose out on many char specific functions. Alternative= ly I can rely on the C functions, but I want to avoid using them if I can. Example: note the raw data below, being 39 vs -110 this._ID =3D "SPEL_wulfharth's cups" rhs._ID =3D "SPEL_wulfharth=E2=96=92s cups" this._ID =3D [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 10= 4, 39, 115, 32, 99, 117, 112, 115, 0] rhs._ID =3D [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 10= 4, -110, 115, 32, 99, 117, 112, 115, 0] I have compiled and made a table for the appropriate conversions to proper = unicode, which you can then use in reverse to get it back to it's previous = state. However I'm not sure. //referenced from http://ascii-table.com/ascii-extended-pc-list.php wchar[128] convertAsciiExtended =3D [ =090x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 0x00E5, 0x00E7, =090x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 0x00C4, 0x00C5, =090x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 0x00FB, 0x00F9, =090x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 0x20A7, 0x0192, =090x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 0x00AA, 0x00BA, =090x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 0x00AB, 0x00BB, =090x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556, =090x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510, =090x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F, =090x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567, =090x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B, =090x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580, =090x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 0x00B5, 0x03C4, =090x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 0x03B5, 0x2229, =090x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00F7, 0x2248, =090x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 0x25A0, 0x00A0];
Jan 28 2012
char is UTF-8 by definition, and D code is free to assume that that's the case. A lot of the string processing code in Phobos will throw if you give it ill- formed unicode. Now, you can put whatever you want in a char, but don't expect other D code to handle it correctly. The only support in Phobos for dealing with alternate encodings is std.encoding. It currently supports "UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1 (also known as LATIN-1), and WINDOWS-1252." So, if you can get that to do the conversions that you want, then there you go, but otherwise you're on your own. Regardless, you need to convert your chars to proper UTF-8 if you want other D code (and especially Phobos) to handle them correctly.Yeah, and while I'm finding more often then not what is breaking the Unicode are likely duplicates and errors in the source file (at least 10 years old too). Based on the sparseness and rarity of the formatting getting in the way I've tried making a custom compare function that uses the phobos code, but catches the exception when the UTF is badly formatted, which then converts it and tries the compare again. The source format doesn't have everything as texts marked, rather it has to be taken in context when it is needed, so needlessly converting to proper unicode on everything will be a waste 75%-95% of the time.
Jan 28 2012