digitalmars.D - Working with utf
- Simen Haugen (16/16) Jun 14 2007 I hate it!
- Regan Heath (5/24) Jun 14 2007 I think what we want for this is a String class which internally stores ...
- Simen Haugen (7/17) Jun 14 2007 That would have been a very nice addition. I cannot even count how many
- Frits van Bommel (24/25) Jun 14 2007 Your time away from D (6 months was it?) is showing...
- Derek Parnell (8/27) Jun 14 2007 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-...
- Simen Haugen (5/7) Jun 14 2007 You're kidding me, right? Then I only have to convert to utf-32 when rea...
- Derek Parnell (9/17) Jun 14 2007 dchar[] Y;
- Frits van Bommel (12/26) Jun 14 2007 Except his input is encoded as Latin-1, not UTF-8. Conversion is still
- Simen Haugen (6/17) Jun 14 2007 I tested this now, and it works like a charm. This means I can finally g...
- Frits van Bommel (9/28) Jun 14 2007 If you only ever need to represent Latin-1 (but need string functions,
- Simen Haugen (6/24) Jun 14 2007 Except that most functions in the string library takes a char and not dc...
- Derek Parnell (37/65) Jun 14 2007 I read the OP as saying he was already converting Latin-1 to utf8 and wa...
- Frits van Bommel (27/50) Jun 14 2007 That'd work, but will allocate more memory than required (5 to 6 times
- Oskar Linde (7/26) Jun 14 2007 The solution is simple. If all your data is latin-1, and your
I hate it! Say we have a string "øl". When I read this from a text file, it is two chars, but since this is no utf8 string, I have to convert it to utf8 before I can do any string operations on it. I can easily live with that. Say we have a file with several lines, and its important that all lines are of equal length. The string "ol" is two chars, but the string "øl" is 3 chars in utf8. Because of this I have to convert it back to latin-1 before checking lengths. The same applies to slicing, but even worse. For all I care, "ø" is one character, not two. If I slice "ø" to get the first character, I only get the first half of the character. Isn't it more obvious that all string manipulation works with all utf8 characters as one character instead of two for values greater than 127? I cannot find any nice solutions for this, and have to convert to and from latin-1/utf8 all the time. There must be a better way...
Jun 14 2007
Simen Haugen Wrote:I hate it! Say we have a string "øl". When I read this from a text file, it is two chars, but since this is no utf8 string, I have to convert it to utf8 before I can do any string operations on it. I can easily live with that. Say we have a file with several lines, and its important that all lines are of equal length. The string "ol" is two chars, but the string "øl" is 3 chars in utf8. Because of this I have to convert it back to latin-1 before checking lengths. The same applies to slicing, but even worse. For all I care, "ø" is one character, not two. If I slice "ø" to get the first character, I only get the first half of the character. Isn't it more obvious that all string manipulation works with all utf8 characters as one character instead of two for values greater than 127? I cannot find any nice solutions for this, and have to convert to and from latin-1/utf8 all the time. There must be a better way...I think what we want for this is a String class which internally stores the data as utf-8, 16 or 32 (making it's own decision or being told which to use) and provides slicing of characters as opposed to codpoints. Then, all you need is to convert from latin-1 to String, do all your work with String and convert back to latin-1 only if/when you need to write it back to a file or similar. My gut feeling is that this functionality belongs in a class and not the language itself. After all, you may want/need to manipulate utf-8, 16, or 32 codepoints directly for some reason. Regan Heath
Jun 14 2007
"Regan Heath" <regan netmail.co.nz> wrote in message news:f4rd7m$dlb$1 digitalmars.com...I think what we want for this is a String class which internally stores the data as utf-8, 16 or 32 (making it's own decision or being told which to use) and provides slicing of characters as opposed to codpoints. Then, all you need is to convert from latin-1 to String, do all your work with String and convert back to latin-1 only if/when you need to write it back to a file or similar. My gut feeling is that this functionality belongs in a class and not the language itself. After all, you may want/need to manipulate utf-8, 16, or 32 codepoints directly for some reason. Regan HeathThat would have been a very nice addition. I cannot even count how many hard-to-find bugs I've had because of this (both slicing and length). Utf8 and slicing is supported by the language, right? To me it sounds more like a bug that these wont work together, as I tend to trust that language features work.
Jun 14 2007
Regan Heath wrote:I think what we want for this is a String class which internally stores the data as utf-8, 16 or 32 (making it's own decision or being told which to use) and provides slicing of characters as opposed to codpoints.Your time away from D (6 months was it?) is showing... There's such a string implementation at http://www.dprogramming.com/dstring.php (Though IIRC it's a struct, not a class ;) ) Features: * Indexing and slicing always works with on code point indices, not code units. * Contents is stored as chars, wchars, dchars (whichever is sufficiently large to store every code point in the string as a single code unit). * The size taken to store a string instance is equal to (char[]).sizeof: 2 * size_t.sizeof. * The upper two bits of the size_t containing the length are used as a flag for what type the pointer refers to (char/wchar/dchar). This causes the maximum length to be cut in 4, but that still allows strings up to 1 GiB on 32-bit machines, and much bigger on 64-bit machines. It can still be a problem that sometimes multiple code points are needed to encode one "logical" character; the extra ones for diacritics (accents). Above-mentioned size limitation can theoretically be a problem, but is probably a rare one. (When was the last time you needed more than a billion characters in a single string?) Basically though, this allows you to pretend it's a dchar[] without the memory penalty (It only uses dchars internally if you use characters outside the BMP (> 0xFFFF), which are rarely needed in non-asian languages).
Jun 14 2007
On Thu, 14 Jun 2007 14:40:02 +0200, Simen Haugen wrote:I hate it! Say we have a string "øl". When I read this from a text file, it is two chars, but since this is no utf8 string, I have to convert it to utf8 before I can do any string operations on it. I can easily live with that. Say we have a file with several lines, and its important that all lines are of equal length. The string "ol" is two chars, but the string "øl" is 3 chars in utf8. Because of this I have to convert it back to latin-1 before checking lengths. The same applies to slicing, but even worse. For all I care, "ø" is one character, not two. If I slice "ø" to get the first character, I only get the first half of the character. Isn't it more obvious that all string manipulation works with all utf8 characters as one character instead of two for values greater than 127? I cannot find any nice solutions for this, and have to convert to and from latin-1/utf8 all the time. There must be a better way...Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1 when you're done. Each dchar[] element is a single character. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Jun 14 2007
"Derek Parnell" <derek psych.ward> wrote in message news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1 when you're done. Each dchar[] element is a single character.You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])
Jun 14 2007
On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:"Derek Parnell" <derek psych.ward> wrote in message news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...dchar[] Y; char[] Z; Y = std.utf.toUTF32(Z); -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnellConvert to utf32 (dchar[]) then do your stuff and convert back to latin-1 when you're done. Each dchar[] element is a single character.You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])
Jun 14 2007
Derek Parnell wrote:On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:Except his input is encoded as Latin-1, not UTF-8. Conversion is still trivial though: --- auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt"); dchar[] utf = new dchar[](latin1.length); for(size_t i = 0; i < latin1.length; i++) { utf[i] = latin1[i]; } --- and the other way around. (The first 256 code points of Unicode are identical to Latin-1)"Derek Parnell" <derek psych.ward> wrote in message news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...dchar[] Y; char[] Z; Y = std.utf.toUTF32(Z);Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1 when you're done. Each dchar[] element is a single character.You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])
Jun 14 2007
I tested this now, and it works like a charm. This means I can finally get rid of all my convertions between utf8 and latin1! (together with all these hidden bugs) Thanks a lot for all your help. "Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message news:f4rh01$lkt$2 digitalmars.com...Except his input is encoded as Latin-1, not UTF-8. Conversion is still trivial though: --- auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt"); dchar[] utf = new dchar[](latin1.length); for(size_t i = 0; i < latin1.length; i++) { utf[i] = latin1[i]; } --- and the other way around. (The first 256 code points of Unicode are identical to Latin-1)
Jun 14 2007
Simen Haugen wrote:I tested this now, and it works like a charm. This means I can finally get rid of all my convertions between utf8 and latin1! (together with all these hidden bugs) Thanks a lot for all your help. "Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message news:f4rh01$lkt$2 digitalmars.com...If you only ever need to represent Latin-1 (but need string functions, not just array functions), wchar[] will also work, and only take half the memory. If you don't need string functions, of course, you can just keep it as ubyte[]s the whole time. (By "string functions" I mean stuff like case conversions, console output and so on. In particular, note that slicing & indexing works on all arrays, not just strings)Except his input is encoded as Latin-1, not UTF-8. Conversion is still trivial though: --- auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt"); dchar[] utf = new dchar[](latin1.length); for(size_t i = 0; i < latin1.length; i++) { utf[i] = latin1[i]; } --- and the other way around. (The first 256 code points of Unicode are identical to Latin-1)
Jun 14 2007
Except that most functions in the string library takes a char and not dchar as parameter. Then I still have to convert to utf8 whenever I want to use the functions, and then I'm just as far. "Simen Haugen" <simen norstat.no> wrote in message news:f4riug$rrt$1 digitalmars.com...I tested this now, and it works like a charm. This means I can finally get rid of all my convertions between utf8 and latin1! (together with all these hidden bugs) Thanks a lot for all your help. "Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message news:f4rh01$lkt$2 digitalmars.com...Except his input is encoded as Latin-1, not UTF-8. Conversion is still trivial though: --- auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt"); dchar[] utf = new dchar[](latin1.length); for(size_t i = 0; i < latin1.length; i++) { utf[i] = latin1[i]; } --- and the other way around. (The first 256 code points of Unicode are identical to Latin-1)
Jun 14 2007
On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:Derek Parnell wrote:I read the OP as saying he was already converting Latin-1 to utf8 and was nowe concerned about converting utf8 to utf32, thus I gave that toUTF32() hint.On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:Except his input is encoded as Latin-1, not UTF-8."Derek Parnell" <derek psych.ward> wrote in message news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...dchar[] Y; char[] Z; Y = std.utf.toUTF32(Z);Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1 when you're done. Each dchar[] element is a single character.You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])Conversion is still trivial though: --- auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt"); dchar[] utf = new dchar[](latin1.length); for(size_t i = 0; i < latin1.length; i++) { utf[i] = latin1[i]; } --- and the other way around. (The first 256 code points of Unicode are identical to Latin-1)I was not aware of that. So if one needs to convert from Latin-1 to utf8 ... import std.utf; dchar[] Latin1toUTF32(ubyte[] pLatin1Text) { dchar[] utf; utf.length = pLatin1Text.length; foreach(i, b; pLatin1Text) utf[i] = b; return utf; } char[] Latin1toUTF8(ubyte[] pLatin1Text) { return std.utf.toUTF8(Latin1toUTF32(pLatin1Text)); } import std.stdio; void main() { ubyte[] td; td.length = 256; for (int i = 0; i < 256; i++) td[i] = i; // On windows, set the code page to 65001 // and the font to Lucinda Console. // eg. C:\> chcp 65001 // Active code page: 65001 std.stdio.writefln("%s", Latin1toUTF8(td)); } -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Jun 14 2007
Derek Parnell wrote:On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:That'd work, but will allocate more memory than required (5 to 6 times the length of the Latin-1 text worth of allocation - 4 times for the utf-32, plus 1 to 2 times for the utf-8). How about this: --- import std.utf; char[] Latin1toUTF8(ubyte[] lat1) { char[] utf8; // preallocate utf8.length = lat1.length; /* optionally preallocate up to 2 * lat1.length characters instead (you'll never need more than that). */ utf8.length = 0; foreach (latchar; lat1) { utf8.encode(latchar); } } --- This should allocate 1 to 3 times the length of the Latin-1 text: 1 time the length as initial allocation, plus a doubling on reallocation if there are any non-ascii characters. (If I remember the allocation policy correctly) It'll 2 times the Latin-1 length if you preallocate that beforehand. All memory allocation sizes calculated above exclude whatever extra memory the allocator adds to get a nice round bin-size of course, so this is more of an estimate; it'll likely be a bit more.(The first 256 code points of Unicode are identical to Latin-1)I was not aware of that. So if one needs to convert from Latin-1 to utf8 ... import std.utf; dchar[] Latin1toUTF32(ubyte[] pLatin1Text) { dchar[] utf; utf.length = pLatin1Text.length; foreach(i, b; pLatin1Text) utf[i] = b; return utf; } char[] Latin1toUTF8(ubyte[] pLatin1Text) { return std.utf.toUTF8(Latin1toUTF32(pLatin1Text)); }
Jun 14 2007
Simen Haugen skrev:I hate it! Say we have a string "øl". When I read this from a text file, it is two chars, but since this is no utf8 string, I have to convert it to utf8 before I can do any string operations on it. I can easily live with that. Say we have a file with several lines, and its important that all lines are of equal length. The string "ol" is two chars, but the string "øl" is 3 chars in utf8. Because of this I have to convert it back to latin-1 before checking lengths. The same applies to slicing, but even worse. For all I care, "ø" is one character, not two. If I slice "ø" to get the first character, I only get the first half of the character. Isn't it more obvious that all string manipulation works with all utf8 characters as one character instead of two for values greater than 127? I cannot find any nice solutions for this, and have to convert to and from latin-1/utf8 all the time. There must be a better way...The solution is simple. If all your data is latin-1, and your requirements are stated in the form of "number of latin-1 units", just use latin-1 as the encoding. typedef ubyte latin1_char; alias latin1_char[] latin1_string; /Oskar
Jun 14 2007