www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Working with utf

reply "Simen Haugen" <simen norstat.no> writes:
I hate it!

Say we have a string "øl". When I read this from a text file, it is two 
chars, but since this is no utf8 string, I have to convert it to utf8 before 
I can do any string operations on it.
I can easily live with that. Say we have a file with several lines, and its 
important that all lines are of equal length.
The string "ol" is two chars, but the string "øl" is 3 chars in utf8. 
Because of this I have to convert it back to latin-1 before checking 
lengths. The same applies to slicing, but even worse.
For all I care, "ø" is one character, not two. If I slice "ø" to get the 
first character, I only get the first half of the character. Isn't it more 
obvious that all string manipulation works with all utf8 characters as one 
character instead of two for values greater than 127?

I cannot find any nice solutions for this, and have to convert to and from 
latin-1/utf8 all the time.

There must be a better way...
Jun 14 2007
next sibling parent reply Regan Heath <regan netmail.co.nz> writes:
Simen Haugen Wrote:
 I hate it!
 
 Say we have a string "øl". When I read this from a text file, it is two 
 chars, but since this is no utf8 string, I have to convert it to utf8 before 
 I can do any string operations on it.
 I can easily live with that. Say we have a file with several lines, and its 
 important that all lines are of equal length.
 The string "ol" is two chars, but the string "øl" is 3 chars in utf8. 
 Because of this I have to convert it back to latin-1 before checking 
 lengths. The same applies to slicing, but even worse.
 For all I care, "ø" is one character, not two. If I slice "ø" to get the 
 first character, I only get the first half of the character. Isn't it more 
 obvious that all string manipulation works with all utf8 characters as one 
 character instead of two for values greater than 127?
 
 I cannot find any nice solutions for this, and have to convert to and from 
 latin-1/utf8 all the time.
 
 There must be a better way...
I think what we want for this is a String class which internally stores the data as utf-8, 16 or 32 (making it's own decision or being told which to use) and provides slicing of characters as opposed to codpoints. Then, all you need is to convert from latin-1 to String, do all your work with String and convert back to latin-1 only if/when you need to write it back to a file or similar. My gut feeling is that this functionality belongs in a class and not the language itself. After all, you may want/need to manipulate utf-8, 16, or 32 codepoints directly for some reason. Regan Heath
Jun 14 2007
next sibling parent "Simen Haugen" <simen norstat.no> writes:
"Regan Heath" <regan netmail.co.nz> wrote in message 
news:f4rd7m$dlb$1 digitalmars.com...
 I think what we want for this is a String class which internally stores 
 the data as utf-8, 16 or 32 (making it's own decision or being told which 
 to use) and provides slicing of characters as opposed to codpoints.

 Then, all you need is to convert from latin-1 to String, do all your work 
 with String and convert back to latin-1 only if/when you need to write it 
 back to a file or similar.

 My gut feeling is that this functionality belongs in a class and not the 
 language itself.  After all, you may want/need to manipulate utf-8, 16, or 
 32 codepoints directly for some reason.

 Regan Heath
That would have been a very nice addition. I cannot even count how many hard-to-find bugs I've had because of this (both slicing and length). Utf8 and slicing is supported by the language, right? To me it sounds more like a bug that these wont work together, as I tend to trust that language features work.
Jun 14 2007
prev sibling parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Regan Heath wrote:
 I think what we want for this is a String class which internally stores the
data as utf-8, 16 or 32 (making it's own decision or being told which to use)
and provides slicing of characters as opposed to codpoints.  
Your time away from D (6 months was it?) is showing... There's such a string implementation at http://www.dprogramming.com/dstring.php (Though IIRC it's a struct, not a class ;) ) Features: * Indexing and slicing always works with on code point indices, not code units. * Contents is stored as chars, wchars, dchars (whichever is sufficiently large to store every code point in the string as a single code unit). * The size taken to store a string instance is equal to (char[]).sizeof: 2 * size_t.sizeof. * The upper two bits of the size_t containing the length are used as a flag for what type the pointer refers to (char/wchar/dchar). This causes the maximum length to be cut in 4, but that still allows strings up to 1 GiB on 32-bit machines, and much bigger on 64-bit machines. It can still be a problem that sometimes multiple code points are needed to encode one "logical" character; the extra ones for diacritics (accents). Above-mentioned size limitation can theoretically be a problem, but is probably a rare one. (When was the last time you needed more than a billion characters in a single string?) Basically though, this allows you to pretend it's a dchar[] without the memory penalty (It only uses dchars internally if you use characters outside the BMP (> 0xFFFF), which are rarely needed in non-asian languages).
Jun 14 2007
prev sibling next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 14 Jun 2007 14:40:02 +0200, Simen Haugen wrote:

 I hate it!
 
 Say we have a string "øl". When I read this from a text file, it is two 
 chars, but since this is no utf8 string, I have to convert it to utf8 before 
 I can do any string operations on it.
 I can easily live with that. Say we have a file with several lines, and its 
 important that all lines are of equal length.
 The string "ol" is two chars, but the string "øl" is 3 chars in utf8. 
 Because of this I have to convert it back to latin-1 before checking 
 lengths. The same applies to slicing, but even worse.
 For all I care, "ø" is one character, not two. If I slice "ø" to get the 
 first character, I only get the first half of the character. Isn't it more 
 obvious that all string manipulation works with all utf8 characters as one 
 character instead of two for values greater than 127?
 
 I cannot find any nice solutions for this, and have to convert to and from 
 latin-1/utf8 all the time.
 
 There must be a better way...
Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1 when you're done. Each dchar[] element is a single character. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Jun 14 2007
parent reply "Simen Haugen" <simen norstat.no> writes:
"Derek Parnell" <derek psych.ward> wrote in message 
news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.
You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])
Jun 14 2007
parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:

 "Derek Parnell" <derek psych.ward> wrote in message 
 news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.
You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])
dchar[] Y; char[] Z; Y = std.utf.toUTF32(Z); -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Jun 14 2007
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Derek Parnell wrote:
 On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:
 
 "Derek Parnell" <derek psych.ward> wrote in message 
 news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.
You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])
dchar[] Y; char[] Z; Y = std.utf.toUTF32(Z);
Except his input is encoded as Latin-1, not UTF-8. Conversion is still trivial though: --- auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt"); dchar[] utf = new dchar[](latin1.length); for(size_t i = 0; i < latin1.length; i++) { utf[i] = latin1[i]; } --- and the other way around. (The first 256 code points of Unicode are identical to Latin-1)
Jun 14 2007
next sibling parent reply "Simen Haugen" <simen norstat.no> writes:
I tested this now, and it works like a charm. This means I can finally get 
rid of all my convertions between utf8 and latin1! (together with all these 
hidden bugs)

Thanks a lot for all your help.

"Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
news:f4rh01$lkt$2 digitalmars.com...
 Except his input is encoded as Latin-1, not UTF-8. Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
     utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1) 
Jun 14 2007
next sibling parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Simen Haugen wrote:
 I tested this now, and it works like a charm. This means I can finally get 
 rid of all my convertions between utf8 and latin1! (together with all these 
 hidden bugs)
 
 Thanks a lot for all your help.
 
 "Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
 news:f4rh01$lkt$2 digitalmars.com...
 Except his input is encoded as Latin-1, not UTF-8. Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
     utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1) 
If you only ever need to represent Latin-1 (but need string functions, not just array functions), wchar[] will also work, and only take half the memory. If you don't need string functions, of course, you can just keep it as ubyte[]s the whole time. (By "string functions" I mean stuff like case conversions, console output and so on. In particular, note that slicing & indexing works on all arrays, not just strings)
Jun 14 2007
prev sibling parent "Simen Haugen" <simen norstat.no> writes:
Except that most functions in the string library takes a char and not dchar 
as parameter.
Then I still have to convert to utf8 whenever I want to use the functions, 
and then I'm just as far.

"Simen Haugen" <simen norstat.no> wrote in message 
news:f4riug$rrt$1 digitalmars.com...
I tested this now, and it works like a charm. This means I can finally get 
rid of all my convertions between utf8 and latin1! (together with all these 
hidden bugs)

 Thanks a lot for all your help.

 "Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
 news:f4rh01$lkt$2 digitalmars.com...
 Except his input is encoded as Latin-1, not UTF-8. Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
     utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1)
Jun 14 2007
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:

 Derek Parnell wrote:
 On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:
 
 "Derek Parnell" <derek psych.ward> wrote in message 
 news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.
You're kidding me, right? Then I only have to convert to utf-32 when reading a file, and back to latin-1 when writing. Thats great! (except I have to modify a lot of char[] to dchar[])
dchar[] Y; char[] Z; Y = std.utf.toUTF32(Z);
Except his input is encoded as Latin-1, not UTF-8.
I read the OP as saying he was already converting Latin-1 to utf8 and was nowe concerned about converting utf8 to utf32, thus I gave that toUTF32() hint.
 Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
      utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1)
I was not aware of that. So if one needs to convert from Latin-1 to utf8 ... import std.utf; dchar[] Latin1toUTF32(ubyte[] pLatin1Text) { dchar[] utf; utf.length = pLatin1Text.length; foreach(i, b; pLatin1Text) utf[i] = b; return utf; } char[] Latin1toUTF8(ubyte[] pLatin1Text) { return std.utf.toUTF8(Latin1toUTF32(pLatin1Text)); } import std.stdio; void main() { ubyte[] td; td.length = 256; for (int i = 0; i < 256; i++) td[i] = i; // On windows, set the code page to 65001 // and the font to Lucinda Console. // eg. C:\> chcp 65001 // Active code page: 65001 std.stdio.writefln("%s", Latin1toUTF8(td)); } -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Jun 14 2007
parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Derek Parnell wrote:
 On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:
 
 (The first 256 code points of Unicode are identical to Latin-1)
I was not aware of that. So if one needs to convert from Latin-1 to utf8 ... import std.utf; dchar[] Latin1toUTF32(ubyte[] pLatin1Text) { dchar[] utf; utf.length = pLatin1Text.length; foreach(i, b; pLatin1Text) utf[i] = b; return utf; } char[] Latin1toUTF8(ubyte[] pLatin1Text) { return std.utf.toUTF8(Latin1toUTF32(pLatin1Text)); }
That'd work, but will allocate more memory than required (5 to 6 times the length of the Latin-1 text worth of allocation - 4 times for the utf-32, plus 1 to 2 times for the utf-8). How about this: --- import std.utf; char[] Latin1toUTF8(ubyte[] lat1) { char[] utf8; // preallocate utf8.length = lat1.length; /* optionally preallocate up to 2 * lat1.length characters instead (you'll never need more than that). */ utf8.length = 0; foreach (latchar; lat1) { utf8.encode(latchar); } } --- This should allocate 1 to 3 times the length of the Latin-1 text: 1 time the length as initial allocation, plus a doubling on reallocation if there are any non-ascii characters. (If I remember the allocation policy correctly) It'll 2 times the Latin-1 length if you preallocate that beforehand. All memory allocation sizes calculated above exclude whatever extra memory the allocator adds to get a nice round bin-size of course, so this is more of an estimate; it'll likely be a bit more.
Jun 14 2007
prev sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Simen Haugen skrev:
 I hate it!
 
 Say we have a string "øl". When I read this from a text file, it is two 
 chars, but since this is no utf8 string, I have to convert it to utf8 before 
 I can do any string operations on it.
 I can easily live with that. Say we have a file with several lines, and its 
 important that all lines are of equal length.
 The string "ol" is two chars, but the string "øl" is 3 chars in utf8. 
 Because of this I have to convert it back to latin-1 before checking 
 lengths. The same applies to slicing, but even worse.
 For all I care, "ø" is one character, not two. If I slice "ø" to get the 
 first character, I only get the first half of the character. Isn't it more 
 obvious that all string manipulation works with all utf8 characters as one 
 character instead of two for values greater than 127?
 
 I cannot find any nice solutions for this, and have to convert to and from 
 latin-1/utf8 all the time.
 
 There must be a better way...
The solution is simple. If all your data is latin-1, and your requirements are stated in the form of "number of latin-1 units", just use latin-1 as the encoding. typedef ubyte latin1_char; alias latin1_char[] latin1_string; /Oskar
Jun 14 2007