digitalmars.D - UTF-8 issues
- Eldar Insafutdinov (3/3) Sep 15 2008 I faced some issues with utf-8 support in D.
- Walter Bright (3/12) Sep 15 2008 This should help:
- Chris R. Miller (6/9) Sep 15 2008 IIRC a char array in D will compress itself for ASCII-encodable
- Jarrett Billingsley (11/20) Sep 15 2008 It's called UTF-8, and it's supposed to work like that. That D does
- Lutger (4/9) Sep 16 2008 There's also std.string of course. What do you find so lacking? (just
- Jarrett Billingsley (7/16) Sep 16 2008 The lack of any way to index or slice a string according to codepoint
- Benji Smith (30/32) Sep 15 2008 The important thing to remember is that a string is absolutely NOT an
- Eldar Insafutdinov (3/54) Sep 15 2008 So this example is only correct in case of latin chars, but in general i...
- Benji Smith (3/11) Sep 15 2008 That's my understanding.
- Oskar Linde (19/26) Sep 16 2008 It is not wrong for UTF-8 strings. It just won't work for arbitrary
I faced some issues with utf-8 support in D. As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too. But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?
Sep 15 2008
Eldar Insafutdinov wrote:I faced some issues with utf-8 support in D. As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too. But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?This should help: http://www.digitalmars.com/d/2.0/phobos/std_utf.html
Sep 15 2008
Eldar Insafutdinov wrote:I faced some issues with utf-8 support in D. As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too. But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?IIRC a char array in D will compress itself for ASCII-encodable characters, which destroys the integrity of the length variable. Well, it's still valid in terms of how long in words the array is, but in terms of real characters it's no longer valid. If you used a wchar or dchar things would be different.
Sep 15 2008
On Mon, Sep 15, 2008 at 2:38 PM, Chris R. Miller <lordsauronthegreat gmail.com> wrote:Eldar Insafutdinov wrote:It's called UTF-8, and it's supposed to work like that. That D does not provide some kind of interface for dealing with multibyte encodings (other than foreach and the encode/decode functions) is a failing on its part, not Unicode's. (Though it could be argued that multibyte encodings are stupid as hell, and I would agree with that.)I faced some issues with utf-8 support in D. As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too. But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?IIRC a char array in D will compress itself for ASCII-encodable characters, which destroys the integrity of the length variable. Well, it's still valid in terms of how long in words the array is, but in terms of real characters it's no longer valid.If you used a wchar or dchar things would be different.If he used dchar it'd be different. wchar still has multi-element encodings (surrogate pairs) for codepoints outside the BMP. Which, admittedly, are not that common, but it can still happen.
Sep 15 2008
Jarrett Billingsley wrote: ...It's called UTF-8, and it's supposed to work like that. That D does not provide some kind of interface for dealing with multibyte encodings (other than foreach and the encode/decode functions) is a failing on its part, not Unicode's.There's also std.string of course. What do you find so lacking? (just curious)
Sep 16 2008
On Tue, Sep 16, 2008 at 4:57 PM, Lutger <lutger.blijdestijn gmail.com> wrote:Jarrett Billingsley wrote: ...The lack of any way to index or slice a string according to codepoint indices (instead of byte/short indices), get the length of a string in codepoints, or to find the nearest beginning character given an arbitrary character index. (std.string is also embarrassingly missing any functionality for wchar[] or dchar[] but that's a slightly different issue.)It's called UTF-8, and it's supposed to work like that. That D does not provide some kind of interface for dealing with multibyte encodings (other than foreach and the encode/decode functions) is a failing on its part, not Unicode's.There's also std.string of course. What do you find so lacking? (just curious)
Sep 16 2008
Eldar Insafutdinov wrote:I faced some issues with utf-8 support in D.The important thing to remember is that a string is absolutely NOT an array of characters, and you can't treat it as such. As you've noticed, a char[] string is actually an array of UTF-8 encoded bytes. Iterating directly through that array is extremely touchy and error-prone. Instead, always use the standard library functions. D1/Tango: http://dsource.org/projects/tango/docs/current/tango.text.Util.html http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html D1/Phobos: http://digitalmars.com/d/1.0/phobos/std_utf.html D2/Phobos: http://digitalmars.com/d/2.0/phobos/std_utf.html Although the libraries do a decent job of hiding the ugly details, my opinion (which is not very popular around here) is that D's string processing is a major design flaw.As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.Indexing, slicing, and lengh-calculation of D strings is based on byte-position, not character position. Character-position indexing and slicing is only possible by iterating from the beginning of the string, decoding the characters on-the-fly, and keeping track of the number of bytes used by each character. That's what the standard library functions basically do. Calculating the actual character-length of the string is fundamentally the same as in C, where strings are null-terminated (e.g., you can't determine the actual length of the string until you've iterated from the beginning to the end). The Phobox & Tango libraries handle all of those details for you, but I think it's important to know what's going on behind the scenes, so that you have a rough idea of the true cost of each operation. --benji
Sep 15 2008
Benji Smith Wrote:Eldar Insafutdinov wrote:Yeah - I know that this operations works with bytes rather than chars[]. But it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly, that strings support slicing:I faced some issues with utf-8 support in D.The important thing to remember is that a string is absolutely NOT an array of characters, and you can't treat it as such. As you've noticed, a char[] string is actually an array of UTF-8 encoded bytes. Iterating directly through that array is extremely touchy and error-prone. Instead, always use the standard library functions. D1/Tango: http://dsource.org/projects/tango/docs/current/tango.text.Util.html http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html D1/Phobos: http://digitalmars.com/d/1.0/phobos/std_utf.html D2/Phobos: http://digitalmars.com/d/2.0/phobos/std_utf.html Although the libraries do a decent job of hiding the ugly details, my opinion (which is not very popular around here) is that D's string processing is a major design flaw.As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.Indexing, slicing, and lengh-calculation of D strings is based on byte-position, not character position. Character-position indexing and slicing is only possible by iterating from the beginning of the string, decoding the characters on-the-fly, and keeping track of the number of bytes used by each character. That's what the standard library functions basically do. Calculating the actual character-length of the string is fundamentally the same as in C, where strings are null-terminated (e.g., you can't determine the actual length of the string until you've iterated from the beginning to the end). The Phobox & Tango libraries handle all of those details for you, but I think it's important to know what's going on behind the scenes, so that you have a rough idea of the true cost of each operation. --benjiD has the array slice syntax, not possible with C++:char[] s1 = "hello world"; char[] s2 = s1[6 .. 11]; // s2 is "world"So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
Sep 15 2008
Eldar Insafutdinov wrote:Yeah - I know that this operations works with bytes rather than chars[]. But it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly, that strings support slicing:That's my understanding. --benjiD has the array slice syntax, not possible with C++:char[] s1 = "hello world"; char[] s2 = s1[6 .. 11]; // s2 is "world"So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
Sep 15 2008
Eldar Insafutdinov wrote:Benji Smith Wrote:It is not wrong for UTF-8 strings. It just won't work for arbitrary indices. But I don't think you will ever use arbitrary indices. All indices will be the result of other string functions (such as find) which behave correctly for UTF-8 strings. Incrementing/decrementing can be done using std.utf or similar. UTF-8 also makes it very easy to determine if an arbitrary position in a UTF-8 sequence lies at the start or in the middle of a multi-byte encoded character. Indexing a UTF-8 string by character rather than byte index is horribly inefficient. As others have said, if you really need to do that, use dchar[](1). Although, I've never personally come across a place where I needed that. 1) Be aware that you will need to make sure your data is of a composed unicode normal form, otherwise it could still use several code points(2) to represent a single grapheme. 2) A code point is a point in the Unicode codespace, which is what a dchar encodes. -- OskarD has the array slice syntax, not possible with C++:char[] s1 = "hello world"; char[] s2 = s1[6 .. 11]; // s2 is "world"So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
Sep 16 2008