digitalmars.D - Inconsitency
-
nickles
(7/7)
Oct 13 2013
Why does
.length return the number of bytes and not the - Dicebot (4/12) Oct 13 2013 Because `length` must be O(1) operation for built-in arrays and
- Dmitry Olshansky (7/14) Oct 13 2013 ???
- nickles (7/10) Oct 13 2013 I do not agree:
- Dicebot (2/9) Oct 13 2013 Because you have wrong understanding of what does "length" mean.
- nickles (3/3) Oct 13 2013 Ok, if my understandig is wrong, how do YOU measure the length of
- David Nadlinger (6/8) Oct 13 2013 Depends on how you define the "length" of a string. Doing that is
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (11/13) Oct 13 2013 The thing is that even count(), which gives you the number of *code
- nickles (19/19) Oct 13 2013 Ok, I understand, that "length" is - obviously - used in analogy
- Michael (1/5) Oct 13 2013 First index is zero, no?
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (12/23) Oct 13 2013 This will _not_ return a trailing surrogate of a Cyrillic letter. It
- nickles (23/38) Oct 13 2013 Both are not true for UTF-32. There is no interpretation (except
- Dicebot (13/25) Oct 13 2013 Ironically, reason is consistency. `string` is just
- Kagamin (3/8) Oct 15 2013 No, he needs graphemes, so `std.algorithm` won't work correctly
- anonymous (7/26) Oct 13 2013 This is not about endianness. It's "\u00E4" vs "a\u0308". The
- Peter Alexander (26/29) Oct 13 2013 You are correct in that UTF-8 is endian agnostic, but I don't
- Temtaime (17/17) Oct 13 2013 I've found another one inconsitency problem.
- deadalnix (2/19) Oct 13 2013 The first one is made to interface with C. It is a special case.
- Andrej Mitrovic (2/6) Oct 13 2013 http://d.puremagic.com/issues/show_bug.cgi?id=6032
- Maxim Fomin (12/25) Oct 13 2013 This is impossible given current design. At runtime "säд"[2] is
- monarch_dodra (14/27) Oct 13 2013 I think the root misunderstanding is that you think that a string
- deadalnix (3/5) Oct 13 2013 That isn't an analogy. It is usually a good idea to try to
- nickles (6/6) Oct 14 2013 It's easy to state this, but - please - don't get sarcastical!
- Andrei Alexandrescu (21/22) Oct 14 2013 Thanks for making this point.
- Kagamin (3/7) Oct 15 2013 Most code doesn't need to count graphemes and lives happily with
- qznc (6/13) Oct 16 2013 Most code might be buggy then.
- Chris (9/23) Oct 16 2013 Now that you mention it, I had a program that would send strings
- monarch_dodra (3/29) Oct 16 2013 I'm not sure this is a "D" issue though: It's a fact of unicode
- Chris (6/37) Oct 16 2013 My point was it would have been nice to have a native D function
- Maxim Fomin (7/38) Oct 16 2013 As I argued previously, it is implementation issue which treats
- Jacob Carlborg (4/9) Oct 16 2013 Why would it require two code points?
- qznc (9/20) Oct 16 2013 It is either [U+00E4] as one code point or [a,U+0308] for two
- Jacob Carlborg (4/11) Oct 16 2013 Aha, now I see.
- monarch_dodra (18/31) Oct 16 2013 One of the interesting points, is with "ba\u00E4r" vs
- qznc (6/40) Oct 16 2013 I agree with your point. Nevertheless you understanding of
- Dmitry Olshansky (4/42) Oct 16 2013 --
- monarch_dodra (2/6) Oct 16 2013 Ah. Learn something new every day. :)
- Kagamin (4/9) Oct 20 2013 And on Windows it's case-insensitive - 2^^N variants of each
- Chris (6/23) Oct 14 2013 I recently discovered a bug in my program. If you take the letter
- Dmitry Olshansky (12/14) Oct 13 2013 It's all there:
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (5/20) Oct 13 2013 But you have to take care to normalize the string WRT diacritics if the
- Maxim Fomin (39/49) Oct 13 2013 This is not a single inconsistency here.
- ilya-stromberg (4/12) Oct 13 2013 Technically, UTF-16 can contain 2 ushort's for 1 character, so
Why does <string>.length return the number of bytes and not the number of UTF-8 characters, whereas <wstring.>length and <dstring>.length return the number of UTF-16 and UTF-32 characters? Wouldn't it be more consistent to have <string>.length return the number of UTF-8 characters as well (instead of having to use std.utf.count(<string>)?
Oct 13 2013
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:Why does <string>.length return the number of bytes and not the number of UTF-8 characters, whereas <wstring.>length and <dstring>.length return the number of UTF-16 and UTF-32 characters? Wouldn't it be more consistent to have <string>.length return the number of UTF-8 characters as well (instead of having to use std.utf.count(<string>)?Because `length` must be O(1) operation for built-in arrays and for UTF-8 strings it would require storing additional length field making it binary incompatible with other array types.
Oct 13 2013
13-Oct-2013 16:36, nickles пишет:Why does <string>.length return the number of bytes and not the number of UTF-8 characters, whereas <wstring.>length and <dstring>.length return the number of UTF-16 and UTF-32 characters???? This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.Wouldn't it be more consistent to have <string>.length return the number of UTF-8 characters as well (instead of having to use std.utf.count(<string>)?It's consistent as is. -- Dmitry Olshansky
Oct 13 2013
This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.I do not agree: writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem) writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16) writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32) This is not consistent - from my point of view.
Oct 13 2013
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:I do not agree: writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem) writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16) writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32) This is not consistent - from my point of view.Because you have wrong understanding of what does "length" mean.
Oct 13 2013
Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative?
Oct 13 2013
On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote:Ok, if my understandig is wrong, how do YOU measure the length of a string?Depends on how you define the "length" of a string. Doing that is surprisingly difficult once the full variety of Unicode code points comes into play, even if you ignore the question of encoding (UTF-8, UTF-16, …). David
Oct 13 2013
Am 13.10.2013 15:25, schrieb nickles:Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative?The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is, the number of actual display characters. UTF is quite a complex beast and doing any operations on it _correctly_ generally requires a lot of care. If you need to do these kinds of operations, I would highly recommend to read up the basics of UTF and Unicode first (quick overview on Wikipedia: <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>). arr.length is meant to be used in conjunction with array indexing and slicing (arr[...]) and its value is consistent for all string and array types for this purpose.
Oct 13 2013
Ok, I understand, that "length" is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing "char"s as UTF-8 which means that a "char" in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't then this (i.e. the character's length) be the "unit of measurement" for "char"s - like e.g. the size of the underlying struct in an array of "struct"s? The story continues with indexing "string"s: In a consistent implementation, shouldn't writeln("säд"[2]) return "д" instead of the trailing surrogate of this cyrillic letter? Btw. how do YOU implement this for "string" (for "dstring" it works - logically, for "wstring" the same problem arises for code points above D800)? Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?
Oct 13 2013
implementation, shouldn't writeln("säд"[2]) return "д" instead of the trailing surrogate of this cyrillic letter?First index is zero, no?
Oct 13 2013
Am 13.10.2013 16:14, schrieb nickles:Ok, I understand, that "length" is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing "char"s as UTF-8 which means that a "char" in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't then this (i.e. the character's length) be the "unit of measurement" for "char"s - like e.g. the size of the underlying struct in an array of "struct"s? The story continues with indexing "string"s: In a consistent implementation, shouldn't writeln("säд"[2]) return "д" instead of the trailing surrogate of this cyrillic letter?This will _not_ return a trailing surrogate of a Cyrillic letter. It will return the second code unit of the "ä" character (U+00E4). However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented. If the string were in UTF-32, [2] could yield either the Cyrillic character, or the umlaut diacritic. The .length of the UTF-32 string could be either 3 or 4. There are multiple reasons why .length and index access is based on code units rather than code points or any higher level representation, but one is that the complexity would suddenly be O(n) instead of O(1). In-place modifications of char[] arrays also wouldn't be possible anymore as the size of the underlying array might have to change.
Oct 13 2013
This will _not_ return a trailing surrogate of a Cyrillic letter. It will return the second code unit of the "ä" character (U+00E4).True. It's UTF-8, not UTF-16.However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented.This is not true for UTF-8, which is not subject to "endianism".If the string were in UTF-32, [2] could yield either the Cyrillic character, or the umlaut diacritic. The .length of the UTF-32 string could be either 3 or 4.Both are not true for UTF-32. There is no interpretation (except for the "endianism", which could be taken care of in a library/the core) for the code point.There are multiple reasons why .length and index access is based on code units rather than code points or any higher level representation, but one is that the complexity would suddenly be O(n) instead of O(1).see my last statement belowIn-place modifications of char[] arrays also wouldn't be possible anymoreThey would be, butas the size of the underlying array might have to change.Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent implementation would put the burden on the implementors of the libraries/core, i.e. the ones who usually have a better understanding of Unicode than the average programmer. Also, implementing such a semantics would not per se abandon a byte-wise access, would it? So, how do you guys handle UTF-8 strings in D? What are your solutions to the problems described? Does it all come down to converting "string"s and "wstring"s to "dstring"s, manipulating them, and re-convert them to "string"s? Btw, what would this mean in terms of speed? These is no irony in my questions. I'm really looking for solutions...
Oct 13 2013
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent implementation would put the burden on the implementors of the libraries/core, i.e. the ones who usually have a better understanding of Unicode than the average programmer.Ironically, reason is consistency. `string` is just `immutable(char)[]` and it conforms to usual array behavior rules. Saying that array element value assignment may allocate it hardly a good option.So, how do you guys handle UTF-8 strings in D? What are your solutions to the problems described? Does it all come down to converting "string"s and "wstring"s to "dstring"s, manipulating them, and re-convert them to "string"s? Btw, what would this mean in terms of speed?If single element access is needed, str.front yields decoded `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the fact it is O(n) operation at least. As `str.front` yields dchar, most `std.algorithm` and `std.range` utilities will also work correctly on default UTF-8 strings. Slicing / .length are probably the only operations that do not respect UTF-8 encoding (because they are exactly the same for all arrays).
Oct 13 2013
On Sunday, 13 October 2013 at 17:01:15 UTC, Dicebot wrote:If single element access is needed, str.front yields decoded `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the fact it is O(n) operation at least. As `str.front` yields dchar, most `std.algorithm` and `std.range` utilities will also work correctly on default UTF-8 strings.No, he needs graphemes, so `std.algorithm` won't work correctly for him as Peter has shown: grapheme doesn't fit in dchar.
Oct 15 2013
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:This is not about endianness. It's "\u00E4" vs "a\u0308". The first is the single code point 'ä', the second is two code points, 'a' plus umlaut dots. [...]However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented.This is not true for UTF-8, which is not subject to "endianism".Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent implementation would put the burden on the implementors of the libraries/core, i.e. the ones who usually have a better understanding of Unicode than the average programmer. Also, implementing such a semantics would not per se abandon a byte-wise access, would it? So, how do you guys handle UTF-8 strings in D? What are your solutions to the problems described? Does it all come down to converting "string"s and "wstring"s to "dstring"s, manipulating them, and re-convert them to "string"s? Btw, what would this mean in terms of speed? These is no irony in my questions. I'm really looking for solutions...I think, std.uni and std.utf are supposed to supply everything Unicode.
Oct 13 2013
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:You are correct in that UTF-8 is endian agnostic, but I don't believe that was Sönke's point. The point is that ä can be produced in Unicode in more than one way. This program illustrates: import std.stdio; void main() { string a = "ä"; string b = "a\u0308"; writeln(a); writeln(b); writeln(cast(ubyte[])a); writeln(cast(ubyte[])b); } This prints: ä ä [195, 164] [97, 204, 136] Notice that they are both the same "character" but have different representations. The first is just the 'ä' code point, which consists of two code units, the second is the 'a' code point followed by a Combining Diaeresis code point. In short, the string "ä" could be either 2 or 3 code units, and either 1 or 2 code points.However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented.This is not true for UTF-8, which is not subject to "endianism".
Oct 13 2013
I've found another one inconsitency problem. void foo(const char *); void foo(const wchar *); void foo(const dchar *); void main() { foo(`123`); foo(`123`w); foo(`123`d); } Error: function hello.foo (const(char*)) is not callable using argument types (immutable(wchar)[]) Error: function hello.foo (const(char*)) is not callable using argument types (immutable(dchar)[]) And typeof(`123`).stringof == `string`. Why `123` can be stored as null terminated utf8 string in rdata segment and `123`w nor `123`d are not? For example wide strings(utf16) are usable with windows *W functions.
Oct 13 2013
On Sunday, 13 October 2013 at 22:34:00 UTC, Temtaime wrote:I've found another one inconsitency problem. void foo(const char *); void foo(const wchar *); void foo(const dchar *); void main() { foo(`123`); foo(`123`w); foo(`123`d); } Error: function hello.foo (const(char*)) is not callable using argument types (immutable(wchar)[]) Error: function hello.foo (const(char*)) is not callable using argument types (immutable(dchar)[]) And typeof(`123`).stringof == `string`. Why `123` can be stored as null terminated utf8 string in rdata segment and `123`w nor `123`d are not? For example wide strings(utf16) are usable with windows *W functions.The first one is made to interface with C. It is a special case.
Oct 13 2013
On 10/14/13, Temtaime <temtaime gmail.com> wrote:And typeof(`123`).stringof == `string`. Why `123` can be stored as null terminated utf8 string in rdata segment and `123`w nor `123`d are not? For example wide strings(utf16) are usable with windows *W functions.http://d.puremagic.com/issues/show_bug.cgi?id=6032
Oct 13 2013
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Ok, I understand, that "length" is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing "char"s as UTF-8 which means that a "char" in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't then this (i.e. the character's length) be the "unit of measurement" for "char"s - like e.g. the size of the underlying struct in an array of "struct"s? The story continues with indexing "string"s: In a consistent implementation, shouldn't writeln("säд"[2]) return "д" instead of the trailing surrogate of this cyrillic letter?This is impossible given current design. At runtime "säд"[2] is viewed as struct { void *ptr; size_t length; }; ptr points to memory having at least five bytes and length having value 5. Druntime hasn't taken UTF course. One option would be to add support in druntime so it can correctly handle such strings, or implement separate string type which does not default to char[], but of course the easiest way is to convince everybody that everything is OK and advice to use some library function which does the job correctly essentially implying that the language does the job wrong (pardon me, some D skepticism, the deeper I am in it, the more critically view it).
Oct 13 2013
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Ok, I understand, that "length" is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing "char"s as UTF-8 which means that a "char" in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't then this (i.e. the character's length) be the "unit of measurement" for "char"s - like e.g. the size of the underlying struct in an array of "struct"s? The story continues with indexing "string"s: In a consistent implementation, shouldn't writeln("säд"[2]) return "д" instead of the trailing surrogate of this cyrillic letter?I think the root misunderstanding is that you think that a string is random access. A string *isn't* random access. They are implemented *inside* an array, but unless you know *exactly* what you are doing, you shouldn't index, slice or take the length of a string. A string should be handled like a bidirectional range. Once you've understood that, it becomes much simpler. You want the first character? front. You want to skip the first character? popFront. You want an arbitrary character in o(N) time? myString.dropFrontExactly(N).front; You want an arbitrary character in o(1) time? You can't.
Oct 13 2013
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Ok, I understand, that "length" is - obviously - used in analogy to any array's length value.That isn't an analogy. It is usually a good idea to try to understand thing before judging if it is consistent.
Oct 13 2013
It's easy to state this, but - please - don't get sarcastical! I'm obviously (as I've learned) speaking about UTF-8 "char"s as they are NOT implemented right now in D; so I'm criticizing that D, as a language which emphasizes on "UTF-8 characters", isn't taking "the last step", like e.g. Python does (and no, I'm not a Python fan, nor do I consider D a bad language).
Oct 14 2013
On 10/14/13 1:09 AM, nickles wrote:It's easy to state this, but - please - don't get sarcastical!Thanks for making this point. String handling in D follows two simple principles: 1. The support is a slice of code units (which often are immutable, seeing as string is an alias for immutable(char)[]). Slice primitives are readily accessible. 2. The standard library (and the foreach language construct) recognize that arrays of code units are special and define bidirectional range primitives on top of them. These are empty, save, front, back, popFront, and popBack. So for a string you may use the range primitives and related algorithms to manipulate code points, or the slice primitives to manipulate code units. This duality has been discussed in the past, and alternatives have proposed (mainly gravitating around making one of the aspects explicit rather than implicit). It is my opinion that a better solution exists (in the form of making representation accessible only through a property .rep). But the current design has "won" not only because it's the existing one, but also because it has good simplicity and flexibility advantages. At this point there is no question about changing the semantics of existing constructs. Andrei
Oct 14 2013
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Oct 15 2013
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Oct 16 2013
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Oct 16 2013
On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:I'm not sure this is a "D" issue though: It's a fact of unicode that there are two different ways to write ä.On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Oct 16 2013
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra wrote:On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:My point was it would have been nice to have a native D function that can convert between the two types, especially because this is a well known issue. NSString (Cocoa / Objective-C) for example has things like precomposedStringWithCompatibilityMapping etc.On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:I'm not sure this is a "D" issue though: It's a fact of unicode that there are two different ways to write ä.On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Oct 16 2013
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra wrote:On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:As I argued previously, it is implementation issue which treats "bär" is sequence of objects which are not capable of representing values (like int[] = [3.14]). By the way, it is a rare case of type system hole. Usually in D you need cast or union to reinterpret some value, with "bär"[X] you need not.On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:I'm not sure this is a "D" issue though: It's a fact of unicode that there are two different ways to write ä.On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
Oct 16 2013
On 2013-10-16 10:03, qznc wrote:Most code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.Why would it require two code points? -- /Jacob Carlborg
Oct 16 2013
On Wednesday, 16 October 2013 at 12:18:40 UTC, Jacob Carlborg wrote:On 2013-10-16 10:03, qznc wrote:It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is "combining diaeresis" [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work [0] http://www.fileformat.info/info/unicode/char/0308/index.htm [1] http://en.wikipedia.org/wiki/Combining_characterMost code might be buggy then. An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.Why would it require two code points?
Oct 16 2013
On 2013-10-16 14:33, qznc wrote:It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is "combining diaeresis" [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work [0] http://www.fileformat.info/info/unicode/char/0308/index.htm [1] http://en.wikipedia.org/wiki/Combining_characterAha, now I see. -- /Jacob Carlborg
Oct 16 2013
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:On 2013-10-16 14:33, qznc wrote:One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör" Which is the correct behavior? There is no correct answer. So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity. Long story short: unicode is f***ing complicated. And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with. On the other hand, I don't know many C++ coders that understand unicode.It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is "combining diaeresis" [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work [0] http://www.fileformat.info/info/unicode/char/0308/index.htm [1] http://en.wikipedia.org/wiki/Combining_characterAha, now I see.
Oct 16 2013
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme. "a\u0308" is one grapheme. U+00e4 is the same grapheme as "a\u0308". http://en.wikipedia.org/wiki/GraphemeOn 2013-10-16 14:33, qznc wrote:One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör" Which is the correct behavior? There is no correct answer. So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity. Long story short: unicode is f***ing complicated. And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with. On the other hand, I don't know many C++ coders that understand unicode.It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is "combining diaeresis" [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work [0] http://www.fileformat.info/info/unicode/char/0308/index.htm [1] http://en.wikipedia.org/wiki/Combining_characterAha, now I see.
Oct 16 2013
16-Oct-2013 23:42, qznc пишет:On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:s/the same/canonically equivalent/ :)On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme. "a\u0308" is one grapheme. U+00e4 is the same grapheme as "a\u0308".On 2013-10-16 14:33, qznc wrote:One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör" Which is the correct behavior? There is no correct answer. So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity. Long story short: unicode is f***ing complicated. And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with. On the other hand, I don't know many C++ coders that understand unicode.It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is "combining diaeresis" [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work [0] http://www.fileformat.info/info/unicode/char/0308/index.htm [1] http://en.wikipedia.org/wiki/Combining_characterAha, now I see.http://en.wikipedia.org/wiki/Grapheme-- Dmitry Olshansky
Oct 16 2013
On Wednesday, 16 October 2013 at 19:42:59 UTC, qznc wrote:I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme. "a\u0308" is one grapheme. U+00e4 is the same grapheme as "a\u0308". http://en.wikipedia.org/wiki/GraphemeAh. Learn something new every day. :)
Oct 16 2013
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:Most code might be buggy then.All code is buggy.An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.And on Windows it's case-insensitive - 2^^N variants of each string. So what?
Oct 20 2013
On Sunday, 13 October 2013 at 13:40:21 UTC, Sönke Ludwig wrote:Am 13.10.2013 15:25, schrieb nickles:I recently discovered a bug in my program. If you take the letter "é" for example (Linux, Ubuntu 12.04), std.utf.count() returns 1 and .length returns 2. I needed the length to slice the string at a given point. Using .length instead of std.utf.count() fixed the bug.Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative?The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is, the number of actual display characters. UTF is quite a complex beast and doing any operations on it _correctly_ generally requires a lot of care. If you need to do these kinds of operations, I would highly recommend to read up the basics of UTF and Unicode first (quick overview on Wikipedia: <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>). arr.length is meant to be used in conjunction with array indexing and slicing (arr[...]) and its value is consistent for all string and array types for this purpose.
Oct 14 2013
13-Oct-2013 17:25, nickles пишет:Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative?It's all there: http://www.unicode.org/glossary/ http://www.unicode.org/versions/Unicode6.3.0/ I measure string length in code units (as defined in the above standard). This bears no easy relation to the number of visible characters but I don't mind it. Measuring number of visible characters isn't trivial but can be done by counting number of graphemes. For simple alphabets counting code points will do the trick as well (what count does). -- Dmitry Olshansky
Oct 13 2013
Am 13.10.2013 15:50, schrieb Dmitry Olshansky:13-Oct-2013 17:25, nickles пишет:But you have to take care to normalize the string WRT diacritics if the estimate is supposed to work. OS X for example (if I remember right) always uses explicit combining characters, while Windows uses precomposed characters if possible.Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative?It's all there: http://www.unicode.org/glossary/ http://www.unicode.org/versions/Unicode6.3.0/ I measure string length in code units (as defined in the above standard). This bears no easy relation to the number of visible characters but I don't mind it. Measuring number of visible characters isn't trivial but can be done by counting number of graphemes. For simple alphabets counting code points will do the trick as well (what count does).
Oct 13 2013
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:This is not a single inconsistency here. First of all, typeof("säд") yileds string type (immutable char) while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[], nor even dchar[] but int[]. In this case D is close to C which also treats character literals as integer type. Secondly, character arrays are only one who have two kinds of array literals - usual [item. item, item] and special "blah", as you see there is no correspondence between them. If you try char[] x = cast(char[])['s', 'ä', 'д'] then length would be indeed 3 (but don't use it - it is broken). In D dynamic array is at binary level represented as struct { void *ptr; size_t length; }. When you perform some operations on dynamic arrays they are implemented by compiler as calls to runtime functions. However, during runtime it is impossible to do something useful on arrays for which there is only information about address of beginning and total elements (this is a source of other problems in D). To handle this, compiler generates and passes as separate argument "TypeInfo" to runtime functions. TypeInfo contains some data, most relevant here is size of the element. What happens is follows. Compiler recognizes that "säд" should be string literal and encoded as UTF-8 (http://dlang.org/lex.html#DoubleQuotedString), so element type should be char, which requires to have 5 elements in array. So, at runtime an object "säд" is treated as array of 5 elements each having 1 byte per element. Basically string (and char[]) plays dual role in the language - on the one hand, it is array of elements having strictly 1 byte size by definition, on the other hand D tries to use it as 'generic' UTF type for which size is not fixed. So, there is contradiction - in source code such strings are viewed by programmer as some abstract UTF string, but druntime views it as 5 byte array. In my view, trouble begins when "säд" is internally casted to char (which is no better than int[] x = [3.14, 5.6]). And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so there is great inconsistency here. By the way, UTF definition is irrelevant here, this is pure implementation issue (I think it is design fault).This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.I do not agree: writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem) writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16) writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32) This is not consistent - from my point of view.
Oct 13 2013
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:Why does <string>.length return the number of bytes and not the number of UTF-8 characters, whereas <wstring.>length and <dstring>.length return the number of UTF-16 and UTF-32 characters? Wouldn't it be more consistent to have <string>.length return the number of UTF-8 characters as well (instead of having to use std.utf.count(<string>)?Technically, UTF-16 can contain 2 ushort's for 1 character, so <wstring.>length return the number of ushort's, not the UTF-16 characters.
Oct 13 2013