digitalmars.D - Unicode handling comparison
- bearophile (7/7) Nov 27 2013 Through Reddit I have seen this small comparison of Unicode
- =?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= (4/9) Nov 27 2013 Indeed it does. Have you tried with std.uni?
- monarch_dodra (11/12) Nov 27 2013 I still think we're doing pretty good.
- David Nadlinger (9/14) Nov 27 2013 If you need to perform this kind of operations on Unicode strings
- Jacob Carlborg (5/10) Nov 27 2013 That didn't work out very well:
- Adam D. Ruppe (7/8) Nov 27 2013 Yeah, I saw it too. The fix is simple:
- Jacob Carlborg (4/9) Nov 27 2013 You were faster. But I created an issue as well.
- bearophile (4/13) Nov 27 2013 Thank you :-)
- Wyatt (12/21) Nov 27 2013 Seems like a pretty big "gotcha" from a usability standpoint;
- Dicebot (2/6) Nov 27 2013 It probably is, but is Unicode gotcha, not D one.
- Jacob Carlborg (4/8) Nov 27 2013 I think we should have that.
- Jakob Ovrum (14/20) Nov 27 2013 What would it do that std.uni doesn't already?
- Jacob Carlborg (5/6) Nov 27 2013 A class/struct that handles all these normalizations and other stuff
- Jakob Ovrum (3/7) Nov 27 2013 Sounds terrible :)
- Dicebot (6/15) Nov 27 2013 +1
- Jacob Carlborg (5/9) Nov 27 2013 I think it's missing a final high level abstraction. As with the rest of...
- Dmitry Olshansky (12/21) Nov 27 2013 This could give an idea of what Perl folks do to get the grapheme feel
- H. S. Teoh (15/39) Nov 27 2013 Maybe it should be called graphemeString?
- Wyatt (34/47) Nov 27 2013 Maybe. If I had called it...say, "normalisedString"? Would you
- Walter Bright (7/11) Nov 28 2013 Sadly, std.array is determined to decode (i.e. convert to dchar[]) all y...
- Jakob Ovrum (9/16) Nov 28 2013 Decoding by default means that algorithms can work reasonably
- bearophile (9/11) Nov 28 2013 If you want to sort an array of chars you need to use a dchar[],
- monarch_dodra (10/17) Nov 28 2013 I think it's great. It means by default, your strings will always
- Walter Bright (11/13) Nov 28 2013 front() in std.array looks like:
- H. S. Teoh (25/41) Nov 28 2013 OTOH, it is actually correct by default. If it *didn't* decode, things
- Dicebot (1/1) Nov 28 2013 http://dlang.org/phobos/std_encoding.html#.AsciiString ?
- monarch_dodra (11/12) Nov 28 2013 Yeah, that or just ubyte[].
- Walter Bright (4/7) Nov 28 2013 It doesn't have to be merely ASCII. You can do string substring searches...
- Dmitry Olshansky (19/26) Nov 28 2013 The greatest problem is surprisingly that you can't use range functions
- Walter Bright (3/7) Nov 28 2013 I suspect the correct approach would be to have the range over string to...
- Charles Hixson (12/28) Nov 27 2013 I don't like the overhead, and I don't know how important this is, but
- Dmitry Olshansky (6/22) Nov 27 2013 It's anything but cheap.
- Walter Bright (3/9) Nov 28 2013 Decoding isn't cheap, either, which is why I rant about it being the def...
- Jakob Ovrum (17/22) Nov 27 2013 Most of the points are good, but the author seems to confuse
- Wyatt (17/29) Nov 27 2013 I agree with the assertion that people SHOULD know how unicode
- Wyatt (9/10) Nov 27 2013 Whoops, overzealous pasting. That is, "e\u0308", which composes
- Jakob Ovrum (13/21) Nov 27 2013 Yes.
- Dmitry Olshansky (9/20) Nov 27 2013 As much as standard defines it. (actually they talk about boundaries,
- Jakob Ovrum (19/36) Nov 27 2013 I thought it was nice that std.uni had a proper terminology
- Charles Hixson (22/56) Nov 27 2013 I would put things a bit more emphatically. The codepoint is analogous
- Walter Bright (6/8) Nov 27 2013 Many things in Phobos either predate ranges, or are written by people wh...
- Dmitry Olshansky (4/8) Nov 27 2013 Which ones? Or do you mean more like isAlpha(rangeOfCodepoints)?
- Andrei Alexandrescu (3/10) Nov 27 2013 Yah, byGrapheme would be a great addition.
- H. S. Teoh (8/19) Nov 27 2013 [...]
- Dmitry Olshansky (13/30) Nov 27 2013 I could have sworn we had byGrapheme somewhere, well apparently not :(
- Jakob Ovrum (4/6) Nov 29 2013 Simple attempt:
- =?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= (40/49) Nov 27 2013 It shouldn't be hard to make, either:
- Gary Willoughby (3/10) Nov 27 2013 Ha, i was just discussing that here:
Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435 Bye, bearophile
Nov 27 2013
On 2013-11-27 13:46, bearophile wrote:Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435Indeed it does. Have you tried with std.uni? -- Simen
Nov 27 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:D+Phobos seem to fail most things (it produces BAFFLE):I still think we're doing pretty good. At least, we *handle* unicode at all (looking at you C++). And we handle *true* unicode, not BMP style UCS (looking at you encoding: UTF8 through UTF32, and the possibility to also have ASCII. We don't yet totally handle things like diacritics or ligatures, but we are getting there. As a whole, I find that D is incredibly "unicode correct enough" out of the box, and with no extra effort involved.
Nov 27 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results. As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap. David
Nov 27 2013
On 2013-11-27 15:45, David Nadlinger wrote:If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results.That didn't work out very well: std/uni.d(6301): Error: undefined identifier tuple -- /Jacob Carlborg
Nov 27 2013
On Wednesday, 27 November 2013 at 15:03:37 UTC, Jacob Carlborg wrote:std/uni.d(6301): Error: undefined identifier tupleYeah, I saw it too. The fix is simple: https://github.com/D-Programming-Language/phobos/pull/1728 tbh this makes me think version(unittest) might just be considered harmful. I'm sure that code passed the tests, but only because a vital import was in a version(unittest) secion!
Nov 27 2013
On 2013-11-27 16:07, Adam D. Ruppe wrote:Yeah, I saw it too. The fix is simple: https://github.com/D-Programming-Language/phobos/pull/1728 tbh this makes me think version(unittest) might just be considered harmful. I'm sure that code passed the tests, but only because a vital import was in a version(unittest) secion!You were faster. But I created an issue as well. -- /Jacob Carlborg
Nov 27 2013
David Nadlinger:If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results. As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.Thank you :-) Bye, bearophile
Nov 27 2013
On Wednesday, 27 November 2013 at 14:45:32 UTC, David Nadlinger wrote:If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results.Seems like a pretty big "gotcha" from a usability standpoint; it's not exactly intuitive. I understand WHY this decision was made, but it feels like a source of code smell and weird string comparison errors.As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe there's even room for a std.string.i18n submodule?) -Wyatt
Nov 27 2013
On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:Seems like a pretty big "gotcha" from a usability standpoint; it's not exactly intuitive. I understand WHY this decision was made, but it feels like a source of code smell and weird string comparison errors.It probably is, but is Unicode gotcha, not D one.
Nov 27 2013
On 2013-11-27 17:15, Wyatt wrote:I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe there's even room for a std.string.i18n submodule?)I think we should have that. -- /Jacob Carlborg
Nov 27 2013
On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe there's even room for a std.string.i18n submodule?) -WyattWhat would it do that std.uni doesn't already? i18nString sounds like a range of graphemes to me. I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid. In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.
Nov 27 2013
On 2013-11-27 18:22, Jakob Ovrum wrote:What would it do that std.uni doesn't already?A class/struct that handles all these normalizations and other stuff automatically. -- /Jacob Carlborg
Nov 27 2013
On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg wrote:On 2013-11-27 18:22, Jakob Ovrum wrote:Sounds terrible :)What would it do that std.uni doesn't already?A class/struct that handles all these normalizations and other stuff automatically.
Nov 27 2013
On Wednesday, 27 November 2013 at 17:37:48 UTC, Jakob Ovrum wrote:On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg wrote:+1 Working with graphemes is rather expensive thing to do performance-wise. I like how D makes this fact obvious and provides continuous transition through abstraction levels here. It is important to make the costs obvious.On 2013-11-27 18:22, Jakob Ovrum wrote:Sounds terrible :)What would it do that std.uni doesn't already?A class/struct that handles all these normalizations and other stuff automatically.
Nov 27 2013
On 2013-11-27 18:56, Dicebot wrote:+1 Working with graphemes is rather expensive thing to do performance-wise. I like how D makes this fact obvious and provides continuous transition through abstraction levels here. It is important to make the costs obvious.I think it's missing a final high level abstraction. As with the rest of the abstractions you're not forced to use them. -- /Jacob Carlborg
Nov 27 2013
27-Nov-2013 22:54, Jacob Carlborg пишет:On 2013-11-27 18:56, Dicebot wrote:This could give an idea of what Perl folks do to get the grapheme feel like a unit of string: http://www.parrot.org/content/ucs-4-nfg-and-how-grapheme-tables-makes-it-awesome You seriously don't want this kind of behind the scenes work taking place in systems language. P.S. The text linked presents some incorrect "facts" about Unicode that I'm not to be held responsible for :) I do believe however that the general idea described is interesting and is worth trying out in addition to what we have in std.uni. -- Dmitry Olshansky+1 Working with graphemes is rather expensive thing to do performance-wise. I like how D makes this fact obvious and provides continuous transition through abstraction levels here. It is important to make the costs obvious.I think it's missing a final high level abstraction. As with the rest of the abstractions you're not forced to use them.
Nov 27 2013
On Wed, Nov 27, 2013 at 06:22:41PM +0100, Jakob Ovrum wrote:On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:Maybe it should be called graphemeString? I'm not sure what this has to do with i18n, though. Properly done i18n should use Unicode line-breaking algorithms and other such standardized functions, rather than manipulating graphemes directly (which fails to take into account double-width characters, language-specific decomposition rules, and many other gotchas, not to mention poorly-performing). AFAIK std.uni already provides a way to extract graphemes when you need it (e.g., for rendering fonts), so there's really no reason to default to graphemeString everywhere in your program. *That* is a sign of poorly written code, IMNSHO.I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important. Making i18n stuff as simple as it looks like it "should" be has merit, IMO. (Maybe there's even room for a std.string.i18n submodule?) -WyattWhat would it do that std.uni doesn't already? i18nString sounds like a range of graphemes to me.I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid. In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.Agreed. T -- MASM = Mana Ada Sistem, Man!
Nov 27 2013
On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote:i18nString sounds like a range of graphemes to me.Maybe. If I had called it...say, "normalisedString"? Would you still think that? That was an off-the-cuff name because my morning brain imagined that this sort of thing would be useful for user input where you can't make assumptions about its form.I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid.Okay, hold up. It's a bit late to prevent everyone from diving down this rabbit hole, but let me be clear: This really isn't about graphemes. Not really. They may be involved, but I think focusing on that obscures the point. If you recall the original article, I don't think he's being unfair in expecting "noël" to have a length of four no matter how it was composed. I don't think it's unfair to expect that "noël".take(3) returns "noë", and I don't think it's unfair that reversing it should be "lëon". All the places where his expectations were defied (and more!) are implementation details. While I stated before that I don't necessarily have anything against people learning more about unicode, neither do I fundamentally believe that's something a lot of people _need_ to worry about. I'm not saying the default string in D should change or anything crazy like that. All I'm suggesting is maybe, rather than telling people they should read a small book about the most arcane stuff imaginable and then explaining which tool does what when that doesn't take, we could just tell them "Here, use this library type where you need it" with the admonishment that it may be too slow if abused. I think THAT could be useful.In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.See, this sways me only a little bit. The reason for that is, often, convenience greatly trumps elegance or performance. Sure I COULD write something in C to look for obvious bad stuff in my syslog, but would I bother when I have a shell with pipes, grep, cut, and sed? This all isn't to say I don't LIKE performance and elegance; but I live, work, and play on both sides of this spectrum, and I'd like to think they can peacefully coexist without too much fuss. -Wyatt
Nov 27 2013
On 11/27/2013 9:22 AM, Jakob Ovrum wrote:In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.Sadly, std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges. This means that all algorithms on strings will be crippled as far as performance goes. http://dlang.org/glossary.html#narrow strings Very, very few operations on strings need decoding. The decoding should have gone into a separate layer.
Nov 28 2013
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright wrote:Sadly, std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges. This means that all algorithms on strings will be crippled as far as performance goes. http://dlang.org/glossary.html#narrow strings Very, very few operations on strings need decoding. The decoding should have gone into a separate layer.Decoding by default means that algorithms can work reasonably with strings without being designed specifically for strings. The algorithms can then later be specialized for narrow strings, which I believe is happening for a few algorithms in std.algorithm like substring search. Decoding is still available as a separate layer through std.utf, when more control over decoding is required.
Nov 28 2013
Walter Bright:This means that all algorithms on strings will be crippled as far as performance goes.If you want to sort an array of chars you need to use a dchar[], or code like this: char[] word = "just a test".dup; auto sword = cast(char[])word.representation.sort().release; See: http://d.puremagic.com/issues/show_bug.cgi?id=10162 Bye, bearophile
Nov 28 2013
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright wrote:Sadly,I think it's great. It means by default, your strings will always be handled correctly. I think there's quite a few algorithms that were written without ever taking strings into account, but still happen to work with them.std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges. This means that all algorithms on strings will be crippled as far as performance goes.Quite a few algorithms in array/algorithm/string *don't* decode the string when they don't need to actually.Very, very few operations on strings need decoding. The decoding should have gone into a separate layer.Which operations are you thinking of in std.array that decode when they shouldn't?
Nov 28 2013
On 11/28/2013 5:24 AM, monarch_dodra wrote:Which operations are you thinking of in std.array that decode when they shouldn't?front() in std.array looks like: property dchar front(T)(T[] a) safe pure if (isNarrowString!(T[])) { assert(a.length, "Attempting to fetch the front of an empty array of " ~ T.stringof); size_t i = 0; return decode(a, i); } So anytime I write a generic algorithm using empty, front, and popFront(), it decodes the strings, which is a large pessimization.
Nov 28 2013
On Thu, Nov 28, 2013 at 09:52:08AM -0800, Walter Bright wrote:On 11/28/2013 5:24 AM, monarch_dodra wrote:OTOH, it is actually correct by default. If it *didn't* decode, things like std.algorithm.sort and std.range.retro would mangle all your multibyte UTF-8 characters. Having said that, though, it would be nice if there were a standard ASCII string type that didn't decode by default. Always decoding strings *is* slow, esp. when you already know that it only contains ASCII characters. Maybe we want something like this: struct AsciiString { immutable(ubyte)[] impl; alias impl this; // This is so that .front returns char instead of ubyte property char front() { return cast(char) impl[0]; } char opIndex(size_t idx) { ... /* ditto */ } ... // other range methods here } AsciiString assumeAscii(string s) { return AsciiString(cast(immutable(ubyte)[]) s); } T -- "640K ought to be enough" -- Bill G., 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.Which operations are you thinking of in std.array that decode when they shouldn't?front() in std.array looks like: property dchar front(T)(T[] a) safe pure if (isNarrowString!(T[])) { assert(a.length, "Attempting to fetch the front of an empty array of " ~ T.stringof); size_t i = 0; return decode(a, i); } So anytime I write a generic algorithm using empty, front, and popFront(), it decodes the strings, which is a large pessimization.
Nov 28 2013
On Thursday, 28 November 2013 at 18:55:44 UTC, Dicebot wrote:Yeah, that or just ubyte[]. The problem with both of these though, is printing :/ (which prints ugly as sin) Something like: struct AsciiChar { private char c; alias c this; } Could be a very easy and efficient alternative.
Nov 28 2013
On 11/28/2013 10:19 AM, H. S. Teoh wrote:Always decoding strings *is* slow, esp. when you already know that it only contains ASCII characters.It doesn't have to be merely ASCII. You can do string substring searches without any need for decoding, for example. You don't even need decoding to do regex. Decoding is rarely needed.
Nov 28 2013
28-Nov-2013 17:24, monarch_dodra пишет:On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright wrote:The greatest problem is surprisingly that you can't use range functions to the implicit codeunit range even if you REALLY wanted to. To not go far away - the only reason std.regex can't take e.g. retro of string: match(retro("hleb), ".el."); is because of the automatic dumbing down at the moment you apply range adapter. What I'd need in std.regex is a codeunit range that due to convention also "happens to be" a range of codepoints. The second problem is that string code is carefully special cased but the effort is completely wasted the moment you have a slice of char-s that come from anywhere else (circular buffer, for instance) then built-in strings. I had a (a bit cloudy) vision of settling encoded ranges problem once and for good. That includes defining notion of an encoded range that is 2 in one: some stronger (as in capabilities) range of code elements and the default decoded view imposed on top of it (that can be weaker). -- Dmitry OlshanskySadly,I think it's great. It means by default, your strings will always be handled correctly. I think there's quite a few algorithms that were written without ever taking strings into account, but still happen to work with them.
Nov 28 2013
On 11/28/2013 11:32 AM, Dmitry Olshansky wrote:I had a (a bit cloudy) vision of settling encoded ranges problem once and for good. That includes defining notion of an encoded range that is 2 in one: some stronger (as in capabilities) range of code elements and the default decoded view imposed on top of it (that can be weaker).I suspect the correct approach would be to have the range over string to produce bytes. If you want decoded values, then run it through an adapter algorithm.
Nov 28 2013
On 11/27/2013 06:45 AM, David Nadlinger wrote:On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:I don't like the overhead, and I don't know how important this is, but perhaps the best way to solve it would be to have string include a "normalization" byte, saying whether it was normalized, and if so in what way. That there can be multiple ways of normalizing is painful, but it *is* the standard. And this would allow normalization to be skipped whenever the comparison of two strings showed the same normalization (or lack thereof). What to do if they're normalized differently is a bit of a puzzle, but most reasonable solutions would work for most cases, so you just need a way to override the defaults. -- Charles HixsonThrough Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results. As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap. David
Nov 27 2013
27-Nov-2013 18:45, David Nadlinger пишет:On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:It's anything but cheap. At the minimum imagine crawling the string and issuing a table lookup per codepoint.Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results. As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.David-- Dmitry Olshansky
Nov 27 2013
On 11/27/2013 12:06 PM, Dmitry Olshansky wrote:27-Nov-2013 18:45, David Nadlinger пишет:Decoding isn't cheap, either, which is why I rant about it being the default behavior.As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.It's anything but cheap. At the minimum imagine crawling the string and issuing a table lookup per codepoint.
Nov 28 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/Most of the points are good, but the author seems to confuse UCS-2 with UTF-16, so the whole point about UTF-16 is plain wrong. The author also doesn't seem to understand the Unicode definitions of character and grapheme, which is a shame, because the difference is more or less the whole point of the post.D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435D strings are arrays of code units and ranges of code points. The failure here is yours; in that you didn't use std.uni to handle graphemes. On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).
Nov 27 2013
On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote:The author also doesn't seem to understand the Unicode definitions of character and grapheme, which is a shame, because the difference is more or less the whole point of the post.I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything. If they know, they know; if they don't, the wall of jargon is intimidating and hard to grasp (more examples up front of more things that you'd actually use std.uni for). Even though I'm decently familiar with Unicode, I was having trouble following all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to std.uni?). On the flip side, std.utf has a serious dearth of examples and the relationship between the two isn't clear.On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).Yes, please. While operations on single codepoints and characters seem pretty robust (i.e. you can do lots of things with and to them), it feels like it just falls apart when you try to work with strings. It honestly surprised me how many things in std.uni don't seem to work on ranges. -Wyatt
Nov 27 2013
On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:trouble following all that (e.g. Isn't "noe\u0308l" a graphemeWhoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it." Is that about right? -Wyatt
Nov 27 2013
On Wednesday, 27 November 2013 at 16:22:58 UTC, Wyatt wrote:Whoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it." Is that about right? -WyattYes. A grapheme is also sometimes explained as being the unit that lay people intuitively think of as being a "character". The difference between a grapheme and a grapheme cluster is just a matter of perspective, like the difference between a character and a code point; the former simply refers to the decoded result, while the latter refers to the sum of encoding parts (where the parts are code points for grapheme cluster, and code units for a code point). Yet another example is that of the UTF-32 code unit: one UTF-32 code unit is (currently) equal to one Unicode code point, but both terms are meaningful in the right context.
Nov 27 2013
27-Nov-2013 20:22, Wyatt пишет:On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:As much as standard defines it. (actually they talk about boundaries, and grapheme is what happens to be in between). More specifically D's std.uni follows the notion of the extended grapheme cluster. There is no need to stick with ugly legacy crap. See also http://www.unicode.org/reports/tr29/trouble following all that (e.g. Isn't "noe\u0308l" a graphemeWhoops, overzealous pasting. That is, "e\u0308", which composes to "ë". A grapheme cluster seems to represent one printed character: "...a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it." Is that about right?-Wyatt-- Dmitry Olshansky
Nov 27 2013
On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything. If they know, they know; if they don't, the wall of jargon is intimidating and hard to grasp (more examples up front of more things that you'd actually use std.uni for). Even though I'm decently familiar with Unicode, I was having trouble following all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to std.uni?). On the flip side, std.utf has a serious dearth of examples and the relationship between the two isn't clear.I thought it was nice that std.uni had a proper terminology section, complete with links to Unicode documents to kick-start beginners to Unicode. It mentions its relationship with std.utf right at the top. Maybe the first paragraph is just too thin, and it's hard to see the big picture. Maybe it should include a small leading paragraph detailing the three levels of Unicode granularity that D/Phobos chooses; arrays of code units -> ranges of code points -> std.uni for graphemes and algorithms.Yes, please. While operations on single codepoints and characters seem pretty robust (i.e. you can do lots of things with and to them), it feels like it just falls apart when you try to work with strings. It honestly surprised me how many things in std.uni don't seem to work on ranges. -WyattMost string code is Unicode-correct as long as it works on code points and all inputs are of the same normalization format; explicit grapheme-awareness is rarely a necessity. By that I mean the most common string operations, such as searching, getting a substring etc. will work without any special grapheme decoding (beyond normalization). The hiccups appear when code points are shuffled around, or the order is changed. Apart from these rare string manipulation cases, grapheme awareness is necessary for rendering code.
Nov 27 2013
On 11/27/2013 08:53 AM, Jakob Ovrum wrote:On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:I would put things a bit more emphatically. The codepoint is analogous to assembler, where the character is analogous to a high level language (and the binary representation is analogous to a binary representation). The desire is to make the characters easy to use in a way that is cheap to do. To me this means that the highlevel language (i.e., D) should make it easy to deal with characters, possible to deal with codepoints, and you can deal with binary representations if you really want to. (Also note the isomorphism between assembler code and binary is matched by an isomorphism between codepoints and binary representation.) To do this cheaply, D needs to know what kind of normalization each string is in. This is likely to cost one byte per string, unless there's some slack in the current representation. But is this worth while? This is the direction that things will eventually go, but that doesn't really mean that we need to push them in that direction today. But if D had a default normalization that occurred during i/o operations, to cost of the normalization would probably be lost during the impedance matching between RAM and storage. (Again, however, any default requires the ability to be overridden.) Also, of course, none of this will be of any significance to ASCII. -- Charles HixsonI agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything. If they know, they know; if they don't, the wall of jargon is intimidating and hard to grasp (more examples up front of more things that you'd actually use std.uni for). Even though I'm decently familiar with Unicode, I was having trouble following all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to std.uni?). On the flip side, std.utf has a serious dearth of examples and the relationship between the two isn't clear.I thought it was nice that std.uni had a proper terminology section, complete with links to Unicode documents to kick-start beginners to Unicode. It mentions its relationship with std.utf right at the top. Maybe the first paragraph is just too thin, and it's hard to see the big picture. Maybe it should include a small leading paragraph detailing the three levels of Unicode granularity that D/Phobos chooses; arrays of code units -> ranges of code points -> std.uni for graphemes and algorithms.Yes, please. While operations on single codepoints and characters seem pretty robust (i.e. you can do lots of things with and to them), it feels like it just falls apart when you try to work with strings. It honestly surprised me how many things in std.uni don't seem to work on ranges. -WyattMost string code is Unicode-correct as long as it works on code points and all inputs are of the same normalization format; explicit grapheme-awareness is rarely a necessity. By that I mean the most common string operations, such as searching, getting a substring etc. will work without any special grapheme decoding (beyond normalization). The hiccups appear when code points are shuffled around, or the order is changed. Apart from these rare string manipulation cases, grapheme awareness is necessary for rendering code.
Nov 27 2013
On 11/27/2013 8:18 AM, Wyatt wrote:It honestly surprised me how many things in std.uni don't seem to work on ranges.Many things in Phobos either predate ranges, or are written by people who aren't used to ranges and don't think in terms of ranges. It's an ongoing issue, and one we need to improve upon. And, of course, you're welcome to pitch in and help with pull requests on the documentation and implementation!
Nov 27 2013
27-Nov-2013 20:18, Wyatt пишет:On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote:It honestly surprised me how many things in std.uni don't seem to work on ranges.Which ones? Or do you mean more like isAlpha(rangeOfCodepoints)? -- Dmitry Olshansky
Nov 27 2013
On 11/27/13 7:43 AM, Jakob Ovrum wrote:On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).Yah, byGrapheme would be a great addition. Andrei
Nov 27 2013
On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:On 11/27/13 7:43 AM, Jakob Ovrum wrote:[...] +1. This is better than the GraphemeString / i18nString proposal elsewhere in this thread, because it discourages people from using graphemes (poor performance) unless where actually necessary. T -- He who laughs last thinks slowest.On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).Yah, byGrapheme would be a great addition.
Nov 27 2013
27-Nov-2013 22:12, H. S. Teoh пишет:On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:I could have sworn we had byGrapheme somewhere, well apparently not :( BTW I believe that GraphemeString could still be a valuable addition. I known of at least one good implementation that gives you O(1) grapheme access with nice memory footprint numbers. It has many benefits but the chief problem with it: a) It doesn't at all solve the interchange at all - you'd have to encode on write/re-code on read b) It relies on having global shared state across the whole program, and that's the real show-stopper thing about it In any case it's a direction well worth exploring.On 11/27/13 7:43 AM, Jakob Ovrum wrote:[...] +1. This is better than the GraphemeString / i18nString proposal elsewhere in this thread, because it discourages people from using graphemes (poor performance) unless where actually necessary.On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).Yah, byGrapheme would be a great addition.T-- Dmitry Olshansky
Nov 27 2013
On Wednesday, 27 November 2013 at 20:13:32 UTC, Dmitry Olshansky wrote:I could have sworn we had byGrapheme somewhere, well apparently not :(Simple attempt: https://github.com/D-Programming-Language/phobos/pull/1736
Nov 29 2013
On 27.11.2013 19:07, Andrei Alexandrescu wrote:On 11/27/13 7:43 AM, Jakob Ovrum wrote:It shouldn't be hard to make, either: import std.uni : Grapheme, decodeGrapheme; import std.traits : isSomeString; import std.array : empty; struct ByGrapheme(T) if (isSomeString!T) { Grapheme _front; bool _empty; T _range; this(T value) { _range = value; popFront(); } property Grapheme front() { assert(!empty); return _front; } void popFront() { assert(!empty); _empty = _range.empty; if (!_empty) { _front = decodeGrapheme(_range); } } property bool empty() { return _empty; } } auto byGrapheme(T)(T value) if (isSomeString!T) { return ByGrapheme!T(value); } void main() { import std.stdio; string s = "তঃঅ৩৵பஂஅபூ௩ᐁᑦᕵᙧᚠᚳᛦᛰ¥¼Ññ"; writeln(s.byGrapheme); } -- SimenOn that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).Yah, byGrapheme would be a great addition.
Nov 27 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:Through Reddit I have seen this small comparison of Unicode handling between different programming languages: http://mortoray.com/2013/11/27/the-string-type-is-broken/ D+Phobos seem to fail most things (it produces BAFFLE): http://dpaste.dzfl.pl/a5268c435 Bye, bearophileHa, i was just discussing that here: http://forum.dlang.org/thread/xmusisihhbmefeigvxvd forum.dlang.org
Nov 27 2013