digitalmars.D - Today's programming challenge - How's your Range-Fu ?
- Walter Bright (11/11) Apr 17 2015 Challenge level - Moderately easy
- H. S. Teoh via Digitalmars-d (33/47) Apr 17 2015 This is harder than it looks at first sight, actually. Mostly thanks to
- Walter Bright (3/9) Apr 17 2015 It'd be good enough to duplicate the existing behavior, which is to trea...
- John Colvin (8/22) Apr 18 2015 Code points aren't equivalent to characters. They're not the same
- Jacob Carlborg (4/8) Apr 18 2015 For that we have std.ascii.
- Walter Bright (9/27) Apr 18 2015 The first order of business is making wrap() work with ranges, and other...
- Panke (8/31) Apr 18 2015 Umlauts, if combined characters are used. Also words that still
- Walter Bright (3/23) Apr 18 2015 That doesn't make sense to me, because the umlauts and the accented e al...
- Panke (4/6) Apr 18 2015 Yes, but you may have perfectly fine unicode text where the
- Jacob Carlborg (19/21) Apr 18 2015 This code snippet demonstrates the problem:
- Chris (8/28) Apr 18 2015 Yep, this was the cause of some bugs I had in my program. The
- Gary Willoughby (4/40) Apr 18 2015 byGrapheme to the rescue:
- Jacob Carlborg (7/10) Apr 18 2015 How is byGrapheme supposed to be used? I tried this put it doesn't do
- Jakob Ovrum (8/18) Apr 18 2015 void main()
- H. S. Teoh via Digitalmars-d (26/61) Apr 18 2015 Wait, I thought the recommended approach is to normalize first, then do
- Tobias Pankrath (9/19) Apr 18 2015 1. Problem: Normalization is not closed under almost all
- Chris (7/93) Apr 18 2015 This is why on OS X I always normalized strings to composed.
- Walter Bright (3/6) Apr 18 2015 That should be done. There should be a fixed maximum codepoint count to
- H. S. Teoh via Digitalmars-d (7/14) Apr 18 2015 Why? Scanning a string for a grapheme of arbitrary length does not need
- Walter Bright (3/14) Apr 18 2015 If there's no need for allocation at all, why does it allocate? This sho...
- H. S. Teoh via Digitalmars-d (7/23) Apr 18 2015 AFAICT, the only reason it allocates is because it shares the same
- Andrei Alexandrescu (4/23) Apr 18 2015 Isn't this solved commonly with a normalization pass? We should have a
- Tobias Pankrath (4/8) Apr 18 2015 I don't think so. The thing is, even after normalization we have
- Chris (5/15) Apr 20 2015 Yes, again and again I encountered length related bugs with
- Panke (4/6) Apr 20 2015 I think it is 100% reliable, it just doesn't make the problems go
- Chris (21/28) Apr 20 2015 The problem is not normalization as such, the problem is with
- Panke (22/25) Apr 20 2015 There are three things that you need to be aware of when handling
- John Colvin (6/9) Apr 20 2015 Even that's not really true. In the end it's up to the font and
- H. S. Teoh via Digitalmars-d (31/41) Apr 20 2015 Yeah, even the grapheme count does not necessarily tell you how wide the
- Panke (2/7) Apr 20 2015 Why? Doesn't string.length give you the byte count?
- rumbu (9/17) Apr 20 2015 You'll also need the unicode character display width:
- JohnnyK (10/18) Apr 21 2015 I think what you are looking for is string.sizeof?
- John Colvin (7/23) Apr 21 2015 I was talking about the "you'll need the number of graphemes".
- Chris (5/31) Apr 20 2015 This is why I use a helper function that uses byCodePoint and
- John Colvin (5/39) Apr 19 2015 Normalisation can allow some simplifications, sometimes, but
- Walter Bright (4/6) Apr 18 2015 I won't deny what the spec says, but it doesn't make any sense to have t...
- H. S. Teoh via Digitalmars-d (14/22) Apr 18 2015 Well, *somebody* has to convert it to the single code point eacute,
- Walter Bright (6/23) Apr 18 2015 Data entry should be handled by the driver program, not a universal inte...
- H. S. Teoh via Digitalmars-d (6/13) Apr 18 2015 Take it up with the Unicode consortium. :-)
- Walter Bright (2/3) Apr 18 2015 I see nobody knows :-)
- Shachar Shemesh (27/30) Apr 18 2015 A lot of areas in Unicode are due to pre-Unicode legacy.
- Abdulhaq (2/38) Apr 19 2015 Yes Arabic is similar too
- Shachar Shemesh (20/29) Apr 19 2015 Actually, the Arab presentation forms serve a slightly different
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (13/17) Apr 19 2015 That's probably right. It is in fact a major feat to have the
- John Colvin (12/19) Apr 19 2015 é might be obvious, but Unicode isn't just for writing European
- ketmar (5/9) Apr 19 2015 e.
- weaselcat (3/7) Apr 19 2015 There's other uses for unicode?
- Nick B (17/20) Apr 19 2015 Ketmar
- ketmar (3/4) Apr 19 2015 alas, it's too late. now we'll live with that "unicode" crap for many=20
- Nick B (5/10) Apr 19 2015 Perhaps. or perhaps not. This community got together under Walter
- Jacob Carlborg (4/5) Apr 20 2015 https://xkcd.com/927/
- Shachar Shemesh (55/58) Apr 19 2015 This is not a very accurate depiction of Unicode.
- Paulo Pinto (8/40) Apr 18 2015 Also another issue is that lower case letters and upper case
- Tobias Pankrath (2/9) Apr 18 2015 While true, it does not affect wrap (the algorithm) as far as I
- Shachar Shemesh (20/22) Apr 17 2015 Which BiDi marking are you referring to? LRM/RLM and friends? If so,
- H. S. Teoh via Digitalmars-d (6/8) Apr 17 2015 Argh, my Perl script doth mock me!
- H. S. Teoh via Digitalmars-d (166/172) Apr 17 2015 [...]
- Walter Bright (2/4) Apr 17 2015 awesome! Please make a pull request for this so you get proper credit!
- H. S. Teoh via Digitalmars-d (5/11) Apr 17 2015 Doesn't that mean I have to add the autodecoding workarounds first?
- Walter Bright (8/16) Apr 17 2015 Before it gets pulled, yes, meaning that the element type of front() sho...
- ketmar (3/5) Apr 17 2015 there is some... inconsistency: `std.string.wrap` adds final "\n" to=20
- Panke (3/11) Apr 17 2015 A range of lines instead of inserted \n would be a good API as
- H. S. Teoh via Digitalmars-d (11/22) Apr 18 2015 Indeed, that would be even more useful, then you could just do
- Walter Bright (5/7) Apr 18 2015 Yes, although the overarching goal is:
Challenge level - Moderately easy Consider the function std.string.wrap: It takes a string as input, and returns a GC allocated string that is word-wrapped. It needs to be enhanced to: 1. Accept a ForwardRange as input. 2. Return a lazy ForwardRange that delivers the characters of the wrapped result one by one on demand. 3. Not allocate any memory. 4. The element encoding type of the returned range must be the same as the element encoding type of the input.
Apr 17 2015
On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote:Challenge level - Moderately easy Consider the function std.string.wrap: It takes a string as input, and returns a GC allocated string that is word-wrapped. It needs to be enhanced to: 1. Accept a ForwardRange as input. 2. Return a lazy ForwardRange that delivers the characters of the wrapped result one by one on demand. 3. Not allocate any memory. 4. The element encoding type of the returned range must be the same as the element encoding type of the input.This is harder than it looks at first sight, actually. Mostly thanks to the complexity of Unicode... you need to identify zero-width, normal-width, and double-width characters, combining diacritics, various kinds of spaces (e.g. cannot break on non-breaking space) and treat them accordingly. Which requires decoding. (Well, in theory std.uni could be enhanced to work directly with encoded data, but right now it doesn't. In any case this is outside the scope of this challenge, I think.) Unfortunately, the only reliable way I know of currently that can deal with the spacing of Unicode characters correctly is to segment the input with byGrapheme, which currently is GC-dependent. So this fails (3). There's also the question of what to do with bidi markings: how do you handle counting the columns in that case? Of course, if you forego Unicode correctness, then you *could* just word-wrap on a per-character basis (i.e., every character counts as 1 column), but this also makes the resulting code useless as far as dealing with general Unicode data is concerned -- it'd only work for ASCII, and various character ranges inherited from the old 8-bit European encodings. Not to mention, line-breaking in Chinese encodings cannot work as prescribed anyway, because the rules are different (you can break anywhere at a character boundary except punctuation -- there is no such concept as a space character in Chinese writing). Same applies for Korean/Japanese. So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). T -- All problems are easy in retrospect.
Apr 17 2015
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Apr 17 2015
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:Code points aren't equivalent to characters. They're not the same thing in most European languages, never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning. Code points are a useful chunk size for some tasjs and completely insufficient for others.So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Apr 18 2015
On 2015-04-18 09:58, John Colvin wrote:Code points aren't equivalent to characters. They're not the same thing in most European languages, never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning.For that we have std.ascii. -- /Jacob Carlborg
Apr 18 2015
On 4/18/2015 12:58 AM, John Colvin wrote:On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:I know a bit of German, for what characters is that not true?On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:Code points aren't equivalent to characters. They're not the same thing in most European languages,So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning. Code points are a useful chunk size for some tasjs and completely insufficient for others.The first order of business is making wrap() work with ranges, and otherwise work the same as it always has (it's one of the oldest Phobos functions). There are different standard levels of Unicode support. The lowest level is working correctly with code points, which is what wrap() does. Going to a higher level of support comes after range support. I know little about combining characters. You obviously know much more, do you want to take charge of this function?
Apr 18 2015
On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:On 4/18/2015 12:58 AM, John Colvin wrote:Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all. See also: http://unicode.org/reports/tr15/#Norm_FormsOn Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:I know a bit of German, for what characters is that not true?On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:Code points aren't equivalent to characters. They're not the same thing in most European languages,So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Apr 18 2015
On 4/18/2015 1:26 AM, Panke wrote:On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.On 4/18/2015 12:58 AM, John Colvin wrote:Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. CaféOn Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:I know a bit of German, for what characters is that not true?On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:Code points aren't equivalent to characters. They're not the same thing in most European languages,So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Apr 18 2015
That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.Yes, but you may have perfectly fine unicode text where the combined form is used. Actually there is a normalization form for unicode that requires the combined form. To be fully correct phobos needs to handle that as well.
Apr 18 2015
On 2015-04-18 12:27, Walter Bright wrote:That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm -- /Jacob Carlborg
Apr 18 2015
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:On 2015-04-18 12:27, Walter Bright wrote:Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Apr 18 2015
On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote:On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:byGrapheme to the rescue: http://dlang.org/phobos/std_uni.html#byGrapheme Or is this unsuitable here?On 2015-04-18 12:27, Walter Bright wrote:Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Apr 18 2015
On 2015-04-18 14:25, Gary Willoughby wrote:byGrapheme to the rescue: http://dlang.org/phobos/std_uni.html#byGrapheme Or is this unsuitable here?How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected: foreach (e ; "e\u0301".byGrapheme) writeln(e); -- /Jacob Carlborg
Apr 18 2015
On Saturday, 18 April 2015 at 12:48:53 UTC, Jacob Carlborg wrote:On 2015-04-18 14:25, Gary Willoughby wrote:void main() { import std.stdio; import std.uni; foreach (e ; "e\u0301".byGrapheme) writeln(e[]); }byGrapheme to the rescue: http://dlang.org/phobos/std_uni.html#byGrapheme Or is this unsuitable here?How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected: foreach (e ; "e\u0301".byGrapheme) writeln(e);
Apr 18 2015
On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function. Of course, even after normalization, you still have the issue of zero-width characters and combining diacritics, because not every language has precomposed characters handy. Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not always match the layman's notion of "character"). Unfortunately, byGrapheme may allocate, which fails Walter's requirements. Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or " nogc". Which makes it unusable in nogc code. One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!On 2015-04-18 12:27, Walter Bright wrote:Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = ""; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Apr 18 2015
Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function.1. Problem: Normalization is not closed under almost all operations. E.g. concatenating two normalized strings does not guarantee the result is in normalized form. 2. Problem: Some unicode algorithms e.g. string comparison require a normalization step. It doesn't matter which form you use, but you have to pick one. Now we could say that all strings passed to phobos have to be normalized as (say) NFC and that phobos function thus skip the normalization.
Apr 18 2015
On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:This is why on OS X I always normalized strings to composed. However, there are always issues with Unicode, because, as you said, the layman's notion of what a character is is not the same as Unicode's. I wrote a utility function that uses byGrapheme and byCodePoint. It's a bit of an overhead, but I always get the correct length and character access (e.g. if txt.startsWith("é")).On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function. Of course, even after normalization, you still have the issue of zero-width characters and combining diacritics, because not every language has precomposed characters handy. Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not always match the layman's notion of "character"). Unfortunately, byGrapheme may allocate, which fails Walter's requirements. Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or " nogc". Which makes it unusable in nogc code. One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. TOn 2015-04-18 12:27, Walter Bright wrote:Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Apr 18 2015
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme.That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Apr 18 2015
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of? T -- "How are you doing?" "Doing what?"One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme.That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Apr 18 2015
On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:If there's no need for allocation at all, why does it allocate? This should be fixed.On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of?One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme.That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Apr 18 2015
On Sat, Apr 18, 2015 at 11:37:27AM -0700, Walter Bright via Digitalmars-d wrote:On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:AFAICT, the only reason it allocates is because it shares the same underlying implementation as byGrapheme. There's probably a way to fix this, I just don't have the time right now to figure out the code. T -- Маленькие детки - маленькие бедки.On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:If there's no need for allocation at all, why does it allocate? This should be fixed.On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of?One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme.That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Apr 18 2015
On 4/18/15 4:35 AM, Jacob Carlborg wrote:On 2015-04-18 12:27, Walter Bright wrote:Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- AndreiThat doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Apr 18 2015
Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline.Yes.Then the rest of Phobos doesn't need to mind these combining characters. -- AndreiI don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization.
Apr 18 2015
On Saturday, 18 April 2015 at 17:04:54 UTC, Tobias Pankrath wrote:Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable. I don't know anyone who works with non English characters who doesn't have problems with Unicode related issues sometimes.Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline.Yes.Then the rest of Phobos doesn't need to mind these combining characters. -- AndreiI don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization.
Apr 20 2015
Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable.I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.
Apr 20 2015
On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:The problem is not normalization as such, the problem is with string (as opposed to dstring): import std.uni : normalize, NFC; void main() { dstring de_one = "é"; dstring de_two = "e\u0301"; assert(de_one.length == 1); assert(de_two.length == 2); string e_one = "é"; string e_two = "e\u0301"; string random = "ab"; assert(e_one.length == 2); assert(e_two.length == 3); assert(e_one.length == random.length); assert(normalize!NFC(e_one).length == 2); assert(normalize!NFC(e_two).length == 2); } This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable.I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.
Apr 20 2015
This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.There are three things that you need to be aware of when handling unicode: code units, code points and graphems. In general the length of one guarantees anything about the length of the other, except for utf32, which is a 1:1 mapping between code units and code points. In this thread, we were discussing the relationship between code points and graphemes. You're examples however apply to the relationship between code units and code points. To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units. If you normalize a string (in the sequence of characters/codepoints sense, not object.string) to NFC, it will decompose every precomposed character in the string (like é, single codeunit), establish a defined order between the composite characters and then recompose a selected few graphemes (like é). This way é always ends up as a single code unit in NFC. There are dozens of other combinations where you'll still have n:1 mapping between code points and graphemes left after normalization. Example given already in this thread: putting an arrow over an latin letter is typical in math and always more than one codepoint.
Apr 20 2015
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.Even that's not really true. In the end it's up to the font and layout engine to decide how much space anything takes up. Unicode doesn't play nicely with the idea of text as a grid of rows and fixed-width columns of characters, although quite a lot can (and is, see urxvt for example) be shoe-horned in.
Apr 20 2015
On Mon, Apr 20, 2015 at 06:03:49PM +0000, John Colvin via Digitalmars-d wrote:On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:Yeah, even the grapheme count does not necessarily tell you how wide the printed string really is. The characters in the CJK block are usually rendered with fonts that are, on average, twice as wide as your typical Latin/Cyrillic character, so even applications like urxvt that shoehorn proportional-width fonts into a text grid render CJK characters as two columns rather than one. Because of this, I actually wrote a function at one time to determine the width of a given Unicode character (i.e., zero, single, or double) as displayed in urxvt. Obviously, this is no help if you need to wrap lines rendered with a proportional font. And it doesn't even attempt to work correctly with bidi text. This is why I said at the beginning that wrapping a line of text is a LOT harder than it sounds. A function that only takes a string as input does not have the necessary information to do this correctly in all use cases. The current wrap() function doesn't even do it correctly modulo the information available: it doesn't handle combining diacritics and zero-width characters properly. In fact, it doesn't even handle control characters properly, except perhaps for \t and \n. There are so many things wrong with the current wrap() function (and many other string-processing functions in Phobos) that it makes it look like a joke when we claim that D provides Unicode correctness out-of-the-box. The only use case where wrap() gives the correct result is when you stick with pre-Unicode Latin strings to be displayed on a text console. As such, I don't really see the general utility of wrap() as it currently stands, and I question its value in Phobos, as opposed to an actually more useful implementation that, for instance, correctly implements the Unicode line-breaking algorithm. T -- It said to install Windows 2000 or better, so I installed Linux instead.To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.Even that's not really true. In the end it's up to the font and layout engine to decide how much space anything takes up. Unicode doesn't play nicely with the idea of text as a grid of rows and fixed-width columns of characters, although quite a lot can (and is, see urxvt for example) be shoe-horned in.
Apr 20 2015
On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:Why? Doesn't string.length give you the byte count?To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.Even that's not really true.
Apr 20 2015
On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:You'll also need the unicode character display width: Even if the font is monospaced, there are characters (Katakana, Hangul and even in Latin script) with variable width. ABCDEFGH ABCDEFGH (unicode 0xff21 through 0xff27). If the text above is not correctly displayed on your computer, a Korean console can be viewed here: http://upload.wikimedia.org/wikipedia/commons/1/14/KoreanDOSPrompt.pngOn Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:Why? Doesn't string.length give you the byte count?To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.Even that's not really true.
Apr 20 2015
On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:I think what you are looking for is string.sizeof? From the D reference .sizeof Returns the array length multiplied by the number of bytes per array element. .length Returns the number of elements in the array. This is a fixed quantity for static arrays. It is of type size_t. Isn't a string type an array of characters (char[] string UTF-8, wchar[] string UTF-16, and dchar[] string UTF-32) and not arbitrary bytes?On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:Why? Doesn't string.length give you the byte count?To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.Even that's not really true.
Apr 21 2015
On Tuesday, 21 April 2015 at 13:06:22 UTC, JohnnyK wrote:On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:I was talking about the "you'll need the number of graphemes". s.length returns the number of elements in the slice, which in the case of D's string types gives is the same as the number of code units.On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:Why? Doesn't string.length give you the byte count?To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.Even that's not really true.I think what you are looking for is string.sizeof? From the D reference .sizeof Returns the array length multiplied by the number of bytes per array element. .length Returns the number of elements in the array. This is a fixed quantity for static arrays. It is of type size_t.That is for static arrays only. .sizeof for slices is just size_t.sizeof + T*.sizeof i.e. 8 on 32 bit, 16 on 64 bit.
Apr 21 2015
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:This is why I use a helper function that uses byCodePoint and byGrapheme. At least for my use cases it returns the correct length. However, I might think about an alternative version based on the discussion here.This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.There are three things that you need to be aware of when handling unicode: code units, code points and graphems.In general the length of one guarantees anything about the length of the other, except for utf32, which is a 1:1 mapping between code units and code points. In this thread, we were discussing the relationship between code points and graphemes. You're examples however apply to the relationship between code units and code points. To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units. If you normalize a string (in the sequence of characters/codepoints sense, not object.string) to NFC, it will decompose every precomposed character in the string (like é, single codeunit), establish a defined order between the composite characters and then recompose a selected few graphemes (like é). This way é always ends up as a single code unit in NFC. There are dozens of other combinations where you'll still have n:1 mapping between code points and graphemes left after normalization. Example given already in this thread: putting an arrow over an latin letter is typical in math and always more than one codepoint.
Apr 20 2015
On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu wrote:On 4/18/15 4:35 AM, Jacob Carlborg wrote:Normalisation can allow some simplifications, sometimes, but knowing whether it will or not requires a lot of a priori knowledge about the input as well as the normalisation form.On 2015-04-18 12:27, Walter Bright wrote:Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- AndreiThat doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Apr 19 2015
On 4/18/2015 4:35 AM, Jacob Carlborg wrote:\u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htmI won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
Apr 18 2015
On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:On 4/18/2015 4:35 AM, Jacob Carlborg wrote:Well, *somebody* has to convert it to the single code point eacute, whether it's the human (if the keyboard has a single key for it), or the code interpreting keystrokes (the user may have typed it as e + combining acute), or the program that generated the combination, or the program that receives the data. When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right. The two code-point version may also arise from string concatenation, in which case normalization has to be done again (or possibly from the point of concatenation, given the right algorithms). T -- Mediocrity has been pushed to extremes.\u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htmI won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
Apr 18 2015
On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:Data entry should be handled by the driver program, not a universal interchange format.On 4/18/2015 4:35 AM, Jacob Carlborg wrote:Well, *somebody* has to convert it to the single code point eacute, whether it's the human (if the keyboard has a single key for it), or the code interpreting keystrokes (the user may have typed it as e + combining acute), or the program that generated the combination, or the program that receives the data.\u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htmI won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right.I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.
Apr 18 2015
On Sat, Apr 18, 2015 at 11:40:08AM -0700, Walter Bright via Digitalmars-d wrote:On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:[...]Take it up with the Unicode consortium. :-) T -- Tech-savvy: euphemism for nerdy.When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right.I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.
Apr 18 2015
On 4/18/2015 1:22 PM, H. S. Teoh via Digitalmars-d wrote:Take it up with the Unicode consortium. :-)I see nobody knows :-)
Apr 18 2015
On 18/04/15 21:40, Walter Bright wrote:I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.A lot of areas in Unicode are due to pre-Unicode legacy. I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks". The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without. This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly. Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters. The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy. Shachar or shall I say שחר
Apr 18 2015
MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:On 18/04/15 21:40, Walter Bright wrote:Yes Arabic is similar tooI'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.A lot of areas in Unicode are due to pre-Unicode legacy. I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks". The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without. This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly. Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters. The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy. Shachar or shall I say שחר
Apr 19 2015
On 19/04/15 10:51, Abdulhaq wrote:MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:On 18/04/15 21:40, Walter Bright wrote:Actually, the Arab presentation forms serve a slightly different purpose. In Hebrew, the presentation forms are mostly for Bibilical text, where certain decorations are usually done. For Arabic, the main reason for the presentation forms is shaping. Almost every Arabic letter can be written in up to four different forms (alone, start of word, middle of word and end of word). This means that Arabic has 28 letters, but over 100 different shapes for those letters. These days, when the font can do the shaping, the 28 letters suffice. During the DOS days, you needed to actually store those glyphs somewhere, which means that you needed to allocate a number to them. In Hebrew, some letters also have a final form. Since the numbers are so significantly smaller, however, (22 letters, 5 of which have final forms), Hebrew keyboards actually have all 27 letters on them. Going strictly by the "Unicode way", one would be expected to spell שלום with U05DE as the last letter, and let the shaping engine figure out that it should use the final form (or add a ZWNJ). Since all Hebrew code charts contained a final form Mem, however, you actually spell it with U05DD in the end, and it is considered a distinct letter. ShacharAlso, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters.Yes Arabic is similar too
Apr 19 2015
On Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without.That's probably right. It is in fact a major feat to have the world adopt a new standard wholesale, but there are also difficult "semiotic" issues when you encode symbols and different languages view symbols differently (e.g. is "ä" an "a" or do you have two unique letters in the alphabet?) Take "å", it can represent a unit (ångström) or a letter with a circle above it, or a unique letter in the alphabet. The letter "æ" can be seen as a combination of "ae" or a unique letter. And we can expect languages, signs and practices to evolve over time too. How can you normalize encodings without normalizing writing practice and natural language development? That would be beyond the mandate of a unicode standard organization...
Apr 19 2015
On Saturday, 18 April 2015 at 17:50:12 UTC, Walter Bright wrote:On 4/18/2015 4:35 AM, Jacob Carlborg wrote:é might be obvious, but Unicode isn't just for writing European prose. Uses for combining characters includes (but is *nowhere* near to limited to) mathematical notation, where the combinatorial explosion of possible combinations that still belong to one grapheme cluster (character is a familiar but misleading word when talking about Unicode) would trivially become an insanely (more atoms than in the universe levels of) large number of characters. Unicode is a nightmarish system in some ways, but considering how incredibly difficult the problem it solves is, it's actually not too crazy.\u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htmI won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
Apr 19 2015
On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:=C3=A9 might be obvious, but Unicode isn't just for writing European pros=e. it is also to insert pictures of the animals into text.Unicode is a nightmarish system in some ways, but considering how incredibly difficult the problem it solves is, it's actually not too crazy.it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdf=
Apr 19 2015
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:There's other uses for unicode? 🐧é might be obvious, but Unicode isn't just for writing European prose.it is also to insert pictures of the animals into text.
Apr 19 2015
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdfKetmar Great link, and a really good arguement about the problems with Unicode. Quote from 'Instead of Conclusion' Yes. This is the root of Unicode misdesign. They mixed up two mutually exclusive approaches. They blended badly two different abstraction levels: the textual level which corresponds to a language idea and the graphical level which does not care of a language, yet cares of writing direction, subscripts, superscripts and so on. In other words we need two different Unicodes built on these two opposite principles, instead of the one built on an insane mix of controversial axioms. end quote. Perhaps Unicode needs to be rebuild from the ground up ?
Apr 19 2015
On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:Perhaps Unicode needs to be rebuild from the ground up ?alas, it's too late. now we'll live with that "unicode" crap for many=20 years.=
Apr 19 2015
On Monday, 20 April 2015 at 03:39:54 UTC, ketmar wrote:On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:Perhaps. or perhaps not. This community got together under Walter and Andrei leadership to building a new programming language, on the pillars of the old. Perhaps a new Unicode standard, could start that way as well ?Perhaps Unicode needs to be rebuild from the ground up ?alas, it's too late. now we'll live with that "unicode" crap for many years.
Apr 19 2015
On 2015-04-20 08:04, Nick B wrote:Perhaps a new Unicode standard, could start that way as well ?https://xkcd.com/927/ -- /Jacob Carlborg
Apr 20 2015
On 19/04/15 22:58, ketmar wrote:On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote: it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdfThis is not a very accurate depiction of Unicode. For example: And, moreover, BOM is meaningless without mentioning of encoding. So we have to specify encoding anyway. No. BOM is what lets your auto-detect the encoding. If you know you will be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you which it is. That is its entire purpose, in fact. And then: Unicode contains at least writing direction control symbols (LTR is U+200E and RTL is U+200F) which role is IDENTICAL to the role of codepage-switching symbols with the associated disadvantages. invisible characters with defined directionality. Cutting them away from a substring would not invalidate your text more than cutting away actual text would under the same conditions. In any case, unlike page switching symbols, it would only affect your display, not your understanding of the text. nonsense. He is right, I think, that denoting units with separate code points makes no sense, but the rest of his arguments seem completely off. For example, asking Latin and Cyrillic to share the same region merely because some letters look alike makes no sense, implementation wise. there is his assumption that the situation is, somehow, worse than it was. Yes, if you knew your encoding was Windows-1255, you could assume the text is Hebrew. Or Yiddish. And this, I think, is one of the encodings with the least number of languages riding on it. Windows-1256 has Arabic, Persian, Urdu and others. Windows-1251 has the entire western Europe script. As pointed out elsewhere in this thread, Spanish and French treat case folding of accented letters differently. Also, we see that the solution he thinks would work better actually doesn't. People living in France don't switch to a QWERTY keyboard when they want to type English. They type English with their AZERTY keyboard. There simply is no automatic way to tell what language something is typed in without a human telling you (or applying content based heuristics). Microsoft Word stores, for each letter, which was the keyboard language it was typed with. This causes great problems when copying to other editors, performing searches, or simply trying to get bidirectional text to appear correctly. The problem is so bad that phone numbers where the prefix appears after the actual number is not considered bad form or unusual, even in official PR material or when sending resumes. In fact, the only time you can count on someone to switch keyboards is when they need to switch to a language with a different alphabet. No Russian speaker will type English using the Russian layout, even if what she has to say happens to use letters with the same glyphs. You simply do not plan that much ahead. The point I'm driving at is that just because some posted some rant on the Internet doesn't mean it's correct. When someone says something is broken, always ask them what they suggest instead. Shachar
Apr 19 2015
On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote:On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:Also another issue is that lower case letters and upper case might have different size requirements or look different depending on where on the word they are located. For example, German ß and SS, Greek σ and ς. I know Turkish also has similar cases. -- PauloOn 4/18/2015 12:58 AM, John Colvin wrote:Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all. See also: http://unicode.org/reports/tr15/#Norm_FormsOn Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:I know a bit of German, for what characters is that not true?On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:Code points aren't equivalent to characters. They're not the same thing in most European languages,So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Apr 18 2015
Also another issue is that lower case letters and upper case might have different size requirements or look different depending on where on the word they are located. For example, German ß and SS, Greek σ and ς. I know Turkish also has similar cases. -- PauloWhile true, it does not affect wrap (the algorithm) as far as I can see.
Apr 18 2015
On 17/04/15 19:59, H. S. Teoh via Digitalmars-d wrote:There's also the question of what to do with bidi markings: how do you handle counting the columns in that case?Which BiDi marking are you referring to? LRM/RLM and friends? If so, don't worry: the interface, as described, is incapable of properly handling BiDi anyways. The proper way to handle BiDi line wrapping is this. First you assign a BiDi level to each character (at which point the markings are, effectively, removed from the input, so there goes your problem). Then you calculate the glyph's width until the line limit is reached, and then you reorder each line according to the BiDi levels you calculated earlier. As can be easily seen, this requires transitioning BiDi information that is per-paragraph across the line break logic, pretty much mandating multiple passes on the input. Since the requested interface does not allow that, proper BiDi line breaking is impossible with that interface. I'll mention that not everyone take that as a serious problem. Window's text control, for example, calculates line breaks on the text, and then runs the BiDi algorithm on each line individually. Few people notice this. Then again, people have already grown used to BiDi text being scrambled. Shachar
Apr 17 2015
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...]-- All problems are easy in retrospect.Argh, my Perl script doth mock me! T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Apr 17 2015
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...]So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's).[...] Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: import std.range.primitives; /** * Range version of $(D std.string.wrap). * * Bugs: * This function does not conform to the Unicode line-breaking algorithm. It * does not take into account zero-width characters, combining diacritics, * double-width characters, non-breaking spaces, and bidi markings. Strings * containing these characters therefore may not be wrapped correctly. */ auto wrapped(R)(R range, in size_t columns = 80, R firstindent = null, R indent = null, in size_t tabsize = 8) if (isForwardRange!R && is(ElementType!R : dchar)) { import std.algorithm.iteration : map, joiner; import std.range : chain; import std.uni; alias CharType = ElementType!R; // Returns: Wrapped lines. struct Result { private R range, indent; private size_t maxCols, tabSize; private size_t currentCol = 0; private R curIndent; bool empty = true; bool atBreak = false; this(R _range, R _firstindent, R _indent, size_t columns, size_t tabsize) { this.range = _range; this.curIndent = _firstindent.save; this.indent = _indent; this.maxCols = columns; this.tabSize = tabsize; empty = _range.empty; } property CharType front() { if (atBreak) return '\n'; // should implicit convert to wider characters else if (!curIndent.empty) return curIndent.front; else return range.front; } void popFront() { if (atBreak) { // We're at a linebreak. atBreak = false; currentCol = 0; // Start new line with indent curIndent = indent.save; return; } else if (!curIndent.empty) { // We're iterating over an initial indent. curIndent.popFront(); currentCol++; return; } // We're iterating over the main range. range.popFront(); if (range.empty) { empty = true; return; } if (range.front == '\t') currentCol += tabSize; else if (isWhite(range.front)) { // Scan for next word boundary to decide whether or not to // break here. R tmp = range.save; assert(!tmp.empty); size_t col = currentCol; // Find start of next word while (!tmp.empty && isWhite(tmp.front)) { col++; tmp.popFront(); } // Remember start of next word so that if we need to break, we // won't introduce extraneous spaces to the start of the new // line. R nextWord = tmp.save; while (!tmp.empty && !isWhite(tmp.front)) { col++; tmp.popFront(); } assert(tmp.empty || isWhite(tmp.front)); if (col > maxCols) { // Word wrap needed. Move current range position to // start of next word. atBreak = true; range = nextWord; return; } } currentCol++; } property Result save() { Result copy = this; copy.range = this.range.save; //copy.indent = this.indent.save; // probably not needed? copy.curIndent = this.curIndent.save; return copy; } } static assert(isForwardRange!Result); return Result(range, firstindent, indent, columns, tabsize); } unittest { import std.algorithm.comparison : equal; auto s = ("This is a very long, artificially long, and gratuitously long "~ "single-line sentence to serve as a test case for byParagraph.") .wrapped(30, ">>>>", ">>"); assert(s.equal( ">>>>This is a very long,\n"~ ">>artificially long, and\n"~ ">>gratuitously long single-line\n"~ ">>sentence to serve as a test\n"~ ">>case for byParagraph." )); } I didn't bother with avoiding autodecoding -- that should be relatively easy to add, but I think it's stupid that we have to continually write workarounds in our code to get around auto-decoding. If it's so important that we don't autodecode, can we pretty please make the stupid decision already and kill it off for good?! T -- To err is human; to forgive is not our policy. -- Samuel Adler
Apr 17 2015
On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:awesome! Please make a pull request for this so you get proper credit!
Apr 17 2015
On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:Doesn't that mean I have to add the autodecoding workarounds first? T -- Life is too short to run proprietary software. -- Bdale GarbeeWell, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:awesome! Please make a pull request for this so you get proper credit!
Apr 17 2015
On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote:On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:Before it gets pulled, yes, meaning that the element type of front() should match the element encoding type of Range. There's also an issue with firstindent and indent being the same range type as 'range', which is not practical as Range is likely a voldemort type. I suggest making them simply of type 'string'. I don't see any point to make them ranges. A unit test with an input range is needed, and one with some multibyte unicode encodings.On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:Doesn't that mean I have to add the autodecoding workarounds first?Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:awesome! Please make a pull request for this so you get proper credit!
Apr 17 2015
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:there is some... inconsistency: `std.string.wrap` adds final "\n" to=20 string. ;-) but i always hated it for that.=
Apr 17 2015
On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:A range of lines instead of inserted \n would be a good API as well.Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that.
Apr 17 2015
On Fri, Apr 17, 2015 at 08:44:51PM +0000, Panke via Digitalmars-d wrote:On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:Indeed, that would be even more useful, then you could just do .joiner("\n") to get the original functionality. However, I think Walter's goal here is to match the original wrap() functionality. Perhaps the prospective wrapped() function could be implemented in terms of a byWrappedLines() function which does return a range of wrapped lines. T -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter VerhelstOn Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:A range of lines instead of inserted \n would be a good API as well.Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate:there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that.
Apr 18 2015
On 4/18/2015 1:32 PM, H. S. Teoh via Digitalmars-d wrote:However, I think Walter's goal here is to match the original wrap() functionality.Yes, although the overarching goal is: Minimize Need For Using GC In Phobos and the method here is to use ranges rather than having to allocate string temporaries.
Apr 18 2015