digitalmars.D - VLERange: a range in between BidirectionalRange and RandomAccessRange
- Andrei Alexandrescu (36/36) Jan 10 2011 I've been thinking on how to better deal with Unicode strings. Currently...
- Michel Fortin (26/68) Jan 11 2011 Seems like a good idea to define things formally.
- Andrei Alexandrescu (9/36) Jan 11 2011 In the design as I thought of it, the effective length of one logical
- spir (25/53) Jan 11 2011 I think Michel is right. If I understand correctly, VLERange addresses
- Andrei Alexandrescu (31/81) Jan 11 2011 It' not about the data, it's about algorithms. Currently there are
- spir (24/51) Jan 11 2011 IIUC, for the case of text, VLERange helps abstracting from the annoying...
- Andrei Alexandrescu (3/34) Jan 11 2011 You should try text.front right now, you might be surprised :o).
- spir (18/30) Jan 11 2011 Hum, right now incorrectly returns "a" as expected. And indeed
- Michel Fortin (26/39) Jan 11 2011 Your understanding is correct.
- Andrei Alexandrescu (19/54) Jan 11 2011 I disagree. When I suggested this design I was worried of
- Steven Schveighoffer (11/48) Jan 11 2011 While this makes it possible to write algorithms that only accept
- Andrei Alexandrescu (13/21) Jan 11 2011 But that's neither here nor there. That would return the logical element...
- Steven Schveighoffer (18/39) Jan 11 2011 This solitary difference is a very thin argument -- foreach(d;
- Andrei Alexandrescu (18/60) Jan 11 2011 Unfinished sentence? Anyway, for my money you just described what we
- Steven Schveighoffer (24/54) Jan 13 2011 Sorry, I forgot '.' :)
- Andrei Alexandrescu (30/37) Jan 13 2011 Let's take a look:
- Steven Schveighoffer (22/59) Jan 13 2011 You might be looking at my previous version. The new version (recently ...
- Andrei Alexandrescu (15/67) Jan 13 2011 I was looking at your latest. It's code that compiles and runs, but
- Steven Schveighoffer (21/77) Jan 13 2011 iterating the code units is possible by accessing the array data. i.e. ...
- Nick Sabalausky (5/7) Jan 13 2011 I dunno, spir has succesfuly convinced me that most of the time it's
- spir (26/35) Jan 13 2011 You are right in that those 2 issues are really analog. In practice,
- Lutger Blijdestijn (19/29) Jan 15 2011 I agree. This is a very informative thread, thanks spir and everybody el...
- Michel Fortin (40/73) Jan 15 2011 I have my idea.
- Lutger Blijdestijn (5/23) Jan 15 2011 ...
- foobar (9/35) Jan 15 2011 My two cents are against this kind of design.
- Michel Fortin (47/91) Jan 15 2011 Nothing prevents that in the design I proposed. Andrei's design already
- foobar (19/126) Jan 15 2011 Ok, I guess I missed the "byDchar()" method.
- Michel Fortin (18/43) Jan 15 2011 What I don't understand is in what way using a string type would make
- foobar (10/31) Jan 15 2011 First thing, the question of possibility is irrelevant since I could als...
- Jonathan M Davis (28/108) Jan 15 2011 ake
- Michel Fortin (11/60) Jan 15 2011 I remember that someone already complained about this issue because he
- Jonathan M Davis (18/82) Jan 15 2011 If a character literal actually became a grapheme instead of a dchar, th...
- Michel Fortin (20/97) Jan 15 2011 Character literals are treated as simple numbers by the language. By
- foobar (10/41) Jan 15 2011 I Understand your concern regarding a simpler implementation. You want t...
- Michel Fortin (15/33) Jan 16 2011 It should also work for:
- foobar (6/49) Jan 16 2011 Right. This does require compiler changes.
- Michel Fortin (16/64) Jan 13 2011 That's forgetting that most of the time people care about graphemes
- Andrei Alexandrescu (10/70) Jan 13 2011 I'm not so sure about that. What do you base this assessment on? Denis
- Nick Sabalausky (39/46) Jan 13 2011 It's what they want, they just don't know it.
- Andrei Alexandrescu (6/18) Jan 13 2011 Thanks. One further question is: in the above example with
- Nick Sabalausky (9/33) Jan 13 2011 My understanding is "yes". At least that's what I've heard, and I've nev...
- Nick Sabalausky (8/44) Jan 13 2011 Heh, as if that wasn't bad enough, there's also digraphs which, from wha...
- Daniel Gibson (3/51) Jan 14 2011 OMG, this is really fucked up.
- Steven Schveighoffer (28/67) Jan 14 2011 http://en.wikipedia.org/wiki/Unicode_normalization
- Jonathan M Davis (10/85) Jan 14 2011 Well, there's plenty in std.string that already deals in strings rather ...
- spir (17/49) Jan 14 2011 The problem is then whether a font knows how to display it. My usual
- Michel Fortin (14/24) Jan 14 2011 Correct, there's a lot of combinations with no pre-combined form. This
- Gianluigi Rubino (3/12) Jan 14 2011 All the examples given so far worked fine on my iPhone.
- spir (12/15) Jan 14 2011 See my previous follow-up to nick's explanation. But the answer is yes,
- Daniel Gibson (5/15) Jan 14 2011 Agreed. Up until spir mentioned graphemes in this newsgroup I always
- spir (8/27) Jan 14 2011 That's what makes sense for the user in 99.9% case, thus that's what
- spir (26/76) Jan 14 2011 If anyone finds a pointer to such an explanation, bravo, and than you.
- Nick Sabalausky (22/35) Jan 14 2011 Yea, most Unicode explanations seem to talk all about "code-units vs
- spir (67/101) Jan 16 2011 If anyone is interested, ICU's documentation is far more readable (and
- Walter Bright (5/11) Jan 15 2011 I know some German, and to the best of my knowledge there are zero combi...
- spir (54/89) Jan 14 2011 I'm aware of that, and I have no definitive answer to the question. The
- Steven Schveighoffer (8/45) Jan 14 2011 * I don't even know how to make a grapheme that is more than one
- spir (17/23) Jan 14 2011 1. See my text at
- Steven Schveighoffer (18/40) Jan 14 2011 I can't read that document, it's black background with super-dark-grey
- Michel Fortin (34/58) Jan 14 2011 Not in my knowledge. But I rarely deal with non-latin texts, there's
- Steven Schveighoffer (21/73) Jan 15 2011 Hm... this pushes the normalization outside the type, and into the
- Lutger Blijdestijn (5/19) Jan 15 2011 If its a matter of choosing which is the 'default' range, I'd think prop...
- Steven Schveighoffer (28/51) Jan 15 2011 English and (if I understand correctly) most other languages. Any
- foobar (5/66) Jan 15 2011 The above compromise provides zero benefit. The proposed default type st...
- Steven Schveighoffer (27/83) Jan 15 2011 I feel like you might be exaggerating, but maybe I'm completely wrong on...
- Steven Schveighoffer (7/24) Jan 15 2011 I see from Michel's post how normalization automatically can be bad. I ...
- foobar (7/105) Jan 15 2011 That was already shown by Michel and Spir where the equality operator is...
- Steven Schveighoffer (3/10) Jan 15 2011 Well said, I've changed my mind. Thanks for explaining.
- spir (16/18) Jan 17 2011 In a few days, D will have an external library able to deal with those
- spir (27/33) Jan 17 2011 Hello Steven,
- Steven Schveighoffer (7/37) Jan 17 2011 I'll reply to this to save you the trouble. I have reversed my position...
- Michel Fortin (34/88) Jan 15 2011 Why don't we build a compiler with an optimizer that generates correct
- Steven Schveighoffer (5/85) Jan 15 2011 You make very good points. I concede that using dchar as the element
- Michel Fortin (42/85) Jan 15 2011 Not really. It pushes the normalization to the string comparison
- Steven Schveighoffer (34/108) Jan 15 2011 Are these common requirements? I thought users mostly care about
- Michel Fortin (18/52) Jan 15 2011 I'm glad we agree on that now.
- Steven Schveighoffer (49/94) Jan 15 2011 It's a matter of me slowly wrapping my brain around unicode and how it's...
- foobar (3/119) Jan 15 2011 I like Michel's proposed semantics and I also agree with you that it sho...
- Steven Schveighoffer (12/18) Jan 17 2011 A grapheme would be its own specialized type. I'd probably remove the
- Jonathan M Davis (12/33) Jan 17 2011 I think that it would make good sense for a grapheme to be struct which ...
- Andrei Alexandrescu (4/37) Jan 17 2011 If someone makes a careful submission of a Grapheme to Phobos as
- Michel Fortin (67/137) Jan 15 2011 Actually, I don't think Unicode was so badly designed. It's just that
- Andrei Alexandrescu (23/119) Jan 15 2011 I'm unclear on where this is converging to. At this point the commitment...
- Jonathan M Davis (32/167) Jan 15 2011 Considering that strings are already dealt with specially in order to ha...
- Michel Fortin (12/25) Jan 15 2011 Walter's argument against changing this for foreach was that it'd
- Andrei Alexandrescu (4/26) Jan 16 2011 I think it's poor abstraction to represent a Grapheme as a string. It
- Andrei Alexandrescu (6/10) Jan 16 2011 It would make everything related a lot (a TON) slower, and it would
- Andrej Mitrovic (4/4) Jan 16 2011 And how would 3rd party libraries handle Graphemes? And C modules? I
- Steven Schveighoffer (15/27) Jan 17 2011 I would have agreed with you last week. Now I understand that using dch...
- Lars T. Kyllingstad (5/11) Jan 17 2011 Googling "unicode sample document" turned up a few examples. This one
- Andrei Alexandrescu (8/35) Jan 17 2011 This is one extreme. Char only works for English. Dchar works for most
- Andrei Alexandrescu (6/12) Jan 17 2011 Oh, one more thing. You don't need a lot of Unicode text containing
- Steven Schveighoffer (6/17) Jan 17 2011 True, benchmarking doesn't apply with combining characters because we ha...
- spir (19/30) Jan 17 2011 Correct. For this reason, we do not use the same source at all for
- spir (39/67) Jan 17 2011 Hello Steve & Andrei,
- Jonathan M Davis (19/187) Jan 15 2011 a
- Michel Fortin (22/47) Jan 15 2011 There's still a disagreement about whether a string or a code unit
- Andrei Alexandrescu (44/88) Jan 16 2011 Disagreement as that might be, a simple fact that needs to be taken into...
- Michel Fortin (55/117) Jan 16 2011 I think the only people who should *not* care are those who have
- Andrei Alexandrescu (22/136) Jan 16 2011 I love the increased precision, but again I'm not sure how many people
- Daniel Gibson (17/42) Jan 16 2011 So why does D use unicode anyway?
- Andrei Alexandrescu (4/52) Jan 16 2011 I think German text works well with dchar.
- Jonathan M Davis (85/144) Jan 16 2011 te
- Daniel Gibson (47/105) Jan 16 2011 Really? UTF32 - maybe. But IMHO even when not considering graphemes and ...
- Daniel Gibson (6/116) Jan 16 2011 of course I forgot:
- Michel Fortin (103/120) Jan 17 2011 As I said: all those people who are not validating the inputs to make
- Andrei Alexandrescu (80/197) Jan 17 2011 The question (which I see you keep on dodging :o)) is how much text
- Michel Fortin (88/207) Jan 17 2011 Not much, right now.
- Andrei Alexandrescu (7/10) Jan 17 2011 But at some point you must be able to talk about individual characters
- Michel Fortin (42/53) Jan 17 2011 It seems that it can. NSString only exposes individual UTF-16 code
- Michel Fortin (22/30) Jan 17 2011 This makes me think of what I did with my XML parser after you made
- Andrei Alexandrescu (3/29) Jan 17 2011 Very insightful. Thanks for sharing. Code it up and make a solid proposa...
- Steven Wawryk (3/38) Jan 17 2011 How does this differ from Steve Schveighoffer's string_t, subtract the
- Andrei Alexandrescu (3/44) Jan 18 2011 There's no string, only range...
- Steven Wawryk (8/43) Jan 18 2011 Which is exactly what I asked you about. I understand that you must be
- Andrei Alexandrescu (23/71) Jan 18 2011 One simple fact is that I'm not the only person who needs to look at a
- Steven Wawryk (10/53) Jan 18 2011 Ok, thanks for this suggestion. But if developing a proposal as
- Andrei Alexandrescu (8/73) Jan 18 2011 My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
- Steven Wawryk (15/35) Jan 18 2011 I don't think that it did. I proposed no language change, nor anything
- Andrei Alexandrescu (13/45) Jan 18 2011 Adding a new string type would be disruptive. Unless I misunderstood,
- Michel Fortin (59/93) Jan 18 2011 What I use right now is this (see below). I'm not sure what would be a
- Andrei Alexandrescu (32/80) Jan 18 2011 [snip]
- Michel Fortin (33/121) Jan 18 2011 Yes, we need a grapheme range.
- spir (14/75) Jan 18 2011 On 01/18/2011 06:14 PM, Michel Fortin wrote:
- spir (13/39) Jan 18 2011 This looks like a very interesting approach. And clear.
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (33/37) Jan 18 2011 That's what I've been thinking. The users can choose whether they want
- spir (72/110) Jan 19 2011 This is very good and helpful summary. But you do not list all relevant
- spir (9/15) Jan 17 2011 Actually, there are at least 2 special cases:
- Steven Schveighoffer (44/149) Jan 17 2011 I didn't read the standard, all I understand about unicode is from this ...
- spir (27/32) Jan 17 2011 I think like you about pre-composed characters: they bring no real gain
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (11/17) Jan 17 2011 Thanks to all that has contributed, I am also following this thread with...
- spir (23/39) Jan 19 2011 This is true and false ;-)
- spir (21/25) Jan 17 2011 I am unsure now about the question of a text's (apparent) natural
- Gerrit Wichert (12/23) Jan 14 2011 I'm afraid that this is not a proper way to handle this problem. It may
- Steven Schveighoffer (5/30) Jan 15 2011 Actually, this would only lazily *and temporarily* convert the string pe...
- Joel C. Salomon (15/19) Jan 23 2011 Hebrew:
- Nick Sabalausky (13/15) Jan 14 2011 How to do that on the Windows (XP) command prompt, for anyone who doesn'...
- Nick Sabalausky (18/24) Jan 14 2011 Forget that step 2, that causes "Active code page: 65001" to be sent to
- Andrej Mitrovic (3/14) Jan 14 2011 Does that work for you? I get back:
- Nick Sabalausky (16/34) Jan 14 2011 Yea, it works for me (XP Pro SP2 32-bit), and my "chcp" is 437, not 6500...
- Andrej Mitrovic (4/5) Jan 14 2011 Nope, I still get the same results (tried with different fonts, lucida
- Nick Sabalausky (5/10) Jan 14 2011 Weird. Which version of windows are you on, and are you using the regula...
- Andrej Mitrovic (8/11) Jan 14 2011 Okay, it appears this is an issue with Console2. I'll have to report
- Andrej Mitrovic (10/12) Jan 14 2011 Woops, let me revise what I've said:
- Michel Fortin (50/61) Jan 14 2011 Apple implemented all these things in the NSString class in Cocoa. They
- Andrei Alexandrescu (25/42) Jan 14 2011 That's a strong indicator, but we shouldn't get ahead of ourselves.
- foobar (10/39) Jan 14 2011 Combining marks do need to be supported.
- Michel Fortin (18/31) Jan 14 2011 That's a good example. Although my attempt to extract the text from the
- foobar (3/17) Jan 15 2011 I've looked into this and I was wrong. Ruby is a layout feature as you s...
- Michel Fortin (16/63) Jan 14 2011 Then perhaps it's time we find out a way to handle non-Unicode
- spir (27/29) Jan 17 2011 Text has a perf module that provides such numbers (on different stages
- Andrei Alexandrescu (9/34) Jan 17 2011 Congrats on this great work. The initial numbers are in keeping with my
- spir (16/52) Jan 17 2011 Andrei, would you have a look at Text's current state, mainly
- Andrei Alexandrescu (28/39) Jan 17 2011 I think this is solid work that reveals good understanding of Unicode.
- spir (75/115) Jan 17 2011 We are exploring a new field. (Except for the work Objective-C designers...
- Andrei Alexandrescu (14/21) Jan 17 2011 Unfortunately I won't have much time to discuss all these points, but
- spir (26/47) Jan 18 2011 I think it is needed to repeat again the following: Text in my view (or
- Andrei Alexandrescu (3/37) Jan 18 2011 You don't provide O(n) indexing.
- Jonathan M Davis (10/13) Jan 17 2011 While it would be nice at times to be able to have an index with foreach...
- Andrei Alexandrescu (12/25) Jan 17 2011 It's a bit more difficult than that. When iterating a variable-length
- spir (23/57) Jan 18 2011 This is a very valid point: a range's logical offset is not necessary
- Steven Schveighoffer (12/45) Jan 19 2011 opApply in no way disables the range interface. It simply is used for
- spir (9/21) Jan 18 2011 You are right. I fully agree, in fact. On the other hand, think at
- spir (10/23) Jan 17 2011 Unfortunatly, things are complicated by _prepend_ combining marks that
- spir (12/70) Jan 11 2011 People interested in solving the general problem with Unicode strings
- Tomek =?ISO-8859-2?Q?Sowi=F1ski?= (28/69) Jan 11 2011 =20
- Steven Wawryk (15/15) Jan 11 2011 Sorry if I'm jumping inhere without the appropriate background, but I
- Michel Fortin (30/47) Jan 11 2011 Actually, displaying a UTF-8/UTF-16 string involves a range of of
- Don (4/15) Jan 12 2011 I think the only problem that we really have, is that "char[]",
- Andrei Alexandrescu (34/49) Jan 12 2011 I hope to assuage part of that issue with representation(). Again, it's
- spir (26/29) Jan 12 2011 I'd like to know when it happens that codepoint is the appropriate level...
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (21/48) Jan 12 2011 Compare according to which alphabet's ordering? Surely not Unicode's...
- Michel Fortin (25/43) Jan 12 2011 I agree with you. I don't see many use for code points.
- Michel Fortin (13/28) Jan 12 2011 Crap, I meant to send this as UTF-8 with combining characters in it,
- spir (9/34) Jan 13 2011 Works :-) But your first post worked as well by me: for instance <<"é"
- spir (35/73) Jan 13 2011 Actually, I had once a real use case for codepoint beeing the proper
- Jonathan M Davis (48/72) Jan 13 2011 ) So
- spir (64/114) Jan 13 2011 D's arrays (even dchar[] & dstring) do not allow having correct results
- Michel Fortin (16/24) Jan 13 2011 D is not the first language dealing correctly with Unicode strings in
- spir (13/33) Jan 13 2011 Thank you very much for this information (I feel less lonely ;-).
- Michel Fortin (12/16) Jan 13 2011 Mac OS sorts file names in a "natural" way since a very long time
- Nick Sabalausky (3/13) Jan 13 2011 XP's explorer does that too. It's a very nice feature.
- Jonathan M Davis (48/171) Jan 13 2011 ". hope you
- Michel Fortin (24/38) Jan 13 2011 What's nice about Cocoa's way of handling strings is that even
- Andrej Mitrovic (3/3) Jan 13 2011 OT: Spir, do you know if I can change the syntax highlighting settings
- Nick Sabalausky (3/6) Jan 13 2011 I'm getting the same problem too.
- Michel Fortin (7/14) Jan 13 2011 I bypassed the problem by fetching the files from the repository. But I
- spir (78/104) Jan 13 2011 The problem is then: how does a library or application programmer know,
I've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit. Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access. The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length. Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding. In addition to these (and connecting the two), a VLERange would offer two additional primitives: 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element. 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element. In both cases, offset is assumed to be at the beginning of a logical element of the range. I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.) Writing to such ranges is not addressed by this design. Ideas are welcome. Adding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile? Andrei
Jan 10 2011
On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit. Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access. The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length. Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding.Seems like a good idea to define things formally.In addition to these (and connecting the two), a VLERange would offer two additional primitives: 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element. 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).In both cases, offset is assumed to be at the beginning of a logical element of the range. I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.)Applicability to other problems seems like a valuable benefit.Writing to such ranges is not addressed by this design. Ideas are welcome.Writing, as in assigning to 'front'? That's not really possible with variable-length units as it'd need to shift everything in case of a length difference. Or maybe you meant writing as in having an output range for variable-length elements... I'm not sureAdding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile?In my opinion it's not a trade-off at all, it's a formalization of how strings are handled which is better in every regard than a "special case". I welcome this move very much. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 11 2011
On 1/11/11 4:41 AM, Michel Fortin wrote:On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.In addition to these (and connecting the two), a VLERange would offer two additional primitives: 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element. 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).Well all of the above :o). Clearly assigning to e.g. front or back should not work. The question is what kind of API can we provide beyond simple append with put(). AndreiWriting to such ranges is not addressed by this design. Ideas are welcome.Writing, as in assigning to 'front'? That's not really possible with variable-length units as it'd need to shift everything in case of a length difference. Or maybe you meant writing as in having an output range for variable-length elements... I'm not sure
Jan 11 2011
On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:On 1/11/11 4:41 AM, Michel Fortin wrote:I think Michel is right. If I understand correctly, VLERange addresses the low-level and rather simple issue of each codepoint beeing encoding as a variable number of code units. Right? If yes, then what is the advantage of VLERange? D already has string/wstring/dstring, allowing to work with the most advatageous encoding according to given source data, and dstring abstracting from low-level encoding issues. The main (and massively ignored) issue when manipulating unicode text is rather that, unlike with legacy character sets, one codepoint does *not* represent a character in the common sense. In character sets like latin-1: * each code represents a character, in the common sense (eg "à") * each character representation has the same size (1 or 2 bytes) * each character has a single representation ("à" --> always 0xe0) All of this is wrong with unicode. And these are complicated and high-level issues, that appear _after_ decoding, on codepoint sequences. If VLERange is helpful is dealing with those problems, then I don't understand your presentation, sorry. Do you for instance mean such a range would, under the hood, group together codes belonging to the same character (thus making indexing meaningful), and/or normalise (decomp & order) (thus allowing to comp/find/count correctly).? denis _________________ vita es estrany spir.wikidot.comOn 2011-01-10 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.In addition to these (and connecting the two), a VLERange would offer two additional primitives: 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element. 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).
Jan 11 2011
On 1/11/11 9:09 AM, spir wrote:On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:It' not about the data, it's about algorithms. Currently there are algorithms that ostensibly work for bidirectional ranges, but internally "cheat" by detecting that the input is actually a string, and use that knowledge for better implementations. The benefit of VLERange would that that it legitimizes those algorithms. I wouldn't be surprised if an entire class of algorithms would in fact require VLERange (e.g. many of those that we commonly consider today "string" algorithms).On 1/11/11 4:41 AM, Michel Fortin wrote:I think Michel is right. If I understand correctly, VLERange addresses the low-level and rather simple issue of each codepoint beeing encoding as a variable number of code units. Right? If yes, then what is the advantage of VLERange? D already has string/wstring/dstring, allowing to work with the most advatageous encoding according to given source data, and dstring abstracting from low-level encoding issues.On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.In addition to these (and connecting the two), a VLERange would offer two additional primitives: 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element. 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element.I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).The main (and massively ignored) issue when manipulating unicode text is rather that, unlike with legacy character sets, one codepoint does *not* represent a character in the common sense. In character sets like latin-1: * each code represents a character, in the common sense (eg "à") * each character representation has the same size (1 or 2 bytes) * each character has a single representation ("à" --> always 0xe0) All of this is wrong with unicode. And these are complicated and high-level issues, that appear _after_ decoding, on codepoint sequences. If VLERange is helpful is dealing with those problems, then I don't understand your presentation, sorry. Do you for instance mean such a range would, under the hood, group together codes belonging to the same character (thus making indexing meaningful), and/or normalise (decomp & order) (thus allowing to comp/find/count correctly).?VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now. The difference is that VLERange being a formal concept, algorithms can specialize on it instead of (a) specializing for UTF strings or (b) specializing for BidirectionalRange and then manually detecting isSomeString inside. Conversely, when defining an algorithm you can specify VLARange as a requirement. Boyer-Moore is a perfect example - it doesn't work on bidirectional ranges, but it does work on VLARange. I suspect there are many like it. Of course, it would help a lot if we figured other remarkable VLARanges. Here are a few that come to mind: * Multibyte encodings other than UTF. Currently we have no special support for those beyond e.g. forward or bidirectional ranges. * Huffman, RLE, LZ encoded buffers (and many other compressed formats) * Vocabulary-based translation systems, e.g. associate each word with a number. * Others...? Some of these are forward-only (don't allow bidirectional access). Once we have a number of examples, it would be great to figure a number of remarkable algorithms operating on them. Andrei
Jan 11 2011
On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:IIUC, for the case of text, VLERange helps abstracting from the annoying fact that a codepoint is encoded as a variable number of code units. What I meant is issues like: auto text = "a\u0302"d; writeln(text); // "â" auto range = VLERange(text); // extracts characters correctly? auto letter = range.front(); // "a" or "â"? // case yes: compares correctly? assert(range.front() == "â"); // fail or pass? Both fail using all unicode-aware types I know of, because 1. They do not recognise that a character is represented by an arbitrary number of codes (code _points_). 2. They do not use normalised forms for comp, search, count, etc... (while in unicode a given char can have several representations).The main (and massively ignored) issue when manipulating unicode text is rather that, unlike with legacy character sets, one codepoint does *not* represent a character in the common sense. In character sets like latin-1: * each code represents a character, in the common sense (eg "à") * each character representation has the same size (1 or 2 bytes) * each character has a single representation ("à" --> always 0xe0) All of this is wrong with unicode. And these are complicated and high-level issues, that appear _after_ decoding, on codepoint sequences. If VLERange is helpful is dealing with those problems, then I don't understand your presentation, sorry. Do you for instance mean such a range would, under the hood, group together codes belonging to the same character (thus making indexing meaningful), and/or normalise (decomp & order) (thus allowing to comp/find/count correctly).?VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now.The difference is that VLERange being a formal concept, algorithms can specialize on it instead of (a) specializing for UTF strings or (b) specializing for BidirectionalRange and then manually detecting isSomeString inside. Conversely, when defining an algorithm you can specify VLARange as a requirement. Boyer-Moore is a perfect example - it doesn't work on bidirectional ranges, but it does work on VLARange. I suspect there are many like it. Of course, it would help a lot if we figured other remarkable VLARanges.I think I see the point, and the general usefulness of such an abstraction. But it would certainly be more useful in other fields than text manipulation, because there are far more annoying issues (that, like in example above, simply prevent code correctness). Denis _________________ vita es estrany spir.wikidot.com
Jan 11 2011
On 1/11/11 4:46 PM, spir wrote:On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:You should try text.front right now, you might be surprised :o). AndreiIIUC, for the case of text, VLERange helps abstracting from the annoying fact that a codepoint is encoded as a variable number of code units. What I meant is issues like: auto text = "a\u0302"d; writeln(text); // "â" auto range = VLERange(text); // extracts characters correctly? auto letter = range.front(); // "a" or "â"? // case yes: compares correctly? assert(range.front() == "â"); // fail or pass?The main (and massively ignored) issue when manipulating unicode text is rather that, unlike with legacy character sets, one codepoint does *not* represent a character in the common sense. In character sets like latin-1: * each code represents a character, in the common sense (eg "à") * each character representation has the same size (1 or 2 bytes) * each character has a single representation ("à" --> always 0xe0) All of this is wrong with unicode. And these are complicated and high-level issues, that appear _after_ decoding, on codepoint sequences. If VLERange is helpful is dealing with those problems, then I don't understand your presentation, sorry. Do you for instance mean such a range would, under the hood, group together codes belonging to the same character (thus making indexing meaningful), and/or normalise (decomp & order) (thus allowing to comp/find/count correctly).?VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now.
Jan 11 2011
On 01/12/2011 02:22 AM, Andrei Alexandrescu wrote:Hum, right now incorrectly returns "a" as expected. And indeed assert ("â" == "a\u0302"); incorrectly fails as expected. Both would work with legacy charsets like latin-1. This is a new issue introduced with UCS, that requires an additional level of abstraction (in addition to the one required by the distincton codepoint/codeunit!) You may have a look at https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/Text.html for a rough implementation of a type that does the right thing, & at https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/U%20missing%20leve %20of%20abstraction for a (far too long) explanation. (I have tried to mention those problems a dozen times already, but for any reason nearly everybody seem definitely deaf in front of them.) Denis _________________ vita es estrany spir.wikidot.comIIUC, for the case of text, VLERange helps abstracting from the annoying fact that a codepoint is encoded as a variable number of code units. What I meant is issues like: auto text = "a\u0302"d; writeln(text); // "â" auto range = VLERange(text); // extracts characters correctly? auto letter = range.front(); // "a" or "â"? // case yes: compares correctly? assert(range.front() == "â"); // fail or pass?You should try text.front right now, you might be surprised :o).
Jan 11 2011
On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/11/11 4:41 AM, Michel Fortin wrote:Your understanding is correct. I think both cases (one becomes many & many becomes one) are important and must be supported. Your proposal only deal with the many-becomes-one case. I proposed returning arrays so we can deal with the one-becomes-many case ("œ" becoming "oe"). Another idea would be to introduce "substeps". When checking for the next character, in addition to determining its step length you could also determine the number of substeps in it. "œ" would have two substeps, "o" and "e", and when there is no longer any substep you move to the next step. All this said, I think this should stay an implementation detail as this would allow a variety of strategies. Also, keeping this an implementation detail means that your proposed 'stepSize' and 'backstepSize' need to be an implementation detail too (because they won't make sense for the one-to-many case). So they can't really be part of a standard VLE interface. As far as I know, all we really need to expose to algorithms is whether a range has elements of variable length, because this has an impact on your indexing capabilities. The rest seems unnecessary to me, or am I missing some use cases? -- Michel Fortin michel.fortin michelf.com http://michelf.com/For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.
Jan 11 2011
On 1/11/11 11:13 AM, Michel Fortin wrote:On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I disagree. When I suggested this design I was worried of over-abstracting. Now this looks like abstracting for stuff that hasn't even been addressed concretely yet. Besides, using bit as an encoding unit sounds like an acceptable approach for anything fractional.On 1/11/11 4:41 AM, Michel Fortin wrote:Your understanding is correct. I think both cases (one becomes many & many becomes one) are important and must be supported. Your proposal only deal with the many-becomes-one case.For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.I proposed returning arrays so we can deal with the one-becomes-many case ("œ" becoming "oe"). Another idea would be to introduce "substeps". When checking for the next character, in addition to determining its step length you could also determine the number of substeps in it. "œ" would have two substeps, "o" and "e", and when there is no longer any substep you move to the next step. All this said, I think this should stay an implementation detail as this would allow a variety of strategies. Also, keeping this an implementation detail means that your proposed 'stepSize' and 'backstepSize' need to be an implementation detail too (because they won't make sense for the one-to-many case). So they can't really be part of a standard VLE interface.If you don't have at least stepSize that tells you how large the stride is to get to the next element, it becomes impossible to move within the range using integral indexes.As far as I know, all we really need to expose to algorithms is whether a range has elements of variable length, because this has an impact on your indexing capabilities. The rest seems unnecessary to me, or am I missing some use cases?I think you could say that you don't really need stepSize because you can compute it as follows: auto r1 = r; r1.popFront(); size_t stepSize = r.length - r1.length; This is tenuous, inefficient, and impossible if the support range doesn't support length (I realize that variable-length encodings work over other ranges than random access, but then again this may be an overgeneralization). Andrei
Jan 11 2011
On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit. Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access. The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length. Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding. In addition to these (and connecting the two), a VLERange would offer two additional primitives: 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element. 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element. In both cases, offset is assumed to be at the beginning of a logical element of the range. I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.) Writing to such ranges is not addressed by this design. Ideas are welcome. Adding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile?While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler. I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit. This makes more sense for generic code IMO. I noticed you never commented on my proposed string type... That reminds me, I should update with suggested changes and re-post it. -Steve
Jan 11 2011
On 1/11/11 5:30 AM, Steven Schveighoffer wrote:While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler.Except when they're not - foreach with dchar...I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit. This makes more sense for generic code IMO.But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.I noticed you never commented on my proposed string type... That reminds me, I should update with suggested changes and re-post it.To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics. That being said, it's good you are doing this work. In the best case, you could bring a compelling abstraction to the table. In the worst, you'll become as happy about D's strings as I am :o). Andrei
Jan 11 2011
On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/11/11 5:30 AM, Steven Schveighoffer wrote:This solitary difference is a very thin argument -- foreach(d; byDchar(str)) would be just as good without requiring compiler help.While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler.Except when they're not - foreach with dchar...It depends on the function, and the way the indexing is implemented.I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit. This makes more sense for generic code IMO.But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.It is supposed to be simple, and provide the expected interface, without causing any undue performance degradation. That is, I should be able to do all the things with a replacement string type that I can with a char array today, as efficiently as I can today, except I should have to work to get at the code-units. The huge benefit is that I can say "I'm dealing with this as an array" when I know it's safe The disagreement will never be fully solved, as there is just as much disagreement about the current state of affairs ;) e.g. should foreach default to using dchar?I noticed you never commented on my proposed string type... That reminds me, I should update with suggested changes and re-post it.To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics.That being said, it's good you are doing this work. In the best case, you could bring a compelling abstraction to the table. In the worst, you'll become as happy about D's strings as I am :o).I don't think I'll ever be 'happy' with the way strings sit in phobos currently. I typically deal in ASCII (i.e. code units), and phobos works very hard to prevent that. -Steve
Jan 11 2011
On 1/11/11 11:21 AM, Steven Schveighoffer wrote:On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Unfinished sentence? Anyway, for my money you just described what we have now.On 1/11/11 5:30 AM, Steven Schveighoffer wrote:This solitary difference is a very thin argument -- foreach(d; byDchar(str)) would be just as good without requiring compiler help.While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler.Except when they're not - foreach with dchar...It depends on the function, and the way the indexing is implemented.I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit. This makes more sense for generic code IMO.But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.It is supposed to be simple, and provide the expected interface, without causing any undue performance degradation. That is, I should be able to do all the things with a replacement string type that I can with a char array today, as efficiently as I can today, except I should have to work to get at the code-units. The huge benefit is that I can say "I'm dealing with this as an array" when I know it's safeI noticed you never commented on my proposed string type... That reminds me, I should update with suggested changes and re-post it.To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics.The disagreement will never be fully solved, as there is just as much disagreement about the current state of affairs ;) e.g. should foreach default to using dchar?I disagree about the disagreement being unsolvable. I'm not rigid; if I saw a terrific abstraction in your string, I'd be all for it. It just shuffles some issues about, and although I agree it does one thing or two better than char[], at the end of the day it doesn't carry its weight.I wonder if we could and should extend some of the functions in std.string to work with ubyte[]. I did add a function called representation() that I didn't document yet. Essentially representation gives you the ubyte[], ushort[], or uint[] underneath a string, with the same qualifiers. Whenever you want an algorithm to work on ASCII in earnest, you can pass representation(s) to it instead of s. If you work a lot with ASCII, an AsciiString abstraction may be a better and more likely to be successful string type. Better yet, you could simply focus on AsciiChar and then define ASCII strings as arrays of AsciiChar. AndreiThat being said, it's good you are doing this work. In the best case, you could bring a compelling abstraction to the table. In the worst, you'll become as happy about D's strings as I am :o).I don't think I'll ever be 'happy' with the way strings sit in phobos currently. I typically deal in ASCII (i.e. code units), and phobos works very hard to prevent that.
Jan 11 2011
On Tue, 11 Jan 2011 18:00:30 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/11/11 11:21 AM, Steven Schveighoffer wrote:Sorry, I forgot '.' :)It is supposed to be simple, and provide the expected interface, without causing any undue performance degradation. That is, I should be able to do all the things with a replacement string type that I can with a char array today, as efficiently as I can today, except I should have to work to get at the code-units. The huge benefit is that I can say "I'm dealing with this as an array" when I know it's safeUnfinished sentence?Anyway, for my money you just described what we have now.All except the 'expected interface' part. The string type should deal with dchars exclusively, since that's what it is a range of. char[] gives you char's back when you index it. Anyone who doesn't use ASCII will be confused by this. Also, I expect to be able to use a char[] as an array, which Phobos doesn't let me in some cases (e.g. sorting ASCII character array).I see it as having two vast improvements: 1. If we replace char[] with a specific type for string, then char[] can be considered a true array by phobos, and phobos can now deal with a char[] array without the need to cast. 2. It protects the casual user from incorrectly using a string by making the default the correct API. Those to me are very important.The disagreement will never be fully solved, as there is just as much disagreement about the current state of affairs ;) e.g. should foreach default to using dchar?I disagree about the disagreement being unsolvable. I'm not rigid; if I saw a terrific abstraction in your string, I'd be all for it. It just shuffles some issues about, and although I agree it does one thing or two better than char[], at the end of the day it doesn't carry its weight.This, again, fails on point 2 above. A char[] is an array, and allows access to code-units, which is not the correct interface for a string. Supporting ubyte[] doesn't fix that problem. Correct as the default is usually a theme in D...I don't think I'll ever be 'happy' with the way strings sit in phobos currently. I typically deal in ASCII (i.e. code units), and phobos works very hard to prevent that.I wonder if we could and should extend some of the functions in std.string to work with ubyte[]. I did add a function called representation() that I didn't document yet. Essentially representation gives you the ubyte[], ushort[], or uint[] underneath a string, with the same qualifiers. Whenever you want an algorithm to work on ASCII in earnest, you can pass representation(s) to it instead of s.If you work a lot with ASCII, an AsciiString abstraction may be a better and more likely to be successful string type. Better yet, you could simply focus on AsciiChar and then define ASCII strings as arrays of AsciiChar.This seems like the wrong approach. Adding a new type does not fix the problems with the original type. We need to replace the original type or at least how it is treated by the compiler. -Steve
Jan 13 2011
On 1/13/11 8:52 AM, Steven Schveighoffer wrote:I see it as having two vast improvements: 1. If we replace char[] with a specific type for string, then char[] can be considered a true array by phobos, and phobos can now deal with a char[] array without the need to cast. 2. It protects the casual user from incorrectly using a string by making the default the correct API. Those to me are very important.Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons). But wait, there's less. Functions for random-access range throughout Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next to s[i]. From a cursory look at string_t, std.range will qualify it as a RandomAccessRange without length. That's an odd beast but does not change the fixed-length encoding assumption. So you'd need to special-case algorithms for string_t, just like right now certain algorithms are specialized for string. Where's the progress? Andrei
Jan 13 2011
On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/13/11 8:52 AM, Steven Schveighoffer wrote:You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found. It also supports this: foreach(i, d; s) { writeln("The character in position ", i, " is ", d); } where i is the index (might not be sequential)I see it as having two vast improvements: 1. If we replace char[] with a specific type for string, then char[] can be considered a true array by phobos, and phobos can now deal with a char[] array without the need to cast. 2. It protects the casual user from incorrectly using a string by making the default the correct API. Those to me are very important.Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).But wait, there's less. Functions for random-access range throughout Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next to s[i]. From a cursory look at string_t, std.range will qualify it as a RandomAccessRange without length. That's an odd beast but does not change the fixed-length encoding assumption. So you'd need to special-case algorithms for string_t, just like right now certain algorithms are specialized for string.isRandomAccessRange requires hasLength (see here: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532). This is not a random access range per that definition. But a string isn't a random access range anyways (it's specifically disallowed by std.range per that same reference). The plan is you would *not* have to special case algorithms for string_t as you do currently for char[]. If that's not the case, then we haven't achieved much. Simply put, we are separating out the strange nature of strings from arrays, so the exceptional treatment of them is handled by the type itself, not the functions using it. -Steve
Jan 13 2011
On 1/13/11 11:35 AM, Steven Schveighoffer wrote:On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.It also supports this: foreach(i, d; s) { writeln("The character in position ", i, " is ", d); } where i is the index (might not be sequential)Well string supports that too, albeit with the nit that you need to specify dchar.That's an interesting twist. By the way I specified length is required then because I couldn't imagine having random access into something that I can't tell the length of. Apparently I was wrong :o).But wait, there's less. Functions for random-access range throughout Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next to s[i]. From a cursory look at string_t, std.range will qualify it as a RandomAccessRange without length. That's an odd beast but does not change the fixed-length encoding assumption. So you'd need to special-case algorithms for string_t, just like right now certain algorithms are specialized for string.isRandomAccessRange requires hasLength (see here: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532). This is not a random access range per that definition.But a string isn't a random access range anyways (it's specifically disallowed by std.range per that same reference).It isn't and it isn't supposed to be.The plan is you would *not* have to special case algorithms for string_t as you do currently for char[]. If that's not the case, then we haven't achieved much. Simply put, we are separating out the strange nature of strings from arrays, so the exceptional treatment of them is handled by the type itself, not the functions using it.That sounds reasonable. Andrei
Jan 13 2011
On Thu, 13 Jan 2011 15:51:00 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/13/11 11:35 AM, Steven Schveighoffer wrote:iterating the code units is possible by accessing the array data. i.e. you could do: foreach(i, c; s.data) if you want the code-units. That is the point of having a separate type. Using string_t tells the library "I'm using this data as a string". Using char[] tells the library "I'm using this data as an array." The difference here is, you have to *specifically* try to access the code units, the default is code-points. All it does really is switch the default.On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.This is not a small problem.It also supports this: foreach(i, d; s) { writeln("The character in position ", i, " is ", d); } where i is the index (might not be sequential)Well string supports that too, albeit with the nit that you need to specify dchar.Yes, in fact, you could say that specifically defines VLERange ;) But actually, there are two types of VLE ranges, those which can be randomly accessed (where determining the beginning of a code point, given a random index is possible) and those that cannot (where decoding depends on the exact order of the data). Actually, those would not be bi-directional ranges anyways.isRandomAccessRange requires hasLength (see here: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532). This is not a random access range per that definition.That's an interesting twist. By the way I specified length is required then because I couldn't imagine having random access into something that I can't tell the length of. Apparently I was wrong :o).I agree with that assessment, which is why I omitted length. -SteveBut a string isn't a random access range anyways (it's specifically disallowed by std.range per that same reference).It isn't and it isn't supposed to be.
Jan 13 2011
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:ignon1$2p4k$1 digitalmars.com...This may sometimes not be what the user expected; most of the time they'd care about the code points.I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.
Jan 13 2011
On 01/13/2011 11:00 PM, Nick Sabalausky wrote:"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:ignon1$2p4k$1 digitalmars.com...You are right in that those 2 issues are really analog. In practice, once universal text is truely and commonly used, I guess problems with codes-do-not-represent-characters may become far more obvious; and also far more serious because (logical) errors can easily pass by unseen. [In fact, how can a programmer even know for instance that a search routine missed its target or returned a false positive, when dealing with characters from unknown languages? Indeed, there are test data sets, but they are useless if the tools one uses just ignore the issues.] The problem with using 16-bit representation and thus ignoring a fair amount of codepoints is maybe less problematic because there are rather few chances to randomly meet characters outside the BMP (Basic Multiligual Plane, part of UCS which codepoints are < 0x10000). Outside the BMP are scripting systems of less commonly studied archeological languages, and various sets of images such as alchemical symbols, playing cards or domino tiles. I doubt they'll ever be commonly used, or else for specialised apps the programmer perfectly knows what they deal with. A list of UCS blocks with pointers to detailed content can be found here: http://www.fileformat.info/info/unicode/block/index.htm Blocks over the BMP start with the line: Linear B Syllabary U+10000 U+1007F (88) Denis _________________ vita es estrany spir.wikidot.comThis may sometimes not be what the user expected; most of the time they'd care about the code points.I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.
Jan 13 2011
Nick Sabalausky wrote:"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:ignon1$2p4k$1 digitalmars.com...I agree. This is a very informative thread, thanks spir and everybody else. Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know. The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings. Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.This may sometimes not be what the user expected; most of the time they'd care about the code points.I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.
Jan 15 2011
On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> said:Nick Sabalausky wrote:I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme. The second component would be to make the string equality operator (==) for strings compare them in their normalized form, so that ("e" with combining acute accent) == (pre-combined ""). I think this would make D support for Unicode much more intuitive. This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change. There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default. I wrote this example (or something similar) earlier in this thread: foreach (grapheme; "expos") if (grapheme == "") break; In this example, even if one of these two strings use the pre-combined form of "" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and == compares using normalization. The important thing to keep in mind here is that the grapheme-splitting algorithm should be optimized for the case where there is no combining character and the compare algorithm for the case where the string is already normalized, since most strings will exhibit these characteristics. As for ASCII, we could make it easier to use ubyte[] for it by making string literals implicitly convert to ubyte[] if all their characters are in ASCII range. -- Michel Fortin michel.fortin michelf.com http://michelf.com/"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:ignon1$2p4k$1 digitalmars.com...I agree. This is a very informative thread, thanks spir and everybody else. Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know. The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings. Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.This may sometimes not be what the user expected; most of the time they'd care about the code points.I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.
Jan 15 2011
Michel Fortin wrote:On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> said:...... Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.
Jan 15 2011
Lutger Blijdestijn Wrote:Michel Fortin wrote:My two cents are against this kind of design. The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g. text.codeUnits to iterate by codeUnits Here's a (perhaps contrived) example: Let's say I want to find the combining marks in some text. For instance, Hebrew uses combining marks for vowels (among other things) and they are optional in the language (There's a "full" form with vowels and a "missing" form without them). I have a Hebrew text with in the "full" form and I want to strip it and convert it to the "missing" form. How would I accomplish this with your design?On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> said:...... Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.
Jan 15 2011
On 2011-01-15 09:09:17 -0500, foobar <foo bar.com> said:Lutger Blijdestijn Wrote:Nothing prevents that in the design I proposed. Andrei's design already implements "str".byDchar() that would work for code points. I'd suggest changing the API to by!char(), by!wchar(), and by!cdhar() for when you deal with whatever kind of code unit or code point you want. This would be mostly symmetric to what you can already do with foreach: foreach (char c; "hello") {} foreach (wchar c; "hello") {} foreach (dchar c; "hello") {} // same as: foreach (c; "hello".by!char()) {} foreach (c; "hello".by!wchar()) {} foreach (c; "hello".by!dchar()) {}Michel Fortin wrote:My two cents are against this kind of design. The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g. text.codeUnits to iterate by codeUnitsOn 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> said:...... Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.Here's a (perhaps contrived) example: Let's say I want to find the combining marks in some text. For instance, Hebrew uses combining marks for vowels (among other things) and they are optional in the language (There's a "full" form with vowels and a "missing" form without them). I have a Hebrew text with in the "full" form and I want to strip it and convert it to the "missing" form. How would I accomplish this with your design?All you need is a range that takes a string as input and give you code points in a decomposed form (NFD), then you use std.algorithm.filter on it: // original string auto str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) auto decomposed = decompose(str); // filter to remove your favorite combining code point (use the hex code you want) auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string (could also be wstring or dstring) string result = array(recomposed.by!char()); This last line is the one doing everything. All the rest just chain ranges together for doing on-the-fly decomposition, filtering, and recomposition; the last line uses that chain of rage to fill the array. A more naive implementation not taking advantage of code points but instead using a replacement table would also work: string str = "..."; string result; string[string] replacements = ["":"e"]; // change this for what you want foreach (grapheme; str) { auto replacement = grapheme in replacements; if (replacement) result ~= replacement; else result ~= grapheme; } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
Michel Fortin Wrote:On 2011-01-15 09:09:17 -0500, foobar <foo bar.com> said:Ok, I guess I missed the "byDchar()" method. I envisioned the same algorithm looking like this: // original string string str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) // Note: explicitly specify code points range: auto decomposed = decompose(str.codePoints); // filter to remove your favorite combining code point auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string // Note: a string type can be constructed from a range of code points string result = string(recomposed); The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?Lutger Blijdestijn Wrote:Nothing prevents that in the design I proposed. Andrei's design already implements "str".byDchar() that would work for code points. I'd suggest changing the API to by!char(), by!wchar(), and by!cdhar() for when you deal with whatever kind of code unit or code point you want. This would be mostly symmetric to what you can already do with foreach: foreach (char c; "hello") {} foreach (wchar c; "hello") {} foreach (dchar c; "hello") {} // same as: foreach (c; "hello".by!char()) {} foreach (c; "hello".by!wchar()) {} foreach (c; "hello".by!dchar()) {}Michel Fortin wrote:My two cents are against this kind of design. The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g. text.codeUnits to iterate by codeUnitsOn 2011-01-15 05:03:20 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> said:...... Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.Here's a (perhaps contrived) example: Let's say I want to find the combining marks in some text. For instance, Hebrew uses combining marks for vowels (among other things) and they are optional in the language (There's a "full" form with vowels and a "missing" form without them). I have a Hebrew text with in the "full" form and I want to strip it and convert it to the "missing" form. How would I accomplish this with your design?All you need is a range that takes a string as input and give you code points in a decomposed form (NFD), then you use std.algorithm.filter on it: // original string auto str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) auto decomposed = decompose(str); // filter to remove your favorite combining code point (use the hex code you want) auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string (could also be wstring or dstring) string result = array(recomposed.by!char()); This last line is the one doing everything. All the rest just chain ranges together for doing on-the-fly decomposition, filtering, and recomposition; the last line uses that chain of rage to fill the array. A more naive implementation not taking advantage of code points but instead using a replacement table would also work: string str = "..."; string result; string[string] replacements = ["":"e"]; // change this for what you want foreach (grapheme; str) { auto replacement = grapheme in replacements; if (replacement) result ~= replacement; else result ~= grapheme; } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On 2011-01-15 10:59:52 -0500, foobar <foo bar.com> said:Ok, I guess I missed the "byDchar()" method. I envisioned the same algorithm looking like this: // original string string str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) // Note: explicitly specify code points range: auto decomposed = decompose(str.codePoints); // filter to remove your favorite combining code point auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string // Note: a string type can be constructed from a range of code points string result = string(recomposed); The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?What I don't understand is in what way using a string type would make the API less complex and use less templates? More generally, in what way would your string type behave differently than char[], wchar[], and dchar[]? I think we need to clarify what how you expect your string type to behave before I can answer anything. I mean, beside cosmetic changes such as having a codePoint property instead of by!dchar or byDchar, what is your string type doing differently? The above algorithm is already possible with strings as they are, provided you implement the 'decompose' and the 'compose' function returning a range. In fact, you only changed two things in it: by!dchar became codePoints, and array() became string(). Surely you're expecting more benefits than that. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
Michel Fortin Wrote:What I don't understand is in what way using a string type would make the API less complex and use less templates? More generally, in what way would your string type behave differently than char[], wchar[], and dchar[]? I think we need to clarify what how you expect your string type to behave before I can answer anything. I mean, beside cosmetic changes such as having a codePoint property instead of by!dchar or byDchar, what is your string type doing differently? The above algorithm is already possible with strings as they are, provided you implement the 'decompose' and the 'compose' function returning a range. In fact, you only changed two things in it: by!dchar became codePoints, and array() became string(). Surely you're expecting more benefits than that. -- Michel Fortin michel.fortin michelf.com http://michelf.com/First thing, the question of possibility is irrelevant since I could also write the same algorithm in brainfuck or assembly (with a lot more code). It's never a question of possibility but rather a question of ease of use for the user. What I want is to encapsulate all the low-level implementation details in one place so that the as a user I will not need to deal with this everywhere. one such detail is the encoding. auto text = w"whatever"; // should be equivalent to: auto text = new Text("whatever", Encoding.UTF16); now I want to write my own string function: void func(Text a); // instead of current: void func(T)(T a) if isTextType(T); // why the USER needs all this? Of course, the Text type would do the correct think by default which we both agree should be graphemes. Only if I need something advanced like in the previous algorithm than I explicitly need to specify that I work on code points or code units. In a sentence: "Make the common case trivial and the complex case possible". The common case is what we Humans think of as characters (graphemes) and the complex case is the encoding level.
Jan 15 2011
On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn =20 <lutger.blijdestijn gmail.com> said:=3D)Nick Sabalausky wrote:=20 I have my idea. =20 I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme. =20 The second component would be to make the string equality operator (=3D="Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:ignon1$2p4k$1 digitalmars.com... =20=20 I agree. This is a very informative thread, thanks spir and everybody else. =20 Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know. =20 The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings. =20 Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.This may sometimes not be what the user expected; most of the time they'd care about the code points.=20 I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.for strings compare them in their normalized form, so that ("e" with combining acute accent) =3D=3D (pre-combined "=E9"). I think this would m=akeD support for Unicode much more intuitive. =20 This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change. =20 There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default. =20 I wrote this example (or something similar) earlier in this thread: =20 foreach (grapheme; "expos=E9") if (grapheme =3D=3D "=E9") break; =20 In this example, even if one of these two strings use the pre-combined form of "=E9" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and =3D=3D compares using normalization. =20 The important thing to keep in mind here is that the grapheme-splitting algorithm should be optimized for the case where there is no combining character and the compare algorithm for the case where the string is already normalized, since most strings will exhibit these characteristics. =20 As for ASCII, we could make it easier to use ubyte[] for it by making string literals implicitly convert to ubyte[] if all their characters are in ASCII range.I think that that would cause definite problems. Having the element type of= the=20 range be the same type as the range seems like it could cause a lot of prob= lems=20 in std.algorithm and the like, and it's _definitely_ going to confuse=20 programmers. I'd expect it to be highly bug-prone. They _need_ to be separa= te=20 types. Now, given that dchar can't actually work completely as an element type, yo= u'd=20 either need the string type to be a new type or the element type to be a ne= w=20 type. So, either the string type has char[], wchar[], or dchar[] for its el= ement=20 type, or char[], wchar[], and dchar[] have something like uchar as their el= ement=20 type, where uchar is a struct which contains a char[], wchar[], or dchar[] = which=20 holds a single grapheme. I think that it's a great idea that programmers try to use substrings and s= lices=20 rather than dchar, but making the element type a slice the original type so= unds=20 like it's really asking for trouble. =2D Jonathan M Davis
Jan 15 2011
On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time.I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme. The second component would be to make the string equality operator (==)for strings compare them in their normalized form, so that ("e" with combining acute accent) == (pre-combined ""). I think this would makeD support for Unicode much more intuitive. This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change. There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default. I wrote this example (or something similar) earlier in this thread: foreach (grapheme; "expos") if (grapheme == "") break; In this example, even if one of these two strings use the pre-combined form of "" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and = compares using normalization.I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.Now, given that dchar can't actually work completely as an element type, you'd either need the string type to be a new type or the element type to be a new type. So, either the string type has char[], wchar[], or dchar[] for its element type, or char[], wchar[], and dchar[] have something like uchar as their element type, where uchar is a struct which contains a char[], wchar[], or dchar[] which holds a single grapheme.Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:d mOn Saturday 15 January 2011 04:24:33 Michel Fortin wrote:I have my idea. =20 I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme. =20 The second component would be to make the string equality operator (=3D=20 =3D) =20for strings compare them in their normalized form, so that ("e" with combining acute accent) =3D=3D (pre-combined "=E9"). I think this woul=If a character literal actually became a grapheme instead of a dchar, then = that=20 would likely solve that issue. But I fear that the semantics of having a ra= nge=20 be its own element type actually make understanding it _harder_, not simple= r.=20 Being forced to compare a string literals against what should be a characte= r=20 would definitely confuse programmers. Making a new character or grapheme ty= pe=20 which represented a grapheme would be _far_ simpler to understand IMO. Howe= ver,=20 making it work really well would likely require that the compiler know abou= t the=20 grapheme type like it knows about dchar. =2D Jonathan M Davis=20 ake =20=20 I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time. =20D support for Unicode much more intuitive. =20 This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change. =20 There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default. =20 I wrote this example (or something similar) earlier in this thread: foreach (grapheme; "expos=E9") =09 if (grapheme =3D=3D "=E9") =09 break; =20 In this example, even if one of these two strings use the pre-combined form of "=E9" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and =3D compares using normalization.=20 I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.Now, given that dchar can't actually work completely as an element type, you'd either need the string type to be a new type or the element type to be a new type. So, either the string type has char[], wchar[], or dchar[] for its element type, or char[], wchar[], and dchar[] have something like uchar as their element type, where uchar is a struct which contains a char[], wchar[], or dchar[] which holds a single grapheme.=20 Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals.
Jan 15 2011
On 2011-01-15 23:58:30 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:Character literals are treated as simple numbers by the language. By that I mean that you can write 'b' - 'a' == 1 and it'll be true. Arithmetic makes absolutely no sense for graphemes. If you want a special literal for graphemes, I'm afraid you'll have to invent something new. And at this point, why not use a string?On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:d mOn Saturday 15 January 2011 04:24:33 Michel Fortin wrote:I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme. The second component would be to make the string equality operator (=)for strings compare them in their normalized form, so that ("e" with combining acute accent) == (pre-combined ""). I think this woulIf a character literal actually became a grapheme instead of a dchar, then that would likely solve that issue. But I fear that the semantics of having a range be its own element type actually make understanding it _harder_, not simpler. Being forced to compare a string literals against what should be a character would definitely confuse programmers.akeI remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time.D support for Unicode much more intuitive. This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change. There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default. I wrote this example (or something similar) earlier in this thread: foreach (grapheme; "expos") if (grapheme == "") break; In this example, even if one of these two strings use the pre-combined form of "" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and compares using normalization.I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.Now, given that dchar can't actually work completely as an element type, you'd either need the string type to be a new type or the element type to be a new type. So, either the string type has char[], wchar[], or dchar[] for its element type, or char[], wchar[], and dchar[] have something like uchar as their element type, where uchar is a struct which contains a char[], wchar[], or dchar[] which holds a single grapheme.Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals.Making a new character or grapheme type which represented a grapheme would be _far_ simpler to understand IMO. However, making it work really well would likely require that the compiler know about the grapheme type like it knows about dchar.I'm looking for a simple solution. One that doesn't involve inventing a new grapheme literal syntax or adding new types the compiler most know about. I'm not really opposed to any of this, but the more complicated is the solution, the less likely it is to be adopted. All I'm asking is that Unicode strings behave as Unicode strings should behave. Making iteration use graphemes by default and string comparison use the normalized form by default seems like a simple way to achieve that goal. The most important is not the implementation, but that the default behaviour be the right behaviour. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
Michel Fortin Wrote:Character literals are treated as simple numbers by the language. By that I mean that you can write 'b' - 'a' == 1 and it'll be true. Arithmetic makes absolutely no sense for graphemes. If you want a special literal for graphemes, I'm afraid you'll have to invent something new. And at this point, why not use a string?I Understand your concern regarding a simpler implementation. You want to minimize the disruption caused by the proposed change. I'd argue that creating a specialized string type as Steve suggests makes integration *easier*. Your suggestion requires that foreach will be changed to default to grapheme. I agree that this can be done because it will not break silently but with Steve's string type this is unnecessary since the type itself would provide a grapheme range interface and the compiler doesn't need to know about this type at all. string becomes a regular library type. Of course, the type should support: string foo = "bar"; by making an implicit conversion from current arrays (to minimize compiler changes) The only disruption as far as I can tell would be using 'a' type literals instead of "a" but that will come up in compilation after string defaults to the new type. Also, all occurrences of: string foo = ...; foreach (c; foo) {...} // c is now a grapheme will now do the correct thing by default.Making a new character or grapheme type which represented a grapheme would be _far_ simpler to understand IMO. However, making it work really well would likely require that the compiler know about the grapheme type like it knows about dchar.I'm looking for a simple solution. One that doesn't involve inventing a new grapheme literal syntax or adding new types the compiler most know about. I'm not really opposed to any of this, but the more complicated is the solution, the less likely it is to be adopted. All I'm asking is that Unicode strings behave as Unicode strings should behave. Making iteration use graphemes by default and string comparison use the normalized form by default seems like a simple way to achieve that goal. The most important is not the implementation, but that the default behaviour be the right behaviour. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On 2011-01-16 02:11:14 -0500, foobar <foo bar.com> said:I Understand your concern regarding a simpler implementation. You want to minimize the disruption caused by the proposed change. I'd argue that creating a specialized string type as Steve suggests makes integration *easier*. Your suggestion requires that foreach will be changed to default to grapheme. I agree that this can be done because it will not break silently but with Steve's string type this is unnecessary since the type itself would provide a grapheme range interface and the compiler doesn't need to know about this type at all. string becomes a regular library type. Of course, the type should support: string foo = "bar"; by making an implicit conversion from current arrays (to minimize compiler changes)It should also work for: auto foo = "bar";The only disruption as far as I can tell would be using 'a' type literals instead of "a" but that will come up in compilation after string defaults to the new type.You say "after string defaults to the new type", but I don't think this change to the language will pass. It'll break TDPL for one thing, so it's surely out for D2. And I somewhat doubt it's low-level enough for Walter's taste. I don't care much if the default type is an array or not, I just want the default type to work properly as a Unicode string. The very small participation to this thread from the key decision makers (Andrei and Walter) worries me however. I'm not even sure we'll achieve that goal. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 16 2011
Michel Fortin Wrote:On 2011-01-16 02:11:14 -0500, foobar <foo bar.com> said:Right. This does require compiler changes.I Understand your concern regarding a simpler implementation. You want to minimize the disruption caused by the proposed change. I'd argue that creating a specialized string type as Steve suggests makes integration *easier*. Your suggestion requires that foreach will be changed to default to grapheme. I agree that this can be done because it will not break silently but with Steve's string type this is unnecessary since the type itself would provide a grapheme range interface and the compiler doesn't need to know about this type at all. string becomes a regular library type. Of course, the type should support: string foo = "bar"; by making an implicit conversion from current arrays (to minimize compiler changes)It should also work for: auto foo = "bar";string is an alias in phobos so it's more of a stdlib change but I see your point about TDPL. I did get the feeling that Andrei is willing to make a change if it proves worthwhile by preventing writing bad code (Which we both agree this change accomplishes).The only disruption as far as I can tell would be using 'a' type literals instead of "a" but that will come up in compilation after string defaults to the new type.You say "after string defaults to the new type", but I don't think this change to the language will pass. It'll break TDPL for one thing, so it's surely out for D2. And I somewhat doubt it's low-level enough for Walter's taste.I don't care much if the default type is an array or not, I just want the default type to work properly as a Unicode string. The very small participation to this thread from the key decision makers (Andrei and Walter) worries me however. I'm not even sure we'll achieve that goal.Anderi did take part and even asked for links that explain the subject. Perhaps the quite is due to the mastermind doing research on the topic rather than reluctance to do any changes. :)-- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 16 2011
On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/13/11 11:35 AM, Steven Schveighoffer wrote:That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.Except it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output: The character in position 0 is t The character in position 1 is ̃ (Note that the tilde becomes combined with the preceding space character.) The conception of character that normal people have does not match the notion of code points when combining characters enters the equation. -- Michel Fortin michel.fortin michelf.com http://michelf.com/It also supports this: foreach(i, d; s) { writeln("The character in position ", i, " is ", d); } where i is the index (might not be sequential)Well string supports that too, albeit with the nit that you need to specify dchar.
Jan 13 2011
On 1/13/11 7:09 PM, Michel Fortin wrote:On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).On 1/13/11 11:35 AM, Steven Schveighoffer wrote:That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters? Thanks, AndreiExcept it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output: The character in position 0 is t The character in position 1 is ̃ (Note that the tilde becomes combined with the preceding space character.) The conception of character that normal people have does not match the notion of code points when combining characters enters the equation.It also supports this: foreach(i, d; s) { writeln("The character in position ", i, " is ", d); } where i is the index (might not be sequential)Well string supports that too, albeit with the nit that you need to specify dchar.
Jan 13 2011
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:igoj6s$17r6$1 digitalmars.com...I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).It's what they want, they just don't know it. Graphemes are what many people *think* code points are.This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?Maybe someone else has a link to an explanation (I don't), but it's basically just this: Three levels of abstraction from lowest to highest: - Code Unit (ie, encoding) - Code Point (ie, what Unicode assigns distinct numbers to) - Grapheme (ie, what we think of as a "character") A code-point can be made up of one or more code-units. Likewise, a grapheme can be made up of one or more code-points. There are (at least) two types of code points: - Regular ones, such as letters, digits, and punctuation. - "Combining Characters", such as accent marks (or if you're familiar with Japanese, the little things in the upper-right corner that change an "s" to a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a vowel). Ie, things that are not characters in their own right, but merely modify other characters. These can be often (always?) be thought of as being like overlays. If a code point representing a "combining character" exists in a string, then instead of being displayed as a character it merely modifies whatever code-point came before it. So, for instance, if you want to store the German word for five (in all lower-case), there are two ways to do it: [ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character. Caveat: There may very well be further complications that I'm not aware of. Heck, knowing Unicode, there probably are.
Jan 13 2011
On 1/13/11 10:26 PM, Nick Sabalausky wrote: [snip][ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point? Andrei
Jan 13 2011
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:igoqrm$1n5r$1 digitalmars.com...On 1/13/11 10:26 PM, Nick Sabalausky wrote: [snip]My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character Michel or spir might have better links though.[ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?
Jan 13 2011
"Nick Sabalausky" <a a.a> wrote in message news:igori7$1ovh$1 digitalmars.com..."Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:igoqrm$1n5r$1 digitalmars.com...Heh, as if that wasn't bad enough, there's also digraphs which, from what I can tell, seem to be single code-points that represent more than one glyph/character/grapheme: http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode This page may be helpful too: http://en.wikipedia.org/wiki/Precomposed_characterOn 1/13/11 10:26 PM, Nick Sabalausky wrote: [snip]My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character Michel or spir might have better links though.[ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?
Jan 13 2011
Am 14.01.2011 08:00, schrieb Nick Sabalausky:"Nick Sabalausky"<a a.a> wrote in message news:igori7$1ovh$1 digitalmars.com...OMG, this is really fucked up. Can't we just go back to 8bit charsets like ISO 8859-* etc? :/"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:igoqrm$1n5r$1 digitalmars.com...Heh, as if that wasn't bad enough, there's also digraphs which, from what I can tell, seem to be single code-points that represent more than one glyph/character/grapheme: http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode This page may be helpful too: http://en.wikipedia.org/wiki/Precomposed_characterOn 1/13/11 10:26 PM, Nick Sabalausky wrote: [snip]My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character Michel or spir might have better links though.[ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?
Jan 14 2011
On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a a.a> wrote:"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:igoqrm$1n5r$1 digitalmars.com...http://en.wikipedia.org/wiki/Unicode_normalization Linked from that page, the normalization process is probably something we need to look at. Using decomposed canonical form would mean we need more state than just what code-unit are we on, plus it creates more likelyhood that a match will be found with part of a grapheme (spir or Michel brought it up earlier). So I think the correct case is to use composed canonical form. This is after just reading that page, so maybe I'm missing something. Non-composable combinations would be a problem. The string range is formed on the basis that the element type is a dchar. If there are combinations that cannot be composed into a single dchar, then the element type has to be a dchar array (or some other type which contains all the info). The other option is to simply leave them decomposed. Then you risk things like partial matches. I'm leaning towards a solution like this: While iterating a string, it should output dchars in normalized composed form. But a specialized comparison function should be used when doing things like searches or regex, because it might not be possible to compose two combining characters. The drawback to this is that a dchar might not be able to represent a grapheme (only if it cannot be composed), but I think it's too much of a hit in complexity and performance to make the element type of a string larger than a dchar. Those who wish to work with a more comprehensive string type can use a more complex string type such as the one created by spir. Does that sound reasonable? -SteveOn 1/13/11 10:26 PM, Nick Sabalausky wrote: [snip]My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character[ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?
Jan 14 2011
On Friday 14 January 2011 04:47:59 Steven Schveighoffer wrote:On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a a.a> wrote:Well, there's plenty in std.string that already deals in strings rather than dchar, and for the most part, any case where you couldn't fit a grapheme in a dchar could be covered by using a string."Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:igoqrm$1n5r$1 digitalmars.com...http://en.wikipedia.org/wiki/Unicode_normalization Linked from that page, the normalization process is probably something we need to look at. Using decomposed canonical form would mean we need more state than just what code-unit are we on, plus it creates more likelyhood that a match will be found with part of a grapheme (spir or Michel brought it up earlier). So I think the correct case is to use composed canonical form. This is after just reading that page, so maybe I'm missing something. Non-composable combinations would be a problem. The string range is formed on the basis that the element type is a dchar. If there are combinations that cannot be composed into a single dchar, then the element type has to be a dchar array (or some other type which contains all the info). The other option is to simply leave them decomposed. Then you risk things like partial matches. I'm leaning towards a solution like this: While iterating a string, it should output dchars in normalized composed form. But a specialized comparison function should be used when doing things like searches or regex, because it might not be possible to compose two combining characters. The drawback to this is that a dchar might not be able to represent a grapheme (only if it cannot be composed), but I think it's too much of a hit in complexity and performance to make the element type of a string larger than a dchar.On 1/13/11 10:26 PM, Nick Sabalausky wrote: [snip]My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character[ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?Those who wish to work with a more comprehensive string type can use a more complex string type such as the one created by spir. Does that sound reasonable?We really should have something along those lines it seems. From what little _I_ know, the basic approach that you suggest seems like the correct one, but perhaps someone more knowledgeable will be able to come up with a reason why it's not a good idea. Certainly, I think that any solution that I'd come up with would be similar to what you're suggesting. - Jonathan M Davis
Jan 14 2011
On 01/14/2011 07:44 AM, Nick Sabalausky wrote:"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:igoqrm$1n5r$1 digitalmars.com...The problem is then whether a font knows how to display it. My usual fonts (DejaVu series, pretty good with Unicode) show: 7̈ meaning they do not know how to combine digits with diacritics (they do it well with other rather strange combinations.) But: one of the relevant advantages of decomposed forms is that when they don't know the character, they can still show at least the component marks, here '7' & '~'. Which is better than nothing for a user who knows the scripting system. If I try to display for instance a _precomposed_ syllable from a language my font does not know, i will get instead either a little square with the codepoint written inside in minuscules digits, or a placeholder like inversed-video "?". denis _________________ vita es estrany spir.wikidot.comOn 1/13/11 10:26 PM, Nick Sabalausky wrote: [snip]My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain.[ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?
Jan 14 2011
On 2011-01-14 01:44:19 -0500, "Nick Sabalausky" <a a.a> said:"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:igoqrm$1n5r$1 digitalmars.com...Correct, there's a lot of combinations with no pre-combined form. This should be no surprise given that you can apply any number of combining marks to any character. mythical 7 with an umlaut: 7̈ mythical 7 with umlaut, ring above, and acute accent: 7̈̊́ I can't guaranty your news reader will display the above correctly, but it works as described in mine (Unison on Mac OS X). In fact, it should work in all Cocoa-based applications. This probably includes iOS-based devices too, but I haven't tested there. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain.
Jan 14 2011
Michel Fortin <michel.fortin michelf.com> wrote:mythical 7 with an umlaut: 7̈ mythical 7 with umlaut, ring above, and acute accent: 7̈̊́ I can't guaranty your news reader will display the above correctly, but it works as described in mine (Unison on Mac OS X). In fact, it should work in all Cocoa-based applications. This probably includes iOS-based devices too, but I haven't tested there.All the examples given so far worked fine on my iPhone. Gianluigi
Jan 14 2011
On 01/14/2011 07:33 AM, Andrei Alexandrescu wrote:Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?See my previous follow-up to nick's explanation. But the answer is yes, not only for usual characters, but due to the fact that a user is, theoratically and practically, totally free to combine base ad combining codes --even to invent chracters. The only limit is that fonts will not know how to display unprobable combinations. (See also my presentation text, shows an example of dots below and above greek letters.) Denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
Am 14.01.2011 07:26, schrieb Nick Sabalausky:"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:igoj6s$17r6$1 digitalmars.com...Agreed. Up until spir mentioned graphemes in this newsgroup I always thought that one Unicode code point == one character on the screen. I guess in the majority of use cases you want to operate on user perceived characters.I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).It's what they want, they just don't know it. Graphemes are what many people *think* code points are.
Jan 14 2011
On 01/14/2011 01:52 PM, Daniel Gibson wrote:Am 14.01.2011 07:26, schrieb Nick Sabalausky:That's what makes sense for the user in 99.9% case, thus that's what makes sense for the programmer, thus that's what makes sense for the language/type/lib designer. denis _________________ vita es estrany spir.wikidot.com"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:igoj6s$17r6$1 digitalmars.com...Agreed. Up until spir mentioned graphemes in this newsgroup I always thought that one Unicode code point == one character on the screen. I guess in the majority of use cases you want to operate on user perceived characters.I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).It's what they want, they just don't know it. Graphemes are what many people *think* code points are.
Jan 14 2011
On 01/14/2011 07:26 AM, Nick Sabalausky wrote:"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:igoj6s$17r6$1 digitalmars.com...If anyone finds a pointer to such an explanation, bravo, and than you. (You will certainly not find it in Unicode literature, for instance.) Nick's explanation below is good and concise. (Just 2 notes added.)I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).It's what they want, they just don't know it. Graphemes are what many people *think* code points are.This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?Maybe someone else has a link to an explanation (I don't), but it's basically just this:Three levels of abstraction from lowest to highest: - Code Unit (ie, encoding) - Code Point (ie, what Unicode assigns distinct numbers to) - Grapheme (ie, what we think of as a "character") A code-point can be made up of one or more code-units. Likewise, a grapheme can be made up of one or more code-points. There are (at least) two types of code points: - Regular ones, such as letters, digits, and punctuation. - "Combining Characters", such as accent marks (or if you're familiar with Japanese, the little things in the upper-right corner that change an "s" to a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a vowel). Ie, things that are not characters in their own right, but merely modify other characters. These can be often (always?) be thought of as being like overlays.You can also say there are 2 kinds of characters: simple like "u" & composite "ü" or "ṵ̈̈". The former are coded with a single (base) code, the latter with one (rarely more) base codes and an arbitrary number of combining codes. For a majority of _common_ characters made of 2 or 3 codes (western language letters, korean Hangul syllables,...), precombined codes have been added to the set. Thus, they can be coded with a single code like simple characters. [Also note, to avoid things be too simple ;-), some (few) combining codes called "prepend" come _before_ the base in raw code sequence...]If a code point representing a "combining character" exists in a string, then instead of being displayed as a character it merely modifies whatever code-point came before it. So, for instance, if you want to store the German word for five (in all lower-case), there are two ways to do it: [ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]Note: the second form is the base form for Unicode. There are reasons to have chosen it (see my text), and why UCS does not and simply cannot propose precomposed codes for all possible composite characters.Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.There is no logical limit, only practical such as how to display 3 diacritics above the same base? You can invent a script for a mythical folk's language if you like :-) Also, some examples of real language characters (Hebrew, IIRC) in Unicode test data sets hold up to 8 codes.Caveat: There may very well be further complications that I'm not aware of. Heck, knowing Unicode, there probably are.Denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
"spir" <denis.spir gmail.com> wrote in message news:mailman.619.1295012086.4748.digitalmars-d puremagic.com...If anyone finds a pointer to such an explanation, bravo, and than you. (You will certainly not find it in Unicode literature, for instance.) Nick's explanation below is good and concise. (Just 2 notes added.)Yea, most Unicode explanations seem to talk all about "code-units vs code-points" and then they'll just have a brief note like "There's also other things like digraphs and combining codes." And that'll be all they mention. You're right about the Unicode literature. It's the usual standards-body documentation, same as W3C: "Instead of only some people understanding how this works, lets encode the documentation in legalese (and have twenty only-slightly-different versions) to make sure that nobody understands how it works."You can also say there are 2 kinds of characters: simple like "u" & composite "" or "??". The former are coded with a single (base) code, the latter with one (rarely more) base codes and an arbitrary number of combining codes.Couple questions about the "more than one base codes": - Do you know an example offhand? - Does that mean like a ligature where the base codes form a single glyph, or does it mean that the combining code either spans or operates over multiple glyphs? Or can it go either way?For a majority of _common_ characters made of 2 or 3 codes (western language letters, korean Hangul syllables,...), precombined codes have been added to the set. Thus, they can be coded with a single code like simple characters.Out of curiosity, how do decomposed Hangul characters work? (Or do you know?) Not actually knowing any Korean, my understanding is that they're a set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it is like a series of base codes that automatically combine, or are there combining characters involved?[Also note, to avoid things be too simple ;-), some (few) combining codes called "prepend" come _before_ the base in raw code sequence...]Fun!
Jan 14 2011
On 01/14/2011 08:20 PM, Nick Sabalausky wrote:"spir"<denis.spir gmail.com> wrote in message news:mailman.619.1295012086.4748.digitalmars-d puremagic.com...If anyone is interested, ICU's documentation is far more readable (and intended for programmers). ICU is *the* reference library for dealing with unicode (an IBM open source product, with C/C++/Java interfaces), used by many other products in the background. ICU: http://site.icu-project.org/ user guide: http://userguide.icu-project.org/ section about text segmentation: http://userguide.icu-project.org/boundaryanalysis Note that just like Unicode, they consider forming graphemes (grouping codes into character representations) a simple particular case of text segmentation, which they call "boundary analysis" (but they have the nice idea to use "character" instead of "grapheme"). The only mention I found in ICU's doc of the issue we have talked about here lengthily is (at http://userguide.icu-project.org/strings): "Handling Lengths, Indexes, and Offsets in Strings The length of a string and all indexes and offsets related to the string are always counted in terms of UChar code units, not in terms of UChar32 code points. (This is the same as in common C library functions that use char * strings with multi-byte encodings.) Often, a user thinks of a "character" as a complete unit in a language, like an 'Ä', while it may be represented with multiple Unicode code points including a base character and combining marks. (See the Unicode standard for details.) This often requires users to index and pass strings (UnicodeString or UChar *) with multiple code units or code points. It cannot be done with single-integer character types. Indexing of such "characters" is done with the BreakIterator class (in C: ubrk_ functions). Even with such "higher-level" indexing functions, the actual index values will be expressed in terms of UChar code units. When more than one code unit is used at a time, the index value changes by more than one at a time. [...] (ICU's UChar are like D wchar.)If anyone finds a pointer to such an explanation, bravo, and than you. (You will certainly not find it in Unicode literature, for instance.) Nick's explanation below is good and concise. (Just 2 notes added.)Yea, most Unicode explanations seem to talk all about "code-units vs code-points" and then they'll just have a brief note like "There's also other things like digraphs and combining codes." And that'll be all they mention. You're right about the Unicode literature. It's the usual standards-body documentation, same as W3C: "Instead of only some people understanding how this works, lets encode the documentation in legalese (and have twenty only-slightly-different versions) to make sure that nobody understands how it works."No. I know this only from it beeing mentionned in documentation. Unless we consider (see below) L jamo as base codes.You can also say there are 2 kinds of characters: simple like "u"& composite "ü" or "ü??". The former are coded with a single (base) code, the latter with one (rarely more) base codes and an arbitrary number of combining codes.Couple questions about the "more than one base codes": - Do you know an example offhand?- Does that mean like a ligature where the base codes form a single glyph, or does it mean that the combining code either spans or operates over multiple glyphs? Or can it go either way?IIRC examples like ij in nederlands are only considered "compability equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in german. Meaning they should not be considered equal by default, this would be an additional feature, and langage- and app-dependant). Unlike base "e"+ combining "^" really == "ê".I know nothing about Korean language except what I studied about its scripting system for Unicode algorithms (but one can also code said algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about Hangul in Unicode http://en.wikipedia.org/wiki/Korean_language_and_computers. What I understand (beware, it's just wild deductions) is there are 3 kinds of "jamo" scripting marks (noted L, V, T) that can combine into syllabic "graphemes", resp in first, median, last place. These marks indeed somehow correspond to vocalic or consonantic phonemes. In unicode, in addition to such jamo, which are simple marks (like base letters and diacritics in latin-based languages), there are precombined codes for LV and LVT combinations (like for "ä" or "û"). We could thus think that Hangul syllables are limited to 3 jamo. But: according to Unicode's official "grapheme break cluster" algorithm (read: how to group codepoints into characters) (http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes for L jamo can also be followed by _and_ should be combined with other L, LV or LVT codes. Similarly, LV or V should be combined with V or VT, and LVT or T with T. (Seems logical.) So, I do not know how complicated a Hangul syllab can be in practice or in theory. If there can be in practice whole syllables following other schemes than L / LV / LVT, then this is another example of real language whole characters that cannot be coded by a single codepoint. Denis _________________ vita es estrany spir.wikidot.comFor a majority of _common_ characters made of 2 or 3 codes (western language letters, korean Hangul syllables,...), precombined codes have been added to the set. Thus, they can be coded with a single code like simple characters.Out of curiosity, how do decomposed Hangul characters work? (Or do you know?) Not actually knowing any Korean, my understanding is that they're a set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it is like a series of base codes that automatically combine, or are there combining characters involved?
Jan 16 2011
Nick Sabalausky wrote:Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes.I know some German, and to the best of my knowledge there are zero combining characters for it. The umlauts and the B both have their own code points.legend has it there are others than can only be represented using a combining character.??? I've never seen or heard of any. Not even in the old script that was in common use in Germany until after WW2.
Jan 15 2011
On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:I'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.) (And what about Objective-C? Why did its designers even bother with that?). The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors: * The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets. (Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.) * ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.) * Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms. * Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output? Thus, practically, programmers can (1) simply don't know the issue (2) have code that really works in typical use cases for their software (3) do not notice their code runs incorrectly. There is also an intermediate situation between (2) & (3), similar to old problems with previous ASCII-only apps: they work wrongly when used in a non-english environment, but what can users do, concretely? Most often, they just have to cope with incorrectness, reinterpret outputs differently, and/or find workarounds by cheating with the interface. The responsability of designers of tools for programmers is, imo, important. We should make the issue clear, first (very difficult, it's an ubiquitous myth to break down), and propose services that run correctly in situations where said issue is relevant, here manipulation of universal text, even if not very efficient at start. On my side, and about D, I wish that most D programmers (1) are aware of the problem (2) understand its why's & how's (3) know there is a correct solution. Then, (4) use it actually is their choice (and I don't care whether or not they do).That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).Beware! far too long text. https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction (the directory above contains the current rough implementation of Text, plus a bit of its brother package DUnicode)This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?Except it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output: The character in position 0 is t The character in position 1 is ̃ (Note that the tilde becomes combined with the preceding space character.) The conception of character that normal people have does not match the notion of code points when combining characters enters the equation.It also supports this: foreach(i, d; s) { writeln("The character in position ", i, " is ", d); } where i is the index (might not be sequential)Well string supports that too, albeit with the nit that you need to specify dchar.Thanks, AndreiDenis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
On Fri, 14 Jan 2011 08:14:02 -0500, spir <denis.spir gmail.com> wrote:On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:* I don't even know how to make a grapheme that is more than one code-unit, let alone more than one code-point :) Every time I try, I get 'invalid utf sequence'. I feel significantly ignorant on this issue, and I'm slowly getting enough knowledge to join the discussion, but being a dumb American who only speaks English, I have a hard time grasping how this shit all works. -SteveI'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.) (And what about Objective-C? Why did its designers even bother with that?). The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors: * The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets. (Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.) * ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.) * Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms. * Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output?That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).
Jan 14 2011
On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:* I don't even know how to make a grapheme that is more than one code-unit, let alone more than one code-point :) Every time I try, I get 'invalid utf sequence'. I feel significantly ignorant on this issue, and I'm slowly getting enough knowledge to join the discussion, but being a dumb American who only speaks English, I have a hard time grasping how this shit all works.1. See my text at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction 2. writeln ("A\u0308\u0330"); <A + tilde above + umlaut below> (or the opposite) If it does not display properly, either set your terminal to UTF* or use a more unicode-aware font (eg DejaVu series). The point is not playing like that with Unicode flexibility. Rather that composite characters are just normal thingies in most languages of the world. Actually, on this point, english is a rare exception (discarding letters imported from foreign languages like french 'à'); to the point of beeing, I guess, the only western language without any diacritic. Denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:I can't read that document, it's black background with super-dark-grey text.* I don't even know how to make a grapheme that is more than one code-unit, let alone more than one code-point :) Every time I try, I get 'invalid utf sequence'. I feel significantly ignorant on this issue, and I'm slowly getting enough knowledge to join the discussion, but being a dumb American who only speaks English, I have a hard time grasping how this shit all works.1. See my text at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction2. writeln ("A\u0308\u0330"); <A + tilde above + umlaut below> (or the opposite) If it does not display properly, either set your terminal to UTF* or use a more unicode-aware font (eg DejaVu series).OK, I'll have to remember this so I can use it to test my string type ;)The point is not playing like that with Unicode flexibility. Rather that composite characters are just normal thingies in most languages of the world. Actually, on this point, english is a rare exception (discarding letters imported from foreign languages like french 'à'); to the point of beeing, I guess, the only western language without any diacritic.Is it common to have multiple modifiers on a single character? The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English. I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist. My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type). If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough. Either way, we need a string type that can be compared canonically for things like searches or opEquals. -Steve
Jan 14 2011
On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:Not in my knowledge. But I rarely deal with non-latin texts, there's probably some scripts out there that takes advantage of this.The point is not playing like that with Unicode flexibility. Rather that composite characters are just normal thingies in most languages of the world. Actually, on this point, english is a rare exception (discarding letters imported from foreign languages like french ''); to the point of beeing, I guess, the only western language without any diacritic.Is it common to have multiple modifiers on a single character?The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English.Actually, returning a sliced char[]or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower. In the case of NSString in Cocoa, you can only access the 'characters' in their UTF-16 form. But everything from comparison to search for substring is done using graphemes. It's like they implemented specialized Unicode-aware algorithms for these functions. There's no genericness about how it handles graphemes. I'm not sure yet about what would be the right approach for D.I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist. My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type). If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough. Either way, we need a string type that can be compared canonically for things like searches or opEquals.I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead. Also bring the idea above that iterating on a string would yield graphemes as char[] and this code would work perfectly irrespective of whether you used combining characters: foreach (grapheme; "expos") { if (grapheme == "") break; } I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin <michel.fortin michelf.com> wrote:On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:Hm... this pushes the normalization outside the type, and into the algorithms (such as find). I was hoping to avoid that. I think I can come up with an algorithm that normalizes into canonical form as it iterates. It just might return part of a grapheme if the grapheme cannot be composed. I do think that we could make a byGrapheme member to aid in this: foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains one composed grapheme.On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:Not in my knowledge. But I rarely deal with non-latin texts, there's probably some scripts out there that takes advantage of this.The point is not playing like that with Unicode flexibility. Rather that composite characters are just normal thingies in most languages of the world. Actually, on this point, english is a rare exception (discarding letters imported from foreign languages like french 'à'); to the point of beeing, I guess, the only western language without any diacritic.Is it common to have multiple modifiers on a single character?The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English.Actually, returning a sliced char[] or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower.In the case of NSString in Cocoa, you can only access the 'characters' in their UTF-16 form. But everything from comparison to search for substring is done using graphemes. It's like they implemented specialized Unicode-aware algorithms for these functions. There's no genericness about how it handles graphemes. I'm not sure yet about what would be the right approach for D.I hope we can use generic versions, so the type itself handles the conversions. That makes any algorithm using the string range correct.No, in my vision of how strings should be typed, char[] is an array, not a string. It should be treated like an array of code-units, where two forms that create the same grapheme are considered different.I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist. My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type). If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough. Either way, we need a string type that can be compared canonically for things like searches or opEquals.I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead.Also bring the idea above that iterating on a string would yield graphemes as char[] and this code would work perfectly irrespective of whether you used combining characters: foreach (grapheme; "exposé") { if (grapheme == "é") break; } I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve
Jan 15 2011
Steven Schveighoffer wrote: ...If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve
Jan 15 2011
On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> wrote:Steven Schveighoffer wrote: ...English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary. Essentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=. -SteveIf its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve
Jan 15 2011
Steven Schveighoffer Wrote:On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> wrote:The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs. More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language. I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default. You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.Steven Schveighoffer wrote: ...English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary. Essentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=. -SteveIf its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve
Jan 15 2011
On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo bar.com> wrote:Steven Schveighoffer Wrote:I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode. The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*. At least, that is how I see it. I'm looking at it as a code-reuse proposition. It's like calendars. There are quite a few different calendars in different cultures. But most people use a Gregorian calendar. So we have three options: a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default. c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them. I'm looking at my proposal as more of a c) solution. Can you show how normalization causes subtle bugs?English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary. Essentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=. -SteveThe above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default. You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.Or French, or Spanish, or German, etc... Look, even the lowest level is valid unicode, but if you want to start extracting individual graphemes, you need more machinery. In 99% of cases, I'd think you want to use strings as strings, not as sequences of graphemes, or code-units. -Steve
Jan 15 2011
On Sat, 15 Jan 2011 14:51:47 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode. The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*. At least, that is how I see it. I'm looking at it as a code-reuse proposition. It's like calendars. There are quite a few different calendars in different cultures. But most people use a Gregorian calendar. So we have three options: a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default. c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them. I'm looking at my proposal as more of a c) solution. Can you show how normalization causes subtle bugs?I see from Michel's post how normalization automatically can be bad. I also see that it can be wasteful. So I've shifted my position. Now I agree that we need a full unicode-compliant string type as the default. See my reply to Michel for more info on my revised proposal. -Steve
Jan 15 2011
Steven Schveighoffer Wrote:On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo bar.com> wrote:The calendar example is a very good one. What you're saying equivalent to saying is that most people use Gregorian but for efficiency and other reasons you want to not implement feb 29th.Steven Schveighoffer Wrote:I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode. The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*. At least, that is how I see it. I'm looking at it as a code-reuse proposition. It's like calendars. There are quite a few different calendars in different cultures. But most people use a Gregorian calendar. So we have three options: a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default. c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them. I'm looking at my proposal as more of a c) solution.English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary. Essentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=. -SteveThe above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.Can you show how normalization causes subtle bugs?That was already shown by Michel and Spir where the equality operator is incorrect due to diacritics (the example with expos). Your solution makes this far worse since it will reduce the bug to far less cases making the problem far less obvious. One would test with expos which will work and another test (let's say in Hebrew) and that will *not* work and unless the programmer is a Unicode expert (Which is very unlikely) the programmer is left scratching his head.As I explained above, 'good enough' in this case is far worse because it masks the problem. Also, If you want comparison to work in all languages including Hebrew/Arabic than it simply isn't good enough.More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization. As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default. You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.Or French, or Spanish, or German, etc... Look, even the lowest level is valid unicode, but if you want to start extracting individual graphemes, you need more machinery. In 99% of cases, I'd think you want to use strings as strings, not as sequences of graphemes, or code-units. -Steve
Jan 15 2011
On Sat, 15 Jan 2011 15:46:11 -0500, foobar <foo bar.com> wrote:I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization. As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.Well said, I've changed my mind. Thanks for explaining. -Steve
Jan 15 2011
On 01/15/2011 09:46 PM, foobar wrote:I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization. As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.In a few days, D will have an external library able to deal with those issues, hopefully correctly and clearly for client programmers. Possibly, its design is not the best possible approach (esp for efficiency: Michel let me doubt about that, and my competence in this field is close to nothing). But it has the merit to exist and provide a clear example of the correct semantics. Let us use it as a base for experimentation. Then, everything can be redesigned from scratch if we realise I was initially completely wrong. In any case, it would certainly be a far easier and fast job to do now, after having explored the issues at length, and with a reference implementation at hand. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:Hello Steven, How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers... A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic). So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings. For further information: Unicode's normalisation algos: http://unicode.org/reports/tr15/ list of technical reports: http://unicode.org/reports/ (Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.) Denis _________________ vita es estrany spir.wikidot.comMore over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.
Jan 17 2011
On Mon, 17 Jan 2011 10:14:19 -0500, spir <denis.spir gmail.com> wrote:On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:I'll reply to this to save you the trouble. I have reversed my position since writing a lot of these posts. In summary, I think strings should default to an element type of a grapheme, which should be implemented via a slice of the original data. Updated string type forthcoming. -SteveHello Steven, How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers... A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic). So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings. For further information: Unicode's normalisation algos: http://unicode.org/reports/tr15/ list of technical reports: http://unicode.org/reports/ (Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.)More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.
Jan 17 2011
On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> wrote:Why don't we build a compiler with an optimizer that generates correct code *almost* all of the time? If you are worried about it not producing correct code for a given function, you can just add "pragma(correct_code)" in front of that function to disable the risky optimizations. No harm done, right? One thing I see very often, often on US web sites but also elsewhere, is that if you enter a name with an accented letter in a form (say Émilie), very often the accented letter gets changed to another semi-random character later in the process. Why? Because somewhere in the process lies an encoding mismatch that no one thought about and no one tested for. At the very least, the form should have rejected those unexpected characters and show an error when it could. Now, with proper Unicode handling up to the code point level, this kind of problem probably won't happen as often because the whole stack works with UTF encodings. But are you going to validate all of your inputs to make sure they have no combining code point? Don't assume that because you're in the United States no one will try to enter characters where you don't expect them. People love to play with Unicode symbols for fun, putting them in their name, signature, or even domain names (✪df.ws). Just wait until they discover they can combine them. ☺̰̎! There is also a variety of combining mathematical symbols with no pre-combined form, such as ≸. Writing in Arabic, Hebrew, Korean, or some other foreign language isn't a prerequisite to use combining characters.Steven Schveighoffer wrote: ...English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary.If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -SteveEssentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=.Basically, you're suggesting that the default way should be to handle Unicode *almost* right. And then, if you want to handle thing *really* right you need to be explicit about it by using "utfstring_t"? I understand your motivation, but it sounds backward to me. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin <michel.fortin michelf.com> wrote:On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:You make very good points. I concede that using dchar as the element point is not correct for unicode strings. -SteveOn Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn <lutger.blijdestijn gmail.com> wrote:Why don't we build a compiler with an optimizer that generates correct code *almost* all of the time? If you are worried about it not producing correct code for a given function, you can just add "pragma(correct_code)" in front of that function to disable the risky optimizations. No harm done, right? One thing I see very often, often on US web sites but also elsewhere, is that if you enter a name with an accented letter in a form (say Émilie), very often the accented letter gets changed to another semi-random character later in the process. Why? Because somewhere in the process lies an encoding mismatch that no one thought about and no one tested for. At the very least, the form should have rejected those unexpected characters and show an error when it could. Now, with proper Unicode handling up to the code point level, this kind of problem probably won't happen as often because the whole stack works with UTF encodings. But are you going to validate all of your inputs to make sure they have no combining code point? Don't assume that because you're in the United States no one will try to enter characters where you don't expect them. People love to play with Unicode symbols for fun, putting them in their name, signature, or even domain names (✪df.ws). Just wait until they discover they can combine them. ☺̰̎! There is also a variety of combining mathematical symbols with no pre-combined form, such as ≸. Writing in Arabic, Hebrew, Korean, or some other foreign language isn't a prerequisite to use combining characters.Steven Schveighoffer wrote: ...English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary.If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -SteveEssentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=.Basically, you're suggesting that the default way should be to handle Unicode *almost* right. And then, if you want to handle thing *really* right you need to be explicit about it by using "utfstring_t"? I understand your motivation, but it sounds backward to me.
Jan 15 2011
On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin <michel.fortin michelf.com> wrote:Not really. It pushes the normalization to the string comparison operator, as explained later.Actually, returning a sliced char[] or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower.Hm... this pushes the normalization outside the type, and into the algorithms (such as find). I was hoping to avoid that.I think I can come up with an algorithm that normalizes into canonical form as it iterates. It just might return part of a grapheme if the grapheme cannot be composed.The problem with normalization while iterating is that you lose information about what the actual code points part of the grapheme. If you wanted to count the number of grapheme with a particular code point you're lost that information. Moreover, if all you want is to count the number of grapheme, normalizing the character is a waste of time. I suggested in another post that we implement ranges for decomposing and recomposing on-the-fly a string in its normalized form. That's basically the same thing as you suggest, but it'd have to be explicit to avoid the problem above.Well, I agree there's a need for that sometime. But if what you want is just a dumb array of code units, why not use ubyte[], ushort[] and uint[] instead? It seems to me that the whole point of having a different type for char[], wchar[], and dchar[] is that you know they are Unicode strings and can treat them as such. And if you treat them as Unicode strings, then perhaps the runtime and the compiler should too, for consistency's sake.I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead.No, in my vision of how strings should be typed, char[] is an array, not a string. It should be treated like an array of code-units, where two forms that create the same grapheme are considered different.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "expos") {} foreach (wchar c; "expos") {} foreach (char c; "expos") {} // or foreach (dchar c; "expos".by!dchar()) {} foreach (wchar c; "expos".by!wchar()) {} foreach (char c; "expos".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.Also bring the idea above that iterating on a string would yield graphemes as char[] and this code would work perfectly irrespective of whether you used combining characters: foreach (grapheme; "expos") { if (grapheme == "") break; } I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English.I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above).I think it should be the reverse. If you want your code to break when it encounters multi-code-point graphemes then it's your choice, but you should have to make your choice explicit. The default should be to handle strings correctly. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On Sat, 15 Jan 2011 13:32:10 -0500, Michel Fortin <michel.fortin michelf.com> wrote:On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:Are these common requirements? I thought users mostly care about graphemes, not code points. Asking in the dark here, since I have next to zero experience with unicode strings.On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin <michel.fortin michelf.com> wrote:Not really. It pushes the normalization to the string comparison operator, as explained later.Actually, returning a sliced char[] or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower.Hm... this pushes the normalization outside the type, and into the algorithms (such as find). I was hoping to avoid that.I think I can come up with an algorithm that normalizes into canonical form as it iterates. It just might return part of a grapheme if the grapheme cannot be composed.The problem with normalization while iterating is that you lose information about what the actual code points part of the grapheme. If you wanted to count the number of grapheme with a particular code point you're lost that information.Moreover, if all you want is to count the number of grapheme, normalizing the character is a waste of time.This is true. I can see this being a common need.I suggested in another post that we implement ranges for decomposing and recomposing on-the-fly a string in its normalized form. That's basically the same thing as you suggest, but it'd have to be explicit to avoid the problem above.OK, I see your point.Because ubyte[] ushort[] and uint[] do not say that their data is unicode text. The point is, I want to write a function that takes utf-8, ubyte[] opens it up to any data, not just UTF-8 data. But if we have a method of iterating code-units as you specify below, then I think we are OK.Well, I agree there's a need for that sometime. But if what you want is just a dumb array of code units, why not use ubyte[], ushort[] and uint[] instead?I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead.No, in my vision of how strings should be typed, char[] is an array, not a string. It should be treated like an array of code-units, where two forms that create the same grapheme are considered different.It seems to me that the whole point of having a different type for char[], wchar[], and dchar[] is that you know they are Unicode strings and can treat them as such. And if you treat them as Unicode strings, then perhaps the runtime and the compiler should too, for consistency's sake.I'd agree with you, but then there's that pesky [] after it indicating it's an array. For consistency's sake, I'd say the compiler should treat T[] as an array of T's.I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar. What if I modified my proposed string_t type to return T[] as its element type, as you say, and string literals are typed as string_t!(whatever)? In addition, the restrictions I imposed on slicing a code point actually get imposed on slicing a grapheme. That is, it is illegal to substring a string_t in a way that slices through a grapheme (and by deduction, a code point)? Actually, we would need a grapheme to be its own type, because comparing two char[]'s that don't contain equivalent bits and having them be equal, violates the expectation that char[] is an array. So the string_t!char would return a grapheme_t!char (names to be discussed) as its element type.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "exposé") {} foreach (wchar c; "exposé") {} foreach (char c; "exposé") {} // or foreach (dchar c; "exposé".by!dchar()) {} foreach (wchar c; "exposé".by!wchar()) {} foreach (char c; "exposé".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.Also bring the idea above that iterating on a string would yield graphemes as char[] and this code would work perfectly irrespective of whether you used combining characters: foreach (grapheme; "exposé") { if (grapheme == "é") break; } I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English.You are probably right. -SteveI think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above).I think it should be the reverse. If you want your code to break when it encounters multi-code-point graphemes then it's your choice, but you should have to make your choice explicit. The default should be to handle strings correctly.
Jan 15 2011
On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:I'm glad we agree on that now.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "expos") {} foreach (wchar c; "expos") {} foreach (char c; "expos") {} // or foreach (dchar c; "expos".by!dchar()) {} foreach (wchar c; "expos".by!wchar()) {} foreach (char c; "expos".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.What if I modified my proposed string_t type to return T[] as its element type, as you say, and string literals are typed as string_t!(whatever)? In addition, the restrictions I imposed on slicing a code point actually get imposed on slicing a grapheme. That is, it is illegal to substring a string_t in a way that slices through a grapheme (and by deduction, a code point)?I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?). If strings and arrays of code units are distinct, slicing in the middle of a grapheme or in the middle of a code point could throw an error, but for performance reasons it should probably check for that only when array bounds checking is turned on (that would require compiler support however).Actually, we would need a grapheme to be its own type, because comparing two char[]'s that don't contain equivalent bits and having them be equal, violates the expectation that char[] is an array. So the string_t!char would return a grapheme_t!char (names to be discussed) as its element type.Or you could make a grapheme a string_t. ;-) -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin <michel.fortin michelf.com> wrote:On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper). I once told a colleague who was on a standards committee that their proposed KLV standard (key length value) was ridiculous. The wise committee had decided that in order to avoid future issues, the length would be encoded as a single byte if < 128, or 128 + length of the length field for anything higher. This means you could potentially have to parse and process a 127-byte integer!I'm glad we agree on that now.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "exposé") {} foreach (wchar c; "exposé") {} foreach (char c; "exposé") {} // or foreach (dchar c; "exposé".by!dchar()) {} foreach (wchar c; "exposé".by!wchar()) {} foreach (char c; "exposé".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.What if I modified my proposed string_t type to return T[] as its element type, as you say, and string literals are typed as string_t!(whatever)? In addition, the restrictions I imposed on slicing a code point actually get imposed on slicing a grapheme. That is, it is illegal to substring a string_t in a way that slices through a grapheme (and by deduction, a code point)?I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).If strings and arrays of code units are distinct, slicing in the middle of a grapheme or in the middle of a code point could throw an error, but for performance reasons it should probably check for that only when array bounds checking is turned on (that would require compiler support however).Not really, it could use assert, but that throws an assert error instead of a RangeError. Of course, both are errors and will abort the program. I do wish there was a version(noboundscheck) to do this kind of stuff with...I'm a little uneasy having a range return itself as its element type. For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: Tango had a type called DateTime. This type represented *either* a point in time, or a span of time (depending on how you used it). But I proposed we switch to two distinct types, one for a point in time, one for a span of time. It was argued that both were so similar, why couldn't we just keep one type? The answer is simple -- having them be separate types allows me to express relationships that the compiler enforces. For example, you can add two time spans together, but you can't add two points in time together. Or maybe you want a function to accept a time span (like a sleep operation). If there was only one type, then sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;) I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime. -SteveActually, we would need a grapheme to be its own type, because comparing two char[]'s that don't contain equivalent bits and having them be equal, violates the expectation that char[] is an array. So the string_t!char would return a grapheme_t!char (names to be discussed) as its element type.Or you could make a grapheme a string_t. ;-)
Jan 15 2011
Steven Schveighoffer Wrote:On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin <michel.fortin michelf.com> wrote:I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper). I once told a colleague who was on a standards committee that their proposed KLV standard (key length value) was ridiculous. The wise committee had decided that in order to avoid future issues, the length would be encoded as a single byte if < 128, or 128 + length of the length field for anything higher. This means you could potentially have to parse and process a 127-byte integer!I'm glad we agree on that now.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "exposé") {} foreach (wchar c; "exposé") {} foreach (char c; "exposé") {} // or foreach (dchar c; "exposé".by!dchar()) {} foreach (wchar c; "exposé".by!wchar()) {} foreach (char c; "exposé".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.What if I modified my proposed string_t type to return T[] as its element type, as you say, and string literals are typed as string_t!(whatever)? In addition, the restrictions I imposed on slicing a code point actually get imposed on slicing a grapheme. That is, it is illegal to substring a string_t in a way that slices through a grapheme (and by deduction, a code point)?I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).If strings and arrays of code units are distinct, slicing in the middle of a grapheme or in the middle of a code point could throw an error, but for performance reasons it should probably check for that only when array bounds checking is turned on (that would require compiler support however).Not really, it could use assert, but that throws an assert error instead of a RangeError. Of course, both are errors and will abort the program. I do wish there was a version(noboundscheck) to do this kind of stuff with...I'm a little uneasy having a range return itself as its element type. For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: Tango had a type called DateTime. This type represented *either* a point in time, or a span of time (depending on how you used it). But I proposed we switch to two distinct types, one for a point in time, one for a span of time. It was argued that both were so similar, why couldn't we just keep one type? The answer is simple -- having them be separate types allows me to express relationships that the compiler enforces. For example, you can add two time spans together, but you can't add two points in time together. Or maybe you want a function to accept a time span (like a sleep operation). If there was only one type, then sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;) I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime. -SteveActually, we would need a grapheme to be its own type, because comparing two char[]'s that don't contain equivalent bits and having them be equal, violates the expectation that char[] is an array. So the string_t!char would return a grapheme_t!char (names to be discussed) as its element type.Or you could make a grapheme a string_t. ;-)
Jan 15 2011
On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo bar.com> wrote:I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.A grapheme would be its own specialized type. I'd probably remove the range primitives to really differentiate it. Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check. Most likely this check would be disabled in release mode. This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems. With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters. At the end of the day, perhaps grapheme *should* just be a string. We'll have to see how this breaks in practice, either way. -Steve
Jan 17 2011
On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo bar.com> wrote:I think that it would make good sense for a grapheme to be struct which holds a string as Andrei suggested: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } I really think that trying to use strings to represent graphemes is asking for it. The element of a range should be a different type than the that of the range itself. - Jonathan M DavisI like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.A grapheme would be its own specialized type. I'd probably remove the range primitives to really differentiate it. Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check. Most likely this check would be disabled in release mode. This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems. With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters. At the end of the day, perhaps grapheme *should* just be a string. We'll have to see how this breaks in practice, either way.
Jan 17 2011
On 1/17/11 6:25 AM, Jonathan M Davis wrote:On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:If someone makes a careful submission of a Grapheme to Phobos as described above, it has a high chance of being accepted. AndreiOn Sat, 15 Jan 2011 17:19:48 -0500, foobar<foo bar.com> wrote:I think that it would make good sense for a grapheme to be struct which holds a string as Andrei suggested: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } I really think that trying to use strings to represent graphemes is asking for it. The element of a range should be a different type than the that of the range itself. - Jonathan M DavisI like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.A grapheme would be its own specialized type. I'd probably remove the range primitives to really differentiate it. Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check. Most likely this check would be disabled in release mode. This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems. With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters. At the end of the day, perhaps grapheme *should* just be a string. We'll have to see how this breaks in practice, either way.
Jan 17 2011
On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin <michel.fortin michelf.com> wrote:Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support. That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).I'm glad we agree on that now.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "expos") {} foreach (wchar c; "expos") {} foreach (char c; "expos") {} // or foreach (dchar c; "expos".by!dchar()) {} foreach (wchar c; "expos".by!wchar()) {} foreach (char c; "expos".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.Indeed, the change would probably be too radical for D2. I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work. Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do. One more thing: NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.I can understand the utility of a separate type in your DateTime example, but in this case I fail to see any advantage. I mean, a grapheme is a slice of a string, can have multiple code points (like a string), can be appended the same way as a string, can be composed or decomposed using canonical normalization or compatibility normalization (like a string), and should be sorted, uppercased, and lowercased according to Unicode rules (like a string). Basically, a grapheme is just a string that happens to contain only one grapheme. What would a custom type do differently than a string? Also, grapheme == "a" is easy to understand because both are strings. But if a grapheme is a separate type, what would a grapheme literal look like? So in the end I don't think a grapheme needs a specific type, at least not for general purpose text processing. If I split a string on whitespace, do I get a range where elements are of type "word"? No, just sliced strings. That said, I'm much less concerned by the type used to represent a grapheme than by the Unicode correctness. I'm not opposed to a separate type, I just don't really see the point. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Or you could make a grapheme a string_t. ;-)I'm a little uneasy having a range return itself as its element type. For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: [...] I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime.
Jan 15 2011
On 1/15/11 4:45 PM, Michel Fortin wrote:On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base. It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } auto byGrapheme(S)(S s) if (isSomeString!S) { ... } string s = "Hello"; foreach (g; byGrapheme(s) { ... } AndreiOn Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin <michel.fortin michelf.com> wrote:Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support. That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).I'm glad we agree on that now.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "expos") {} foreach (wchar c; "expos") {} foreach (char c; "expos") {} // or foreach (dchar c; "expos".by!dchar()) {} foreach (wchar c; "expos".by!wchar()) {} foreach (char c; "expos".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.Indeed, the change would probably be too radical for D2. I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work. Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do. One more thing: NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.
Jan 15 2011
On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:On 1/15/11 4:45 PM, Michel Fortin wrote:t.On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" =20 <schveiguy yahoo.com> said:On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin =20 <michel.fortin michelf.com> wrote:=20 Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support. =20 That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth i=On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" =20 <schveiguy yahoo.com> said:=20 It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).=20 I'm glad we agree on that now.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "expos=E9") {} foreach (wchar c; "expos=E9") {} foreach (char c; "expos=E9") {} // or foreach (dchar c; "expos=E9".by!dchar()) {} foreach (wchar c; "expos=E9".by!wchar()) {} foreach (char c; "expos=E9".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.=20 I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.Considering that strings are already dealt with specially in order to have = an=20 element of dchar, I wouldn't think that it would be all that distruptive to= make=20 it so that they had an element type of Grapheme instead. Wouldn't that then= fix=20 all of std.algorithm and the like without really disrupting anything? The issue of foreach remains, but without being willing to change what fore= ach=20 defaults to, you can't really fix it - though I'd suggest that we at least = make=20 it a warning to iterate over strings without specifying the type. And if fo= reach=20 were made to understand Grapheme like it understands dchar, then you could = do foreach(Grapheme g; str) { ... } and have the compiler warn about foreach(g; str) { ... } and tell you to use Grapheme if you want to be comparing actual characters.= =20 Regardless, by making strings ranges of Grapheme rather than dchar, I would= =20 think that we would solve most of the problem. At minimum, we'd have pretty= much=20 the same problems that we have right now with char and wchar arrays, but we= 'd=20 get rid of a whole class of unicode problems. So, nothing would be worse, b= ut=20 some of it would be better. =2D Jonathan M Davis=20=20 I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base. =20 It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: =20 struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } =20 auto byGrapheme(S)(S s) if (isSomeString!S) { ... } =20 string s =3D "Hello"; foreach (g; byGrapheme(s) { ... }=20 Indeed, the change would probably be too radical for D2. =20 I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work. =20 Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. =20 I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. =20 That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do. =20 One more thing: =20 NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: =20 I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).=20 I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. =20 However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.
Jan 15 2011
On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:The issue of foreach remains, but without being willing to change what foreach defaults to, you can't really fix it - though I'd suggest that we at least make it a warning to iterate over strings without specifying the type. And if foreach were made to understand Grapheme like it understands dchar, then you could do foreach(Grapheme g; str) { ... } and have the compiler warn about foreach(g; str) { ... } and tell you to use Grapheme if you want to be comparing actual characters.Walter's argument against changing this for foreach was that it'd *silently* break compatibility with existing D1 code. Changing the default to a grapheme makes this argument obsolete: since a grapheme is essentially a string, you can't compare it with char or wchar or dchar directly, so it'll break at compile time with an error and you'll have to decide what to do. So Walter would have to find another argument to defend the status quo. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On 1/15/11 10:47 PM, Michel Fortin wrote:On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:I think it's poor abstraction to represent a Grapheme as a string. It should be its own type. AndreiThe issue of foreach remains, but without being willing to change what foreach defaults to, you can't really fix it - though I'd suggest that we at least make it a warning to iterate over strings without specifying the type. And if foreach were made to understand Grapheme like it understands dchar, then you could do foreach(Grapheme g; str) { ... } and have the compiler warn about foreach(g; str) { ... } and tell you to use Grapheme if you want to be comparing actual characters.Walter's argument against changing this for foreach was that it'd *silently* break compatibility with existing D1 code. Changing the default to a grapheme makes this argument obsolete: since a grapheme is essentially a string, you can't compare it with char or wchar or dchar directly, so it'll break at compile time with an error and you'll have to decide what to do. So Walter would have to find another argument to defend the status quo.
Jan 16 2011
On 1/15/11 9:25 PM, Jonathan M Davis wrote:Considering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything?It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption. Andrei
Jan 16 2011
And how would 3rd party libraries handle Graphemes? And C modules? I think making these Graphemes the default would make quite a mess, since you would have to convert back and forth between char[] and Grapheme[] all the time (right?).
Jan 16 2011
On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/15/11 9:25 PM, Jonathan M Davis wrote:I would have agreed with you last week. Now I understand that using dchar is just as useless for unicode as using char. Will it be slower? Perhaps. A TON slower? Probably not. But it will be correct. Correct and slow is better than incorrect and fast. If I showed you a shortest-path algorithm that ran in O(V) time, but didn't always find the shortest path, would you call it a success? We need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data. -SteveConsidering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything?It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.
Jan 17 2011
On Mon, 17 Jan 2011 07:44:17 -0500, Steven Schveighoffer wrote:We need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data.Googling "unicode sample document" turned up a few examples. This one looks promising: http://www.humancomp.org/unichtm/unichtm.htm -Lars
Jan 17 2011
On 1/17/11 6:44 AM, Steven Schveighoffer wrote:On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:This is one extreme. Char only works for English. Dchar works for most languages. It won't work for a few. That doesn't make it useless for languages that work with it.On 1/15/11 9:25 PM, Jonathan M Davis wrote:I would have agreed with you last week. Now I understand that using dchar is just as useless for unicode as using char.Considering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything?It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.Will it be slower? Perhaps. A TON slower? Probably not.It will be a ton slower.But it will be correct. Correct and slow is better than incorrect and fast. If I showed you a shortest-path algorithm that ran in O(V) time, but didn't always find the shortest path, would you call it a success?The comparison doesn't apply.We need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data.I very much appreciate that you're doing actual work on this. Andrei
Jan 17 2011
On 1/17/11 6:44 AM, Steven Schveighoffer wrote:We need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data.Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text. Andrei
Jan 17 2011
On Mon, 17 Jan 2011 10:00:57 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/17/11 6:44 AM, Steven Schveighoffer wrote:True, benchmarking doesn't apply with combining characters because we have nothing to compare it to. The current scheme fails on it anyways, so it by default would be the best solution. -SteveWe need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data.Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text.
Jan 17 2011
On 01/17/2011 04:00 PM, Andrei Alexandrescu wrote:On 1/17/11 6:44 AM, Steven Schveighoffer wrote:Correct. For this reason, we do not use the same source at all for correctness and performance testing. It is impossible to define typical or representative source (who judges?) But at very minimum, source texts for perf measurement should mix languages as diverse as possible, including some material of the ones known to be problematic and/or atypical (english, korean, hebrew...) The following (ripped and composed from ICU data sets) is just that: https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt Content: 12 natural languages 34767 bytes = utf8 code units --> 20133 code points --> 22033 normal codes (NFD decomposed) --> 19205 piles = true characters Denis _________________ vita es estrany spir.wikidot.comWe need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data.Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text.
Jan 17 2011
On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Hello Steve & Andrei, I see 2 questions: (1) whether we should provide Unicode correctness as a default or not? and relative points of level of abstraction & normalisation (2) what is the best way to implement such correctness? Let us put aside (1) for a while, anyway nothing prevents us to experiment while waiting for an agreement; such experiment would in fact feed the debate with real facts instead of "airy" ideas. It seems there are 2 opposite approaches to Unicode correctness. Mine was to build a types that systematically abstracts UCS-created issues (that real whole characters are coded by mini-arrays of codes I call "code piles", that those piles have variable lengths, _and_ that cheracters even may have several representations). Then, in my wild guesses, every text manipulation method should obviously be "flash fast", actually faster than any on the fly algo by several orders of magnitude. But Michel let me doubt on that point. The other approach is precisely to provide needed abstraction ("piling" and normalisation) on the fly. Like proposed by Michel, and like Objective-C does, IIUC. This way seems to me closer to a kind of re-design Steven's new String type and/or Andrei's VLERange. As you say, we need real timing numbers to decide. I think we should measure at least 2 routines: * indexing (or better iteration?) which only requires "piling" * counting occurrences of a given character or slice, which requires both piling and normalisation I do not feel like implementating such routine for the on the fly version, and have no time for this in coming days; but if anyone is volunteer, feel free to rip code and data from Text's current implementation if it may help. As source text, we can use the one at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt (already my source for perf measures). It has the only merit to be a text (about unicode!) in twelve rather different languages. [My intuitive guess is that Michel is wrong by orders of magnitude --but again I know about nothing about code performance.] Denis _________________ vita es estrany spir.wikidot.comOn 1/15/11 9:25 PM, Jonathan M Davis wrote:I would have agreed with you last week. Now I understand that using dchar is just as useless for unicode as using char. Will it be slower? Perhaps. A TON slower? Probably not. But it will be correct. Correct and slow is better than incorrect and fast. If I showed you a shortest-path algorithm that ran in O(V) time, but didn't always find the shortest path, would you call it a success? We need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data. -SteveConsidering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything?It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.
Jan 17 2011
On Saturday 15 January 2011 19:25:47 Jonathan M Davis wrote:On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:ctOn 1/15/11 4:45 PM, Michel Fortin wrote:On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" =20 <schveiguy yahoo.com> said:On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin =20 <michel.fortin michelf.com> wrote:On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" =20 <schveiguy yahoo.com> said:I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "expos=E9") {} foreach (wchar c; "expos=E9") {} foreach (char c; "expos=E9") {} // or foreach (dchar c; "expos=E9".by!dchar()) {} foreach (wchar c; "expos=E9".by!wchar()) {} foreach (char c; "expos=E9".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.=20 I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to dete=aall the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since=he=20 It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out t=normalized grapheme can occupy more than one dchar.=20 I'm glad we agree on that now.atlesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).=20 Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand =gsfirst, and so they had to add a lot of things but wanted to keep thin=lybackward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support. =20 That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it. =20=20 Indeed, the change would probably be too radical for D2. =20 I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't real=I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: =20 I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).=20 I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. =20 However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.s.=20 Considering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything? =20 The issue of foreach remains, but without being willing to change what foreach defaults to, you can't really fix it - though I'd suggest that we at least make it a warning to iterate over strings without specifying the type. And if foreach were made to understand Grapheme like it understands dchar, then you could do =20 foreach(Grapheme g; str) { ... } =20 and have the compiler warn about =20 foreach(g; str) { ... } =20 and tell you to use Grapheme if you want to be comparing actual character=change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work. =20 Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. =20 I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. =20 That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do. =20 One more thing: =20 NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.=20 I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base. =20 It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: =20 struct Grapheme(Char) if (isSomeChar!Char) { =20 private const Char[] rep; ... =20 } =20 auto byGrapheme(S)(S s) if (isSomeString!S) { =20 ... =20 } =20 string s =3D "Hello"; foreach (g; byGrapheme(s) { =20 ... =20 }Regardless, by making strings ranges of Grapheme rather than dchar, I wou=ldthink that we would solve most of the problem. At minimum, we'd have pret=tymuch the same problems that we have right now with char and wchar arrays, but we'd get rid of a whole class of unicode problems. So, nothing would be worse, but some of it would be better.I suppose that the one major omission though is that string comparisons wou= ld be=20 by code unit, not graphemes, which would be a problem. =3D=3D could be made= to use=20 graphemes instead, but then you couldn't compare them by code units or code= =20 points unless you cast to ubyte[], ushort[], or uint[]... It would still=20 probably be worth making =3D=3D use graphemes though. =2D Jonathan M Davis
Jan 15 2011
On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base.There's still a disagreement about whether a string or a code unit array should be the default string representation, and whether iterating on a code unit array should give you code unit or grapheme elements. Of those who who participated in the discussion, I don't think anyone is disputing the idea that a grapheme element is better than a dchar element for iterating over a string.It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } auto byGrapheme(S)(S s) if (isSomeString!S) { ... } string s = "Hello"; foreach (g; byGrapheme(s) { ... }No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme? Of those, how many will forget to use byGrapheme at one time or another? And so in most programs string manipulation will misbehave in the presence of combining characters or unnormalized strings. If you want to help D programmers write correct code when it comes to Unicode manipulation, you need to help them iterate on real characters (graphemes), and you need the algorithms to apply to real characters (graphemes), not the approximation of a Unicode character that is a code point. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On 1/15/11 10:45 PM, Michel Fortin wrote:On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:Disagreement as that might be, a simple fact that needs to be taken into account is that as of right now all of Phobos uses UTF arrays for string representation and dchar as element type. Besides, for one I do dispute the idea that a grapheme element is better than a dchar element for iterating over a string. The grapheme has the attractiveness of being theoretically clean but at the same time is woefully inefficient and helps languages that few D users need to work with. At least that's my perception, and we need some serious numbers instead of convincing rhetoric to make a big decision. It's all a matter of picking one's trade-offs. Clearly ASCII is out as no serious amount of non-English text can be trafficked without diacritics. So switching to UTF makes a lot of sense, and that's what D did. When I introduced std.range and std.algorithm, they'd handle char[] and wchar[] no differently than any other array. A lot of algorithms simply did the wrong thing by default, so I attempted to fix that situation by defining byDchar(). So instead of passing some string str to an algorithm, one would pass byDchar(str). A couple of weeks went by in testing that state of affairs, and before late I figured that I need to insert byDchar() virtually _everywhere_. There were a couple of algorithms (e.g. Boyer-Moore) that happened to work with arrays for subtle reasons (needless to say, they won't work with graphemes at all). But by and large the situation was that the simple and intuitive code was wrong and that the correct code necessitated inserting byDchar(). So my next decision, which understandably some of the people who didn't go through the experiment may find unintuitive, was to make byDchar() the default. This cleaned up a lot of crap in std itself and saved a lot of crap in the yet-unwritten client code. I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried. Now, thanks to the effort people have spent in this group (thank you!), I have an understanding of the grapheme issue. I guarantee that grapheme-level iteration will have a high cost incurred to it: efficiency and changes in std. The languages that need composing characters for producing meaningful text are few and far between, so it makes sense to confine support for them to libraries that are not the default, unless we find ways to not disrupt everyone else.I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base.There's still a disagreement about whether a string or a code unit array should be the default string representation, and whether iterating on a code unit array should give you code unit or grapheme elements. Of those who who participated in the discussion, I don't think anyone is disputing the idea that a grapheme element is better than a dchar element for iterating over a string.How many people really should care?It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } auto byGrapheme(S)(S s) if (isSomeString!S) { ... } string s = "Hello"; foreach (g; byGrapheme(s) { ... }No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme?Of those, how many will forget to use byGrapheme at one time or another? And so in most programs string manipulation will misbehave in the presence of combining characters or unnormalized strings.But most strings don't contain combining characters or unnormalized strings.If you want to help D programmers write correct code when it comes to Unicode manipulation, you need to help them iterate on real characters (graphemes), and you need the algorithms to apply to real characters (graphemes), not the approximation of a Unicode character that is a code point.I don't think the situation is as clean cut, as grave, and as urgent as you say. Andrei
Jan 16 2011
On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/15/11 10:45 PM, Michel Fortin wrote:I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them. If we don't make correct Unicode handling the default, someday someone is going to ask a developer to fix a problem where his system doesn't handle some text correctly. Later that day, he'll come to the realization that almost none of his D code and none of the D libraries he use handle unicode correctly, and he'll say: can't fix this. His peer working on a similar Objective-C program will have a good laugh. Sure, correct Unicode handling is slower and more complicated to implement, but at least you know you'll get the right results.No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme?How many people really should care?I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow? A few years ago, many Unicode symbols didn't even show up correctly on Windows. Today, we have Unicode domain names and people start putting funny symbols in them (for instance: <http://◉.ws>). I haven't seen it yet, but we'll surely see combining characters in domain names soon enough (if only as a way to make fun of programs that can't handle Unicode correctly). Well, let me be the first to make fun of such programs: <http://☺̭̏.michelf.com/>. Also, not all combining characters are marks meant to be used by some foreign languages. Some are used for mathematics for instance. Or you could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay indicating some kind of prohibition.Of those, how many will forget to use byGrapheme at one time or another? And so in most programs string manipulation will misbehave in the presence of combining characters or unnormalized strings.But most strings don't contain combining characters or unnormalized strings.I agree it's probably not as clean cut as I say (I'm trying to keep complicated things simple here), but it's something important to decide early because the cost of changing it increase as more code is written. Quoting the first part of the same post (out of order):If you want to help D programmers write correct code when it comes to Unicode manipulation, you need to help them iterate on real characters (graphemes), and you need the algorithms to apply to real characters (graphemes), not the approximation of a Unicode character that is a code point.I don't think the situation is as clean cut, as grave, and as urgent as you say.Disagreement as that might be, a simple fact that needs to be taken into account is that as of right now all of Phobos uses UTF arrays for string representation and dchar as element type. Besides, for one I do dispute the idea that a grapheme element is better than a dchar element for iterating over a string. The grapheme has the attractiveness of being theoretically clean but at the same time is woefully inefficient and helps languages that few D users need to work with. At least that's my perception, and we need some serious numbers instead of convincing rhetoric to make a big decision.You'll no doubt get more performance from a grapheme-aware specialized algorithm working directly on code points than by iterating on graphemes returned as string slices. But both will give *correct* results. Implementing a specialized algorithm of this kind becomes an optimization, and it's likely you'll want an optimized version for most string algorithms. I'd like to have some numbers too about performance, but I have none at this time.It's all a matter of picking one's trade-offs. Clearly ASCII is out as no serious amount of non-English text can be trafficked without diacritics. So switching to UTF makes a lot of sense, and that's what D did. When I introduced std.range and std.algorithm, they'd handle char[] and wchar[] no differently than any other array. A lot of algorithms simply did the wrong thing by default, so I attempted to fix that situation by defining byDchar(). So instead of passing some string str to an algorithm, one would pass byDchar(str). A couple of weeks went by in testing that state of affairs, and before late I figured that I need to insert byDchar() virtually _everywhere_. There were a couple of algorithms (e.g. Boyer-Moore) that happened to work with arrays for subtle reasons (needless to say, they won't work with graphemes at all). But by and large the situation was that the simple and intuitive code was wrong and that the correct code necessitated inserting byDchar(). So my next decision, which understandably some of the people who didn't go through the experiment may find unintuitive, was to make byDchar() the default. This cleaned up a lot of crap in std itself and saved a lot of crap in the yet-unwritten client code.But were your algorithms *correct* in the first place? I'd argue that by making byDchar the default you've not saved yourself from any crap because dchar isn't the right layer of abstraction.I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis. Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.Now, thanks to the effort people have spent in this group (thank you!), I have an understanding of the grapheme issue. I guarantee that grapheme-level iteration will have a high cost incurred to it: efficiency and changes in std. The languages that need composing characters for producing meaningful text are few and far between, so it makes sense to confine support for them to libraries that are not the default, unless we find ways to not disrupt everyone else.We all are more aware of the problem now, that's a good thing. :-) -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 16 2011
On 1/16/11 3:20 PM, Michel Fortin wrote:On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I agree. Now let me ask again: how many people really should care?On 1/15/11 10:45 PM, Michel Fortin wrote:I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme?How many people really should care?If we don't make correct Unicode handling the default, someday someone is going to ask a developer to fix a problem where his system doesn't handle some text correctly. Later that day, he'll come to the realization that almost none of his D code and none of the D libraries he use handle unicode correctly, and he'll say: can't fix this. His peer working on a similar Objective-C program will have a good laugh. Sure, correct Unicode handling is slower and more complicated to implement, but at least you know you'll get the right results.I love the increased precision, but again I'm not sure how many people ever manipulate text with combining characters. Meanwhile they'll complain that D is slower than other languages.I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?Of those, how many will forget to use byGrapheme at one time or another? And so in most programs string manipulation will misbehave in the presence of combining characters or unnormalized strings.But most strings don't contain combining characters or unnormalized strings.A few years ago, many Unicode symbols didn't even show up correctly on Windows. Today, we have Unicode domain names and people start putting funny symbols in them (for instance: <http://◉.ws>). I haven't seen it yet, but we'll surely see combining characters in domain names soon enough (if only as a way to make fun of programs that can't handle Unicode correctly). Well, let me be the first to make fun of such programs: <http://☺̭̏.michelf.com/>.Would you bet the language on that?Also, not all combining characters are marks meant to be used by some foreign languages. Some are used for mathematics for instance. Or you could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay indicating some kind of prohibition.Agreed.I agree it's probably not as clean cut as I say (I'm trying to keep complicated things simple here), but it's something important to decide early because the cost of changing it increase as more code is written.If you want to help D programmers write correct code when it comes to Unicode manipulation, you need to help them iterate on real characters (graphemes), and you need the algorithms to apply to real characters (graphemes), not the approximation of a Unicode character that is a code point.I don't think the situation is as clean cut, as grave, and as urgent as you say.Quoting the first part of the same post (out of order):I spent a fair amount of time comparing ASCII vs. Unicode code speed. The fact of the matter is that the overhead is measurable and often high. Also it occurs at a very core level. For starters, the grapheme itself is larger and has one extra indirection. I am confident the marginal overhead for graphemes would be considerable.Disagreement as that might be, a simple fact that needs to be taken into account is that as of right now all of Phobos uses UTF arrays for string representation and dchar as element type. Besides, for one I do dispute the idea that a grapheme element is better than a dchar element for iterating over a string. The grapheme has the attractiveness of being theoretically clean but at the same time is woefully inefficient and helps languages that few D users need to work with. At least that's my perception, and we need some serious numbers instead of convincing rhetoric to make a big decision.You'll no doubt get more performance from a grapheme-aware specialized algorithm working directly on code points than by iterating on graphemes returned as string slices. But both will give *correct* results. Implementing a specialized algorithm of this kind becomes an optimization, and it's likely you'll want an optimized version for most string algorithms. I'd like to have some numbers too about performance, but I have none at this time.It was correct for all but a couple languages. Again: most of today's languages don't ever need combining characters.It's all a matter of picking one's trade-offs. Clearly ASCII is out as no serious amount of non-English text can be trafficked without diacritics. So switching to UTF makes a lot of sense, and that's what D did. When I introduced std.range and std.algorithm, they'd handle char[] and wchar[] no differently than any other array. A lot of algorithms simply did the wrong thing by default, so I attempted to fix that situation by defining byDchar(). So instead of passing some string str to an algorithm, one would pass byDchar(str). A couple of weeks went by in testing that state of affairs, and before late I figured that I need to insert byDchar() virtually _everywhere_. There were a couple of algorithms (e.g. Boyer-Moore) that happened to work with arrays for subtle reasons (needless to say, they won't work with graphemes at all). But by and large the situation was that the simple and intuitive code was wrong and that the correct code necessitated inserting byDchar(). So my next decision, which understandably some of the people who didn't go through the experiment may find unintuitive, was to make byDchar() the default. This cleaned up a lot of crap in std itself and saved a lot of crap in the yet-unwritten client code.But were your algorithms *correct* in the first place? I'd argue that by making byDchar the default you've not saved yourself from any crap because dchar isn't the right layer of abstraction.Do you, and can you?I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.They can't be unaware and write said code.All I wish is it's not blown out of proportion. It fares rather low on my list of library issues that D has right now. AndreiNow, thanks to the effort people have spent in this group (thank you!), I have an understanding of the grapheme issue. I guarantee that grapheme-level iteration will have a high cost incurred to it: efficiency and changes in std. The languages that need composing characters for producing meaningful text are few and far between, so it makes sense to confine support for them to libraries that are not the default, unless we find ways to not disrupt everyone else.We all are more aware of the problem now, that's a good thing. :-)
Jan 16 2011
Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:On 1/16/11 3:20 PM, Michel Fortin wrote:So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.But most strings don't contain combining characters or unnormalized strings.I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - DanielDo you, and can you?I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.They can't be unaware and write said code.
Jan 16 2011
On 1/16/11 6:42 PM, Daniel Gibson wrote:Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:I consider UTF8 superior to all of the above.On 1/16/11 3:20 PM, Michel Fortin wrote:So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.But most strings don't contain combining characters or unnormalized strings.I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?I think German text works well with dchar. AndreiFun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - DanielDo you, and can you?I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.They can't be unaware and write said code.
Jan 16 2011
On Sunday 16 January 2011 18:45:26 Andrei Alexandrescu wrote:On 1/16/11 6:42 PM, Daniel Gibson wrote:ofAm 17.01.2011 00:58, schrieb Andrei Alexandrescu:On 1/16/11 3:20 PM, Michel Fortin wrote:On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu =20 <SeeWebsiteForEmail erdani.org> said:=20 I don't think languages will acquire more diacritics soon. I do hope, =But most strings don't contain combining characters or unnormalized strings.=20 I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?te=20 I consider UTF8 superior to all of the above. =20course, that D applications gain more usage in the Arabic, Hebrew etc. world.=20 So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). =20 You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.=20 It is indeed easy to understand why you're happy with the current sta=rks it=20 Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. =20 I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and =C3=9F wo=of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.=20 Do you, and can you? =20Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.=20 They can't be unaware and write said code.I think that whether dchar will be enough will depend primarily on where th= e=20 unicode is coming from and what the programmer is doing with it. There's pl= enty=20 which will just work regardless of whether code poinst are pre-combined or = not,=20 and there's other stuff which will have subtle bugs if they're not pre-comb= ined. =46or the most part, Western languages should have pre-combined characters,= but=20 whether a program sees them in combined form or not will depend on where th= e=20 text comes from. If it comes from a file, then it all depends on the progra= m=20 which wrote the file. If it comes from the console, then it depends on what= that=20 console does. If it comes from a socket or pipe or whatnot, then it depends= on=20 whatever program is sending the data. So, the question becomes what the norm is? Are unicode characters normally = pre- combined or left as separate code points? The majority of English text will= be=20 fine regardless, since English only uses accented characters and the like w= hen=20 including foreign words, but most any other European language will have acc= ented=20 characters and then it's an open question. If it's more likely that a D pro= gram=20 will receive pre-combined characters than not, then many programs will like= ly be=20 safe treating a code point as a character. But if the odds are high that a = D=20 program will receive characters which are not yet combined, then certain se= ts of=20 text will invariably result in bugs in your average D program. I don't think that there's much question that from a performance standpoint= and=20 from the standpoint of trying to avoid breaking TDPL and a lot of pre-exist= ing=20 code, we should continue to treat a code point - a dchar - as an abstract=20 character. Moving to graphemes could really harm performance - and there _a= re_=20 plenty of programs that couldn't care less about unicode. However, it's qui= te=20 clear that in a number of circumstances, that's going to result in buggy co= de.=20 The question then is whether it's okay to take a performance hit just to=20 correctly handle unicode. And I expect that a _lot_ of people are going to = say=20 no to that. D already does better at handling unicode than many other languages, so it'= s=20 definitely a step up as it is. The cost for handling unicode completely cor= rectly=20 is quite high from the sounds of it - all of a sudden you're effectively (i= f not=20 literally) dealing with arrays of arrays instead of arrays. So, I think tha= t=20 it's a viable option to say that the default path that D will take is the=20 _mostly_ correct but still reasonably efficient path, and then - through 3r= d party=20 libraries or possibly even with a module in Phobos - we'll provide a means = to=20 handle unicode 100% correctly for those who really care. At minimum, we need the tools to handle unicode correctly, but if we can't= =20 handle it both correctly and efficiently, then I'm afraid that it's just no= t going=20 to be reasonable to handle it correctly - especially if we can handle it=20 _almost_ correctly and still be efficient. Regardless, the real question is how likely a D program is to deal with uni= code=20 which is not pre-combined. If the odds are relatively low in the general ca= se,=20 then sticking to dchar should be fine. But if the adds or relatively high, = then=20 not going to graphemes could mean that there will be a _lot_ of buggy D pro= grams=20 out there. =2D Jonathan M Davisdoesn't mean that all other kinds of strange characters work as well. =20 =20 Cheers, - Daniel=20 I think German text works well with dchar.
Jan 16 2011
Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:On 1/16/11 6:42 PM, Daniel Gibson wrote:Really? UTF32 - maybe. But IMHO even when not considering graphemes and such UTF8 sucks hard in comparison to those because one code point consists of 1-4 code units (even in German 1-2 code units).Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:I consider UTF8 superior to all of the above.On 1/16/11 3:20 PM, Michel Fortin wrote:So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.But most strings don't contain combining characters or unnormalized strings.I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?Yes, but even in Germany there are people whose names contain "strange" characters ;) Is it common to have programs that deal with text in a specific language but not with names? I do understand your resistance to support Unicode properly - it's a lot of trouble and makes things inefficient (more inefficient than UTF8/16 already are because of that code point != code unit thing). Another thing is that due to bad support from fonts or console/GUI technology it may happen (quite often) that one grapheme is *not* displayed as a single character, thus messing up formatting anyway (Still you probably should cut a string within a grapheme). So here's what I think can be done (and, at least the first two points, especially the first, should be done): 1. Mention the Grapheme and Digraph situation in string related documentation (std.string and maybe string-related stuff in std.algorithm like Splitter) to make sure people who use Phobos are aware of the problem. Then at least they can't say that nobody told them when their Objective-C using colleagues are laughing at their broken unicode-support ;) 2. Maybe add some functions that *do* deal with this. Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can check themselves, if they just split their string within a grapheme or something. 3. Include a proper Unicode-string type/module, if somebody has the time and knowledge to develop one. spir already started something like that AFAIK and Steven Schveighoffer also is even working on a complete string type - maybe these efforts could be combined? I guess default strings will stay mostly the way they are (but please add an ASCII type or allow ubyte[] asciiStr = "asdf";). Having an additional type in Phobos that works correctly in all cases (e.g. Arabic, Hebrew, Japanese, ..) would be really great, though. UniString uStr = new UniString("sdfüñẫ"); UniString uStr2 = uStr[3..$]; // "üñẫ" UniGraph ug = uStr[5]; // 'ẫ' size_t i = uStr2.length; // 3 something like that maybe (of course plus a lot of other stuff like proper comparison for different encodings of the same char like a modified icmp() discussed before). But something like size_t len = uniLen("sdfüñẫ"); // 6 string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length); etc may be just as good. (I hope this all made sense)I think German text works well with dchar.Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - DanielDo you, and can you?I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.They can't be unaware and write said code.AndreiCheers, - Daniel
Jan 16 2011
Am 17.01.2011 04:38, schrieb Daniel Gibson:Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:I meant you should *not* cut a string within a grapheme.On 1/16/11 6:42 PM, Daniel Gibson wrote:Really? UTF32 - maybe. But IMHO even when not considering graphemes and such UTF8 sucks hard in comparison to those because one code point consists of 1-4 code units (even in German 1-2 code units).Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:I consider UTF8 superior to all of the above.On 1/16/11 3:20 PM, Michel Fortin wrote:So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.But most strings don't contain combining characters or unnormalized strings.I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?Yes, but even in Germany there are people whose names contain "strange" characters ;) Is it common to have programs that deal with text in a specific language but not with names? I do understand your resistance to support Unicode properly - it's a lot of trouble and makes things inefficient (more inefficient than UTF8/16 already are because of that code point != code unit thing). Another thing is that due to bad support from fonts or console/GUI technology it may happen (quite often) that one grapheme is *not* displayed as a single character, thus messing up formatting anyway (Still you probably should cut a string within a grapheme).I think German text works well with dchar.Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - DanielDo you, and can you?I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.They can't be unaware and write said code.So here's what I think can be done (and, at least the first two points, especially the first, should be done): 1. Mention the Grapheme and Digraph situation in string related documentation (std.string and maybe string-related stuff in std.algorithm like Splitter) to make sure people who use Phobos are aware of the problem. Then at least they can't say that nobody told them when their Objective-C using colleagues are laughing at their broken unicode-support ;) 2. Maybe add some functions that *do* deal with this. Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can check themselves, if they just split their string within a grapheme or something. 3. Include a proper Unicode-string type/module, if somebody has the time and knowledge to develop one. spir already started something like that AFAIK and Steven Schveighoffer also is even working on a complete string type - maybe these efforts could be combined? I guess default strings will stay mostly the way they are (but please add an ASCII type or allow ubyte[] asciiStr = "asdf";). Having an additional type in Phobos that works correctly in all cases (e.g. Arabic, Hebrew, Japanese, ..) would be really great, though. UniString uStr = new UniString("sdfüñẫ"); UniString uStr2 = uStr[3..$]; // "üñẫ" UniGraph ug = uStr[5]; // 'ẫ' size_t i = uStr2.length; // 3of course I forgot: string s = uStr2.toString(); dstring s2 = uStr2.toDString(); to convert it back to a "normal" stringsomething like that maybe (of course plus a lot of other stuff like proper comparison for different encodings of the same char like a modified icmp() discussed before). But something like size_t len = uniLen("sdfüñẫ"); // 6 string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length); etc may be just as good. (I hope this all made sense)AndreiCheers, - Daniel
Jan 16 2011
On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/16/11 3:20 PM, Michel Fortin wrote:As I said: all those people who are not validating the inputs to make sure they don't contain combining code points. As far as I know, no one is doing that, so that means everybody should use algorithms capable of handling multi-code-point graphemes. If someone indeed is doing this validation, he'll probably also be smart enough to make his algorithms to work with dchars. That said, no one should really have to care but those who implement the string manipulation functions. The idea behind making the grapheme the element type is to make it easier to write grapheme-aware string manipulation functions, even if you don't know about graphemes. But the reality is probably more mixed than that. - - - I gave some thought about all this, and came to an interesting realizations that made me refine the proposal. The new proposal is disruptive perhaps as much as the first, but in a different way. But first, let's state a few facts to reframe the current discussion: Fact 1: most people don't know Unicode very well Fact 2: most people are confused by code units, code points, graphemes, and what is a 'character' Fact 3: most people won't bother with all this, they'll just use the basic language facilities and assume everything work correctly if it it works correctly for them Now, let's define two goals: Goal 1: make most people's string operations work correctly Goal 2: make most people's string operations work fast To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not sure we agree on this, but let's continue. From the above 3 facts, we can deduce that a user won't want to bother to using byDchar, byGrapheme, or byWhatever when using algorithms. You were annoyed by having to write byDchar everywhere, so changed the element type to always be dchar and you don't have to write byDchar anymore. That's understandable and perfectly reasonable. The problem is of course that it doesn't give you correct results. Most of the time what you really want is to use graphemes, dchar just happen to be a good approximation of that that works most of the time. Iterating by grapheme is somewhat problematic, and it degrades performance. Same for comparing graphemes for normalized equivalence. That's all true. I'm not too sure what we can do about that. It can be optimized, but it's very understandable that some people won't be satisfied by the performance and will want to avoid graphemes. Speaking of optimization, I do understand that iterating by grapheme using the range interface won't give you the best performance. It's certainly convenient as it enables the reuse of existing algorithms with graphemes, but more specialized algorithms and interfaces might be more suited. One observation I made with having dchar as the default element type is that not all algorithms really need to deal with dchar. If I'm searching for code point 'a' in a UTF-8 string, decoding code units into code points is a waste of time. Why? because the only way to represent code point 'a' is by having code point 'a'. And guess what? The almost same optimization can apply to graphemes: if you're searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you have to do is search for the UTF-8 code unit 'a', then check if the 'a' code unit is followed by a combining mark code point to confirm it is really a 'a', not a composed grapheme. Iterating the string by code unit is enough for these cases, and it'd increase performance by a lot. So making dchar the default type is no doubt convenient because it abstracts things enough so that generic algorithms can work with strings, but it has a performance penalty that you don't always need. I made an example using UTF-8, it applies even more to UTF-16. And it applies to grapheme-aware manipulations too. This penalty with generic algorithms comes from the fact that they take a predicate of the form "a == 'a'" or "a == b", which is ill-suited for strings because you always need to fully decode the string (by dchar or by graphemes) for the purpose of calling the predicate. Given that comparing characters for something else than equality or them being part of a set is very rarely something you do, generic algorithms miss a big optimization opportunity here. - - - So here's what I think we should do: Todo 1: disallow generic algorithms on naked strings: string-specific Unicode-aware algorithms should be used instead; they can share the same name if their usage is similar Todo 2: to use a generic algorithm with a strings, you must dress the string using one of toDchar, toGrapheme, toCodeUnits; this way your intentions are clear Todo 3: string-specific algorithms can implemented as simple wrappers for generic algorithms with the string dressed correctly for the task, or they can implement more sophisticated algorithms to increase performance There's two major benefits to this approach: Benefit 1: if indeed you really don't want the performance penalty that comes with checking for composed graphemes, you can bypass it at some specific places in your code using byDchar, or you can disable it altogether by modifying the string-specific algorithms and recompiling Phobos. Benefit 2: we don't have to rush to implementing graphemes in the Unicode-aware algorithms. Just make sure the interface for string-specific algorithms *can* accept graphemes, and we can roll out support for them at a later time once we have a decent implementation. Also, all this is leaving the question open as to what to do when someone uses the string as a range. In my opinion, it should either iterate on code units (because the string is actually an array, and because that's what foreach does) or simply disallow iteration (asking that you dress the string first using toCodeUnit, toDchar, or toGrapheme). Do you like that more? -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I agree. Now let me ask again: how many people really should care?On 1/15/11 10:45 PM, Michel Fortin wrote:I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme?How many people really should care?
Jan 17 2011
On 1/17/11 10:34 AM, Michel Fortin wrote:On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:The question (which I see you keep on dodging :o)) is how much text contains combining code points. I have worked in NLP for years, and still do. I even worked on Arabic text (albeit Romanized). I work with Wikipedia. I use Unicode all the time, but I have yet to have trouble with a combining character. I was just vaguely aware of their existence up until this discussion, but just waved it away and guess what - it worked for me. It does not serve us well to rigidly claim that the only good way of doing anything Unicode is to care about graphemes. Even NSString exposes the UTF16 underlying encoding and provides dedicated functions for grapheme-based processing. For one thing, if you care about the width of a word in printed text (one of the case where graphemes are important), you need font information. And - surprise! - some fonts do NOT support combining characters and print signs next to one another instead of juxtaposing them, so the "wrong" method of counting characters is more informative.On 1/16/11 3:20 PM, Michel Fortin wrote:As I said: all those people who are not validating the inputs to make sure they don't contain combining code points.On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I agree. Now let me ask again: how many people really should care?On 1/15/11 10:45 PM, Michel Fortin wrote:I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme?How many people really should care?As far as I know, no one is doing that, so that means everybody should use algorithms capable of handling multi-code-point graphemes. If someone indeed is doing this validation, he'll probably also be smart enough to make his algorithms to work with dchars.I am not sure everybody should use graphemes.That said, no one should really have to care but those who implement the string manipulation functions. The idea behind making the grapheme the element type is to make it easier to write grapheme-aware string manipulation functions, even if you don't know about graphemes. But the reality is probably more mixed than that.The reality is indeed more mixed. Inevitably at some point the API needs to answer the question: "what is the first character of this string?" Transparency is not possible. You break all string code out there.- - - I gave some thought about all this, and came to an interesting realizations that made me refine the proposal. The new proposal is disruptive perhaps as much as the first, but in a different way. But first, let's state a few facts to reframe the current discussion: Fact 1: most people don't know Unicode very well Fact 2: most people are confused by code units, code points, graphemes, and what is a 'character' Fact 3: most people won't bother with all this, they'll just use the basic language facilities and assume everything work correctly if it it works correctly for themNice :o).Now, let's define two goals: Goal 1: make most people's string operations work correctly Goal 2: make most people's string operations work fastGoal 3: don't break all existing code Goal 4: make most people's string-based code easy to write and understandTo me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not sure we agree on this, but let's continue.I think we disagree about what "most" means. For you it means "people who don't understand Unicode well but deal with combining characters anyway". For me it's "the largest percentage of D users across various writing systems".From the above 3 facts, we can deduce that a user won't want to bother to using byDchar, byGrapheme, or byWhatever when using algorithms. You were annoyed by having to write byDchar everywhere, so changed the element type to always be dchar and you don't have to write byDchar anymore. That's understandable and perfectly reasonable. The problem is of course that it doesn't give you correct results. Most of the time what you really want is to use graphemes, dchar just happen to be a good approximation of that that works most of the time.Again, it's a matter of tradeoffs. I chose dchar because char was plain _wrong_ most of the time, not because char was a pretty darn good approximation that worked for most people most of the time. The fact remains that dchar _is_ a pretty darn good approximation that also has pretty good darn speed. So I'd say that I _still_ want to use dchar most of the time. Committing to graphemes would complicate APIs for _everyone_ and would make things slower for _everyone_ for the sake of combining characters that _never_ occur in _most_ people's text. This is bad design, pure and simple. A good design is to cater for the majority and provide dedicated APIs for the few.Iterating by grapheme is somewhat problematic, and it degrades performance.Yes.Same for comparing graphemes for normalized equivalence.Yes, although I think you can optimize code such that comparing two strings wholesale only has a few more comparisons on the critical path. That would be still slower, but not as slow as iterating by grapheme in a naive implementation.That's all true. I'm not too sure what we can do about that. It can be optimized, but it's very understandable that some people won't be satisfied by the performance and will want to avoid graphemes.I agree.Speaking of optimization, I do understand that iterating by grapheme using the range interface won't give you the best performance. It's certainly convenient as it enables the reuse of existing algorithms with graphemes, but more specialized algorithms and interfaces might be more suited.Even the specialized algorithms will be significantly slower.One observation I made with having dchar as the default element type is that not all algorithms really need to deal with dchar. If I'm searching for code point 'a' in a UTF-8 string, decoding code units into code points is a waste of time. Why? because the only way to represent code point 'a' is by having code point 'a'.Right. That's why many algorithms in std are specialized for such cases.And guess what? The almost same optimization can apply to graphemes: if you're searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you have to do is search for the UTF-8 code unit 'a', then check if the 'a' code unit is followed by a combining mark code point to confirm it is really a 'a', not a composed grapheme. Iterating the string by code unit is enough for these cases, and it'd increase performance by a lot.Unfortunately it all breaks as soon as you go beyond one code point. You can't search efficiently, you can't compare efficiently. Boyer-Moore and friends are out. I'm not saying that we shouldn't implement the correct operations! I'm just not convinced they should be the default.So making dchar the default type is no doubt convenient because it abstracts things enough so that generic algorithms can work with strings, but it has a performance penalty that you don't always need. I made an example using UTF-8, it applies even more to UTF-16. And it applies to grapheme-aware manipulations too.It is true that UTF manipulation incurs overhead. The tradeoff has many dimensions: UTF-16 is bulkier and less cache friendly, ASCII is not sufficient for most people, the UTF decoding overhead is not that high... it's difficult to find the sweetest spot.This penalty with generic algorithms comes from the fact that they take a predicate of the form "a == 'a'" or "a == b", which is ill-suited for strings because you always need to fully decode the string (by dchar or by graphemes) for the purpose of calling the predicate. Given that comparing characters for something else than equality or them being part of a set is very rarely something you do, generic algorithms miss a big optimization opportunity here.How can we improve that? You can't argue for an inefficient scheme just because what we have isn't as efficient as it could possibly be.- - - So here's what I think we should do: Todo 1: disallow generic algorithms on naked strings: string-specific Unicode-aware algorithms should be used instead; they can share the same name if their usage is similarI don't understand this. We already do this, and by "Unicode-aware" we understand using dchar throughout. This is transparent to client code.Todo 2: to use a generic algorithm with a strings, you must dress the string using one of toDchar, toGrapheme, toCodeUnits; this way your intentions are clearBreaks a lot of existing code. Won't fly with Walter unless it solves change about the built-in strings is that they implicitly are two things at the same time. Asking for representation should be explicit.Todo 3: string-specific algorithms can implemented as simple wrappers for generic algorithms with the string dressed correctly for the task, or they can implement more sophisticated algorithms to increase performanceOne thing I like about the current scheme is that all bidirectional-range algorithms work out of the box with all strings, and lend themselves to optimization whenever you want to. This will have trouble passing Walter's wanking test. Mine too; every time I need to write a bunch of forwarding functions, that's a signal something went wrong somewhere. Remember MFC? :o)There's two major benefits to this approach: Benefit 1: if indeed you really don't want the performance penalty that comes with checking for composed graphemes, you can bypass it at some specific places in your code using byDchar, or you can disable it altogether by modifying the string-specific algorithms and recompiling Phobos. Benefit 2: we don't have to rush to implementing graphemes in the Unicode-aware algorithms. Just make sure the interface for string-specific algorithms *can* accept graphemes, and we can roll out support for them at a later time once we have a decent implementation.I'm not seeing the drawbacks. Hurts everyone for the sake of a few, breaks existent code, makes all string processing a mess, would-be users will throw their hands in the air seeing the simplest examples, but we'll have the satisfaction of high-five-ing one another telling ourselves that we did the right thing.Also, all this is leaving the question open as to what to do when someone uses the string as a range. In my opinion, it should either iterate on code units (because the string is actually an array, and because that's what foreach does) or simply disallow iteration (asking that you dress the string first using toCodeUnit, toDchar, or toGrapheme). Do you like that more?This is not about liking. I like doing the right thing as much as you do, and I think Phobos shows that. Clearly doing the right thing through and through is handling combining characters appropriately. The problem is keeping all desiderata in careful balance. Andrei
Jan 17 2011
On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/17/11 10:34 AM, Michel Fortin wrote:Not much, right now. The problem is that the answer to this question is likely to change as Unicode support improves in operating system and applications. Shouldn't we future-proof Phobos?As I said: all those people who are not validating the inputs to make sure they don't contain combining code points.The question (which I see you keep on dodging :o)) is how much text contains combining code points.It does not serve us well to rigidly claim that the only good way of doing anything Unicode is to care about graphemes.For the time being we can probably afford it.Even NSString exposes the UTF16 underlying encoding and provides dedicated functions for grapheme-based processing. For one thing, if you care about the width of a word in printed text (one of the case where graphemes are important), you need font information. And - surprise! - some fonts do NOT support combining characters and print signs next to one another instead of juxtaposing them, so the "wrong" method of counting characters is more informative.Generally what OS X does in those case is that it displays that character in another font. That said, counting grapheme is never a good way to tell how much space some text will take (unless the application enforces a fixed width per grapheme). It's more useful for telling the number of character in a text document, similar to a word count.I'm not sure what you mean by that.That said, no one should really have to care but those who implement the string manipulation functions. The idea behind making the grapheme the element type is to make it easier to write grapheme-aware string manipulation functions, even if you don't know about graphemes. But the reality is probably more mixed than that.The reality is indeed more mixed. Inevitably at some point the API needs to answer the question: "what is the first character of this string?" Transparency is not possible. You break all string code out there.Those are worthy goals too.- - - I gave some thought about all this, and came to an interesting realizations that made me refine the proposal. The new proposal is disruptive perhaps as much as the first, but in a different way. But first, let's state a few facts to reframe the current discussion: Fact 1: most people don't know Unicode very well Fact 2: most people are confused by code units, code points, graphemes, and what is a 'character' Fact 3: most people won't bother with all this, they'll just use the basic language facilities and assume everything work correctly if it it works correctly for themNice :o).Now, let's define two goals: Goal 1: make most people's string operations work correctly Goal 2: make most people's string operations work fastGoal 3: don't break all existing code Goal 4: make most people's string-based code easy to write and understandIt's not just D users, it's also for the users of programs written by D users. I can't count how many times I've seen accented character mishandled on websites and elsewhere, and I probably have an aversion about doing the same thing to people of other cultures and languages. If the operating system supports combining marks, users have an expectations that applications running on it will deal with them correctly too, and they'll (rightfully) blame your application if it doesn't work. Same for websites. I understand that in some situations you don't want to deal with graphemes even if you theoretically should, but I don't think it should be the default.To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not sure we agree on this, but let's continue.I think we disagree about what "most" means. For you it means "people who don't understand Unicode well but deal with combining characters anyway". For me it's "the largest percentage of D users across various writing systems".Ok. Say you were searching for the needle "toil" in an UTF-8 haystack, I see two way to extend the optimization described above: 1. search for the easy part "toil", then check its surrounding graphemes to confirm it's really "toil" 2. search for a code point matching '' or 'e', then confirm that the code points following it form the right graphemes. Implementing the second one can be done by converting the needle to a regular expression operating at code-unit level. With that you can search efficiently for the needle directly in code units without having to decode and/or normalize the whole haystack.One observation I made with having dchar as the default element type is that not all algorithms really need to deal with dchar. If I'm searching for code point 'a' in a UTF-8 string, decoding code units into code points is a waste of time. Why? because the only way to represent code point 'a' is by having code point 'a'.Right. That's why many algorithms in std are specialized for such cases.And guess what? The almost same optimization can apply to graphemes: if you're searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you have to do is search for the UTF-8 code unit 'a', then check if the 'a' code unit is followed by a combining mark code point to confirm it is really a 'a', not a composed grapheme. Iterating the string by code unit is enough for these cases, and it'd increase performance by a lot.Unfortunately it all breaks as soon as you go beyond one code point. You can't search efficiently, you can't compare efficiently. Boyer-Moore and friends are out.You ask what's inefficient about generic algorithms having customizable predicates? You can't implement the above optimization if you can't guaranty the predicate is "==". That said, perhaps we can detect "==" and only apply the optimization then. Being able to specify the predicate doesn't gain you much for strings, because a < 'a' doesn't make much sense. All you need to check for is equality with some value or membership of given character set, both of which can use the optimization above.This penalty with generic algorithms comes from the fact that they take a predicate of the form "a == 'a'" or "a == b", which is ill-suited for strings because you always need to fully decode the string (by dchar or by graphemes) for the purpose of calling the predicate. Given that comparing characters for something else than equality or them being part of a set is very rarely something you do, generic algorithms miss a big optimization opportunity here.How can we improve that? You can't argue for an inefficient scheme just because what we have isn't as efficient as it could possibly be.That's probably because you haven't understood the intent (I might not have made it very clear either). The problem I see currently is that you rely on dchar being the element type. That should be an implementation detail, not something client code can see or rely on. By making it an implementation detail, you can later make grapheme-aware algorithms the default without changing the API. Since you're the gatekeeper to Phobos, you can make this change conditional to getting an acceptable level of performance out of the grapheme-aware algorithms, or on other factors like the amount of combining characters you encounter in the wild in the next few years. So the general string functions would implement your compromise (using dchar) but not commit indefinitely to it. Someone who really want to work in code point can use toDchar, someone who want to deal with graphemes uses toGraphemes, someone who doesn't care won't choose anything and get the default behaviour of compromise. All you need to do for this is document it, and try to make sure the string APIs don't force the implementation to work with code points.So here's what I think we should do: Todo 1: disallow generic algorithms on naked strings: string-specific Unicode-aware algorithms should be used instead; they can share the same name if their usage is similarI don't understand this. We already do this, and by "Unicode-aware" we understand using dchar throughout. This is transparent to client code.No, it doesn't break anything. This is just the continuation of what I tried to explain above: if you want to be sure you're working with graphemes or dchar, say it. Also, it said nothing about iteration or foreach, so I'm not sure why it wouldn't fly with Walter. It can stay as it is, except for one thing: you and Walter should really get on the same wavelength regarding ElementType!(char[]) and foreach(c; string). I don't care that much which is the default, but they absolutely need to be the same.Todo 2: to use a generic algorithm with a strings, you must dress the string using one of toDchar, toGrapheme, toCodeUnits; this way your intentions are clearBreaks a lot of existing code. Won't fly with Walter unless it solves world hunger. Nevertheless I strings is that they implicitly are two things at the same time. Asking for representation should be explicit.I like this as the default behaviour too. I think however that you should restrict the algorithms that work out of the box to those which can also work with graphemes. This way you can change the behaviour in the future and support graphemes by a simple upgrade of Phobos. Algorithms that doesn't work with graphemes would still work with toDchar. So what doesn't work with graphemes? Predicates such as "a < b" for instance. That's pretty much it.Todo 3: string-specific algorithms can implemented as simple wrappers for generic algorithms with the string dressed correctly for the task, or they can implement more sophisticated algorithms to increase performanceOne thing I like about the current scheme is that all bidirectional-range algorithms work out of the box with all strings, and lend themselves to optimization whenever you want to.This will have trouble passing Walter's wanking test. Mine too; every time I need to write a bunch of forwarding functions, that's a signal something went wrong somewhere. Remember MFC? :o)The idea is that we write the API as it would apply to graphemes, but we implement it using dchar for the time being. Some function signatures might have to differ a bit.Well then, don't you find it balanced enough? I'm not asking that everything be done with graphemes. I'm not even asking that anything be done with graphemes by default. I'm only asking that we keep the API clean enough so we can pass to graphemes by default in the future without having to rewrite all the code everywhere to use byGrapheme. If this isn't the right balance. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Do you like that more?This is not about liking. I like doing the right thing as much as you do, and I think Phobos shows that. Clearly doing the right thing through and through is handling combining characters appropriately. The problem is keeping all desiderata in careful balance.
Jan 17 2011
On 1/17/11 2:29 PM, Michel Fortin wrote:The problem I see currently is that you rely on dchar being the element type. That should be an implementation detail, not something client code can see or rely on.But at some point you must be able to talk about individual characters in a text. It can't be something that client code doesn't see!!! SuperDuperText txt; auto c = giveMeTheFirstCharacter(txt); What is the type of c? That is visible to the client! Andrei
Jan 17 2011
On 2011-01-17 15:49:26 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/17/11 2:29 PM, Michel Fortin wrote:It seems that it can. NSString only exposes individual UTF-16 code units directly (or semi-directly via an accessor method), even though searching and comparing is grapheme-aware. I'm not saying it's a good design, but it certainly can work in practice. In any case, I didn't mean to say the client code should't be aware of the characters in a string. I meant that the client shouldn't assume the algorithm works at the same layer as ElementType!(string) for a given string type. Even if ElementType!(string) is dchar, the default function you get if you don't use any of toCodeUnit, toDchar, or toGrapheme can work at the dchar or grapheme level if it makes more sense that way. In other words, the client says: "I have two strings, compare them!" The client didn't specify if they should be compared by char, wchar, dchar, or by normalized grapheme; so we do what's sensible. That's what I call the 'default' string functions, those you get when you don't ask for anything specific. They should have a signature making them able to work at the grapheme level, even though they might not for practical reasons (performance). This way if it becomes more important or practical to support graphemes, it's easy to evolve to them.The problem I see currently is that you rely on dchar being the element type. That should be an implementation detail, not something client code can see or rely on.But at some point you must be able to talk about individual characters in a text. It can't be something that client code doesn't see!!!SuperDuperText txt; auto c = giveMeTheFirstCharacter(txt); What is the type of c? That is visible to the client!That depends on how you implement the giveMeTheFirstCharacter function. :-) More seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation. You and Walter can't come to understand each other between 1 and 2, regarding foreach and ranges. To keep things consistent with what I said above I'd tend to say 4, but that's weird for something that looks like an array. My second choice goes for 1 when it comes to consistency, and 3 when it comes to correctness, and 2 when it comes to being practical. Given something is going to be inconsistent either way, I'd say any of the above is acceptable. But please make sure you and Walter agree on the default element type for ranges and foreach. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 17 2011
On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:More seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 17 2011
On 1/17/11 9:48 PM, Michel Fortin wrote:On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:Very insightful. Thanks for sharing. Code it up and make a solid proposal! AndreiMore seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.
Jan 17 2011
On 18/01/11 16:46, Andrei Alexandrescu wrote:On 1/17/11 9:48 PM, Michel Fortin wrote:How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:Very insightful. Thanks for sharing. Code it up and make a solid proposal! AndreiMore seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.
Jan 17 2011
On 1/18/11 1:58 AM, Steven Wawryk wrote:On 18/01/11 16:46, Andrei Alexandrescu wrote:There's no string, only range... AndreiOn 1/17/11 9:48 PM, Michel Fortin wrote:How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:Very insightful. Thanks for sharing. Code it up and make a solid proposal! AndreiMore seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.
Jan 18 2011
On 19/01/11 02:40, Andrei Alexandrescu wrote:On 1/18/11 1:58 AM, Steven Wawryk wrote:Which is exactly what I asked you about. I understand that you must be very busy, But how do I get you to look at the actual technical content of something? Is there something in the way I phrase thing that you dismiss my introductory motivation without looking into the content? I don't mean this as a criticism. I really want to know because I'm considering a proposal on a different topic but wasn't sure it's worth it as there seems to be a barrier to getting things considered.On 18/01/11 16:46, Andrei Alexandrescu wrote:There's no string, only range...On 1/17/11 9:48 PM, Michel Fortin wrote:How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.Very insightful. Thanks for sharing. Code it up and make a solid proposal! Andrei
Jan 18 2011
On 1/18/11 6:00 PM, Steven Wawryk wrote:On 19/01/11 02:40, Andrei Alexandrescu wrote:One simple fact is that I'm not the only person who needs to look at a design. If you want to propose something for inclusion in Phobos, please put the code in good shape, document it properly, and make a submission in this newsgroup following the Boost model. I get one vote and everyone else gets a vote. Looking back at our exchanges in search for a perceived dismissive attitude on my part (apologies if it seems that way - it was unintentional), I infer your annoyance stems from my answer to this:On 1/18/11 1:58 AM, Steven Wawryk wrote:Which is exactly what I asked you about. I understand that you must be very busy, But how do I get you to look at the actual technical content of something? Is there something in the way I phrase thing that you dismiss my introductory motivation without looking into the content? I don't mean this as a criticism. I really want to know because I'm considering a proposal on a different topic but wasn't sure it's worth it as there seems to be a barrier to getting things considered.On 18/01/11 16:46, Andrei Alexandrescu wrote:There's no string, only range...On 1/17/11 9:48 PM, Michel Fortin wrote:How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.Very insightful. Thanks for sharing. Code it up and make a solid proposal! AndreiI happen to have discussed at length my beef with Steve's proposal. Now in one sentence you change the proposed design on the fly without fleshing out the consequences, add to it again without substantiation, and presumably expect me to come with a salient analysis of the result. I don't think it's fair to characterize my answer to that as dismissive, nor to pressure me into expanding on it. Finally, let me say again what I already said for a few times: in order to experiment with grapheme-based processing, we need a byGrapheme range. There is no need for a new string class. We need a range over the existing string types. That would allow us to play with graphemes, assess their efficiency and ubiquity, and would ultimately put us in a better position when it comes to deciding whether it makes sense to make grapheme a character type or the default character type. AndreiHow does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?
Jan 18 2011
On 19/01/11 11:37, Andrei Alexandrescu wrote:On 1/18/11 6:00 PM, Steven Wawryk wrote:Ok, thanks for this suggestion. But if developing a proposal as concrete code is a lot of work that may be rejected, is there a way to sound out the idea first before deciding to commit to developing it?Which is exactly what I asked you about. I understand that you must be very busy, But how do I get you to look at the actual technical content of something? Is there something in the way I phrase thing that you dismiss my introductory motivation without looking into the content? I don't mean this as a criticism. I really want to know because I'm considering a proposal on a different topic but wasn't sure it's worth it as there seems to be a barrier to getting things considered.One simple fact is that I'm not the only person who needs to look at a design. If you want to propose something for inclusion in Phobos, please put the code in good shape, document it properly, and make a submission in this newsgroup following the Boost model. I get one vote and everyone else gets a vote.Looking back at our exchanges in search for a perceived dismissive attitude on my part (apologies if it seems that way - it was unintentional), I infer your annoyance stems from my answer to this:No, this was just a summary. Here is the post that you answered dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1 digitalmars.comHow does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?In the interest of moving this on, would it become acceptable to you if: 1. indexing and slicing of the code-point range were removed? 2. any additional ranges are exposed to the user according to decisions made about graphemes, etc? 3. other constructive criticisms were accommodated? Steve On 15/01/11 03:33, Andrei Alexandrescu wrote:On 1/14/11 5:06 AM, Steven Schveighoffer wrote:I respectfully disagree. A stream built on fixed-sized units, but with variable length elements, where you can determine the start of an element in O(1) time given a random index absolutely provides random-access. It just doesn't provide length.I equally respectfully disagree. I think random access is defined as accessing the ith element in O(1) time. That's not the case here. AndreiI happen to have discussed at length my beef with Steve's proposal. Now in one sentence you change the proposed design on the fly without fleshing out the consequences, add to it again without substantiation, and presumably expect me to come with a salient analysis of the result. I don't think it's fair to characterize my answer to that as dismissive, nor to pressure me into expanding on it.Sorry, I could have given more context. But you didn't discuss what I asked, based on the observation that your detailed criticisms of Steve's proposal all related to a single aspect of it. Steve
Jan 18 2011
On 1/18/11 7:48 PM, Steven Wawryk wrote:On 19/01/11 11:37, Andrei Alexandrescu wrote:This is the best place as far as I know.On 1/18/11 6:00 PM, Steven Wawryk wrote:Ok, thanks for this suggestion. But if developing a proposal as concrete code is a lot of work that may be rejected, is there a way to sound out the idea first before deciding to commit to developing it?Which is exactly what I asked you about. I understand that you must be very busy, But how do I get you to look at the actual technical content of something? Is there something in the way I phrase thing that you dismiss my introductory motivation without looking into the content? I don't mean this as a criticism. I really want to know because I'm considering a proposal on a different topic but wasn't sure it's worth it as there seems to be a barrier to getting things considered.One simple fact is that I'm not the only person who needs to look at a design. If you want to propose something for inclusion in Phobos, please put the code in good shape, document it properly, and make a submission in this newsgroup following the Boost model. I get one vote and everyone else gets a vote.My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a response. If you found that dismissive, I'd be hard pressed to improve it. To quote myself:Looking back at our exchanges in search for a perceived dismissive attitude on my part (apologies if it seems that way - it was unintentional), I infer your annoyance stems from my answer to this:No, this was just a summary. Here is the post that you answered dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1 digitalmars.comHow does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?I believe the proposed scheme: 1. Changes the language in a major way; 2. Is highly disruptive; 3. Improves the status quo in only minor ways. I'd be much more willing to improve things by e.g. defining the representation() function I talked about a bit ago, and other less disruptive additions.That took into consideration your amendments.> > In the interest of moving this on, would it become acceptable to you if: > > 1. indexing and slicing of the code-point range were removed? > 2. any additional ranges are exposed to the user according to decisions > made about graphemes, etc? > 3. other constructive criticisms were accommodated? > > Steve > > > On 15/01/11 03:33, Andrei Alexandrescu wrote: >> On 1/14/11 5:06 AM, Steven Schveighoffer wrote: >>> I respectfully disagree. A stream built on fixed-sized units, but with >>> variable length elements, where you can determine the start of an >>> element in O(1) time given a random index absolutely provides >>> random-access. It just doesn't provide length. >> >> I equally respectfully disagree. I think random access is defined as >> accessing the ith element in O(1) time. That's not the case here. >> >> Andrei >I really don't know what to add to make my answer more meaningful. AndreiI happen to have discussed at length my beef with Steve's proposal. Now in one sentence you change the proposed design on the fly without fleshing out the consequences, add to it again without substantiation, and presumably expect me to come with a salient analysis of the result. I don't think it's fair to characterize my answer to that as dismissive, nor to pressure me into expanding on it.Sorry, I could have given more context. But you didn't discuss what I asked, based on the observation that your detailed criticisms of Steve's proposal all related to a single aspect of it.
Jan 18 2011
On 19/01/11 13:53, Andrei Alexandrescu wrote:My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a response. If you found that dismissive, I'd be hard pressed to improve it. To quote myself:I don't think that it did. I proposed no language change, nor anything disruptive. The change in status quo I proposed was essentially the same one you encouraged here, about a type that gives the user the choice of what kind of range to be operated on. It appears to me that you were responding to some perception you had about Steve's full proposal (that may have been triggered by something I said in the introduction), not what I actually said in the content. So, I would still be interested to know how to sound out this newsgroup with an idea (before coding commitment) and have the suggestions considered on something more than a superficial level. Is the newsgroup too busy? Should there be people nominated to screen ideas that are worth looking at? Should I use a completely different approach? Your suggestions so far I will take into account, but it still looks like there's a barrier to me.I believe the proposed scheme: 1. Changes the language in a major way; 2. Is highly disruptive; 3. Improves the status quo in only minor ways. I'd be much more willing to improve things by e.g. defining the representation() function I talked about a bit ago, and other less disruptive additions.That took into consideration your amendments.Sorry, I could have given more context. But you didn't discuss what I asked, based on the observation that your detailed criticisms of Steve's proposal all related to a single aspect of it.I really don't know what to add to make my answer more meaningful. Andrei
Jan 18 2011
On 1/18/11 9:46 PM, Steven Wawryk wrote:On 19/01/11 13:53, Andrei Alexandrescu wrote:Adding a new string type would be disruptive. Unless I misunderstood, there is still a new string type in Steve's proposal, and one that would be the default one, even after the amendments you mentioned. That is a problem because people write this: auto s = "hello"; and the question is, what is the type of s. The change in status quo I proposed was essentially the sameMy response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a response. If you found that dismissive, I'd be hard pressed to improve it. To quote myself:I don't think that it did. I proposed no language change, nor anything disruptive.I believe the proposed scheme: 1. Changes the language in a major way; 2. Is highly disruptive; 3. Improves the status quo in only minor ways. I'd be much more willing to improve things by e.g. defining the representation() function I talked about a bit ago, and other less disruptive additions.That took into consideration your amendments.one you encouraged here, about a type that gives the user the choice of what kind of range to be operated on. It appears to me that you were responding to some perception you had about Steve's full proposal (that may have been triggered by something I said in the introduction), not what I actually said in the content.If that's what it is, great. To clarify: no new string type, only a range that iterates one grapheme over existing strings.So, I would still be interested to know how to sound out this newsgroup with an idea (before coding commitment) and have the suggestions considered on something more than a superficial level. Is the newsgroup too busy? Should there be people nominated to screen ideas that are worth looking at? Should I use a completely different approach? Your suggestions so far I will take into account, but it still looks like there's a barrier to me.My perception is that you want to minimize risks before starting to invest work into this. I'm not sure how you can do that. Andrei
Jan 18 2011
On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/17/11 9:48 PM, Michel Fortin wrote:What I use right now is this (see below). I'm not sure what would be a good name for it though. The expectation is that I'll get either an ASCII char or something out of ASCII range if it isn't ASCII. The abstraction doesn't seem very 'solid' to me, in the sense that I can't see how it'd apply to ranges other than strings, so it's only useful for strings (the character array kind), and it's only useful as a workaround since you made ElementType!(char[]) a dchar. Well, any range returning char,dchar,wchar could map frontUnit to front and popFrontUnit to popFront to keep things working, but it makes the optimization rather pointless. I don't really have an idea where to go from here. char frontUnit(string input) { assert(input.length > 0); return input[0]; } wchar frontUnit(wstring input) { assert(input.length > 0); return input[0]; } dchar frontUnit(dstring input) { assert(input.length > 0); return input[0]; } void popFrontUnit(ref string input) { assert(input.length > 0); input = input[1..$]; } void popFrontUnit(ref wstring input) { assert(input.length > 0); input = input[1..$]; } void popFrontUnit(ref dstring input) { assert(input.length > 0); input = input[1..$]; } version (unittest) { import std.string : front, popFront; } unittest { string test = "t"; assert(test.length == 5); string test2 = test; assert(test2.front == ''); test2.popFront(); assert(test2.length == 3); // removed "" which is two UTF-8 code units string test3 = test; assert(test3.frontUnit == ""c[0]); test3.popFrontUnit(); assert(test3.length == 4); // removed first half of "" which, one UTF-8 code units } -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:Very insightful. Thanks for sharing. Code it up and make a solid proposal!More seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.
Jan 18 2011
On 1/18/11 7:17 AM, Michel Fortin wrote:On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:[snip] I was thinking along the lines of: struct Grapheme { private string support_; ... } struct ByGrapheme { private string iteratee_; bool empty(); Grapheme front(); void popFront(); // Additional funs dchar frontCodePoint(); void popFrontCodePoint(); char frontCodeUnit(); void popFrontCodeUnit(); ... } // helper function ByGrapheme byGrapheme(string s); // usage string s = ...; size_t i; foreach (g; byGrapheme(s)) { } We need this range in Phobos. AndreiOn 1/17/11 9:48 PM, Michel Fortin wrote:What I use right now is this (see below). I'm not sure what would be a good name for it though. The expectation is that I'll get either an ASCII char or something out of ASCII range if it isn't ASCII. The abstraction doesn't seem very 'solid' to me, in the sense that I can't see how it'd apply to ranges other than strings, so it's only useful for strings (the character array kind), and it's only useful as a workaround since you made ElementType!(char[]) a dchar. Well, any range returning char,dchar,wchar could map frontUnit to front and popFrontUnit to popFront to keep things working, but it makes the optimization rather pointless. I don't really have an idea where to go from here.On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:Very insightful. Thanks for sharing. Code it up and make a solid proposal!More seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.
Jan 18 2011
On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/18/11 7:17 AM, Michel Fortin wrote:Yes, we need a grapheme range. But that's not what my thing was about. It was about shortcutting code point decoding when it isn't necessary while still keeping the ability to decode to code points when iterating on the same range. For instance, here's a simple made up example: string s = "<hello>"; if (!s.empty && s.frontUnit == '<') s.popFrontUnit(); // skip while (!s.empty && s.frontUnit != '>') s.popFront(); // do something with each code point if (!s.empty && s.frontUnit == '>') s.popFrontUnit(); // skip assert(s.empty); Here, since I know I'm testing and skipping for '<', an ASCII character, decoding the code point is wasted time, so I skip that decoding. The problem is that this optimization can't happen with a range that abstracts things at the code point level. I can do it with strings because strings still allow you to access code units through the indexing operators, but this can't really apply to ranges of code points in general. And parsing with range of code unit would also be a pain, because even if I'm testing for '<' for the first character, sometimes I really need to advance by code point and test for code points. One thing that might be interesting is benchmarking my XML parser by replacing every instance of frontUnit and popFrontUnit with front and popFront. That won't change there results, but it'd give us an idea of the overhead of the unnecessary decoded characters code points. -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:[snip] I was thinking along the lines of: struct Grapheme { private string support_; ... } struct ByGrapheme { private string iteratee_; bool empty(); Grapheme front(); void popFront(); // Additional funs dchar frontCodePoint(); void popFrontCodePoint(); char frontCodeUnit(); void popFrontCodeUnit(); ... } // helper function ByGrapheme byGrapheme(string s); // usage string s = ...; size_t i; foreach (g; byGrapheme(s)) { } We need this range in Phobos.On 1/17/11 9:48 PM, Michel Fortin wrote:What I use right now is this (see below). I'm not sure what would be a good name for it though. The expectation is that I'll get either an ASCII char or something out of ASCII range if it isn't ASCII. The abstraction doesn't seem very 'solid' to me, in the sense that I can't see how it'd apply to ranges other than strings, so it's only useful for strings (the character array kind), and it's only useful as a workaround since you made ElementType!(char[]) a dchar. Well, any range returning char,dchar,wchar could map frontUnit to front and popFrontUnit to popFront to keep things working, but it makes the optimization rather pointless. I don't really have an idea where to go from here.On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:Very insightful. Thanks for sharing. Code it up and make a solid proposal!More seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.
Jan 18 2011
On 01/18/2011 06:14 PM, Michel Fortin wrote: On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:This means a single string type that exposes various _synchrone_ range levels (codeunit, codepoint, grapheme), doesn't it? As opposed to Andrei's approach of ranges beeing structures external to string types, IIUC, which thus move on independantly?I was thinking along the lines of: struct Grapheme { private string support_; ... } struct ByGrapheme { private string iteratee_; bool empty(); Grapheme front(); void popFront(); // Additional funs dchar frontCodePoint(); void popFrontCodePoint(); char frontCodeUnit(); void popFrontCodeUnit(); ... } // helper function ByGrapheme byGrapheme(string s); // usage string s = ...; size_t i; foreach (g; byGrapheme(s)) { } We need this range in Phobos.Yes, we need a grapheme range. But that's not what my thing was about. It was about shortcutting code point decoding when it isn't necessary while still keeping the ability to decode to code points when iterating on the same range. For instance, here's a simple made up example: string s = "<hello>"; if (!s.empty && s.frontUnit == '<') s.popFrontUnit(); // skip while (!s.empty && s.frontUnit != '>') s.popFront(); // do something with each code point if (!s.empty && s.frontUnit == '>') s.popFrontUnit(); // skip assert(s.empty); Here, since I know I'm testing and skipping for '<', an ASCII character, decoding the code point is wasted time, so I skip that decoding. The problem is that this optimization can't happen with a range that abstracts things at the code point level. I can do it with strings because strings still allow you to access code units through the indexing operators, but this can't really apply to ranges of code points in general. And parsing with range of code unit would also be a pain, because even if I'm testing for '<' for the first character, sometimes I really need to advance by code point and test for code points.One thing that might be interesting is benchmarking my XML parser by replacing every instance of frontUnit and popFrontUnit with front and popFront. That won't change there results, but it'd give us an idea of the overhead of the unnecessary decoded characters code points.Yes, would you have time to do it? I would be interesting in such perf measurements. (--> your idea about a Text variant, for which I would like to know whether it's worth still decoding systematically.) Denis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
On 01/18/2011 04:48 AM, Michel Fortin wrote:On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:This looks like a very interesting approach. And clear. I guess range synchronisation would be based on an internal lowest-level (codeunit) index. Then, you also need internal validity-checking and/or offseting routines when a higher-level range is used after a lowel-level one has been used. (I mean eg to ensure start-of-codepoint after a codeunit popFront, or throw an error.) Also, how to avoid duplicating many operational functions (eg find a given slice) for each level? Denis _________________ vita es estrany spir.wikidot.comMore seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.
Jan 18 2011
Michel Fortin wrote:On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want.That's what I've been thinking. The users can choose whether they want random access or not. A grapheme-aware string can provide random access at a space cost, or no random access for efficient space use. I see 5 layers in string processing. Layers 1 and 2 are currently handled by D, sometimes in an unclear way. e.g. char[] may be used as an array of code units or an array of code points depending on the type of iteration. 1) Code units: This is what D provides with its string types This layers models RandomAccessRange 2) Code points: This is what D and Phobos provide for example with foreach(d; stride(s, 1)) dchar[] models RandomAccessRange at this layer char[] and wchar[] model ForwardRange at this layer (If I understand it correctly, Steven Schveighoffer is trying to provide a pseudo-RandomAccessRange to char[] and wchar[] with his string type.) 3) Graphemes: This is what the string type that spir is working on. There could be at least two types: 3a) RandomAccessGraphemeRange: Has random access but the data type is large 3b) ForwardGraphemeRange: space-efficient but does not provide random access I think the programmers would be happy to be able to choose. 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a writing system enters the picture: lower/upper case transformations and sorting happen at this layer. (I have a library that tries to handle this layer but is ignorant of graphemes; I am waiting for spir's string type. ;)) 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange 4b) Models ForwardRange if based on a ForwardGraphemeRange 5) Text: Collection of Letters. This is where a name like "ali & tim" is correctly capitalized as "ALİ & TIM" because the text consists of two separate writing systems. (The same library that I mentioned in 4 tries to handle this layer as well.) Ali
Jan 18 2011
On 01/19/2011 08:43 AM, Ali Çehreli wrote:Michel Fortin wrote: > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> > said: > So perhaps the best interface for strings would be to provide multiple > range-like interfaces that you can use at the level you want. That's what I've been thinking. The users can choose whether they want random access or not. A grapheme-aware string can provide random access at a space cost, or no random access for efficient space use. I see 5 layers in string processing. Layers 1 and 2 are currently handled by D, sometimes in an unclear way. e.g. char[] may be used as an array of code units or an array of code points depending on the type of iteration.This is very good and helpful summary. But you do not list all relevant aspects of the question, I guess. Defining which codes belong to a given grapheme (what I call "piling") is necessary for true O(1) random-access, but not only. More importantly, all operations involving equality comparison (find, count, replace,...) require normalisation --in addition to piling. A few notes:1) Code units: This is what D provides with its string types This layers models RandomAccessRangeThis level is pure implementation artifact that simply cannot make any sense. (from user and thus programmer points of view) Any kind of text manipulation (slice, find, replace...) may lead to random incorrectness, except when source texts can be guaranteed to hold plain ASCII (which may be hard to prove). Conversely, pieces of text only passed around by an app do not require any more costly representation, in terms of time (decoding) or space. In addition, concat works provided all pieces share the same encoding (ASCII beeing a subset of most historic charsets and of UTF-8).2) Code points: This is what D and Phobos provide for example with foreach(d; stride(s, 1)) dchar[] models RandomAccessRange at this layer char[] and wchar[] model ForwardRange at this layer (If I understand it correctly, Steven Schveighoffer is trying to provide a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)This level is also a kind of implementation artifact, compared to historic charsets, but actually based on a real fact of natural languages: they hold composite characters that can thus be coded by combining lower-level codes which represent "scripting marks" (base & combining ones). For this reason, this level can have some sense. My latest guess is that apps that consider text as a study object (read linguistic apps), instead of a means, may regurarly need operating at this level, in addition to the next one. Normalisation can be applied at this level --and is necessary for the above kind of use case. But using it for operations requiring compare will typically also require "piling", that is the next level, if only to determine what is to be compared.3) Graphemes: This is what the string type that spir is working on. There could be at least two types:This is the meaningful level for, probably, nearly all applications.3a) RandomAccessGraphemeRange: Has random access but the data type is largeI guess this is Text's approach? Text is "flash fast" indeed for any operation benefiting from random-access. But not only: since it normalises its input, it should be far faster for any operation using compare (rough evaluations suggest a speed ratio of 1 to 2 orders of magnitude). The cost is high in terms of space, which in turn certainly reduces its speed gain in the general case, because to cache (miss) effects. (Thank you Michel for making this clear.)3b) ForwardGraphemeRange: space-efficient but does not provide random accessIs it what Andrei expects, namely a Grapheme type with a corresponding ByGrapheme iterator IIUC? Time efficiency of operations? 3) metadata RandomAccessGraphemeRange Michel Fortin suggested (off list) an alternative approach to Text: instead of actually "piling" at construction time, just store metadata upon grapheme bounds. The core benefit is indeed to keep "normal" text storage (meaning *char[], for modification): would this point please Andrei better? I let you evaluate various consequences of this change (mostly positive, I guess). The same metadata principle could certainly be used for further optimisations, but this is another story. I'm motivated to implement this variant, looke like best of both worlds tome. (support welcome ;-)I think the programmers would be happy to be able to choose. 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a writing system enters the picture: lower/upper case transformations and sorting happen at this layer. (I have a library that tries to handle this layer but is ignorant of graphemes; I am waiting for spir's string type. ;)) 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange 4b) Models ForwardRange if based on a ForwardGraphemeRangeI do not understand what this level means. For me, letters are, precisely, archetypical true characters, meaning level 3. [Note: "grapheme", used by Unicode to denote the common sense of "character", is simply wrong: "sh" and "ti" are graphemes in english (for the same phoneme /ʃ/), not characters; and tab, §, or © are probalby not considered graphemes by linguists, while they are characters. This is the reason why I try to avoid this term and use "character", like ICU's doc, to avoid even more confusion.]5) Text: Collection of Letters. This is where a name like "ali & tim" is correctly capitalized as "ALİ & TIM" because the text consists of two separate writing systems. (The same library that I mentioned in 4 tries to handle this layer as well.)This is an immensely complicated field. Note that it has nothing to do with text & character representation issues: whatever the character set, one has to confront problems like uppercase of 'i', 'ss' vs 'ß', definiton of "letter" or "character", matching, sorting order... Text does not even try to address natural language issues. Instead it deals onl,y but hopefully clearly & correctly, with restoring simple and safe representation for client apps.AliDenis _________________ vita es estrany spir.wikidot.com
Jan 19 2011
On 01/17/2011 05:34 PM, Michel Fortin wrote:As I said: all those people who are not validating the inputs to make sure they don't contain combining code points. As far as I know, no one is doing that, so that means everybody should use algorithms capable of handling multi-code-point graphemes. If someone indeed is doing this validation, he'll probably also be smart enough to make his algorithms to work with dchars.Actually, there are at least 2 special cases: * apps that only deal with pre-unicode source stuff * apps that only deal with source stuff "mechanically" generated by text-producing software which itself guarantees single-code-only graphemes Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
On Sat, 15 Jan 2011 17:45:37 -0500, Michel Fortin <michel.fortin michelf.com> wrote:On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:I didn't read the standard, all I understand about unicode is from this NG ;) What I meant was the ability to do things more than one way seems like a committee-designed standard. Usually with one of those, you have one party who "absolutely needs" one way of doing things (most likely because all their code is based on it), and other parties who want it a different way. When compromises occur, the end result is, you have a standard that's unnecessarily difficult to implement.On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin <michel.fortin michelf.com> wrote:Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support.On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).I'm glad we agree on that now.I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "exposé") {} foreach (wchar c; "exposé") {} foreach (char c; "exposé") {} // or foreach (dchar c; "exposé".by!dchar()) {} foreach (wchar c; "exposé".by!wchar()) {} foreach (char c; "exposé".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.Indeed, the change would probably be too radical for D2. I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work.I was hoping to change string literal types. If we don't do that, we have a half-ass solution. I don't think it's going to be impossible, because string, wstring, dstring are all aliases. In fact, with my current proposed type, this already works: mystring s = "hello"; But this doesn't: auto s = "hello"; // still typed as immutable(char)[] This isn't so bad, just require one to specify the type, right? Well, it fails miserably here: foo(mystring s) {...} foo("hello"); // fails to match. In order to have a string type, string literals have to be typed as that type.Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do.Changing iteration and not indexing is not going to fix the mess we have right now.One more thing: NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.But is NSString typed the *exact same* as an array, or is it a wrapper for an array? Looking at the docs, it appears it is not.A grapheme type would not be a range, it would be an element of the string range. You could not append to it (otherwise, that makes it into a string). In all other respects, it should act similar to a string (as you say, printing, upper-casing, comparison, etc.)I can understand the utility of a separate type in your DateTime example, but in this case I fail to see any advantage. I mean, a grapheme is a slice of a string, can have multiple code points (like a string), can be appended the same way as a string, can be composed or decomposed using canonical normalization or compatibility normalization (like a string), and should be sorted, uppercased, and lowercased according to Unicode rules (like a string). Basically, a grapheme is just a string that happens to contain only one grapheme. What would a custom type do differently than a string?Or you could make a grapheme a string_t. ;-)I'm a little uneasy having a range return itself as its element type. For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: [...] I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime.Also, grapheme == "a" is easy to understand because both are strings. But if a grapheme is a separate type, what would a grapheme literal look like?A grapheme should be comparable to a string literal. It should be assignable to a string literal. The drawback is we would need a runtime check to ensure the string literal was actually one grapheme. Some compiler help in this regard would be useful, but I'm not sure how the mechanics would work (you couldn't exactly type a literal differently based on its contents). Another possibility is to come up with a different syntax to denote grapheme literals.So in the end I don't think a grapheme needs a specific type, at least not for general purpose text processing. If I split a string on whitespace, do I get a range where elements are of type "word"? No, just sliced strings.It is not clear that using a separate type is the "right answer." It may be that an element of a string should be a string. This does work in other languages that don't have a concept of a character. An extra type however, allows us to have more concrete positions to work with.That said, I'm much less concerned by the type used to represent a grapheme than by the Unicode correctness. I'm not opposed to a separate type, I just don't really see the point.I will try to explain better by making an actual candidate type. -Steve
Jan 17 2011
On 01/15/2011 11:45 PM, Michel Fortin wrote:That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.I think like you about pre-composed characters: they bring no real gain (even for easing passage from historic charsets, since texts must be decoded anyway, then mapping to single or multiple codes is nothing). But they add complication to the design in proposing 2 // representation schemes (one character <--> one "code pile" versus one character <--> one precomposed code). And impose much weight on the back of software (and programmers) relative to correct indexing/ slicing and comparison, search, count, etc. Where normalisation forms enter the game. My whoice would be: * decomposed form only * ordering imposed by the standard at text-composition time ==> no normalisation because everything is normalised from scratch. Remains only what I call "piling". But we cannot easily get rid of it --without separators in standard UTF encodings. I had the idea of UTF-33 ;-): a alternative freely agreed-upon encoding that just says (in addition to UTF-32) that the content is already normalised (NFD decomposed and ordered): either so produced intially or already processed. So that software can happily read texts in and only think at piling if needed. UTF-33+ would add "grapheme" separators (a costly solution in terms of space) to get rid of piling. The aim indeed beeing to avoid stupidly doing the same job multiple times on the same text. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
Thanks to all that has contributed, I am also following this thread with great interest. :) Michel Fortin wrote:I mean, a grapheme is a slice of a string, can have multiple code points (like a string), can be appended the same way as a string, can be composed or decomposed using canonical normalization or compatibility normalization (like a string), and should be sorted, uppercased, and lowercased according to Unicode rules (like a string). Basically, a grapheme is just a string that happens to contain only one grapheme.I would like to stress the fact that Unicode knows nothing about sorting, uppercasing, or lowercasing. Those operations are tied to the alphabet (or writing system) that a certain grapheme happens to belong to at a given time. For example, we cannot uppercase the letter i without knowing what alphabet we are dealing with. Two possibilities: I and İ (I dot above). It is the same issue with sorting. Ali
Jan 17 2011
On 01/18/2011 06:11 AM, Ali Çehreli wrote:Thanks to all that has contributed, I am also following this thread with great interest. :) Michel Fortin wrote: > I mean, a grapheme is a slice of a string, can have multiple code points > (like a string), can be appended the same way as a string, can be > composed or decomposed using canonical normalization or compatibility > normalization (like a string), and should be sorted, uppercased, and > lowercased according to Unicode rules (like a string). Basically, a > grapheme is just a string that happens to contain only one grapheme. I would like to stress the fact that Unicode knows nothing about sorting, uppercasing, or lowercasing. Those operations are tied to the alphabet (or writing system) that a certain grapheme happens to belong to at a given time. For example, we cannot uppercase the letter i without knowing what alphabet we are dealing with. Two possibilities: I and İ (I dot above). It is the same issue with sorting.This is true and false ;-) You are right, indeed, on the fact that issues like sorting one are language-specific, and more, use-case-specific. The case of the turkish beeing a good example. For another one, in french I do not even know whether there is an official rule! Anyway, whatever the answer, even eg famous newpapers, and official documents, used different rules. Most of them let down accents on uppercase, possibly because of computer limitation; there is a recent move (back) toward accented uppercase. This is very annoying: "Hélène" has 2 consistent and used uppercase versions. Conversely, how is software supposed to guess the lowercase version of "HELENE"? Upon Unicode, it still defines norms for casing and so-called collation (compare, for sorting) algorithms. Dunno much more, i have never applied them, personly, for reasons like the ones above. The full list of it's technical docs can be found at http://unicode.org/reports/. See in particular http://unicode.org/reports/tr10/ for collation. (Unfortnately, case mapping is know part of the core standard doc, so that it's hard to get it.) Denis _________________ vita es estrany spir.wikidot.com
Jan 19 2011
On 01/15/2011 05:59 PM, Steven Schveighoffer wrote:I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above).I am unsure now about the question of a text's (apparent) natural language in relation to unicode issues. For instance English, precisely, seems to often include foreign words literally (or is it a kind of pedantism from highly educated people?). In fact, users are free to include whatever characters they like, as soon as they text-composition interface allows it. All main OSes, I guess, now have at least one standard way to type in characters (or codepoint) that are not directly accessible on keyboards, and application sometimes offer another. Some kinds of users love to play with such flexibility. So, maybe, the right question is not the one of natural language but of text-composition means. I guess that as soon as a human user may have freely typed or edited a text, we cannot guarantee much upon its actual content, what do you think? The case of historic ASCII-only text is relevant, indeed, but will fast become less. And how does an application writer recognises them without iterating the whole content? (The encoding is utf8 compatible.) Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
Am 14.01.2011 15:34, schrieb Steven Schveighoffer:Is it common to have multiple modifiers on a single character? The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English. I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist. My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type). If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough.I'm afraid that this is not a proper way to handle this problem. It may be better for a language not to 'translate' by default. If the user wants to convert the codepoints this can be requested on demand. But pemature default conversion is a subltle way to lose information that may be important. Imagine we want to write a tool for dealing with the in/output of some other ignorant legacy software. Even if it is only text files, that software may choke on some converted input. So i belive that it is very importent that we are able to reproduce strings in exact that form in which we have read them in. Gerrit
Jan 14 2011
On Fri, 14 Jan 2011 15:54:19 -0500, Gerrit Wichert <gwichert yahoo.com> wrote:Am 14.01.2011 15:34, schrieb Steven Schveighoffer:Actually, this would only lazily *and temporarily* convert the string per grapheme. Essentially, the original is left alone, so no harm there. -Steve.Is it common to have multiple modifiers on a single character? The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English. I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist. My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type). If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough.I'm afraid that this is not a proper way to handle this problem. It may be better for a language not to 'translate' by default. If the user wants to convert the codepoints this can be requested on demand. But pemature default conversion is a subltle way to lose information that may be important. Imagine we want to write a tool for dealing with the in/output of some other ignorant legacy software. Even if it is only text files, that software may choke on some converted input. So i belive that it is very importent that we are able to reproduce strings in exact that form in which we have read them in.
Jan 15 2011
On 01/14/2011 09:34 AM, Steven Schveighoffer wrote:Is it common to have multiple modifiers on a single character? The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English.Hebrew: • Almost every letter in a printed Hebrew bible has at least one of— ‣ vowel marker (the Hebrew alphabet is otherwise consonantal) and ‣ a /dagesh/ dot, indicating the difference between /b/ & /v/, or between /mm/ and /m/; • almost every word has at least one letter with a cantillation mark in addition to the above; and • other marks too complicated & off-topic to explain. Vietnamese uses Latin letters with accents playing multiple roles, so there are often two or three accent marks on a single letter; e.g., the name of the creator of pdfTeX is spelled “Hàn Thế Thành”, with two accents on the “e”. I’m sure there are others. —Joel
Jan 23 2011
"spir" <denis.spir gmail.com> wrote in message news:mailman.624.1295013588.4748.digitalmars-d puremagic.com...If it does not display properly, either set your terminal to UTF* or use a more unicode-aware font (eg DejaVu series).How to do that on the Windows (XP) command prompt, for anyone who doesn't know: Step 1: Right-click title bar, "Properties", "Font" tab, set font to "Lucidia Console" (It'll look weird at first, but you get used to it.) Step 2 (I had to google this step): For just the current terminal session: Run "chcp 65001". (Ie "CHange Code Page) Also, you can run "chcp" to just see what codepage you're already set to. To make it work permanently: Put "chcp 65001" into the registry key "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"
Jan 14 2011
"Nick Sabalausky" <a a.a> wrote in message news:igq9u6$1bqu$1 digitalmars.com...Step 2 (I had to google this step): For just the current terminal session: Run "chcp 65001". (Ie "CHange Code Page) Also, you can run "chcp" to just see what codepage you're already set to. To make it work permanently: Put "chcp 65001" into the registry key "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"Forget that step 2, that causes "Active code page: 65001" to be sent to stdout *every* time system() is invoked. We shouldn't be relying on that. *This* is what should be done (and this really should be done in all D command line apps - or better yet, put into the runtime): import std.stdio; version(Windows) { import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP(UINT); } void main() { version(Windows) SetConsoleOutputCP(65001); writeln("HuG says: Fukken ber Death Terminal"); } See also: http://d.puremagic.com/issues/show_bug.cgi?id=1448
Jan 14 2011
On 1/14/11, Nick Sabalausky <a a.a> wrote:import std.stdio; version(Windows) { import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP(UINT); } void main() { version(Windows) SetConsoleOutputCP(65001); writeln("HuG says: Fukken =DCber Death Terminal"); }Does that work for you? I get back: HuG says: Fukken =C3=9Cber Death Terminal
Jan 14 2011
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message news:mailman.631.1295038817.4748.digitalmars-d puremagic.com...On 1/14/11, Nick Sabalausky <a a.a> wrote:Yea, it works for me (XP Pro SP2 32-bit), and my "chcp" is 437, not 65001. The NG or copy-paste might have messed it up. Try with a code-point escape sequence: import std.stdio; version(Windows) { import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP(UINT); } void main() { version(Windows) SetConsoleOutputCP(65001); writeln("HuG says: Fukken \u00DCber Death Terminal"); }import std.stdio; version(Windows) { import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP(UINT); } void main() { version(Windows) SetConsoleOutputCP(65001); writeln("HuG says: Fukken ber Death Terminal"); }Does that work for you? I get back: HuG says: Fukken Über Death Terminal
Jan 14 2011
On 1/14/11, Nick Sabalausky <a a.a> wrote:Try with a code-point escape sequenceNope, I still get the same results (tried with different fonts, lucida etc.., but I don't think it's a font issue). Maybe I have my settings messed up or something.
Jan 14 2011
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message news:mailman.633.1295044452.4748.digitalmars-d puremagic.com...On 1/14/11, Nick Sabalausky <a a.a> wrote:Weird. Which version of windows are you on, and are you using the regular command line or powershell or something else? If you run "chcp 65001" from the cmd line first, does it work then?Try with a code-point escape sequenceNope, I still get the same results (tried with different fonts, lucida etc.., but I don't think it's a font issue). Maybe I have my settings messed up or something.
Jan 14 2011
On 1/14/11, Nick Sabalausky <a a.a> wrote:Weird. Which version of windows are you on, and are you using the regular command line or powershell or something else? If you run "chcp 65001" from the cmd line first, does it work then?Okay, it appears this is an issue with Console2. I'll have to report it to the dev, although he hasn't fixed much of anything for ages already. I'm really contemplating of writing my own shell by now. (no Linux jokes now, please. :p) Works fine in cmd.exe, Lucida font without calling 65001 manually. In fact, it works with the 437 code page as well when I comment out SetConsoleOutputCP.
Jan 14 2011
On 1/15/11, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:fact, it works with the 437 code page as well when I comment out SetConsoleOutputCP.Woops, let me revise what I've said: If the code has the call to change the codepage, then I'll get back the correct result in console. If it doesn't, I have to switch the codepage manually. I don't know what the problem with Console2 is, but if I change cmd.exe to always use a Lucida font then Console2 will output the correct result (even though I'm using fixedsys in Console2). This is getting too specific and I don't want to hijack the thread. Everything is working fine now. Thx. :)
Jan 14 2011
On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/13/11 7:09 PM, Michel Fortin wrote:Apple implemented all these things in the NSString class in Cocoa. They did all this work on Unicode at the beginning of Mac OS X, at a time where making such changes wouldn't break anything. It's a hard thing to change later when you have code that depend on the old behaviour. It's a complicated matter and not so many people will understand the issues, so it's no wonder many languages just deal with code points.That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?As usual, Wikipedia offers a good summary and a couple of references. Here's the part about combining characters: <http://en.wikipedia.org/wiki/Combining_character>. There's basically four ranges of code points which are combining: - Combining Diacritical Marks (0300–036F) - Combining Diacritical Marks Supplement (1DC0–1DFF) - Combining Diacritical Marks for Symbols (20D0–20FF) - Combining Half Marks (FE20–FE2F) A code point followed by one or more code points in these ranges is conceptually a single character (a grapheme). But for comparing strings correctly, you need to determine the canonical equivalence. Wikipedia describes it in Unicode Normalization article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full algorithm specification can be found here: <http://unicode.org/reports/tr15/> (the algorithm . The canonical form has both a composed and decomposed variant, the first trying to use pre-combined character when possible, the second not using any pre-combined character. Not only combining marks are concerned, there are a few single-code-point characters which have a duplicate somewhere else in the code point table. Also, there's two normalizations: the canonical one (described above) and the compatibility one which is more lax (making the ligature "fl" would equivalent to "fl" for instance). If a user searches for some text in a document, it's probably better to search using the compatibility normalization so that "flower" (with ligature) and "flower" (without ligature) can match each other. If you want to search case-insensitively, then you'll need to implement the collation algorithm, but that's getting further. If you're wondering which direction to take, this official FAQ seems like a good resource (especially the first few questions): <http://www.unicode.org/faq/normalization.html> One important thing to note is that most of the time, strings come already in the normalized pre-composed form. So the normalization algorithm should be optimized for the case it has nothing to do. That's what is said in section 1.3 Description of the Normalization Algorithm in the specification: <http://www.unicode.org/reports/tr15/#Description_Norm>. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
On 1/14/11 7:50 AM, Michel Fortin wrote:On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:That's a strong indicator, but we shouldn't get ahead of ourselves. D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode. I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users. Switching to variable-length representation of graphemes as bundles of dchars and committing to that through and through will bring with it a larger hit in efficiency and an increased difficulty in usage. I agree that at a level that's the "right" thing to do, but I don't have yet the feeling that combining characters are a widely-adopted winner. For the most part, fonts don't support combining characters, and as a font dilettante I can tell that putting arbitrary sets of diacritics on top of characters is not what one should be doing as it'll look terrible. Unicode is begrudgingly acknowledging combining characters. Only a handful of libraries deal with them. I don't know how many applications need or care for them, versus how many applications do fine with precombined characters. I have trouble getting combining characters to combine on this machine in any of the applications I use - and this is a Mac. AndreiOn 1/13/11 7:09 PM, Michel Fortin wrote:Apple implemented all these things in the NSString class in Cocoa. They did all this work on Unicode at the beginning of Mac OS X, at a time where making such changes wouldn't break anything. It's a hard thing to change later when you have code that depend on the old behaviour. It's a complicated matter and not so many people will understand the issues, so it's no wonder many languages just deal with code points.That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).
Jan 14 2011
Andrei Alexandrescu Wrote:That's a strong indicator, but we shouldn't get ahead of ourselves. D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode. I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users. Switching to variable-length representation of graphemes as bundles of dchars and committing to that through and through will bring with it a larger hit in efficiency and an increased difficulty in usage. I agree that at a level that's the "right" thing to do, but I don't have yet the feeling that combining characters are a widely-adopted winner. For the most part, fonts don't support combining characters, and as a font dilettante I can tell that putting arbitrary sets of diacritics on top of characters is not what one should be doing as it'll look terrible. Unicode is begrudgingly acknowledging combining characters. Only a handful of libraries deal with them. I don't know how many applications need or care for them, versus how many applications do fine with precombined characters. I have trouble getting combining characters to combine on this machine in any of the applications I use - and this is a Mac. AndreiCombining marks do need to be supported. Some languages use combining marks extensively (see my other post) and of course font for those languages exist and they do support this. Mac doesn't support all languages so I'm unsure if it's the best example out there. here's an example of the Hebrew bible: http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm Just look at the any of the PDFs there to see how Hebrew looks like with all sorts of different marks. In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it) Using a dchar as a string element instead of a proper grapheme will make it really hard to work with texts in such languages. Regarding efficiency concerns for ASCII users - there's no rule that forces us to have a single string type, just look for comparison at how many integral types D has. I believe that the correct thing is to have a 'universal string' type be the default (just like int is for integral types) and provide additional types for other commonly useful encodings such as ASCII. A geneticist for instance should use a 'DNA' type that encodes the four DNA letters instead of an ASCII string or even worse, a universal (Unicode) string.
Jan 14 2011
On 2011-01-14 18:02:32 -0500, foobar <foo bar.com> said:Combining marks do need to be supported. Some languages use combining marks extensively (see my other post) and of course font for those languages exist and they do support this. Mac doesn't support all languages so I'm unsure if it's the best example out there. here's an example of the Hebrew bible: http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm Just look at the any of the PDFs there to see how Hebrew looks like with all sorts of different marks.That's a good example. Although my attempt to extract the text from the PDF wasn't perfect, I can confirm that the marks I got in the copy-pasted text are indeed combining code points, not pre-combined ones. This character for instance has a combining mark: "יָ"; and it can't be represented by a pre-combined code point because there is no pre-combined form for it (or at least I couldn't find one). Some hebrew characters have a pre-combined form for the middle dot and some other marks, presumably the most common ones, but it was clearly insufficient for this text.In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it)Are you sure those are combining code points? I though ruby was a layout feature, not something part of Unicode. And I can't find combining code points that would match those. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
Michel Fortin Wrote:I've looked into this and I was wrong. Ruby is a layout feature as you said. Sorry for the confusion.In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it)Are you sure those are combining code points? I though ruby was a layout feature, not something part of Unicode. And I can't find combining code points that would match those.-- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
On 2011-01-14 17:04:08 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/14/11 7:50 AM, Michel Fortin wrote:Then perhaps it's time we find out a way to handle non-Unicode encodings too. We can get away treating ASCII strings as Unicode strings because of a useful property of UTF-8, but should we really do this? Also, it'd really help this discussion to have some hard numbers about the cost of decoding graphemes.On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:That's a strong indicator, but we shouldn't get ahead of ourselves. D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode. I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users.On 1/13/11 7:09 PM, Michel Fortin wrote:Apple implemented all these things in the NSString class in Cocoa. They did all this work on Unicode at the beginning of Mac OS X, at a time where making such changes wouldn't break anything. It's a hard thing to change later when you have code that depend on the old behaviour. It's a complicated matter and not so many people will understand the issues, so it's no wonder many languages just deal with code points.That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).Switching to variable-length representation of graphemes as bundles of dchars and committing to that through and through will bring with it a larger hit in efficiency and an increased difficulty in usage. I agree that at a level that's the "right" thing to do, but I don't have yet the feeling that combining characters are a widely-adopted winner. For the most part, fonts don't support combining characters, and as a font dilettante I can tell that putting arbitrary sets of diacritics on top of characters is not what one should be doing as it'll look terrible. Unicode is begrudgingly acknowledging combining characters. Only a handful of libraries deal with them. I don't know how many applications need or care for them, versus how many applications do fine with precombined characters. I have trouble getting combining characters to combine on this machine in any of the applications I use - and this is a Mac.I'm using the character palette: Edit menu >Special Characters... from there you can insert arbitrary code points. Use the search function of the palette to get code points with "combining" in their names, then click the big character box on the lower left to insert them. Have fun! -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
On 01/15/2011 12:21 AM, Michel Fortin wrote:Also, it'd really help this discussion to have some hard numbers about the cost of decoding graphemes.Text has a perf module that provides such numbers (on different stages of Text object construction) (but the measured algos are not yet stabilised, so that said numbers regularly change, but in the right sense ;-) You can try the current version at https://bitbucket.org/denispir/denispir-d/src (the perf module is called chrono.d) For information, recently, the cost of full text construction: decoding, normalisation (both decomp & ordering), piling, was about 5 times decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just informed me about a new gain in piling I have not yet tested. This performance places our library in-between Windows native tools and ICU in terms of speed. Which is imo rather good for a brand new tool written in a still unstable language. I have carefully read your arguments on Text's approach to systematically "pile" and normalise source texts not beeing the right one from an efficiency point of view. Even for strict use cases of universal text manipulation (because the relative space cost would indirectly cause time cost due to cache effects). Instead, you state we should "pile" and/or normalise on the fly. But I am, similarly to you, rather doubtful on this point without any numbers available. So, let us produce some benchmark results on both approaches if you like. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
On 1/17/11 10:55 AM, spir wrote:On 01/15/2011 12:21 AM, Michel Fortin wrote:Congrats on this great work. The initial numbers are in keeping with my expectation; UTF adds for certain primitives up to 3x overhead compared to ASCII, and I expect combining character handling to bring about as much on top of that. Your work and Steve's won't go to waste; one way or another we need to add grapheme-based processing to D. I think it would be great if later on a Phobos submission was made. AndreiAlso, it'd really help this discussion to have some hard numbers about the cost of decoding graphemes.Text has a perf module that provides such numbers (on different stages of Text object construction) (but the measured algos are not yet stabilised, so that said numbers regularly change, but in the right sense ;-) You can try the current version at https://bitbucket.org/denispir/denispir-d/src (the perf module is called chrono.d) For information, recently, the cost of full text construction: decoding, normalisation (both decomp & ordering), piling, was about 5 times decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just informed me about a new gain in piling I have not yet tested. This performance places our library in-between Windows native tools and ICU in terms of speed. Which is imo rather good for a brand new tool written in a still unstable language. I have carefully read your arguments on Text's approach to systematically "pile" and normalise source texts not beeing the right one from an efficiency point of view. Even for strict use cases of universal text manipulation (because the relative space cost would indirectly cause time cost due to cache effects). Instead, you state we should "pile" and/or normalise on the fly. But I am, similarly to you, rather doubtful on this point without any numbers available. So, let us produce some benchmark results on both approaches if you like.
Jan 17 2011
On 01/17/2011 06:36 PM, Andrei Alexandrescu wrote:On 1/17/11 10:55 AM, spir wrote:Andrei, would you have a look at Text's current state, mainly theinterface, when you have time for that (no hurry) at https://bitbucket.org/denispir/denispir-d/src It is actually a bit more than just a string type considering true characters as natural elements. * It is a textual type providing a client interface of common text manipulation methods similar to ones in common high-level languages. (including the fact that a character is a singleton string) * The repo also holds the main module (unicodedata) of Text's sister lib (dunicode), providing access to various unicode algos and data. (We are about to merge the 2 libs into a new repository.) Denis _________________ vita es estrany spir.wikidot.comOn 01/15/2011 12:21 AM, Michel Fortin wrote:Congrats on this great work. The initial numbers are in keeping with my expectation; UTF adds for certain primitives up to 3x overhead compared to ASCII, and I expect combining character handling to bring about as much on top of that. Your work and Steve's won't go to waste; one way or another we need to add grapheme-based processing to D. I think it would be great if later on a Phobos submission was made.Also, it'd really help this discussion to have some hard numbers about the cost of decoding graphemes.Text has a perf module that provides such numbers (on different stages of Text object construction) (but the measured algos are not yet stabilised, so that said numbers regularly change, but in the right sense ;-) You can try the current version at https://bitbucket.org/denispir/denispir-d/src (the perf module is called chrono.d) For information, recently, the cost of full text construction: decoding, normalisation (both decomp & ordering), piling, was about 5 times decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just informed me about a new gain in piling I have not yet tested. This performance places our library in-between Windows native tools and ICU in terms of speed. Which is imo rather good for a brand new tool written in a still unstable language. I have carefully read your arguments on Text's approach to systematically "pile" and normalise source texts not beeing the right one from an efficiency point of view. Even for strict use cases of universal text manipulation (because the relative space cost would indirectly cause time cost due to cache effects). Instead, you state we should "pile" and/or normalise on the fly. But I am, similarly to you, rather doubtful on this point without any numbers available. So, let us produce some benchmark results on both approaches if you like.
Jan 17 2011
On 1/17/11 12:23 PM, spir wrote:Andrei, would you have a look at Text's current state, mainly theinterface, when you have time for that (no hurry) at https://bitbucket.org/denispir/denispir-d/src It is actually a bit more than just a string type considering true characters as natural elements. * It is a textual type providing a client interface of common text manipulation methods similar to ones in common high-level languages. (including the fact that a character is a singleton string) * The repo also holds the main module (unicodedata) of Text's sister lib (dunicode), providing access to various unicode algos and data. (We are about to merge the 2 libs into a new repository.)I think this is solid work that reveals good understanding of Unicode. That being said, there are a few things I disagree about and I don't think it can be integrated into Phobos. One thing is that it looks a lot more like D1 code than D2. D2 code of this kind is automatically expected to play nice with the rest of Phobos (ranges and algorithms). As it is, the code is an island that implements its own algorithms (mostly by equivalent handwritten code). In detail: * Line 130: representing a text as a dchar[][] has its advantages but major efficiency issues. To be frank I think it's a disaster. I think a representation building on UTF strings directly is bound to be vastly better. * 163: equality does what std.algorithm.equal does. * 174: equality also does what std.algorithm.equal does (possibly with a custom pred) * 189: TextException is unnecessary * 340: Unless properly motivate, iteration with opApply is archaic and inefficient. * 370: Why lose the information that the result is in fact a single Pile? * 430, 456, 474: contains, indexOf, count and probably others should use generic algorithms, not duplicate them. * 534: replace is std.array.replace * 623: copy copies the piles shallowly (not sure if that's a problem) As I mentioned before - why not focus on defining a Grapheme type (what you call Pile, but using UTF encoding) and defining a ByGrapheme range that iterates a UTF-encoded string by grapheme? Andrei
Jan 17 2011
On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:On 1/17/11 12:23 PM, spir wrote:We are exploring a new field. (Except for the work Objective-C designers did -- but we just discovered it.)Andrei, would you have a look at Text's current state, mainly theinterface, when you have time for that (no hurry) at https://bitbucket.org/denispir/denispir-d/src It is actually a bit more than just a string type considering true characters as natural elements. * It is a textual type providing a client interface of common text manipulation methods similar to ones in common high-level languages. (including the fact that a character is a singleton string) * The repo also holds the main module (unicodedata) of Text's sister lib (dunicode), providing access to various unicode algos and data. (We are about to merge the 2 libs into a new repository.)I think this is solid work that reveals good understanding of Unicode. That being said, there are a few things I disagree about and I don't think it can be integrated into Phobos.One thing is that it looks a lot more like D1 code than D2. D2 code of this kind is automatically expected to play nice with the rest of Phobos (ranges and algorithms). As it is, the code is an island that implements its own algorithms (mostly by equivalent handwritten code).Right. We precisely initially wanted to let it play nicely with the rest of new Phobos. This meant mainly provide a range interface, which also gives access to std.algorithm routines. But we were blocked by current bugs related to ranges. I have posted about those issues (you may remember having replied to this post).In detail: * Line 130: representing a text as a dchar[][] has its advantages but major efficiency issues. To be frank I think it's a disaster. I think a representation building on UTF strings directly is bound to be vastly better.I don't understand your point. Where is the difference with D's builtin types, then? Also, which efficiency issue do you mention? Upon text object construction, we do agree and I have given some data. But this happens only once; it is an investment intended to provide correctness first, and efficiency of _every_ operation on constructed text. Upon speed ofsuch methods / algorithms operating _correctly_ on universal text, precisely, since there is no alternative to Text (yet), there are also no available performance data to judge. (What about comparing Objective-C's NSString to Text's current performance for indexing, slicing, searching, counting,...? Even in its current experimental stage, I bet it would not be ridiculous, rather the opposite. But I may be completely wrong.)* 163: equality does what std.algorithm.equal does. * 174: equality also does what std.algorithm.equal does (possibly with a custom pred)Right, these are unimportant tool func at the "pile" level. (Initially introduced because builtin "==" showed strange inefficency in our case. May test again later.)* 189: TextException is unnecessaryAgreed.* 340: Unless properly motivate, iteration with opApply is archaic and inefficient.See range bug evoked above. opApply is the only workaround AFAIK. Also, ranges cannot yet provide indexed iteration like foreach(i, char ; text) {...}* 370: Why lose the information that the result is in fact a single Pile?I don't know what information loss you mean. Generally speaking, Pile is more or less an implementation detail used to internally represent a true character; while Text is the important thing. At one time we had to chose whether make Pile an obviously exposed type as well, or not. I chose (after some exchange on the topic) not to do it for a few reasons: * Simplicity: one type does all the job well. * Avoid confusion due to conflict with historic string types which elements (codes=characters) were atomic thingies. This was also a reason not to name it simply "Character"; "Pile" for me was supposed to rather evoke the technical side than the meaningful side. * Lightness of the interface: if we expose Pile obviously, then we need to double all methods that may take or return a single character, like searching, counting, replacing etc... and also possibly indexing and iteration. In fact, the resulting interface is more or less like a string type in high-level languages such as Python; with the motivating difference that it operates correctly on universal text. Now, it seems you rather expect, maybe, the character/pile type to be the important thing and Text to just be a sequence of them? (possibly even unnecessary to be defined formally)* 430, 456, 474: contains, indexOf, count and probably others should use generic algorithms, not duplicate them. * 534: replace is std.array.replaceI had to write algos because most of them in std.algorithm require a range interface, IIUC; and also for testing purpose.* 623: copy copies the piles shallowly (not sure if that's a problem)Had the same interrogation.As I mentioned before - why not focus on defining a Grapheme type (what you call Pile, but using UTF encoding) and defining a ByGrapheme range that iterates a UTF-encoded string by grapheme?Dunno. This simply was not my approach. Seems to me Text as is provides clients with an interface a simple and clear as possible, while operating correctly in the backgroung. It seems if you just build a ByGrapheme iterator, then you have no other choice than abstracting on the fly (constructing piles on the fly for operations like indexing and normalising them in addition for searching, counting...). As I said in other posts, this may be the right thing to do from an efficiency point of view, but this remains to be proven. I bet the opposite, in fact, that --with same implementation language and same investment in optimisation-- the approach defining a true textual type like Text is inevitbly more efficient by orders of magnitude (*). Again, Text construction initial cost is an investment. Prove me wrong (**).AndreiDenis (*) Except, probably, for the choice of making the ElemenType a singleton Text (seems costly). (**) I'm now aware of the high speed loss Text certainly suffers from representing characters as mini-arrays, but I guess it is marginally relevant compared to the gain of not piling and normalising for every operation. _________________ vita es estrany spir.wikidot.com
Jan 17 2011
On 1/17/11 5:13 PM, spir wrote:On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:Unfortunately I won't have much time to discuss all these points, but this is a simple one: using dchar[][] wastes memory and time. You need to build on a flatter representation. Don't confuse the abstraction you are building with its underlying representation. The difference between your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend you to build on) is that the abstractions offer different, higher-level primitives that the representation doesn't. Let me repeat again: if anyone in this community wants to put work in a forward range that iterates one grapheme at a time, that work would be very valuable because it will allow us to experiment with graphemes in a non-disruptive way while benefiting of a host of algorithms. ByGrapheme and friends will help more than defining new string types. Andrei* Line 130: representing a text as a dchar[][] has its advantages but major efficiency issues. To be frank I think it's a disaster. I think a representation building on UTF strings directly is bound to be vastly better.I don't understand your point. Where is the difference with D's builtin types, then?
Jan 17 2011
On 01/18/2011 03:52 AM, Andrei Alexandrescu wrote:On 1/17/11 5:13 PM, spir wrote:I think it is needed to repeat again the following: Text in my view (or whatever variant solution to work correctly with universal text) is _not_ intended as a basic string type, even less default. If programmers can guarantee all their app's input will ever hold single-codepoint characters only, _or_ if they jst pass pieces of text around without manipulation, then such a tool is big overkill. It has a time cost a Text construction time, which I consider as an investment. It has also some space & time cost for operations that should be only slightly relevant compared to speed offered by the simple facts routines can then operate just (actualy nearly) like with historic charsets. Indexing is just normal O(1) indexing, possibly plus producing the result. Not O(n) across the source with building piles along the way. (1000X slower, 1000000X slower?) Counting is just O(n) with mini-array compares, not building & normalising piles across the whole code sequence. (10X, 100X slower?)On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:Unfortunately I won't have much time to discuss all these points, but this is a simple one: using dchar[][] wastes memory and time. You need to build on a flatter representation. Don't confuse the abstraction you are building with its underlying representation. The difference between your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend you to build on) is that the abstractions offer different, higher-level primitives that the representation doesn't.* Line 130: representing a text as a dchar[][] has its advantages but major efficiency issues. To be frank I think it's a disaster. I think a representation building on UTF strings directly is bound to be vastly better.I don't understand your point. Where is the difference with D's builtin types, then?Let me repeat again: if anyone in this community wants to put work in a forward range that iterates one grapheme at a time, that work would be very valuable because it will allow us to experiment with graphemes in a non-disruptive way while benefiting of a host of algorithms. ByGrapheme and friends will help more than defining new string types.Right. I understand your point-of-view, esp "non-disruptive". But then, how to avoid the possibly huge inefficiency evoked above? We have no true perf numbers yet, right, for any alternative to Text's approach. But for this reason we also should not randomly speak of this approach's space & time costs. Compared to what? Denis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
On 1/18/11 7:25 AM, spir wrote:On 01/18/2011 03:52 AM, Andrei Alexandrescu wrote:You don't provide O(n) indexing. AndreiOn 1/17/11 5:13 PM, spir wrote:I think it is needed to repeat again the following: Text in my view (or whatever variant solution to work correctly with universal text) is _not_ intended as a basic string type, even less default. If programmers can guarantee all their app's input will ever hold single-codepoint characters only, _or_ if they jst pass pieces of text around without manipulation, then such a tool is big overkill. It has a time cost a Text construction time, which I consider as an investment. It has also some space & time cost for operations that should be only slightly relevant compared to speed offered by the simple facts routines can then operate just (actualy nearly) like with historic charsets. Indexing is just normal O(1) indexing, possibly plus producing the result. Not O(n) across the source with building piles along the way. (1000X slower, 1000000X slower?) Counting is just O(n) with mini-array compares, not building & normalising piles across the whole code sequence. (10X, 100X slower?)On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:Unfortunately I won't have much time to discuss all these points, but this is a simple one: using dchar[][] wastes memory and time. You need to build on a flatter representation. Don't confuse the abstraction you are building with its underlying representation. The difference between your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend you to build on) is that the abstractions offer different, higher-level primitives that the representation doesn't.* Line 130: representing a text as a dchar[][] has its advantages but major efficiency issues. To be frank I think it's a disaster. I think a representation building on UTF strings directly is bound to be vastly better.I don't understand your point. Where is the difference with D's builtin types, then?
Jan 18 2011
On Monday 17 January 2011 15:13:42 spir wrote:See range bug evoked above. opApply is the only workaround AFAIK. Also, ranges cannot yet provide indexed iteration like foreach(i, char ; text) {...}While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M Davis
Jan 17 2011
On 1/17/11 11:48 PM, Jonathan M Davis wrote:On Monday 17 January 2011 15:13:42 spir wrote:It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step). At any rate, I agree with two things - one, we need to fix the foreach situation. Two, even before we find a fix, at this point committing to iteration with opApply essentially commits the iteratee to an island where all basic algorithms need to be reinvented from first principles. AndreiSee range bug evoked above. opApply is the only workaround AFAIK. Also, ranges cannot yet provide indexed iteration like foreach(i, char ; text) {...}While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M Davis
Jan 17 2011
On 01/18/2011 07:11 AM, Andrei Alexandrescu wrote:On 1/17/11 11:48 PM, Jonathan M Davis wrote:This is a very valid point: a range's logical offset is not necessary equal to physical (hum) offset, even on a plain sequence. But for the case of Text it is in fact, precisely because codepoints have been grouped in "piles" each representing true character (grapheme). This is actually one third of the purpose of Text (the others beeing to ensure unique representation of each character, and to provide users with clear interface). Thus, Jonathan's point simply applies to Text.On Monday 17 January 2011 15:13:42 spir wrote:It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step).See range bug evoked above. opApply is the only workaround AFAIK. Also, ranges cannot yet provide indexed iteration like foreach(i, char ; text) {...}While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M DavisAt any rate, I agree with two things - one, we need to fix the foreach situation. Two, even before we find a fix, at this point committing to iteration with opApply essentially commits the iteratee to an island where all basic algorithms need to be reinvented from first principles.I agree. The situation would be different if D had not proposed indexed iteration already, and programmers would routinely count manually and/or call an extra range primitive, as you say. Upon using opApply: it works fine nevertheless, at least for a first rough implementation like in the case of Text. Reinventing basic algos is not an issue at this stage, as long as they are simple enough, and mainly for testing. (Actually, it can be an advantage in avoiding integration issues, possibly due to D's current beta stage --I mean bugs that pop up only when combinng given features-- like we had eg with range & formatValue).AndreiDenis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
On Tue, 18 Jan 2011 01:11:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/17/11 11:48 PM, Jonathan M Davis wrote:opApply in no way disables the range interface. It simply is used for foreach. So the only "algorithm" which is different is foreach. If you use the range primitives, opApply is nowhere to be found. That being said, we have an annoying situation in all this. opApply cannot be used to foreach using indexes *and* ranges are used to foreach elements. If one opApply is found, the compiler gives up on using the range functions for foreach (this is reflected in my most recent string_t code). This means you will have to implement a "wrapper" opApply around the range primitives in order to also implement indexing foreach. -SteveOn Monday 17 January 2011 15:13:42 spir wrote:It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step). At any rate, I agree with two things - one, we need to fix the foreach situation. Two, even before we find a fix, at this point committing to iteration with opApply essentially commits the iteratee to an island where all basic algorithms need to be reinvented from first principles.See range bug evoked above. opApply is the only workaround AFAIK. Also, ranges cannot yet provide indexed iteration like foreach(i, char ; text) {...}While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M Davis
Jan 19 2011
On 01/18/2011 06:48 AM, Jonathan M Davis wrote:On Monday 17 January 2011 15:13:42 spir wrote:You are right. I fully agree, in fact. On the other hand, think at expectations of users of a library providing iteration on "naturally" sequential thingies. The point is that D makes indexed iteration available elsewhere. Denis _________________ vita es estrany spir.wikidot.comSee range bug evoked above. opApply is the only workaround AFAIK. Also, ranges cannot yet provide indexed iteration like foreach(i, char ; text) {...}While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance.
Jan 18 2011
On 01/14/2011 04:50 PM, Michel Fortin wrote:Unfortunatly, things are complicated by _prepend_ combining marks that happen in a code sequence _before_ the base mark. The Unicode algorithm is described here: http://unicode.org/reports/tr29/ section 3 (humanly readable ;-). See esp the first table in section 3.1. Denis _________________ vita es estrany spir.wikidot.comThis might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?As usual, Wikipedia offers a good summary and a couple of references. Here's the part about combining characters: <http://en.wikipedia.org/wiki/Combining_character>. There's basically four ranges of code points which are combining: - Combining Diacritical Marks (0300–036F) - Combining Diacritical Marks Supplement (1DC0–1DFF) - Combining Diacritical Marks for Symbols (20D0–20FF) - Combining Half Marks (FE20–FE2F) A code point followed by one or more code points in these ranges is conceptually a single character (a grapheme).
Jan 17 2011
On 01/11/2011 02:30 PM, Steven Schveighoffer wrote:On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:People interested in solving the general problem with Unicode strings may have a look at https://bitbucket.org/denispir/denispir-d. All constructive feedback welcome. (This will be asked for review in a short while. The main / client interface module is Text.d. A (long) presentation of the issues, reasons, solution can be found in the text called "U missing level of abstraction") Denis _________________ vita es estrany spir.wikidot.comI've been thinking on how to better deal with Unicode strings. Currently strings are formally bidirectional ranges with a surreptitious random access interface. The random access interface accesses the support of the string, which is understood to hold data in a variable-encoded format. For as long as the programmer understands this relationship, code for string manipulation can be written with relative ease. However, there is still room for writing wrong code that looks legit. Sometimes the best way to tackle a hairy reality is to invite it to the negotiation table and offer it promotion to first-class abstraction status. Along that vein I was thinking of defining a new range: VLERange, i.e. Variable Length Encoding Range. Such a range would have the power somewhere in between bidirectional and random access. The primitives offered would include empty, access to front and back, popFront and popBack (just like BidirectionalRange), and in addition properties typical of random access ranges: indexing, slicing, and length. Note that the result of the indexing operator is not the same as the element type of the range, as it only represents the unit of encoding. In addition to these (and connecting the two), a VLERange would offer two additional primitives: 1. size_t stepSize(size_t offset) gives the length of the step needed to skip to the next element. 2. size_t backstepSize(size_t offset) gives the size of the _backward_ step that goes to the previous element. In both cases, offset is assumed to be at the beginning of a logical element of the range. I suspect that a lot of functions in std.string can be written without Unicode-specific knowledge just by relying on such an interface. Moreover, algorithms can be generalized to other structures that use variable-length encoding, such as those used in data compression. (In that case, the support would be a bit array and the encoded type would be ubyte.) Writing to such ranges is not addressed by this design. Ideas are welcome. Adding VLERange would legitimize strings and would clarify their handling, at the cost of adding one additional concept that needs to be minded. Is the trade-off worthwhile?While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler. I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit. This makes more sense for generic code IMO. I noticed you never commented on my proposed string type... That reminds me, I should update with suggested changes and re-post it.
Jan 11 2011
Andrei Alexandrescu napisa=B3:I've been thinking on how to better deal with Unicode strings. Currently==20strings are formally bidirectional ranges with a surreptitious random=20 access interface. The random access interface accesses the support of=20 the string, which is understood to hold data in a variable-encoded=20 format. For as long as the programmer understands this relationship,=20 code for string manipulation can be written with relative ease. However,==20there is still room for writing wrong code that looks legit. =20 Sometimes the best way to tackle a hairy reality is to invite it to the=20 negotiation table and offer it promotion to first-class abstraction=20 status. Along that vein I was thinking of defining a new range:=20 VLERange, i.e. Variable Length Encoding Range. Such a range would have=20 the power somewhere in between bidirectional and random access. =20 The primitives offered would include empty, access to front and back,=20 popFront and popBack (just like BidirectionalRange), and in addition=20 properties typical of random access ranges: indexing, slicing, and=20 length.For some compressions implementing *back is troublesome if not impossible...Note that the result of the indexing operator is not the same as=20 the element type of the range, as it only represents the unit of encoding.It's worth to mention it explicitly -- a VLERange is dually typed. It's imp= ortant for searching. Statically check if original and encoded match, if so= , perform fast search on directly on encoded elements. I think an important= feature of a VLERange should be dropping itself down to a encoded-typed r= ange, so that front and back return raw data. Dual typing will also affect foreach -- in general case you'd want to choos= e whether to decode or not by typing the element. I can't stop thinking that VLERange is a two-piece bikini making a bare ran= dom-access range safe to look at, and that you can take off when partners h= ave confidence, not a limited random-access probing facility to span the vo= id between front and back.In addition to these (and connecting the two), a VLERange would offer=20 two additional primitives: =20 1. size_t stepSize(size_t offset) gives the length of the step needed to==20skip to the next element. =20 2. size_t backstepSize(size_t offset) gives the size of the _backward_=20 step that goes to the previous element. =20 In both cases, offset is assumed to be at the beginning of a logical=20 element of the range.So when I move the spinner in an iPod, I get catapulted in position with th= e raw data opIndex and from there I try to work my way to the next frame to= start playback. Sounds promising.I suspect that a lot of functions in std.string can be written without=20 Unicode-specific knowledge just by relying on such an interface.=20 Moreover, algorithms can be generalized to other structures that use=20 variable-length encoding, such as those used in data compression. (In=20 that case, the support would be a bit array and the encoded type would=20 be ubyte.)I agree, acknowledging encoding/compression as a general direction will bri= ng substantial benefits.Writing to such ranges is not addressed by this design. Ideas are welcome.Yeah, we can address outputting later, that's fair.Adding VLERange would legitimize strings and would clarify their=20 handling, at the cost of adding one additional concept that needs to be=20 minded. Is the trade-off worthwhile?Well, the only way to find out is try it. My advice: VLERanges originated a= s a solution to the string problem, so start with a non-string incarnation.= Having at least two (one, we know, is string) plugs that fit the same sock= et will spur confidence in the abstraction.=20 --=20 Tomek
Jan 11 2011
Sorry if I'm jumping inhere without the appropriate background, but I don't understand why jumping through these hoops are necessary. Please let me know if I'm missing anything. Many problems can be solved by another layer of indirection. Isn't a string essentially a bidirectional range of code points built on top of a random access range of code units? It seems to me that each abstraction separately already fits within the existing D range framework and all the difficulties arise as a consequence of trying to lump them into a single abstraction. Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them? When character representation is the primary requirement then make it a bidirectional range of code points. When storage representation and random access is required then make it a random access range of code units.
Jan 11 2011
On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:Sorry if I'm jumping inhere without the appropriate background, but I don't understand why jumping through these hoops are necessary. Please let me know if I'm missing anything. Many problems can be solved by another layer of indirection. Isn't a string essentially a bidirectional range of code points built on top of a random access range of code units?Actually, displaying a UTF-8/UTF-16 string involves a range of of glyphs layered over a range of graphemes layered over a range of code points layered over a range of code units. Glyphs represent the visual characters you can get from a font, they often map one-to-one with graphemes but not always (ligatures for instance). Graphemes are what people generally reason about when they see text (the so called "user-perceived characters"), they often map one-to-one with code points but not always (combining marks for instance). Code points are a list of standardized codes representing various elements of a string, and code units basically encode the code points. If you're writing an XML, JSON or whatever else parser you'll probably care about code points. If you're advancing the insertion point in a text field or count the number of user-perceived characters you'll probably want to deal with graphemes. For searching a substring inside a string, or comparing strings you'll probably want to deal with either graphemes or collation elements (collation elements are layered on top of code points). To print a string you'll need to map graphemes to the glyphs from a particular font. Reducing string operations to code points manipulations will only work as long as all your graphemes, collation elements, or glyphs map one-to-one with code points.It seems to me that each abstraction separately already fits within the existing D range framework and all the difficulties arise as a consequence of trying to lump them into a single abstraction.It's true that each of these abstraction can fit within the existing range framework.Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them? When character representation is the primary requirement then make it a bidirectional range of code points. When storage representation and random access is required then make it a random access range of code units.I think you're right. The need for a new concept isn't that great, and it gets complicated really fast. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 11 2011
Michel Fortin wrote:On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them? When character representation is the primary requirement then make it a bidirectional range of code points. When storage representation and random access is required then make it a random access range of code units.I think you're right. The need for a new concept isn't that great, and it gets complicated really fast.
Jan 12 2011
On 1/12/11 11:28 AM, Don wrote:Michel Fortin wrote:I hope to assuage part of that issue with representation(). Again, it's not documented yet (mainly because of the famous ddoc bug that prevents auto functions from carrying documentation). Here it is: /** * Returns the representation type of a string, which is the same type * as the string except the character type is replaced by $(D ubyte), * $(D ushort), or $(D uint) depending on the character width. * * Example: ---- string s = "hello"; static assert(is(typeof(representation(s)) == immutable(ubyte)[])); ---- */ /*private*/ auto representation(Char)(Char[] s) if (isSomeChar!Char) { // Get representation type static if (Char.sizeof == 1) enum t = "ubyte"; else static if (Char.sizeof == 2) enum t = "ushort"; else static if (Char.sizeof == 4) enum t = "uint"; else static assert(false); // can't happen due to isSomeChar!Char // Get representation qualifier static if (is(Char == immutable)) enum q = "immutable"; else static if (is(Char == const)) enum q = "const"; else static if (is(Char == shared)) enum q = "shared"; else enum q = ""; // Result type is qualifier(RepType)[] static if (q.length) return mixin("cast(" ~ q ~ "(" ~ t ~ ")[]) s"); else return mixin("cast(" ~ t ~ "[]) s"); } AndreiOn 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them? When character representation is the primary requirement then make it a bidirectional range of code points. When storage representation and random access is required then make it a random access range of code units.I think you're right. The need for a new concept isn't that great, and it gets complicated really fast.
Jan 12 2011
On 01/12/2011 08:28 PM, Don wrote:I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.I'd like to know when it happens that codepoint is the appropriate level of abstraction. * If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK. * But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text. As I see it now, we need 2 types: * One plain string similar to good old ones (bytestring would do the job, since most unicode is utf8 encoded) for the first kind of use above. With optional validity check when it's supposed to be unicode text. * One hiher-level type abstracting from codepoint (not code unit) issues, restoring the necessary properties: (1) each character is one element in the sequence (2) each character is always represented the same way. Denis _________________ vita es estrany spir.wikidot.com
Jan 12 2011
spir wrote:On 01/12/2011 08:28 PM, Don wrote:When on a document that describes code points... :)I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.I'd like to know when it happens that codepoint is the appropriate level of abstraction.* If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK. * But any of manipulation (indexing, slicing, compare,Compare according to which alphabet's ordering? Surely not Unicode's... I may be alone in this, but ordering is tied to an alphabet (or writing system), not locale.) I try to solve that issue with my trileri library: http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr Warning: the code is in Turkish and is not aware of the concept of collation at all; it has its own simplistic view of text, where every character is an entity that can be lower/upper cased to a single character.search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense).I don't know this about Unicode: should e and ´ (acute accent) be always collated? If so, wouldn't it be impossible to put those two in that order say, in a text book? (Perhaps Unicode defines a way to stop collation.)Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text. As I see it now, we need 2 types:I think we need more than 2 types...* One plain string similar to good old ones (bytestring would do the job, since most unicode is utf8 encoded) for the first kind of use above. With optional validity check when it's supposed to be unicodetext. Agreed. D gives us three UTF encondings, but I am not sure that there is only one abstraction above that.* One hiher-level type abstracting from codepoint (not code unit) issues, restoring the necessary properties: (1) each character is one element in the sequence (2) each character is always represented the same way.I think VLERange should solve only the variable-length-encoding issue. It should not get into higher abstractions. Ali
Jan 12 2011
On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:On 01/12/2011 08:28 PM, Don wrote:I agree with you. I don't see many use for code points. One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.I'd like to know when it happens that codepoint is the appropriate level of abstraction.* If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK. * But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text.Very true. In the same way that code points can span on multiple code units, user-perceived characters (graphemes) can span on multiple code points. A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortun". If the last "" is expressed as two code points, as "e" followed by a combining acute accent (this: ), replacing occurrences of "fortune" by "expose" would also replace "fortun" with "expos" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that. In the case of "", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "" (single code point) and "" ("e" + combining acute accent) as equivalent. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 12 2011
On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin michelf.com> said:A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortuné". If the last "é" is expressed as two code points, as "e" followed by a combining acute accent (this: é), replacing occurrences of "fortune" by "expose" would also replace "fortuné" with "exposé" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that. In the case of "é", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "é" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "é" (single code point) and "é" ("e" + combining acute accent) as equivalent.Crap, I meant to send this as UTF-8 with combining characters in it, but my news client converted everything to ISO-8859-1. I'm not sure it'll work, but here's my second attempt at posting real combining marks: Single code point: é e with combining mark: é t with combining mark: t̂ t with two combining marks: t̂̃ -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 12 2011
On 01/13/2011 01:51 AM, Michel Fortin wrote:On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin michelf.com> said:Works :-) But your first post worked as well by me: for instance <<"é" ("e" + combining acute accent)>> was displayed "é" as a single accented letter. I guess maybe your email client did not convert into iso-8859-1 on sending, but on reading (mine is set for utf-8). Denis _________________ vita es estrany spir.wikidot.comA funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortuné". If the last "é" is expressed as two code points, as "e" followed by a combining acute accent (this: é), replacing occurrences of "fortune" by "expose" would also replace "fortuné" with "exposé" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that. In the case of "é", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "é" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "é" (single code point) and "é" ("e" + combining acute accent) as equivalent.Crap, I meant to send this as UTF-8 with combining characters in it, but my news client converted everything to ISO-8859-1. I'm not sure it'll work, but here's my second attempt at posting real combining marks: Single code point: é e with combining mark: é t with combining mark: t̂ t with two combining marks: t̂̃
Jan 13 2011
On 01/13/2011 01:45 AM, Michel Fortin wrote:On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you see what I mean. Once the text is properly NFD decomposed, each of those marks in coded as a codepoint. (But if it's not decomposed, then most of those marks are probably hidden by precomposed codes coding characters like "ä".) So that even such an app benefits from a higher-level type basically operating on normalised (NFD) characters.On 01/12/2011 08:28 PM, Don wrote:I agree with you. I don't see many use for code points. One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.I'd like to know when it happens that codepoint is the appropriate level of abstraction.You'll find another example in the introduction of the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction About your last remark, this is precisely one of the two abstractions my Text type provides: it groups togeter in "piles" codes that belong to the same "true" character (grapheme) like "é". So that the resulting text representation is a sequence of "piles", each representing a character. Consequence: indexing, slicing, etc work sensibly (and even other operations are faster for they do not need to perform that "piling" again & again). In addition to that, the string is first NFD-normalised, thus each chraracter can have one & only representation. Consequence: search, count, replace, etc, and compare (*) work as expected. In your case: // 2 forms of "é" assert(Text("\u00E9") == Text("\u0065\u0301")); Denis (*) According to UCS coding, not language-specific idiosyncrasies. More generally, Text abstract from lower-level issues _introduced_ by UCS, Unicode's character set. It does not code with script-, language-, culture-, domain-, app- specific needs such as custom text sorting rules. Some base routines for such operations are provided by Text's brother lib DUnicode (access to some code properties, safe concat, casefolded compare, NF* normalisation). _________________ vita es estrany spir.wikidot.com* If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK. * But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text.Very true. In the same way that code points can span on multiple code units, user-perceived characters (graphemes) can span on multiple code points. A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortuné". If the last "é" is expressed as two code points, as "e" followed by a combining acute accent (this: é), replacing occurrences of "fortune" by "expose" would also replace "fortuné" with "exposé" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that. In the case of "é", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "é" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "é" (single code point) and "é" ("e" + combining acute accent) as equivalent.
Jan 13 2011
On Thursday 13 January 2011 01:49:31 spir wrote:On 01/13/2011 01:45 AM, Michel Fortin wrote:hope youOn 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:=20 Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a' & '=C2=A8' in "=C3=A4". =On 01/12/2011 08:28 PM, Don wrote:=20 I agree with you. I don't see many use for code points. =20 One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.=20 I'd like to know when it happens that codepoint is the appropriate level of abstraction.see what I mean. Once the text is properly NFD decomposed, each of those marks in coded as a codepoint. (But if it's not decomposed, then most of those marks are probably hidden by precomposed codes coding characters like "=C3=A4".=) Sothat even such an app benefits from a higher-level type basically operating on normalised (NFD) characters.There's also the question of efficiency. On the whole, string operations ca= n be=20 very expensive - particularly when you're doing a lot of them. The fact tha= t D's=20 arrays are so powerful may reduce the problem in D, but in general, if you'= re=20 doing a lot with strings, it can get costly, performance-wise. The question then is what is the cost of actually having strings abstracted= to=20 the point that they really are ranges of characters rather than code units = or=20 code points or whatever? If the cost is large enough, then dealing with str= ings=20 as arrays as they currently are and having the occasional unicode issue cou= ld=20 very well be worth it. As it is, there are plenty of people who don't want = to=20 have to care about unicode in the first place, since the programs that they= write=20 only deal with ASCII characters. The fact that D makes it so easy to deal w= ith=20 unicode code points is a definite improvement, but taking the abstraction t= o the=20 point that you're definitely dealing with characters rather than code units= or=20 code points could be too costly. Now, if it can be done efficiently, then having unicode dealt with properly= =20 without the programmer having to worry about it would be a big boon. As it = is,=20 D's handling of unicode is a big boon, even if it doesn't deal with graphem= es=20 and the like. So, I think that we definitely should have an abstraction for unicode which= uses=20 characters as the elements in the range and doesn't have to care about the= =20 underlying encoding of the characters (except perhaps picking whether char,= =20 wchar, or dchar is use internally, and therefore how much space it requires= ).=20 However, I'm not at all convinced that such an abstraction can be done effi= ciently=20 enough to make it the default way of handling strings. =2D Jonathan M Davis
Jan 13 2011
On 01/13/2011 11:16 AM, Jonathan M Davis wrote:On Thursday 13 January 2011 01:49:31 spir wrote:D's arrays (even dchar[] & dstring) do not allow having correct results when dealing with UCS/Unicode text in the general case. See Michel's example (and several ones I posted on this list, and the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20leve %20of%20abstraction for a very lengthy explanation). You and some other people seem to still mistake Unicode's low level issue of codepoint vs code unit, with the higher-level issue of codes _not_ representing characters in the commmon sense ("graphemes"). The above pointed text was written precisely to introduce to this issue because obviously no-one wants to face it... (Eg each time I evoke it on this list it is ignored, except by Michel, but the same is true everywhere else, including on the Unicode mailing list!). The core of the problem is the misleading term "abstract character" which deceivingly lets programmers believe that a codepoints codes a character, like in historic character sets -- which is *wrong*. No Unicode document AFAIK explains this. This is a case of unsaid lie. Compared to legacy charsets, dealing with Unicode actually requires *2* levels of abstraction... (one to decode codepoints from code units, one to construct characters from codepoints) Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).On 01/13/2011 01:45 AM, Michel Fortin wrote:There's also the question of efficiency. On the whole, string operations can be very expensive - particularly when you're doing a lot of them. The fact that D's arrays are so powerful may reduce the problem in D, but in general, if you're doing a lot with strings, it can get costly, performance-wise.On 2011-01-12 14:57:58 -0500, spir<denis.spir gmail.com> said:Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a'& '¨' in "ä". hope you see what I mean. Once the text is properly NFD decomposed, each of those marks in coded as a codepoint. (But if it's not decomposed, then most of those marks are probably hidden by precomposed codes coding characters like "ä".) So that even such an app benefits from a higher-level type basically operating on normalised (NFD) characters.On 01/12/2011 08:28 PM, Don wrote:I agree with you. I don't see many use for code points. One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.I'd like to know when it happens that codepoint is the appropriate level of abstraction.The question then is what is the cost of actually having strings abstracted to the point that they really are ranges of characters rather than code units or code points or whatever? If the cost is large enough, then dealing with strings as arrays as they currently are and having the occasional unicode issue could very well be worth it. As it is, there are plenty of people who don't want to have to care about unicode in the first place, since the programs that they write only deal with ASCII characters. The fact that D makes it so easy to deal with unicode code points is a definite improvement, but taking the abstraction to the point that you're definitely dealing with characters rather than code units or code points could be too costly.When _manipulating_ text (indexing, search, changing), you have the choice between: * On the fly abstraction (composing characters on the fly, and/or normalising them), for each operation for each piece of text (including parameters, including literals). * Use of a type that constructs this abstraction once only for each piece of text. Note that a single count operation is forced to construct this abstraction on the fly for the whole text... (and for the searched snippet). Also note that optimisation is probably easier is the second case, for the abstraction operation is then standard.Now, if it can be done efficiently, then having unicode dealt with properly without the programmer having to worry about it would be a big boon. As it is, D's handling of unicode is a big boon, even if it doesn't deal with graphemes and the like.It has a cost at intial Text construction time. Currently, on my very slow computer, 1MB source text requires ~ 500 ms (decoding + decomposition + ordering + "piling" codes into characters). Decoding only using D's builtin std.utf.decode takes about 100 ms. The bottle neck is piling: 70% of the time in average, on a test case melting texts from a dozen natural languages. We would be very glad to get the community's help in optimising this phase :-) (We have progressed very much already in terms of speed, but now reach limits of our competences.)So, I think that we definitely should have an abstraction for unicode which uses characters as the elements in the range and doesn't have to care about the underlying encoding of the characters (except perhaps picking whether char, wchar, or dchar is use internally, and therefore how much space it requires). However, I'm not at all convinced that such an abstraction can be done efficiently enough to make it the default way of handling strings.If you only have ASCII, or if you don't manipulate text at all, then as said in a previous post any string representation works fine (whatever the encoding it possibly uses under the hood). D's builtin char/dchar/wchar and string/dstring/wstring are very nice and well done, but they are not necessary in such a use case. Actually, as shown by Steven's repeted complaints, they rather get in the way when dealing with non-unicode source data (IIUC, by assuming string elements are utf codes). And they do not even try to solve the real issues one necessarily meets when manipulating unicode texts, which are due to UCS's coding format. Thus my previous statement: the level of codepoints is nearly never the proper level of abstraction.- Jonathan M DavisDenis _________________ vita es estrany spir.wikidot.com
Jan 13 2011
On 2011-01-13 06:48:46 -0500, spir <denis.spir gmail.com> said:Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).D is not the first language dealing correctly with Unicode strings in this manner. Objective-C's NSString class search and compare methods deal with characters with combining marks correctly. If you want to compare code points, you can do so explicitly using the NSLiteralSearch option, but the default is to compare the canonical version (at the grapheme level). <http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI> In Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
On 01/13/2011 02:47 PM, Michel Fortin wrote:On 2011-01-13 06:48:46 -0500, spir <denis.spir gmail.com> said:Thank you very much for this information (I feel less lonely ;-). I'll have a look at this NSString class ASAP, looks like it does The-Right-Thing as default (an Apple product...)Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).D is not the first language dealing correctly with Unicode strings in this manner. Objective-C's NSString class search and compare methods deal with characters with combining marks correctly. If you want to compare code points, you can do so explicitly using the NSLiteralSearch option, but the default is to compare the canonical version (at the grapheme level). <http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>In Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want.On this point, I'm more dubitative. (Locale settings do not guarantee anything about right way of sorting for given domain, a given app, a given use case. There is an infinity of potential choices. But maybe it's a right default? See kde trying to invent a, hum, "natural", way of sorting file names...) Denis _________________ vita es estrany spir.wikidot.com
Jan 13 2011
On 2011-01-13 14:11:44 -0500, spir <denis.spir gmail.com> said:In Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want. See kde trying to invent a, hum, "natural", way of sorting file names...)Mac OS sorts file names in a "natural" way since a very long time (since Mac OS 8 I believe). By natural, I mean that numbers inside the file name are sorted in numeric order while the rest is sorted character by character. For instance "My File 2" will go before "My File 10" in file listings because "2" is less than "10". There's an option in NSString comparison methods to use this ordering, but it's not the default. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
"Michel Fortin" <michel.fortin michelf.com> wrote in message news:igo5v2$gq2$1 digitalmars.com...On 2011-01-13 14:11:44 -0500, spir <denis.spir gmail.com> said:XP's explorer does that too. It's a very nice feature.In Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want. See kde trying to invent a, hum, "natural", way of sorting file names...)Mac OS sorts file names in a "natural" way since a very long time (since Mac OS 8 I believe). By natural, I mean that numbers inside the file name are sorted in numeric order while the rest is sorted character by character. For instance "My File 2" will go before "My File 10" in file listings because "2" is less than "10".
Jan 13 2011
On Thursday 13 January 2011 03:48:46 spir wrote:On 01/13/2011 11:16 AM, Jonathan M Davis wrote:ofOn Thursday 13 January 2011 01:49:31 spir wrote:On 01/13/2011 01:45 AM, Michel Fortin wrote:On 2011-01-12 14:57:58 -0500, spir<denis.spir gmail.com> said:On 01/12/2011 08:28 PM, Don wrote:I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level =". hope you=20 Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a'& '=C2=A8' in "=C3=A4==20 I agree with you. I don't see many use for code points. =20 One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.abstraction.=20 I'd like to know when it happens that codepoint is the appropriate level of abstraction.=A4".) Sosee what I mean. Once the text is properly NFD decomposed, each of those marks in coded as a codepoint. (But if it's not decomposed, then most of those marks are probably hidden by precomposed codes coding characters like "=C3=le=20 D's arrays (even dchar[] & dstring) do not allow having correct results when dealing with UCS/Unicode text in the general case. See Michel's example (and several ones I posted on this list, and the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20=that even such an app benefits from a higher-level type basically operating on normalised (NFD) characters.=20 There's also the question of efficiency. On the whole, string operations can be very expensive - particularly when you're doing a lot of them. The fact that D's arrays are so powerful may reduce the problem in D, but in general, if you're doing a lot with strings, it can get costly, performance-wise.vel%20of%20abstraction for a very lengthy explanation). You and some other people seem to still mistake Unicode's low level issue of codepoint vs code unit, with the higher-level issue of codes _not_ representing characters in the commmon sense ("graphemes"). =20 The above pointed text was written precisely to introduce to this issue because obviously no-one wants to face it... (Eg each time I evoke it on this list it is ignored, except by Michel, but the same is true everywhere else, including on the Unicode mailing list!). The core of the problem is the misleading term "abstract character" which deceivingly lets programmers believe that a codepoints codes a character, like in historic character sets -- which is *wrong*. No Unicode document AFAIK explains this. This is a case of unsaid lie. Compared to legacy charsets, dealing with Unicode actually requires *2* levels of abstraction... (one to decode codepoints from code units, one to construct characters from codepoints) =20 Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3). =20I wasn't saying that code points are guaranteed to be characters. I was say= ing=20 that in most cases they are, so if efficiency is an issue, then having prop= erly=20 abstract characters could be too costly. However, having a range type which= =20 properly abstracts characters and deals with whatever graphemes and=20 normalization and whatnot that it has to would be a very good thing to have= =2E The=20 real question is whether it can be made efficient enough to even consider u= sing it=20 normally instead of just when you know that you're really going to need it. The fact that you're seeing such a large drop in performance with your Text= type=20 definitely would support the idea that it could be just plain too expensive= to=20 use such a type in the average case. Even something like a 20% drop in=20 performance could be devastating if you're dealing with code which does a l= ot of=20 string processing. Regardless though, there will obviously be cases where y= ou'll=20 need something like your Text type if you want to process unicode correctly. However, regardless of what the best way to handle unicode is in general, I= =20 think that it's painfully clear that your average programmer doesn't know m= uch=20 about unicode. Even understanding the nuances between char, wchar, and dcha= r is=20 more than your average programmer seems to understand at first. The idea th= at a=20 char wouldn't be guaranteed to be an actual character is not something that= many=20 programmers take to immediately. It's quite foreign to how chars are typica= lly=20 dealt with in other languages, and many programmers never worry about unico= de at=20 all, only dealing with ASCII. So, not only is unicode a rather disgusting=20 problem, but it's not one that your average programmer begins to grasp as f= ar as=20 I've seen. Unless the issue is abstracted away completely, it takes a fair = bit=20 of explaining to understand how to deal with unicoder properly. =2D Jonathan M DavisThe question then is what is the cost of actually having strings abstracted to the point that they really are ranges of characters rather than code units or code points or whatever? If the cost is large enough, then dealing with strings as arrays as they currently are and having the occasional unicode issue could very well be worth it. As it is, there are plenty of people who don't want to have to care about unicode in the first place, since the programs that they write only deal with ASCII characters. The fact that D makes it so easy to deal with unicode code points is a definite improvement, but taking the abstraction to the point that you're definitely dealing with characters rather than code units or code points could be too costly.=20 When _manipulating_ text (indexing, search, changing), you have the choice between: * On the fly abstraction (composing characters on the fly, and/or normalising them), for each operation for each piece of text (including parameters, including literals). * Use of a type that constructs this abstraction once only for each piece of text. Note that a single count operation is forced to construct this abstraction on the fly for the whole text... (and for the searched snippet). Also note that optimisation is probably easier is the second case, for the abstraction operation is then standard. =20Now, if it can be done efficiently, then having unicode dealt with properly without the programmer having to worry about it would be a big boon. As it is, D's handling of unicode is a big boon, even if it doesn't deal with graphemes and the like.=20 It has a cost at intial Text construction time. Currently, on my very slow computer, 1MB source text requires ~ 500 ms (decoding + decomposition + ordering + "piling" codes into characters). Decoding only using D's builtin std.utf.decode takes about 100 ms. The bottle neck is piling: 70% of the time in average, on a test case melting texts from a dozen natural languages. We would be very glad to get the community's help in optimising this phase :-) (We have progressed very much already in terms of speed, but now reach limits of our competences.) =20So, I think that we definitely should have an abstraction for unicode which uses characters as the elements in the range and doesn't have to care about the underlying encoding of the characters (except perhaps picking whether char, wchar, or dchar is use internally, and therefore how much space it requires). However, I'm not at all convinced that such an abstraction can be done efficiently enough to make it the default way of handling strings.=20 If you only have ASCII, or if you don't manipulate text at all, then as said in a previous post any string representation works fine (whatever the encoding it possibly uses under the hood). D's builtin char/dchar/wchar and string/dstring/wstring are very nice and well done, but they are not necessary in such a use case. Actually, as shown by Steven's repeted complaints, they rather get in the way when dealing with non-unicode source data (IIUC, by assuming string elements are utf codes). =20 And they do not even try to solve the real issues one necessarily meets when manipulating unicode texts, which are due to UCS's coding format. Thus my previous statement: the level of codepoints is nearly never the proper level of abstraction.
Jan 13 2011
On 2011-01-13 07:10:09 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:However, regardless of what the best way to handle unicode is in general, I think that it's painfully clear that your average programmer doesn't know much about unicode. Even understanding the nuances between char, wchar, and dchar is more than your average programmer seems to understand at first. The idea that a char wouldn't be guaranteed to be an actual character is not something that many programmers take to immediately. It's quite foreign to how chars are typically dealt with in other languages, and many programmers never worry about unicode at all, only dealing with ASCII. So, not only is unicode a rather disgusting problem, but it's not one that your average programmer begins to grasp as far as I've seen. Unless the issue is abstracted away completely, it takes a fair bit of explaining to understand how to deal with unicoder properly.What's nice about Cocoa's way of handling strings is that even programmers not bothering about it get things right most of the time. Strings are compared in their canonical form (graphemes), unless you request a literal compression; and they are sorted and compared case-insensitively according to the user's locale, unless you specify your own locale settings. Its only major pitfall is that indexing is done on UTF-16 code units. The cost for this correctness is a small performance penalty, but I think it's the right path to take. For when performance or access to code points is important, the programmer should still be able to go down one layer and play with code points directly. That said, we need to make sure the performance drop is minimal. I somewhat doubt much that spir's approach of storing strings as an array of piles of characters is the right approach for most usage scenarios, but this area would need a little more research. spir's approach is certainly the ultimate step in correctness as it allows O(1) indexing of graphemes, but personally I'd favor not to have indexing and just do on-the-fly decoding at the grapheme level when performing various string operations. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
OT: Spir, do you know if I can change the syntax highlighting settings on bitbucket? I can't see anything with these gray on dark-gray colors: http://i.imgur.com/SmLk1.jpg
Jan 13 2011
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message news:mailman.604.1294932704.4748.digitalmars-d puremagic.com...OT: Spir, do you know if I can change the syntax highlighting settings on bitbucket? I can't see anything with these gray on dark-gray colors: http://i.imgur.com/SmLk1.jpgI'm getting the same problem too.
Jan 13 2011
On 2011-01-13 15:39:14 -0500, "Nick Sabalausky" <a a.a> said:"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message news:mailman.604.1294932704.4748.digitalmars-d puremagic.com...I bypassed the problem by fetching the files from the repository. But I agree it's very annoying. -- Michel Fortin michel.fortin michelf.com http://michelf.com/OT: Spir, do you know if I can change the syntax highlighting settings on bitbucket? I can't see anything with these gray on dark-gray colors: http://i.imgur.com/SmLk1.jpgI'm getting the same problem too.
Jan 13 2011
On 01/13/2011 01:10 PM, Jonathan M Davis wrote:I wasn't saying that code points are guaranteed to be characters. I was saying that in most cases they are, so if efficiency is an issue, then having properly abstract characters could be too costly.The problem is then: how does a library or application programmer know, for sure, that all true characters (graphemes) from all source texts its software will ever deal with are coded with a single codepoint? If you cope with ASCII only now & forever, then you know that. If you do not manipulate text at all, then the question vanishes. Else, you cannot know, I guess. The problem is partially masked because, most of us currently process only western language sources, for which scripts there exist precomposed codes for every _predefine_ character, and text-producing software (like editors) usually use precomposed codes when available. Hope I'm clear. (I hope this use of precomposed codes will change because the gain in space for western langs is ridiculous and the cost in processing is instead relevant.) In the future, all of this may change, so that the issue would more often be obvious for many programmers dealing with international text. Note that even now nothing prevents a user (including a programmer in source code!), even less a text-producing software, to use decomposed coding (the right choice imo). And there are true characters, and you can "invent" as many fancy characters you like, for which no precomposed code is defined, indeed. All of this is valid unicode and must be properly dealt with.However, having a range type which properly abstracts characters and deals with whatever graphemes and normalization and whatnot that it has to would be a very good thingto have. The real question is whether it can be made efficient enough to even consider using it normally instead of just when you know that you're really going to need it. Upon range, we initially planned to expose a range interface in our type for iteration, instead of opApply, for better integration with coming D2 style, and algorithms. But had to let it down due to a few range bugs exposed in a previous thread (search for "range usability" IIRC).The fact that you're seeing such a large drop in performance with your Text type definitely would support the idea that it could be just plain too expensive to use such a type in the average case. Even something like a 20% drop in performance could be devastating if you're dealing with code which does a lot of string processing. Regardless though, there will obviously be cases where you'll need something like your Text type if you want to process unicode correctly.The question of efficency is not as you present it. If you cannot guarantee that every character is coded by a single code (in all pieces of text, including params and literal), then you *must* construct an abstraction at the level of true characters --and even probably normalise them. You have the choice of doing it on the fly for _every_ operation, or using a tool like the type Text. In the latter case, not only everything is far simpler for client code, but the abstraction is constructed only once (and forever ;-). In the first case, the cost is the same (or rather higher because optimisation can probably be more efficient for a single standard case than for various operation cases); but _multiplied_ by the number of operations you need to perform on each piece of text. Thus, for a given operation, you get the slowest possible run: for instance indexing is O(k*n) where k is the cost of "piling" a single char, and n the char count... In the second case, the efficiency issue happens only initially for each piece of text. Then, every operation is as fast as possible: indexing is indeed O(1). But: this O(1) is slightly slower than with historic charsets because characters are now represented by mini code arrays instead of single codes. The same point applies even more for every operation involving compares (search, count, replace). We cannot solve this: it is due to UCS's coding scheme.However, regardless of what the best way to handle unicode is in general, I think that it's painfully clear that your average programmer doesn't know much about unicode.True. Even those who think they are informed. Because Unicode's docs all not only ignore the problem, but contribute to creating it by using the deceiving term "abstract character" (and often worse, "character" alone) to denote what a codepoint codes. All articles I have ever read _about_ Unicode by third party simply follow. Evoking this issue on the unicode mailing list usually results in plain silence.Even understanding the nuances between char, wchar, and dchar is more than your average programmer seems to understand at first. The idea that a char wouldn't be guaranteed to be an actual character is not something that many programmers take to immediately. It's quite foreign to how chars are typically dealt with in other languages, and many programmers never worry about unicode at all, only dealing with ASCII.(average programmer ? ;-) Not that much to "how chars are typically dealt with in other languages", rather to how characters were coded in historic charsets. Other languages ignore the issue, and thus run incorrectly with universal text, the same way as D's builtin tools do it. About ASCII, note that the only kind of source it's able to encode is plain english text, without any bit of fancy thingy in it. A single non-breaking space, "≥", "×" (product U+00D7), or using a letter imported from foreign language like in "à la", same for "αβγ", not to evoke "©" & "®"...So, not only is unicode a rather disgusting problem, but it's not one that your average programmer begins to grasp as far as I've seen. Unless the issue is abstracted away completely, it takes a fair bit of explaining to understand how to deal with unicoder properly.Please have a look at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3, read https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level 20of%20abstraction, and try https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/Text.d Any feedback welcome (esp on reformulating the text concisely ;-)- Jonathan M DavisDenis _________________ vita es estrany spir.wikidot.com
Jan 13 2011