digitalmars.D - Why the hell doesn't foreach decode strings
- Martin Nowak (6/6) Oct 20 2011 It just took me over one hour to find out the unthinkable.
- Peter Alexander (18/24) Oct 20 2011 D has got itself into a tricky situation in this regard. Doing it either...
- Walter Bright (5/8) Oct 20 2011 There is no 'fixing' it, even to break compatibility. Sometimes you want...
- Adam D. Ruppe (4/6) Oct 20 2011 Well, they could always cast it to a ubyte[].
- Jonathan M Davis (9/20) Oct 20 2011 True, but if the default were dchar, then the common case would be have ...
- Walter Bright (8/16) Oct 20 2011 Is that really the common case? It's certainly the *slow* case. Common s...
- Jonathan M Davis (52/74) Oct 20 2011 And why would you iterate over a string with foreach without decoding it...
- Walter Bright (18/32) Oct 20 2011 No, it doesn't. If I'm searching for a dchar, I'll be searching for a su...
- Martin Nowak (11/52) Oct 21 2011 You have a good point here. I would have immediately thrown out the loop...
- Walter Bright (3/9) Oct 21 2011 I understand the issue, but I don't think it's resolvable. It's a lot li...
- Andrei Alexandrescu (17/29) Oct 21 2011 It is resolvable, just not without breaking compatibility. Latching on
- bearophile (11/20) Oct 21 2011 Maybe I have missed your precedent explanations of this idea. Or to me t...
- Jacob Carlborg (6/20) Oct 22 2011 Can't we implement a new string type that people can choose to use if
- Timon Gehr (4/23) Oct 22 2011 Having multiple standard string types is bad. Furthermore, it is hard to...
- Michel Fortin (35/70) Oct 20 2011 It works for non-ASCII too. You're probably missing an interesting
- Jonathan M Davis (14/14) Oct 20 2011 Actually, I'd have to say that having foreach default to iterating over ...
- Walter Bright (3/7) Oct 20 2011 Making such a string type would be terribly inefficient. It would make D...
- Steven Schveighoffer (4/15) Oct 21 2011 I don't think it would. Do you have any proof to support this?
- Walter Bright (3/6) Oct 21 2011 I've done string processing code, and done a lot of profiling of them. E...
- Andrei Alexandrescu (4/11) Oct 21 2011 The key here is to allow people to easily either decode strings or not,
- Steven Schveighoffer (9/17) Oct 24 2011 What I mean is, default to a well-built string type, and let people who ...
- travert phare.normalesup.org (Christophe Travert) (18/26) Oct 28 2011 I definitely agree with you, but I have a piece of news for you :
- Andrei Alexandrescu (7/23) Oct 28 2011 It was to allow people who know Unicode to use strings without too much
- Norbert Nemec (16/20) Oct 24 2011 Indeed - expressing strings as arrays of characters will always fall
- Peter Alexander (23/33) Oct 21 2011 Then separate those ways of viewing strings.
- Jonathan M Davis (6/48) Oct 21 2011 In another post in this thread, Walter said in reference to post on
- so (8/14) Oct 21 2011 Well he is right, the reason built-in strings everywhere and rarely (so ...
- bearophile (5/6) Oct 21 2011 We should aim to something *better* than dynamic languages, where possib...
- so (11/21) Oct 21 2011 With spoiled i didn't mean those languages do it better in general
- Martin Nowak (18/29) Oct 21 2011 =
- Peter Alexander (2/17) Oct 22 2011 Which operations do you believe would be less efficient?
- Walter Bright (3/4) Oct 22 2011 All of the ones that don't require decoding, such as searching, would be...
- kennytm (3/8) Oct 22 2011 You can std.algorithm.find to do searching, not foreach. The former can
- Timon Gehr (2/10) Oct 22 2011 Afaics the current std.algorithm.find implementation decodes its argumen...
- Andrei Alexandrescu (5/16) Oct 22 2011 That can be easily fixed. Currently single-element find does decoding
- Timon Gehr (4/22) Oct 22 2011 Ok, I actually did not see that, thanks. However this is usually still
- Andrei Alexandrescu (4/7) Oct 22 2011 I understand. For some reason I'm not seeing the URL of your pull
- Steven Schveighoffer (11/15) Oct 24 2011 Searching that does not do decoding is fundamentally incorrect. That is...
- Simen Kjaeraas (7/17) Oct 24 2011 Assuming both string are valid UTF-8, you can. Continuation bytes can ne...
- Steven Schveighoffer (19/38) Oct 24 2011 d =
- Dmitry Olshansky (19/54) Oct 24 2011 Assuming language support stays on stage of "codepoint is a character"
- Steven Schveighoffer (51/105) Oct 24 2011 t
- Walter Bright (4/6) Oct 24 2011 There are different levels of Unicode support. D operates at the most ba...
- Timon Gehr (4/95) Oct 24 2011 The answer is probably to get proper unicode normalization into Phobos.
- Dmitry Olshansky (37/127) Oct 24 2011 Last time I checked, the legion of "everything is ASCII" zealots was
- Steven Schveighoffer (80/218) Oct 26 2011 hat
- Dmitry Olshansky (21/146) Oct 26 2011 That's it.
- travert phare.normalesup.org (Christophe) (14/18) Oct 28 2011 string being seen as range of codepoint (dchar) is already aweful
- Simen Kjaeraas (8/10) Oct 24 2011 If this is correct (and it is), then decoding to dchar is simply not
- Steven Schveighoffer (23/31) Oct 26 2011 a
- Andrei Alexandrescu (11/34) Oct 29 2011 Why doesn't that simple search work?
- Jonathan M Davis (53/86) Oct 29 2011 ot
- Steven Schveighoffer (31/67) Oct 31 2011 f a
- Andrei Alexandrescu (3/11) Oct 31 2011 I see this as a decoding primitive issue that can be resolved.
- Walter Bright (5/14) Oct 24 2011 Sure you can. A Unicode character is a string, a Unicode string is a str...
- Steven Schveighoffer (7/25) Oct 24 2011 d =
- Michel Fortin (30/32) Oct 24 2011 The more I think about it, the more I think it should work like this:
- Steven Schveighoffer (22/48) Oct 26 2011 .
- Michel Fortin (23/30) Oct 26 2011 Well, treating the string as an array of dchar doesn't work in the
- Steven Schveighoffer (29/54) Oct 26 2011 the
- Michel Fortin (15/23) Oct 26 2011 And all I'm saying is that such a standalone function that eliminates
- so (6/10) Oct 21 2011 Indeed, if one knows/understands how unicode works AND knows the structu...
- Jacob Carlborg (5/23) Oct 20 2011 The index could skip certain indexes to make that assert pass. But that
- so (7/13) Oct 21 2011 I really can't believe why people expect that.
- Timon Gehr (3/9) Oct 21 2011 I actually like the current behaviour.
- Alvaro (14/20) Oct 22 2011 Maybe I didn't fully get your point, but, you do know that you can do
- Daniel Gibson (5/17) Oct 22 2011 One visible Unicode character can consists of several dchars.
It just took me over one hour to find out the unthinkable. foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode. Now there is so many unicode transcoding happening in the language that it starts to get annoying, but the most basic string iteration doesn't support it by default?
Oct 20 2011
On 20/10/11 8:37 PM, Martin Nowak wrote:It just took me over one hour to find out the unthinkable. foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode. Now there is so many unicode transcoding happening in the language that it starts to get annoying, but the most basic string iteration doesn't support it by default?D has got itself into a tricky situation in this regard. Doing it either way introduces an unintuitive mess. The way it is now, you get the problem that you just described where foreach is unaware of Unicode. If you changed it to loop as Unicode, then indices won't match up: immutable(int)[] a = ... foreach (x, i; a) assert(x == a[i]); // ok immutable(char)[] b = ... foreach (x, i; b) assert(x == b[i]); // not necessarily! Also, the loop won't necessarily iterate b.length times. There's inconsistencies all over the place. The whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.
Oct 20 2011
On 10/20/2011 2:49 PM, Peter Alexander wrote:The whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what. There is no way to program strings in D without being aware of UTF-8 encoding.
Oct 20 2011
Walter Bright wrote:Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars.Well, they could always cast it to a ubyte[]. But on the other hand, you can just ask for dchar now. So yeah.
Oct 20 2011
On Thursday, October 20, 2011 19:26:43 Walter Bright wrote:On 10/20/2011 2:49 PM, Peter Alexander wrote:True, but if the default were dchar, then the common case would be have fewer bugs (still allowing you to explicitly use char or wchar when you want to). At minimum, I think that it would be a good idea to implement http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not to explicitly give the type with foreach for arrays of char or wchar. It would catch bugs without changing the behavior of any existing code, and it still allows you to iterate over either code units or code points. - Jonathan M DavisThe whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what. There is no way to program strings in D without being aware of UTF-8 encoding.
Oct 20 2011
On 10/20/2011 7:37 PM, Jonathan M Davis wrote:True, but if the default were dchar, then the common case would be have fewer bugsIs that really the common case? It's certainly the *slow* case. Common string operations like searching, copying, etc., do not require decoding.(still allowing you to explicitly use char or wchar when you want to). At minimum, I think that it would be a good idea to implement http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not to explicitly give the type with foreach for arrays of char or wchar. It would catch bugs without changing the behavior of any existing code, and it still allows you to iterate over either code units or code points.I like the type deduction feature of foreach, and don't think it should be removed for strings. Currently, it's consistent - T[] gets an element type of T. I want to reiterate that there's no way to program strings in D without being cognizant of them being a multibyte representation. D is both a high level and a low level language, and you can pick which to use, but you still gotta pick.
Oct 20 2011
On Thursday, October 20, 2011 20:37:40 Walter Bright wrote:On 10/20/2011 7:37 PM, Jonathan M Davis wrote:And why would you iterate over a string with foreach without decoding it unless you specifically need to operate on code units (which I would expect to be _very_ rare)? Sure, copying doesn't require decoding, but searching sure does (unless you're specifically looking for a code unit rather than a code point, which would not be normal). Most anything which needs to operate on the characters of a string needs to decode them. And iterating over them to do much of anything would require decoding, since otherwise you're operating on code units, and how often does anyone do that unless they're specifically messing around with character encodings? Sure, if you _know_ that you're dealing with a string with only ASCII, it's faster to just iterate over chars, but then you can explicitly give the type of the foreach variable as char, but normally what people care about is iterating over characters, not pieces of characters. So, I would expect the case where people _want_ to iterate over chars to be rare. In most cases, iterating over a string as chars is a bug - one which in many cases won't be quickly caught, because the programmer is English speaking and uses almost exclusively ASCII for whatever testing that they do. The default for string handling really should be to treat them as ranges of dchar but still make it easy for them to be treated as arrays of code units when necessary. There's plenty of code in Phobos which is able to special case strings and make operating on them more efficient when it's not necessary to operate on them as ranges of dchar or when decoding the string explicitly with functions such as stride. But the default is still to operate on them as ranges of dchar, because that is what is normally correct. Defaulting to the guaranteed correct handling of characters and special casing when it's possible to write code more efficiently than that is definitely the way to go about it, and it's how Phobos generally does it. The fact that foreach doesn't is incongruous with how strings are handled in most other cases.True, but if the default were dchar, then the common case would be have fewer bugsIs that really the common case? It's certainly the *slow* case. Common string operations like searching, copying, etc., do not require decoding.Sure, the type deduction of foreach is great, and it's completely consistent that iterating over an array of chars would iterate over chars rather than dchars when you don't give the type. However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters. So, the default at this point is _wrong_ in the common case. As such, I'm very leery of any code which uses foreach over a string without specifying the iteration type. And in fact, unless the code is clearly intended to operate on code units, I would expect a comment indicating that the use of char instead of dchar was intentional, or I'd still consider it likely that it's a bug and a mistake on the programmer's part (likely due to a misunderstanding of unicode and how D deals with it).(still allowing you to explicitly use char or wchar when you want to). At minimum, I think that it would be a good idea to implement http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not to explicitly give the type with foreach for arrays of char or wchar. It would catch bugs without changing the behavior of any existing code, and it still allows you to iterate over either code units or code points.I like the type deduction feature of foreach, and don't think it should be removed for strings. Currently, it's consistent - T[] gets an element type of T.I want to reiterate that there's no way to program strings in D without being cognizant of them being a multibyte representation. D is both a high level and a low level language, and you can pick which to use, but you still gotta pick.I fully agree that programmers need to properly understand unicode to use strings in D properly. However, the problem is that the default handling of strings with foreach is _not_ what programmers are going to normally want, so the default will cause bugs. If strings defaulted to iterating as ranges of dchar, or if programmers had to say what type they wanted to iterate over when dealing with strings (or at least got a warning if they didn't), then there would be fewer bugs. Pretty much every time that the use of strings with foreach comes up on this list, most everyone agrees that it's a wart in the language that the default is to iterate over chars rather than dchars. Not everyone agrees on the best way to fix the problem, but most everyone agrees that it _is_ a problem. - Jonathan M Davis
Oct 20 2011
On 10/20/2011 8:58 PM, Jonathan M Davis wrote:And why would you iterate over a string with foreach without decoding it unless you specifically need to operate on code units (which I would expect to be _very_ rare)? Sure, copying doesn't require decoding, but searching sure doesNo, it doesn't. If I'm searching for a dchar, I'll be searching for a substring in the UTF-8 string. It's far, FAR more efficient to search as a substring rather than decoding while searching. Even more, 99.9999% of searches involve an ascii search string. It is simply not necessary to decode the searched string, as encoded chars cannot be ascii. For example: foreach (c; somestring) if (c == '+') found it! gains absolutely nothing by decoding somestring.(unless you're specifically looking for a code unit rather than a code point, which would not be normal). Most anything which needs to operate on the characters of a string needs to decode them. And iterating over them to do much of anything would require decoding, since otherwise you're operating on code units, and how often does anyone do that unless they're specifically messing around with character encodings?What you write sounds intuitively correct, but in my experience writing Unicode processing code, it simply isn't true. One rarely needs to decode.However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters. So, the default at this point is _wrong_ in the common case.This is simply not my experience when working with Unicode. Performance takes a big hit when one structures an algorithm to require decoding/encoding. Doing the algorithm using substrings is a huge win. Take a look at dmd's lexer, it handles Unicode correctly and avoids doing decoding as much as possible.
Oct 20 2011
On Fri, 21 Oct 2011 06:39:56 +0200, Walter Bright <newshound2 digitalmars.com> wrote:On 10/20/2011 8:58 PM, Jonathan M Davis wrote:You have a good point here. I would have immediately thrown out the loop AFTER profiling. What hits me here is that I had an incorrect program with built-in unicode aware strings. This is counterintuitive to correct unicode handling throughout the std library, and even more to the complementary operation of appending any char type to strings. martinAnd why would you iterate over a string with foreach without decoding it unless you specifically need to operate on code units (which I would expect to be _very_ rare)? Sure, copying doesn't require decoding, but searching sure doesNo, it doesn't. If I'm searching for a dchar, I'll be searching for a substring in the UTF-8 string. It's far, FAR more efficient to search as a substring rather than decoding while searching. Even more, 99.9999% of searches involve an ascii search string. It is simply not necessary to decode the searched string, as encoded chars cannot be ascii. For example: foreach (c; somestring) if (c == '+') found it! gains absolutely nothing by decoding somestring.(unless you're specifically looking for a code unit rather than a code point, which would not be normal). Most anything which needs to operate on the characters of a string needs to decode them. And iterating over them to do much of anything would require decoding, since otherwise you're operating on code units, and how often does anyone do that unless they're specifically messing around with character encodings?What you write sounds intuitively correct, but in my experience writing Unicode processing code, it simply isn't true. One rarely needs to decode.However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters. So, the default at this point is _wrong_ in the common case.This is simply not my experience when working with Unicode. Performance takes a big hit when one structures an algorithm to require decoding/encoding. Doing the algorithm using substrings is a huge win. Take a look at dmd's lexer, it handles Unicode correctly and avoids doing decoding as much as possible.
Oct 21 2011
On 10/21/2011 2:51 AM, Martin Nowak wrote:You have a good point here. I would have immediately thrown out the loop AFTER profiling. What hits me here is that I had an incorrect program with built-in unicode aware strings. This is counterintuitive to correct unicode handling throughout the std library, and even more to the complementary operation of appending any char type to strings.I understand the issue, but I don't think it's resolvable. It's a lot like the signed/unsigned issue. Java got rid of it by simply not having any unsigned types.
Oct 21 2011
On 10/21/11 1:38 PM, Walter Bright wrote:On 10/21/2011 2:51 AM, Martin Nowak wrote:It is resolvable, just not without breaking compatibility. Latching on the notion that a problem is unsolvable is highly nocive because it sets up the mind for failing to not only look for solutions, but also to see and understand them when they're in the open. I said this a number of times, and I repeat: if we had the luxury of doing it over again, I'd disable random access and .length for char[], wchar[], and their qualified versions. For those types I would add a property .rep that yields respectively ubyte[], ushort[], and the appropriately-qualified variants. This would shed the remaining awkwardnesses from a generally very elegant approach to string handling. The loop issue would be trivial to solve. foreach (x; s) would iterate one dchar at a time, whereas foreach (x; s.rep) would iterate one ubyte or ushort at a time. There would be the ability to iterate foreach (ref x; s.rep) but not foreach (ref x; s). AndreiYou have a good point here. I would have immediately thrown out the loop AFTER profiling. What hits me here is that I had an incorrect program with built-in unicode aware strings. This is counterintuitive to correct unicode handling throughout the std library, and even more to the complementary operation of appending any char type to strings.I understand the issue, but I don't think it's resolvable.
Oct 21 2011
Andrei Alexandrescu:Latching on the notion that a problem is unsolvable is highly nocive because it sets up the mind for failing to not only look for solutions, but also to see and understand them when they're in the open.Right. Experimental psychology has confirmed this some decades ago :-)I said this a number of times, and I repeat:Maybe I have missed your precedent explanations of this idea. Or to me this time it seems more clear and focused.if we had the luxury of doing it over again,Unfortunately D3 language can't break too much backward compatibility :-|I'd disable random access and .length for char[], wchar[], and their qualified versions. For those types I would add a property .rep that yields respectively ubyte[], ushort[], and the appropriately-qualified variants.Good. But the need to know the length of a variable-sized string is common. So I presume in such cases you have to use: somestring.walkLength() This is not too much bad, but it looks a bit long, and it requires an import. So maybe a shorter named property function in the object module is needed: somestring.wlength What about slices? I do need to take slices of variable-length strings too. Bye, bearophile
Oct 21 2011
On 2011-10-21 20:38, Walter Bright wrote:On 10/21/2011 2:51 AM, Martin Nowak wrote:Can't we implement a new string type that people can choose to use if they want. It will hide all the Unicode details that has been brought up by this thread. -- /Jacob CarlborgYou have a good point here. I would have immediately thrown out the loop AFTER profiling. What hits me here is that I had an incorrect program with built-in unicode aware strings. This is counterintuitive to correct unicode handling throughout the std library, and even more to the complementary operation of appending any char type to strings.I understand the issue, but I don't think it's resolvable. It's a lot like the signed/unsigned issue. Java got rid of it by simply not having any unsigned types.
Oct 22 2011
On 10/22/2011 02:14 PM, Jacob Carlborg wrote:On 2011-10-21 20:38, Walter Bright wrote:Having multiple standard string types is bad. Furthermore, it is hard to meaningfully hide all the Unicode details. Not even immutable(dchar)[] necessarily encodes one character as one code unit.On 10/21/2011 2:51 AM, Martin Nowak wrote:Can't we implement a new string type that people can choose to use if they want. It will hide all the Unicode details that has been brought up by this thread.You have a good point here. I would have immediately thrown out the loop AFTER profiling. What hits me here is that I had an incorrect program with built-in unicode aware strings. This is counterintuitive to correct unicode handling throughout the std library, and even more to the complementary operation of appending any char type to strings.I understand the issue, but I don't think it's resolvable. It's a lot like the signed/unsigned issue. Java got rid of it by simply not having any unsigned types.
Oct 22 2011
On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisProg gmx.com> said:Sure, if you _know_ that you're dealing with a string with only ASCII, it's faster to just iterate over charsIt works for non-ASCII too. You're probably missing an interesting property of UTF encodings: if you want to search for a substring in a well-formed UTF sequence, you do not need to decode the bigger string, comparing the UTF-x code units of the substring with the UTF-x code units of the bigger string is plenty enough. Similarly, if you're searching for the 'ê' code point in an UTF-8 string, the most efficient way is to search the string for the two-byte UTF-8 sequence you would use to encode 'ê' (in other words, convert 'ê' to a string). Decoding the whole string is a wasteful process.Sure, if you _know_ that you're dealing with a string with only ASCII, it's faster to just iterate over chars, but then you can explicitly give the type of the foreach variable as char, but normally what people care about is iterating over characters, not pieces of characters.If you want to iterate over what people consider characters, then you need to take into account combining marks that form multi-code-point graphemes. (You'll probably want to deal with unicode normalization too.) Treating code points as if they were characters is a misconception in the same way as treating UTF-16 code units as character is: both works most of the time but also fail in a number of cases.So, I would expect the case where people _want_ to iterate over chars to be rare. In most cases, iterating over a string as chars is a bug - one which in many cases won't be quickly caught, because the programmer is English speaking and uses almost exclusively ASCII for whatever testing that they do.That's a real problem. But is treating everything as dchar the only solution to that problem?Defaulting to the guaranteed correct handling of characters and special casing when it's possible to write code more efficiently than that is definitely the way to go about it, and it's how Phobos generally does it.Iterating on dchar is not guarantied to be correct, it only has significantly more chances of being correct.The fact that foreach doesn't is incongruous with how strings are handled in most other cases.You could also argue that ranges are doing things the wrong way.I note that you keep confusing characters with code units.I like the type deduction feature of foreach, and don't think it should be removed for strings. Currently, it's consistent - T[] gets an element type of T.Sure, the type deduction of foreach is great, and it's completely consistent that iterating over an array of chars would iterate over chars rather than dchars when you don't give the type. However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters.That said I wouldn't expect most programmers understand Unicode. Giving them dchars by default won't eliminate bugs related to multi-code-point characters, but it'll likely eliminate bugs relating to multi-code-unit sequences. That could be a good start. I'd say choosing dchar is a practical compromise between the "characters by default" and "the type of the array by default", but it is neither of those ideals. How is that pragmatic trade-off going to fare a few years in the future? I'm a little skeptical that this is the ideal solution. -- Michel Fortin michel.fortin michelf.com http://michelf.com/I want to reiterate that there's no way to program strings in D without being cognizant of them being a multibyte representation. D is both a high level and a low level language, and you can pick which to use, but you still gotta pick.I fully agree that programmers need to properly understand unicode to use strings in D properly. However, the problem is that the default handling of strings with foreach is _not_ what programmers are going to normally want, so the default will cause bugs.
Oct 20 2011
Actually, I'd have to say that having foreach default to iterating over char for string is a bit like if it defaulted to iterating over byte for int[]. Sure, it would work in some cases, but in the general case, it would be very wrong. Yes, it's consistent with how arrays are normally iterated over to have foreach iterate over char for char[], which is why it's arguably not a good idea to make it default to dchar. But it's _wrong_ in most cases, so at least giving a warning when the programmer doesn't give an iteration type for foreach with an array of char or wchar would be a big help. It's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic. - Jonathan M Davis
Oct 20 2011
On 10/20/2011 9:06 PM, Jonathan M Davis wrote:It's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic.Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.
Oct 20 2011
On Fri, 21 Oct 2011 00:43:20 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 10/20/2011 9:06 PM, Jonathan M Davis wrote:I don't think it would. Do you have any proof to support this? -SteveIt's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic.Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.
Oct 21 2011
On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:I've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles.Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.I don't think it would. Do you have any proof to support this?
Oct 21 2011
On 10/21/11 1:39 PM, Walter Bright wrote:On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:The key here is to allow people to easily either decode strings or not, without defaulting to an error-prone choice. AndreiI've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles.Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.I don't think it would. Do you have any proof to support this?
Oct 21 2011
On Fri, 21 Oct 2011 14:39:58 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:What I mean is, default to a well-built string type, and let people who want to deal with arrays of code-units deal with arrays of code-units. This schizophrenic view phobos has of char[] arrays as not being arrays is horrendous to work with. For my usage, I almost never iterate over string characters or graphemes, I just pass strings. -SteveI've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles.Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.I don't think it would. Do you have any proof to support this?
Oct 24 2011
Walter Bright , dans le message (digitalmars.D:147161), a écrit :On 10/20/2011 9:06 PM, Jonathan M Davis wrote:I definitely agree with you, but I have a piece of news for you : The whole phobos alreaday treats strings as a dchar ranges, and IS inefficient for processing strings. The fact is: char[] is not char[] in phobos. It is Range!dchar. This is aweful and schizophrenic, and also inefficient. The purpose was to allow people to use strings without knowing anything about unicode. That is why Jonathan proposed having a specific string structure to manipulate strings without having to worry about unicode, just like how strings are currently manipulated in phobos, and letting char[] be char[], and be as efficient as they should be. I was not there when it was decided to treat strings as dchar ranges, but now it is done. The only thing I can do is use ubyte[] instead of char[] so that phobos treat them as proper arrays, and propose optimized overloads for various phobos algorithm to make them as efficient as they should be (which I didn't find the time to do yet). -- ChristopheIt's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic.Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.
Oct 28 2011
On 10/28/11 4:41 AM, Christophe Travert wrote:I definitely agree with you, but I have a piece of news for you : The whole phobos alreaday treats strings as a dchar ranges, and IS inefficient for processing strings. The fact is: char[] is not char[] in phobos. It is Range!dchar. This is aweful and schizophrenic, and also inefficient. The purpose was to allow people to use strings without knowing anything about unicode.It was to allow people who know Unicode to use strings without too much arcana.That is why Jonathan proposed having a specific string structure to manipulate strings without having to worry about unicode, just like how strings are currently manipulated in phobos, and letting char[] be char[], and be as efficient as they should be.No need. Use http://www.digitalmars.com/d/2.0/phobos/std_string.html#representationI was not there when it was decided to treat strings as dchar ranges, but now it is done. The only thing I can do is use ubyte[] instead of char[] so that phobos treat them as proper arrays, and propose optimized overloads for various phobos algorithm to make them as efficient as they should be (which I didn't find the time to do yet).What would you have done if you were there? What is your design of choice? Andrei
Oct 28 2011
On 21.10.2011 06:06, Jonathan M Davis wrote:It's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic.Indeed - expressing strings as arrays of characters will always fall short of the unicode concept in some way. A true unicode-compliant languages have to handle strings as opaque objects that do not have any encoding. There is a number of operations that can be done with these objects (concatenation, comparison, searching, etc.). Any kind of defined memory representation can only be obtained by an explicit encoding operation. Python3, for example, did a fundamental step by introducing this fundamental distinction. At first it seems silly, having to think about encodings so often when writing trivial code. After a short while, the strict conceptual separation between unencoded "strings" and encoded "arrays of something" really helps avoiding ugly problems. Sure, for a performance-critical language, the issue becomes a lot trickier. I still think it is possible and ultimately the only way to solve tricky problems that will otherwise always crop up somewhere.
Oct 24 2011
On 21/10/11 3:26 AM, Walter Bright wrote:On 10/20/2011 2:49 PM, Peter Alexander wrote:Then separate those ways of viewing strings. Here's one solution that I believe would satisfy everyone: 1. Remove the string, wstring and dstring aliases. An array of char should be an array of char, i.e. the same as array of byte. Same for arrays of wchar and dchar. This way, arrays of T have no subtle differences for certain kinds of T. 2. Add string, wstring and dstring structs with the following interface: a. foreach should iterate as dchar. b. property front() would be dchar. c. property length() would not exist. d. property buffer() returns the underlying immutable array of char, wchar etc. e. Remove opIndex and co. What this does: - Makes all array types consistent and intuitive. - Makes looping over strings do the expected thing. - Provides an interface to the underlying 8-bit chars for those that want it. Of course, people will still need to understand UTF-8. I don't think that's a problem. It's unreasonable to expect the language to do the thinking for you. The problem is that we have people that *do* understand UTF-8 (like the OP), but *don't* understand D's strings.The whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what.
Oct 21 2011
On Friday, October 21, 2011 11:11 Peter Alexander wrote:On 21/10/11 3:26 AM, Walter Bright wrote:In another post in this thread, Walter said in reference to post on essentially this idea: "Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings." Now, whether that's true is debatable, but that's his stance on the idea. - Jonathan M DavisOn 10/20/2011 2:49 PM, Peter Alexander wrote:Then separate those ways of viewing strings. Here's one solution that I believe would satisfy everyone: 1. Remove the string, wstring and dstring aliases. An array of char should be an array of char, i.e. the same as array of byte. Same for arrays of wchar and dchar. This way, arrays of T have no subtle differences for certain kinds of T. 2. Add string, wstring and dstring structs with the following interface: a. foreach should iterate as dchar. b. property front() would be dchar. c. property length() would not exist. d. property buffer() returns the underlying immutable array of char, wchar etc. e. Remove opIndex and co. What this does: - Makes all array types consistent and intuitive. - Makes looping over strings do the expected thing. - Provides an interface to the underlying 8-bit chars for those that want it. Of course, people will still need to understand UTF-8. I don't think that's a problem. It's unreasonable to expect the language to do the thinking for you. The problem is that we have people that *do* understand UTF-8 (like the OP), but *don't* understand D's strings.The whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what.
Oct 21 2011
On Fri, 21 Oct 2011 21:38:39 +0300, Jonathan M Davis <jmdavisProg gmx.com> wrote:In another post in this thread, Walter said in reference to post on essentially this idea: "Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings." Now, whether that's true is debatable, but that's his stance on the idea. - Jonathan M DavisWell he is right, the reason built-in strings everywhere and rarely (so rare that i have never seen one yet) someone comes up with a better alternative. IMO people are spoiled by dynamic languages and fail to see the strengths of D strings. Someone coming from C/C++ this is heaven, both for performance and flexibility.
Oct 21 2011
so:IMO people are spoiled by dynamic languagesWe should aim to something *better* than dynamic languages, where possible. If you have to do certain things (even "slow" ones), there's no point in making them harder than necessary in D. Python is a good language, but it's not perfect, and in a new language I'd like something even better. D contains several small things that are better than Python (an many things that are worse than Python). Bye, bearophile
Oct 21 2011
On Sat, 22 Oct 2011 02:28:28 +0300, bearophile <bearophileHUGS lycos.com> wrote:so:With spoiled i didn't mean those languages do it better in general (expressiveness, maybe), what i meant is that they just work, without any effort from the programmer. It may be all right for them and you could say it is what every language should do but in reality this is not the case for D like languages. Not with the expense of efficiency on a low level task. We could aim all we like but we can't beat dynamic languages that easily, if you don't care reaching or beating C (on every aspect, system access, performance...) you don't have much of a limit do you? :)IMO people are spoiled by dynamic languagesWe should aim to something *better* than dynamic languages, where possible. If you have to do certain things (even "slow" ones), there's no point in making them harder than necessary in D. Python is a good language, but it's not perfect, and in a new language I'd like something even better. D contains several small things that are better than Python (an many things that are worse than Python). Bye, bearophile
Oct 21 2011
On Sat, 22 Oct 2011 01:28:28 +0200, bearophile <bearophileHUGS lycos.com==wrote:so:IMO people are spoiled by dynamic languagesWe should aim to something *better* than dynamic languages, where =possible. If you have to do certain things (even "slow" ones), there's==no point in making them harder than necessary in D. Python is a good language, but it's not perfect, and in a new language==I'd like something even better. D contains several small things that a=re =better than Python (an many things that are worse than Python). Bye, bearophileWell the first thing I tried out was: for c in "f#a print c Which I still didn't get to run after: - reading SyntaxError: Non-ASCII character '\xe2' in file ./run.py on = = line 1, but no encoding declared; see = http://www.python.org/peps/pep-0263.html for details - adding a BOM to the script - remove the shebang which collides with the BOM - writing unicode("f#a martin
Oct 21 2011
On 21/10/11 11:03 PM, so wrote:On Fri, 21 Oct 2011 21:38:39 +0300, Jonathan M Davis <jmdavisProg gmx.com> wrote:Which operations do you believe would be less efficient?In another post in this thread, Walter said in reference to post on essentially this idea: "Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings." Now, whether that's true is debatable, but that's his stance on the idea. - Jonathan M DavisWell he is right, the reason built-in strings everywhere and rarely (so rare that i have never seen one yet) someone comes up with a better alternative. IMO people are spoiled by dynamic languages and fail to see the strengths of D strings. Someone coming from C/C++ this is heaven, both for performance and flexibility.
Oct 22 2011
On 10/22/2011 2:21 AM, Peter Alexander wrote:Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 22 2011
Walter Bright <newshound2 digitalmars.com> wrote:On 10/22/2011 2:21 AM, Peter Alexander wrote:You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 22 2011
On 10/22/2011 09:37 PM, kennytm wrote:Walter Bright<newshound2 digitalmars.com> wrote:Afaics the current std.algorithm.find implementation decodes its arguments.On 10/22/2011 2:21 AM, Peter Alexander wrote:You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 22 2011
On 10/22/11 3:05 PM, Timon Gehr wrote:On 10/22/2011 09:37 PM, kennytm wrote:That can be easily fixed. Currently single-element find does decoding but substring find avoids it if possible: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819 AndreiWalter Bright<newshound2 digitalmars.com> wrote:Afaics the current std.algorithm.find implementation decodes its arguments.On 10/22/2011 2:21 AM, Peter Alexander wrote:You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 22 2011
On 10/22/2011 10:42 PM, Andrei Alexandrescu wrote:On 10/22/11 3:05 PM, Timon Gehr wrote:Ok, I actually did not see that, thanks. However this is usually still not the most efficient implementation in case the first argument is string and the other is wstring/dstring.On 10/22/2011 09:37 PM, kennytm wrote:That can be easily fixed. Currently single-element find does decoding but substring find avoids it if possible: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819Walter Bright<newshound2 digitalmars.com> wrote:Afaics the current std.algorithm.find implementation decodes its arguments.On 10/22/2011 2:21 AM, Peter Alexander wrote:You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 22 2011
On 10/22/11 5:32 PM, Timon Gehr wrote:Ok, I actually did not see that, thanks. However this is usually still not the most efficient implementation in case the first argument is string and the other is wstring/dstring.I understand. For some reason I'm not seeing the URL of your pull request fixing that :o). Andrei
Oct 22 2011
On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars. But if you want to do the fundamentally incorrect thing (perhaps because you as the programmer know the limits of what the data may contain), then you should be able to do that for efficiency. But to default to something incorrect and then claim it's for efficiency is almost laughable. If I gave you an O(lg(n)) shortest path algorithm that did not always find the shortest path, would you agree it was an improvment over dijkstra's? -SteveWhich operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 24 2011
On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be. -- SimenOn 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 24 2011
On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer =<schveiguy yahoo.com> wrote:On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright =d =<newshound2 digitalmars.com> wrote:On 10/22/2011 2:21 AM, Peter Alexander wrote:Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, woul==be less efficient if decoding was done.Searching that does not do decoding is fundamentally incorrect. That=is, if you want to find a substring in a string, you cannot just ==compare chars.Assuming both string are valid UTF-8, you can. Continuation bytes can =never be confused with the first byte of a code point, and the first byte =always identifies how many continuation bytes there should be.As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They mus= t be normalized to accurately compare. Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point. If that's on the last character in the word such as fianc=C3=A9, then searching for fiance wil= l result in a match without proper decoding! Or if fianc=C3=A9 uses a precomposed =C3=A9, it won't match. So two valid representations of the= word either match or they don't. It's just a complete mess without proper unicode decoding. -Steve
Oct 24 2011
On 24.10.2011 23:41, Steven Schveighoffer wrote:On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They must be normalized to accurately compare.On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be.On 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point. If that's on the last character in the word such as fiancé, then searching for fiance will result in a match without proper decoding!Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable. Or if fiancé uses aprecomposed é, it won't match. So two valid representations of the word either match or they don't. It's just a complete mess without proper unicode decoding.It's a complete mess even with proper decoding ;)-Steve-- Dmitry Olshansky
Oct 24 2011
On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky = <dmitry.olsh gmail.com> wrote:On 24.10.2011 23:41, Steven Schveighoffer wrote:tOn Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. Tha=Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.nis, if you want to find a substring in a string, you cannot just compare chars.Assuming both string are valid UTF-8, you can. Continuation bytes ca=dnever be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be.As others have pointed out in the past to me (and I thought as you di=ustonce), the same characters can be encoded in *different ways*. They m==be normalized to accurately compare.Assuming language support stays on stage of "codepoint is a character"=it's totaly expected to ignore modifiers and compare identically =normalized UTF without decoding. Yes, it risks to hit certain issues.Again, the "risk" is that it fails to achieve the goal you ask of it! D-language: Here, use this search algorithm, it works most of the time, = = but may not work correctly in some cases. If you run into one of those = = cases, you'll have to run a specialized search algorithm for strings. User: How do I know I hit one of those cases? D-language: You'll have to run the specialized version to find out. User: Why wouldn't I just run the specialized version first? D-language: Well, because it's slower! User: But don't I have to use both algorithms to make sure I find the da= ta? D-language: Only if you "care" about accuracy! Call me ludicrous, but is this really what we want to push on someone as= a = "unicode-aware" language?aPlus, a combining character (such as an umlaut or accent) is part of =willcharacter, but may be a separate code point. If that's on the last character in the word such as fianc=C3=A9, then searching for fiance =result in a match without proper decoding!Now if you are going to do real characters... If source/needle are =normalized you still can avoid lots of work by searching without =decoding. All you need to decode is one codepoint on each successful =match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive searc=h =it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, =negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take==them into account, but it's doable.I agree with you that simple searches using only byte (or dchar) = comparison does not work, and can be optimized based on several factors.= = The easiest thing is to find a code unit sequence that only has one vali= d = form, then search for that without decoding. Then when found, decode th= e = characters around it. Or if that isn't possible, create all the = un-normalized forms for one grapheme (based on how likely it is to occur= ), = and search for one of those in the undecoded stream. This can all be built into a specialized string type. There's actually = = some really interesting problems to solve in this space I think. I've = created a basic string type that has lamented in my unfinished pile of = stuff to do. I think it can be done in a way that is close to as = efficient as arrays for the most common operations (slicing, indexing, = etc.), but is *correct* before it is efficient. You should always be ab= le = to drop into "array mode" and deal with the code-units.Or if fianc=C3=A9 uses ahe wordprecomposed =C3=A9, it won't match. So two valid representations of t=Yes, all the more reason to solve the problem correctly so the hapless = unicode novice user doesn't have to! -Steveeither match or they don't. It's just a complete mess without proper unicode decoding.It's a complete mess even with proper decoding ;)
Oct 24 2011
On 10/24/2011 1:52 PM, Steven Schveighoffer wrote:Call me ludicrous, but is this really what we want to push on someone as a "unicode-aware" language?There are different levels of Unicode support. D operates at the most basic level (forgot what it is called) where only recognition of the encodings is necessary.
Oct 24 2011
On 10/24/2011 10:52 PM, Steven Schveighoffer wrote:On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:The answer is probably to get proper unicode normalization into Phobos. The efficient versions can then specify that they are guaranteed to work perfectly accurate on normalized input only.On 24.10.2011 23:41, Steven Schveighoffer wrote:Again, the "risk" is that it fails to achieve the goal you ask of it! D-language: Here, use this search algorithm, it works most of the time, but may not work correctly in some cases. If you run into one of those cases, you'll have to run a specialized search algorithm for strings. User: How do I know I hit one of those cases? D-language: You'll have to run the specialized version to find out. User: Why wouldn't I just run the specialized version first? D-language: Well, because it's slower! User: But don't I have to use both algorithms to make sure I find the data? D-language: Only if you "care" about accuracy! Call me ludicrous, but is this really what we want to push on someone as a "unicode-aware" language?On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They must be normalized to accurately compare.On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be.On 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several factors. The easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding. Then when found, decode the characters around it. Or if that isn't possible, create all the un-normalized forms for one grapheme (based on how likely it is to occur), and search for one of those in the undecoded stream. This can all be built into a specialized string type. There's actually some really interesting problems to solve in this space I think. I've created a basic string type that has lamented in my unfinished pile of stuff to do. I think it can be done in a way that is close to as efficient as arrays for the most common operations (slicing, indexing, etc.), but is *correct* before it is efficient. You should always be able to drop into "array mode" and deal with the code-units.Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point. If that's on the last character in the word such as fiancé, then searching for fiance will result in a match without proper decoding!Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.Or if fiancé uses aYes, all the more reason to solve the problem correctly so the hapless unicode novice user doesn't have to! -Steveprecomposed é, it won't match. So two valid representations of the word either match or they don't. It's just a complete mess without proper unicode decoding.It's a complete mess even with proper decoding ;)
Oct 24 2011
On 25.10.2011 0:52, Steven Schveighoffer wrote:On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Last time I checked, the legion of "everything is ASCII" zealots was pretty large ;) So the "goal" is a blury line, meaning that "search for a substring on in this chunk of unicode text" can mean very different things, and be correct in it's own sense on about 3 different levels. There is no default way, so I'm not sure we have to embed all of this complexity into language right now. Phobos is were we must first implement this, once it works we may look at revisiting our pick of default comparison method, etc.On 24.10.2011 23:41, Steven Schveighoffer wrote:Again, the "risk" is that it fails to achieve the goal you ask of it!On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They must be normalized to accurately compare.On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be.On 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.D-language: Here, use this search algorithm, it works most of the time, but may not work correctly in some cases. If you run into one of those cases, you'll have to run a specialized search algorithm for strings. User: How do I know I hit one of those cases? D-language: You'll have to run the specialized version to find out. User: Why wouldn't I just run the specialized version first? D-language: Well, because it's slower! User: But don't I have to use both algorithms to make sure I find the data? D-language: Only if you "care" about accuracy! Call me ludicrous, but is this really what we want to push on someone as a "unicode-aware" language?No, but we might want to fix a string search to do a little more - namely check if it skewed a graphem (assuming normalizations match). BTW adding normalization to std.uni is another thing to do right now. That should be a good enough start, and looking at unicode standard, things are rather fluid there meaning that "unicode-aware" language should be sufixed by version number :)Yes, it's all revolves around find all of variations of a substring. If user is willing to spend some time upfront, this could be done in one pass e.g. using the trick I employed in new std.regex. For a rough idea see ShiftOr string search, that is easy & fast inexact search: http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060 There are even better variations of this stuff in the wild.I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several factors. The easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding. Then when found, decode the characters around it. Or if that isn't possible, create all the un-normalized forms for one grapheme (based on how likely it is to occur), and search for one of those in the undecoded stream.Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point. If that's on the last character in the word such as fiancé, then searching for fiance will result in a match without proper decoding!Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.This can all be built into a specialized string type. There's actually some really interesting problems to solve in this space I think. I've created a basic string type that has lamented in my unfinished pile of stuff to do. I think it can be done in a way that is close to as efficient as arrays for the most common operations (slicing, indexing, etc.), but is *correct* before it is efficient. You should always be able to drop into "array mode" and deal with the code-units.Here is where we probably differ, to me there is no simple safe fast string type. Why? Because if you need anything "not pedantic and safe above all" you work with implicit data type of e.g. "a chunk of UTF-8 bytes that are normalized in NFC", ... I'd rather see them as sealed array with a bunch of meta info, and string functions to specialize on them in order to do some cool speed hacks. Of course, drilling down to row data should be possible for the user as well.Hapless novice unicode user don't stand a chance. I'm serious. One needs to know a lot of these "basics" to e.g. know why indexing by "true" character could be so slow, about encodings, normalizations etc. Yet we can save a metric ton of special cased crap from ever hitting his eyes. -- Dmitry OlshanskyOr if fiancé uses aYes, all the more reason to solve the problem correctly so the hapless unicode novice user doesn't have to!precomposed é, it won't match. So two valid representations of the word either match or they don't. It's just a complete mess without proper unicode decoding.It's a complete mess even with proper decoding ;)
Oct 24 2011
On Mon, 24 Oct 2011 18:25:25 -0400, Dmitry Olshansky = <dmitry.olsh gmail.com> wrote:On 25.10.2011 0:52, Steven Schveighoffer wrote:hatOn Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:On 24.10.2011 23:41, Steven Schveighoffer wrote:On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. T=Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.canis, if you want to find a substring in a string, you cannot just compare chars.Assuming both string are valid UTF-8, you can. Continuation bytes =enever be confused with the first byte of a code point, and the first byt=didalways identifies how many continuation bytes there should be.As others have pointed out in the past to me (and I thought as you ==once), the same characters can be encoded in *different ways*. They=r"must be normalized to accurately compare.Assuming language support stays on stage of "codepoint is a characte=.it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues=Again, the "risk" is that it fails to achieve the goal you ask of it!=Last time I checked, the legion of "everything is ASCII" zealots was =pretty large ;) So the "goal" is a blury line, meaning that "search fo=r =a substring on in this chunk of unicode text" can mean very different ==things, and be correct in it's own sense on about 3 different levels. There is no default way, so I'm not sure we have to embed all of this ==complexity into language right now. Phobos is were we must first =implement this, once it works we may look at revisiting our pick of =default comparison method, etc.If we say D obeys the subset of unicode that only works on ascii = characters, well, then I think we do not support unicode at all ;) I agree the string type needs to be fleshed out before the language adop= ts = it. I think we could legitimately define a string type that auto-assign= s = from it's respective char[] array type. Once it's shown that the type = is = nearly as fast as a char[] array, it can be used as the default type for= = string literals (the only real place where the language gets in the way)= .e,D-language: Here, use this search algorithm, it works most of the tim=ebut may not work correctly in some cases. If you run into one of thos=cases, you'll have to run a specialized search algorithm for strings.==User: How do I know I hit one of those cases? D-language: You'll have to run the specialized version to find out. User: Why wouldn't I just run the specialized version first? D-language: Well, because it's slower! User: But don't I have to use both algorithms to make sure I find the=asdata? D-language: Only if you "care" about accuracy! Call me ludicrous, but is this really what we want to push on someone=a "unicode-aware" language?No, but we might want to fix a string search to do a little more - =namely check if it skewed a graphem (assuming normalizations match).That's a big assumption. Valid unicode is valid unicode, even if it's n= ot = normalized.BTW adding normalization to std.uni is another thing to do right now. ==That should be a good enough start, and looking at unicode standard, =things are rather fluid there meaning that "unicode-aware" language =should be sufixed by version number :)I agree we need normalization and it is not necessary to involve the = compiler in this. I'd suggest a two to three phase approach: 1. Leave phobos' schizo view of "char arrays are not arrays" for now, an= d = build the necessary framework to get a string type that actually works. 2. Remove schizo view. 3. (optional) make compiler use library-defined string type for string = literals.f aPlus, a combining character (such as an umlaut or accent) is part o=e willcharacter, but may be a separate code point. If that's on the last character in the word such as fianc=C3=A9, then searching for fianc=result in a match without proper decoding!Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful=omatch to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down t=kethis: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you ta=rs.them into account, but it's doable.I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several facto=lThe easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding. Then when found, decode the characters around it. Or if that isn't possible, create al=othe un-normalized forms for one grapheme (based on how likely it is t=ne =occur), and search for one of those in the undecoded stream.Yes, it's all revolves around find all of variations of a substring. If user is willing to spend some time upfront, this could be done in o=pass e.g. using the trick I employed in new std.regex. For a rough idea see ShiftOr string search, that is easy & fast inexact search: http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060 There are even better variations of this stuff in the wild.This sounds good. I'll have a look at it when I have time.yThis can all be built into a specialized string type. There's actuall=some really interesting problems to solve in this space I think. I've=fcreated a basic string type that has lamented in my unfinished pile o=,stuff to do. I think it can be done in a way that is close to as efficient as arrays for the most common operations (slicing, indexing=etc.), but is *correct* before it is efficient. You should always be able to drop into "array mode" and deal with the code-units.Here is where we probably differ, to me there is no simple safe fast =string type. Why? Because if you need anything "not pedantic and safe ==above all" you work with implicit data type of e.g. "a chunk of UTF-8 ==bytes that are normalized in NFC", ...Specialized types which dictate normalization could be created. We do n= ot = have to always assume the worst if we are optimizing. But in general, when you read a utf-8 file, it could have anything in it= . = It may not even be normalized.I'd rather see them as sealed array with a bunch of meta info, and =string functions to specialize on them in order to do some cool speed ==hacks. Of course, drilling down to row data should be possible for the==user as well.This is exactly what I am envisioning ;) Except I'd build the meta-info= = into the type.the =Or if fianc=C3=A9 uses aprecomposed =C3=A9, it won't match. So two valid representations of=rword either match or they don't. It's just a complete mess without prope=sYes, all the more reason to solve the problem correctly so the haples=unicode decoding.It's a complete mess even with proper decoding ;)y =unicode novice user doesn't have to!Hapless novice unicode user don't stand a chance. I'm serious. One needs to know a lot of these "basics" to e.g. know wh=indexing by "true" character could be so slow, about encodings, =normalizations etc. Yet we can save a metric ton of special cased crap==from ever hitting his eyes.I disagree. 99% of the time I use strings, I don't care about indexes. = I = just want to deal with whole strings and substrings. I rarely use = arbitrary indexes, and when I do, I'm assuming ascii data. The hapless novice unicode user doesn't need to know unicode to do: writefln("hello, world!"); Or to do: if(find(str, "xyz")) {...} Or to even use regex! -Steve
Oct 26 2011
On 26.10.2011 16:12, Steven Schveighoffer wrote: <snip>I agree the string type needs to be fleshed out before the language adopts it. I think we could legitimately define a string type that auto-assigns from it's respective char[] array type. Once it's shown that the type is nearly as fast as a char[] array, it can be used as the default type for string literals (the only real place where the language gets in the way).That's it.Something like that, I may land a hand here, as I plan to merge some unicode stuff between std.regex and std.uni. As extras probably striding by grapheme and support for various degree of case-insensitive string comparison.That's a big assumption. Valid unicode is valid unicode, even if it's not normalized.D-language: Here, use this search algorithm, it works most of the time, but may not work correctly in some cases. If you run into one of those cases, you'll have to run a specialized search algorithm for strings. User: How do I know I hit one of those cases? D-language: You'll have to run the specialized version to find out. User: Why wouldn't I just run the specialized version first? D-language: Well, because it's slower! User: But don't I have to use both algorithms to make sure I find the data? D-language: Only if you "care" about accuracy! Call me ludicrous, but is this really what we want to push on someone as a "unicode-aware" language?No, but we might want to fix a string search to do a little more - namely check if it skewed a graphem (assuming normalizations match).BTW adding normalization to std.uni is another thing to do right now. That should be a good enough start, and looking at unicode standard, things are rather fluid there meaning that "unicode-aware" language should be sufixed by version number :)I agree we need normalization and it is not necessary to involve the compiler in this. I'd suggest a two to three phase approach: 1. Leave phobos' schizo view of "char arrays are not arrays" for now, and build the necessary framework to get a string type that actually works. 2. Remove schizo view. 3. (optional) make compiler use library-defined string type for string literals.There should be a way of e.g. auto my_string = normalize(readText("file.txt"), Normalize.NFKC); //my_string has typed normalized UTF-8 or somesuch Yes, that should keep general non-assuming case. Though it still can do some clever things, since as W3C points out >90% of text on web is already normalized in NFC (I can search for a proof link, if hard pressed).This sounds good. I'll have a look at it when I have time.Yes, it's all revolves around find all of variations of a substring. If user is willing to spend some time upfront, this could be done in one pass e.g. using the trick I employed in new std.regex. For a rough idea see ShiftOr string search, that is easy & fast inexact search: http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060 There are even better variations of this stuff in the wild.I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several factors. The easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding. Then when found, decode the characters around it. Or if that isn't possible, create all the un-normalized forms for one grapheme (based on how likely it is to occur), and search for one of those in the undecoded stream.Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point. If that's on the last character in the word such as fiancé, then searching for fiance will result in a match without proper decoding!Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.Specialized types which dictate normalization could be created. We do not have to always assume the worst if we are optimizing. But in general, when you read a utf-8 file, it could have anything in it. It may not even be normalized.This can all be built into a specialized string type. There's actually some really interesting problems to solve in this space I think. I've created a basic string type that has lamented in my unfinished pile of stuff to do. I think it can be done in a way that is close to as efficient as arrays for the most common operations (slicing, indexing, etc.), but is *correct* before it is efficient. You should always be able to drop into "array mode" and deal with the code-units.Here is where we probably differ, to me there is no simple safe fast string type. Why? Because if you need anything "not pedantic and safe above all" you work with implicit data type of e.g. "a chunk of UTF-8 bytes that are normalized in NFC", ...Yeah, I meant that the type caries this info. Looks like we are more in agreement then I suspected ;)I'd rather see them as sealed array with a bunch of meta info, and string functions to specialize on them in order to do some cool speed hacks. Of course, drilling down to row data should be possible for the user as well.This is exactly what I am envisioning ;) Except I'd build the meta-info into the type.To print chars you don't need to know unicode, you OS needs to :) Though that wasn't string processing. However find/match is indeed a good example. You are probably right, if we cover enough of common operations most people might not have to "look inside". -- Dmitry OlshanskyI disagree. 99% of the time I use strings, I don't care about indexes. I just want to deal with whole strings and substrings. I rarely use arbitrary indexes, and when I do, I'm assuming ascii data. The hapless novice unicode user doesn't need to know unicode to do: writefln("hello, world!"); Or to do: if(find(str, "xyz")) {...} Or to even use regex!Hapless novice unicode user don't stand a chance. I'm serious. One needs to know a lot of these "basics" to e.g. know why indexing by "true" character could be so slow, about encodings, normalizations etc. Yet we can save a metric ton of special cased crap from ever hitting his eyes.Or if fiancé uses aYes, all the more reason to solve the problem correctly so the hapless unicode novice user doesn't have to!precomposed é, it won't match. So two valid representations of the word either match or they don't. It's just a complete mess without proper unicode decoding.It's a complete mess even with proper decoding ;)
Oct 26 2011
Dmitry Olshansky , dans le message (digitalmars.D:147415), a écrit :Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.string being seen as range of codepoint (dchar) is already aweful enough. Now seeing strings as range of displayable caracters just do not make sense. Unicode is too complicated to allow doing this for a general purpose string manipulation. All the transformations to displayable characters can only be done when displaying characters ! Just like fiancé is hidden is you write fiance' (with the approriate unicode character to have the ' placed over the 'e'). You can hide any word by using delete characters. You have to make asumption on the input, and you have to put limitations to the algorithm because in any case, you can have unexpected behavior. And I can assure you there is less unexpected behavior if you treat strings as dchar range or even char[], than if you treat them as displayable characters.It's a complete mess even with proper decoding ;)Sure, that's why we better not decode.
Oct 28 2011
On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point.If this is correct (and it is), then decoding to dchar is simply not enough. You seem to advocate decoding to graphemes, which is a whole different matter. -- Simen
Oct 24 2011
On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas = <simen.kjaras gmail.com> wrote:On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer =<schveiguy yahoo.com> wrote:aPlus, a combining character (such as an umlaut or accent) is part of =character, but may be a separate code point.If this is correct (and it is), then decoding to dchar is simply not =enough. You seem to advocate decoding to graphemes, which is a whole different==matter.I am advocating that. And it's a matter of perception. D can say "we = only support code-point decoding" and what that means to a user is, "we = = don't support language as you know it." Sure it's a part of unicode, bu= t = it takes that extra piece to make it actually usable to people who requi= re = unicode. Even in English, fianc=C3=A9 has an accent. To say D supports unicode, = but = then won't do a simple search on a file which contains a certain *valid*= = encoding of that word is disingenuous to say the least. D needs a fully unicode-aware string type. I advocate D should use it a= s = the default string type, but it needs one whether it's the default or no= t = in order to say it supports unicode. -Steve
Oct 26 2011
On 10/26/11 7:18 AM, Steven Schveighoffer wrote:On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:Why doesn't that simple search work? foreach (line; stdin.byLine()) { if (line.canFind("fiancé")) { writeln("There it is."); } }On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:I am advocating that. And it's a matter of perception. D can say "we only support code-point decoding" and what that means to a user is, "we don't support language as you know it." Sure it's a part of unicode, but it takes that extra piece to make it actually usable to people who require unicode. Even in English, fiancé has an accent. To say D supports unicode, but then won't do a simple search on a file which contains a certain *valid* encoding of that word is disingenuous to say the least.Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point.If this is correct (and it is), then decoding to dchar is simply not enough. You seem to advocate decoding to graphemes, which is a whole different matter.D needs a fully unicode-aware string type. I advocate D should use it as the default string type, but it needs one whether it's the default or not in order to say it supports unicode.How do you define "supports Unicode"? For my money, the main sin of (w)string is that it offers [] and .length with potentially confusing semantics, so if I could I'd curb, not expand, its interface. Andrei
Oct 29 2011
On Saturday, October 29, 2011 09:42:54 Andrei Alexandrescu wrote:On 10/26/11 7:18 AM, Steven Schveighoffer wrote:ofOn Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas =20 <simen.kjaras gmail.com> wrote:On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer =20 <schveiguy yahoo.com> wrote:Plus, a combining character (such as an umlaut or accent) is part=ota character, but may be a separate code point.=20 If this is correct (and it is), then decoding to dchar is simply n=rentenough. You seem to advocate decoding to graphemes, which is a whole diffe=ematter.=20 I am advocating that. And it's a matter of perception. D can say "w="weonly support code-point decoding" and what that means to a user is,=, butdon't support language as you know it." Sure it's a part of unicode=de, butit takes that extra piece to make it actually usable to people who require unicode. =20 Even in English, fianc=C3=A9 has an accent. To say D supports unico=alid*then won't do a simple search on a file which contains a certain *v=If the strings aren't normalized the same way, then it might not find f= ianc=C3=A9. If=20 they _are_ normalized the same way and fianc=C3=A9 is in there except t= hat the =C3=A9 is=20 actually modified by another code point after it (e.g. a subscript of 2= - not=20 exactly likely in this case but certainly possible), then that string w= ould =20 be found when it shouldn't be. The bigger problem though, I think, is w= hen=20 you're searching for a string which is the same without the modifiers -= which=20 would be fiance in this case - since then if the modfiying code points = are=20 after, then find will think that it found the string that you were look= ing for=20 when it didn't. Once you're dealing with modifying code points, in the general case, yo= u=20 _must_ operate on the grapheme level to ensure that you find exactly wh= at=20 you're looking for and only what you're looking for. If we assume that = all=20 strings are normalized the same way and pick the right normalization fo= r it=20 (and provide a function to normalize strings that way of course), then = we=20 could probably make that work 100% of the time (assuming that there's a= =20 normalized form with all of the modifying code points being _before_ th= e code=20 point that we modify and that no modifying code point can be a characte= r on=20 its own), but I'd have to study up on it more to be sure. Regardless, while searching for fianc=C3=A9 has a decent chance of succ= ess=20 (especially if programs generall favor using single code points instea= d of=20 multiple code points wherever possible), it's still a risky proposition= =20 without at least doing unicode normalization if not outright using a ra= nge of=20 graphemes rather than code points. - Jonathan M Davisencoding of that word is disingenuous to say the least.=20 Why doesn't that simple search work? =20 foreach (line; stdin.byLine()) { if (line.canFind("fianc=C3=A9")) { writeln("There it is."); } }
Oct 29 2011
On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu = <SeeWebsiteForEmail erdani.org> wrote:On 10/26/11 7:18 AM, Steven Schveighoffer wrote:f aOn Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas <simen.kjaras gmail.com> wrote:On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer <schveiguy yahoo.com> wrote:Plus, a combining character (such as an umlaut or accent) is part o=character, but may be a separate code point.If this is correct (and it is), then decoding to dchar is simply not=ntenough. You seem to advocate decoding to graphemes, which is a whole differe=wematter.I am advocating that. And it's a matter of perception. D can say "we only support code-point decoding" and what that means to a user is, "=butdon't support language as you know it." Sure it's a part of unicode, =, butit takes that extra piece to make it actually usable to people who require unicode. Even in English, fianc=C3=A9 has an accent. To say D supports unicode=id*then won't do a simple search on a file which contains a certain *val=I think Jonathan answered that quite well, nothing else to add...encoding of that word is disingenuous to say the least.Why doesn't that simple search work? foreach (line; stdin.byLine()) { if (line.canFind("fianc=C3=A9")) { writeln("There it is."); } }asD needs a fully unicode-aware string type. I advocate D should use it=the default string type, but it needs one whether it's the default or=not in order to say it supports unicode.How do you define "supports Unicode"? For my money, the main sin of =(w)string is that it offers [] and .length with potentially confusing ==semantics, so if I could I'd curb, not expand, its interface.LOL, I'm so used to programming that I'm trying to figure out what the = meaning of sin(string) (as in sine) means :) I think there are two problems with [] and .length. First, that they = imply "get nth character" and "number of characters" respectively, and = second, that many times they *actually are* those things. So I agree with you the proposed string type needs to curb that interfac= e, = while giving us a fully character/grapheme aware interface (which is = currently lacking). I made an early attempt at doing this, and I will eventually get around = to = finishing it. I was in the middle of creating an algorithm to = as-efficiently-as-possible delineate a grapheme, when I got sidetracked = by = other things :) There are still lingering issues with the language whic= h = makes this a less-than-ideal replacement (arrays currently enjoy a lot o= f = "extra features" that custom types do not). -Steve
Oct 31 2011
On 10/31/11 8:20 AM, Steven Schveighoffer wrote:I see this as a decoding primitive issue that can be resolved. AndreiWhy doesn't that simple search work? foreach (line; stdin.byLine()) { if (line.canFind("fiancé")) { writeln("There it is."); } }I think Jonathan answered that quite well, nothing else to add...
Oct 31 2011
On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Sure you can. A Unicode character is a string, a Unicode string is a string of those strings. So, searching for a Unicode character is searching for a substring. Doing this character by character is standard, rote stuff. It's how grep works, for example.On 10/22/2011 2:21 AM, Peter Alexander wrote:Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 24 2011
On Mon, 24 Oct 2011 17:27:46 -0400, Walter Bright = <newshound2 digitalmars.com> wrote:On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright =d =<newshound2 digitalmars.com> wrote:On 10/22/2011 2:21 AM, Peter Alexander wrote:Which operations do you believe would be less efficient?All of the ones that don't require decoding, such as searching, woul==be less efficient if decoding was done.Searching that does not do decoding is fundamentally incorrect. That =is, if you want to find a substring in a string, you cannot just compare chars.Sure you can. A Unicode character is a string, a Unicode string is a =string of those strings. So, searching for a Unicode character is =searching for a substring.What if the source character is encoded differently than the search = string? This is basic unicode stuff. See my example with fianc=C3=A9. -Steve
Oct 24 2011
On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer" <schveiguy yahoo.com> said:What if the source character is encoded differently than the search string? This is basic unicode stuff. See my example with fiancé.The more I think about it, the more I think it should work like this: just like we assume they contain well-formed UTF sequences, char[], wchar[], and dchar[] should also be assumed to contain **normalized** unicode strings. Which normalization form to use? no idea. Just pick a sensible one in the four. <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms> Once we know all strings are normalized in the same way, we can then compare two strings bitwise to check if they're the same. And we can check for a substring in the same manner except we need to insert a simple check after the match to verify that it isn't surrounded by combining marks applying to the first or last character. And it'll also deeply simplify any proper Unicode string handling code we'll add in the future. - - - That said, I fear that forcing a specific normalization might be problematic. You don't always want to have to normalize everything... <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences> So perhaps we could simplify things a bit more: don't pick a standard normalization form. Just assume that both strings being used are in the same normalization form. Comparison will work, searching for substring in the way specified above will work, and other functions could document which normalization form they accept. Problem solved... somewhat. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 24 2011
On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin = <michel.fortin michelf.com> wrote:On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer" =<schveiguy yahoo.com> said:.What if the source character is encoded differently than the search string? This is basic unicode stuff. See my example with fianc=C3=A9=The more I think about it, the more I think it should work like this: ==just like we assume they contain well-formed UTF sequences, char[], =wchar[], and dchar[] should also be assumed to contain **normalized** ==unicode strings. Which normalization form to use? no idea. Just pick a==sensible one in the four. <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms> Once we know all strings are normalized in the same way, we can then =compare two strings bitwise to check if they're the same. And we can =check for a substring in the same manner except we need to insert a =simple check after the match to verify that it isn't surrounded by =combining marks applying to the first or last character. And it'll als=o =deeply simplify any proper Unicode string handling code we'll add in t=he =future. - - - That said, I fear that forcing a specific normalization might be =problematic. You don't always want to have to normalize everything... <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normal=ization_differences>So perhaps we could simplify things a bit more: don't pick a standard ==normalization form. Just assume that both strings being used are in th=e =same normalization form. Comparison will work, searching for substring==in the way specified above will work, and other functions could docume=nt =which normalization form they accept. Problem solved... somewhat.It's even easier than this: a) you want to do a proper string comparison not knowing what state the = = unicode strings are in, use the full-fledged decode-when-needed string = type, and its associated str.find method. b) you know they are both the same normalized form and want to optimize,= = use std.algorithm.find(haystack.asArray, needle.asArray). -Steve
Oct 26 2011
On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" <schveiguy yahoo.com> said:It's even easier than this: a) you want to do a proper string comparison not knowing what state the unicode strings are in, use the full-fledged decode-when-needed string type, and its associated str.find method. b) you know they are both the same normalized form and want to optimize, use std.algorithm.find(haystack.asArray, needle.asArray).Well, treating the string as an array of dchar doesn't work in the general case, even with strings normalized the same way your fiancé example can break. So should never treat them as plain arrays unless I'm sure I have no combining marks in the string. I'm not opposed to having a new string type being developed, but I'm skeptical about its inclusion in the language. We already have three string types which can be assumed to contain valid UTF sequences. I think the first thing to do is not to develop a new string type, but to develop the normalization and grapheme splitting algorithms, and those to find a substring, using the existing char[], wchar[] and dchar[] types. Then write a program with proper handling of Unicode using those and hand-optimize it. If that proves to be a pain (it might well be), write a new string type, rewrite the program using it, do some benchmarks and then we'll know if it's a good idea, and will be able to quantify the drawbacks. But right now, all this arguing for or against a new string type is stacking hypothesis against other hypothesis, it won't lead anywhere. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 26 2011
On Wed, 26 Oct 2011 10:04:08 -0400, Michel Fortin = <michel.fortin michelf.com> wrote:On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" =<schveiguy yahoo.com> said:theIt's even easier than this: a) you want to do a proper string comparison not knowing what state =gunicode strings are in, use the full-fledged decode-when-needed strin=ze,type, and its associated str.find method. b) you know they are both the same normalized form and want to optimi=use std.algorithm.find(haystack.asArray, needle.asArray).Well, treating the string as an array of dchar doesn't work in the =general case, even with strings normalized the same way your fianc=C3=A9==example can break. So should never treat them as plain arrays unless I='m =sure I have no combining marks in the string.If you have the same normalization, you would at least find all the vali= d = instances. You'd have to eliminate false positives, by decoding the nex= t = dchar after the match to see if it's a combining character. This would = be = a simple function to write. All I'm saying is, doing a search on the array should be available.I'm not opposed to having a new string type being developed, but I'm =skeptical about its inclusion in the language. We already have three =string types which can be assumed to contain valid UTF sequences. I =think the first thing to do is not to develop a new string type, but t=o =develop the normalization and grapheme splitting algorithms, and those==to find a substring, using the existing char[], wchar[] and dchar[] =types. Then write a program with proper handling of Unicode using tho=se =and hand-optimize it. If that proves to be a pain (it might well be), ==write a new string type, rewrite the program using it, do some =benchmarks and then we'll know if it's a good idea, and will be able t=o =quantify the drawbacks. But right now, all this arguing for or against a new string type is =stacking hypothesis against other hypothesis, it won't lead anywhere.I have a half-implemented string type which does just this. I need to = finish it. There are some language barriers that need to be sorted out = = (such as how to deal with tail-const for arbitrary types). I think a user who no longer posts here (spir) had some full = implementation of unicode strings using classes, though I don't see that= = being comparable in performance. -Steve
Oct 26 2011
On 2011-10-26 14:20:54 +0000, "Steven Schveighoffer" <schveiguy yahoo.com> said:If you have the same normalization, you would at least find all the vali d instances. You'd have to eliminate false positives, by decoding the nex t dchar after the match to see if it's a combining character. This would be a simple function to write. All I'm saying is, doing a search on the array should be available.And all I'm saying is that such a standalone function that eliminates the false positive should be available outside of a string type too. You'll likely need it for your string type anyway, but there's no reason you should't be able to use it standalone on a char[]/wchar[]/dchar[]. I think most string algorithms should be able to work with normalized character arrays; a new string type should just be a shell around these that makes things easier to the user.I have a half-implemented string type which does just this. I need to finish it.That should be interesting. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 26 2011
On Fri, 21 Oct 2011 21:11:14 +0300, Peter Alexander <peter.alexander.au gmail.com> wrote:Of course, people will still need to understand UTF-8. I don't think that's a problem. It's unreasonable to expect the language to do the thinking for you. The problem is that we have people that *do* understand UTF-8 (like the OP), but *don't* understand D's strings.Indeed, if one knows/understands how unicode works AND knows the structure of D strings, the current scheme is the best of both worlds. foreach(e; string) // chars foreach(e; byUTFxxx(string)) // ...
Oct 21 2011
On 2011-10-20 23:49, Peter Alexander wrote:On 20/10/11 8:37 PM, Martin Nowak wrote:The index could skip certain indexes to make that assert pass. But that would be confusing as well. -- /Jacob CarlborgIt just took me over one hour to find out the unthinkable. foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode. Now there is so many unicode transcoding happening in the language that it starts to get annoying, but the most basic string iteration doesn't support it by default?D has got itself into a tricky situation in this regard. Doing it either way introduces an unintuitive mess. The way it is now, you get the problem that you just described where foreach is unaware of Unicode. If you changed it to loop as Unicode, then indices won't match up: immutable(int)[] a = ... foreach (x, i; a) assert(x == a[i]); // ok immutable(char)[] b = ... foreach (x, i; b) assert(x == b[i]); // not necessarily!
Oct 20 2011
On Thu, 20 Oct 2011 22:37:56 +0300, Martin Nowak <dawg dawgfoto.de> wrote:It just took me over one hour to find out the unthinkable. foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode. Now there is so many unicode transcoding happening in the language that it starts to get annoying, but the most basic string iteration doesn't support it by default?I really can't believe why people expect that. By definition string is an array of chars, and this way it is consistent. All we need is: foreach(c, whatever(str)) instead of: foreach(c, str)
Oct 21 2011
On 10/20/2011 09:37 PM, Martin Nowak wrote:It just took me over one hour to find out the unthinkable. foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode. Now there is so many unicode transcoding happening in the languageIn the standard library. Not in the language.that it starts to get annoying, but the most basic string iteration doesn't support it by default?I actually like the current behaviour.
Oct 21 2011
El 20/10/2011 21:37, Martin Nowak escribió:It just took me over one hour to find out the unthinkable. foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode. Now there is so many unicode transcoding happening in the language that it starts to get annoying, but the most basic string iteration doesn't support it by default?Maybe I didn't fully get your point, but, you do know that you can do the following, right? string str = "Ñandú"; foreach(dchar c; str) ... and it decodes full unicode characters just fine. Or maybe you are just talking about the auto type inference, just to make sure... OTOH, as others say, it's not rare to iterate on 8-bit units if you're dealing with ASCII or if you are parsing and looking for operators and separators (which are normally 8-bit). Then you can leave the rest untouched or extract the parts in between without caring if they are 1 byte per character or several. (e.g. parsing XML, or JSON, or CSV, or INI, or conf, or D source, etc.)
Oct 22 2011
Am 22.10.2011 12:55, schrieb Alvaro:El 20/10/2011 21:37, Martin Nowak escribió:One visible Unicode character can consists of several dchars. (This is called "Grapheme") Cheers, - DanielIt just took me over one hour to find out the unthinkable. foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode. Now there is so many unicode transcoding happening in the language that it starts to get annoying, but the most basic string iteration doesn't support it by default?Maybe I didn't fully get your point, but, you do know that you can do the following, right? string str = "Ñandú"; foreach(dchar c; str) ...
Oct 22 2011