digitalmars.D - Proposal for fixing dchar ranges
- Steven Schveighoffer (35/35) Mar 10 2014 I proposed this inside the long "major performance problem with =
- Steven Schveighoffer (4/5) Mar 10 2014 An array of char even.
- Dicebot (6/41) Mar 10 2014 It will break any code that slices stored char[] strings directly
- Steven Schveighoffer (25/69) Mar 10 2014 =
- Dicebot (7/13) Mar 10 2014 Broken as if in "you are not supposed to do it user code"? Yes.
- Steven Schveighoffer (20/32) Mar 10 2014 If the idea to ensure the user cannot slice a code point was added, you ...
- H. S. Teoh (20/52) Mar 10 2014 I like this idea. Special-casing char[] in templates was a bad idea. It
- Steven Schveighoffer (9/18) Mar 10 2014 I agree that is a limitation of the proposal. It's more of a language-wi...
- Boyd (8/43) Mar 10 2014 I personally love this idea, though I think it probably
- Steven Schveighoffer (3/6) Mar 10 2014 What silent breaking changes?
- Boyd (4/11) Mar 10 2014 Utf8 aware slicing for strings would be an issue.
- Steven Schveighoffer (3/4) Mar 10 2014 I'm not proposing to add this.
- Boyd (5/10) Mar 10 2014 Ok, then you just destroyed my sole hypothetical objection to
- Brad Anderson (8/43) Mar 10 2014 Generally I think it's a good idea. Going a bit further you could
- Steven Schveighoffer (5/7) Mar 10 2014 You're the second person to mention that, I was not planning on disablin...
- John Colvin (3/12) Mar 10 2014 How is slicing any better than indexing?
- Steven Schveighoffer (9/22) Mar 10 2014 Because one can slice out a multi-code-unit code point, one cannot acces...
- John Colvin (9/34) Mar 10 2014 In order to be correct, both require exactly the same knowledge:
- Steven Schveighoffer (12/27) Mar 10 2014 Using indexing, you simply cannot get the single code unit that represen...
- John Colvin (7/38) Mar 10 2014 I think I understand your motivation now. Indexing never provides
- Steven Schveighoffer (13/16) Mar 10 2014 What it would do is remove the confusion of is(typeof(r.front) !=
- Johannes Pfau (10/22) Mar 10 2014 Unfortunately slicing by code units is probably the most important
- Steven Schveighoffer (13/35) Mar 10 2014 Slicing can never be a code point based operation. It would be too slow ...
- Steven Schveighoffer (12/15) Mar 10 2014 I said that wrong, of course it has meaning. What I mean is that if you ...
- Brad Anderson (3/12) Mar 10 2014 Sorry, I misunderstood. That sounds reasonable.
- Dicebot (4/13) Mar 11 2014 It is unacceptable to have slicing which is not O(1) for basic
- Steven Schveighoffer (3/15) Mar 11 2014 It would be O(1), work just like it does today.
- Dicebot (5/7) Mar 11 2014 Today it works by allowing arbitrary index and not checking if
- Steven Schveighoffer (13/20) Mar 11 2014 Well, a valid improvement would be to throw an exception when the slice ...
- Chris Williams (6/9) Mar 11 2014 If the indexes put into the slice aren't by code-point, but
- Steven Schveighoffer (34/41) Mar 11 2014 No, where we are today is that in some cases, the language treats a char...
- Johannes Pfau (13/32) Mar 11 2014 Yes, you can workaround the count problem, but then it is not
- Steven Schveighoffer (9/20) Mar 11 2014 I look at it a different way -- indexes are increasing, just not
- monarch_dodra (10/22) Mar 12 2014 I think it is import to remember that in terms of
- monarch_dodra (24/28) Mar 12 2014 I want to mention something I've had trouble with recently, that
- John Colvin (9/44) Mar 10 2014 I know warnings are disliked, but couldn't we make the slicing
- Steven Schveighoffer (15/59) Mar 10 2014 =
- Chris Williams (24/26) Mar 10 2014 If I was writing something like a chat or terminal window, I
- Walter Bright (4/8) Mar 10 2014 Proposals to make a string class for D have come up many times. I have a...
- Steven Schveighoffer (5/17) Mar 10 2014 I wholly agree, they should be an array type. But what they are now is
- Johannes Pfau (14/24) Mar 10 2014 Question: which type T doesn't have slicing, has an ElementType of
- Artem Tarasov (3/8) Mar 10 2014 In addition, hasLength!T == false, which totally freaked me out
- H. S. Teoh (38/66) Mar 10 2014 I'm on the fence about this one. The nice thing about strings being an
- John Colvin (3/68) Mar 10 2014 You started off on the fence, but you seem pretty convinced by
- Steven Schveighoffer (10/22) Mar 10 2014 BTW, this escaped my view the first time reading your post, but I am NOT...
- Walter Bright (3/5) Mar 10 2014 Right, but here I used the term "class" to be more generic as in being a...
- Steven Schveighoffer (7/12) Mar 10 2014 Then I don't understand your point. What strings are already is a
- Walter Bright (5/7) Mar 10 2014 With no enforcement, and that is by design.
- John Colvin (6/13) Mar 10 2014 I don't see how this proposal would limit that access. The raw
- Steven Schveighoffer (26/33) Mar 10 2014 The functionality added via phobos can hardly be considered extraneous. ...
- Walter Bright (4/9) Mar 10 2014 You divide the D world into two camps - those that use 'struct string', ...
- Steven Schveighoffer (13/25) Mar 10 2014 Really? It's not that divisive. However, the situation is certainly bett...
- Walter Bright (6/14) Mar 10 2014 I deserve that criticism. On the other hand, I've pretty much given up o...
- bearophile (5/8) Mar 10 2014 There are still some breaking changed that I'd like to perform in
- Meta (4/13) Mar 10 2014 That damnable comma operator is one of the worst things that was
- H. S. Teoh (8/20) Mar 11 2014 I've always been of the opinion that the comma operator in a for loop
- bearophile (5/8) Mar 11 2014 The place for the discussion about the comma operator:
- John Colvin (8/18) Mar 11 2014 I would go so far as to say this is a good thing, as long as the
- Ary Borenszweig (5/16) Mar 12 2014 You can also look at Erlang, where strings are just lists of numbers.
- Andrei Alexandrescu (4/22) Mar 12 2014 Erlang's mistake was different from what you believe was D's mistake.
- Ary Borenszweig (2/27) Mar 12 2014 What's D's mistake then?
- Andrei Alexandrescu (4/5) Mar 12 2014 I don't think we made a mistake with D's strings. They could have been
- John Colvin (13/48) Mar 10 2014 just to check I understand this fully:
- Steven Schveighoffer (25/77) Mar 10 2014 =
- John Colvin (3/82) Mar 11 2014 Awesome, let's do this :)
- Kagamin (4/4) Mar 11 2014 Automatic decoding by default itself is a WTF factor. The problem
- Marco Leise (37/37) Mar 17 2014 The Unicode standard is too complex for general purpose
- Dmitry Olshansky (21/54) Mar 18 2014 There is ICU and very few other things, like support in OSX frameworks
- Marco Leise (32/97) Mar 19 2014 =BD=84=EF=BD=94=EF=BD=88, =E1=86=A8=E1=86=A8=E1=86=A8=E1=86=9A=E1=86=9A=...
- Dmitry Olshansky (28/76) Mar 19 2014 If that of any comfort other languages are even worse here. In C++ your
- Marco Leise (26/55) Mar 19 2014 And I thought of going the slow route where normalized and
I proposed this inside the long "major performance problem with = std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even = negative attention :) An idea to fix the whole problems I see with char[] being treated = specially by phobos: introduce an actual string type, with char[] as = backing, that is a dchar range, that actually dictates the rules we want= . = Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation =3D data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dchar 2. No more issues with "casse=CC=81"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the librar= y, = and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still decode/encod= e, = but it's more explicit. It's EXPLICITLY a dchar range. Use = std.algorithm.copy(string1.representation, mutablestring.representation)= = will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that = = code is easily 'fixed' by adding the 'representation' qualifier. -Steve
Mar 10 2014
On Mon, 10 Mar 2014 09:35:44 -0400, Steven Schveighoffer <schveiguy yahoo.com> wrote:Then, a char[] array is simply an array of char[].An array of char even. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even negative attention :) An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation = data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "cassé"), it iterates via dchar 2. No more issues with "cassé"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still decode/encode, but it's more explicit. It's EXPLICITLY a dchar range. Use std.algorithm.copy(string1.representation, mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that code is easily 'fixed' by adding the 'representation' qualifier. -SteveIt will break any code that slices stored char[] strings directly which may or may not be breaking UTF depending on how indices are calculated. Also adding one more runtime dependency into language but there are so many that it probably does not matter.
Mar 10 2014
On Mon, 10 Mar 2014 10:48:26 -0400, Dicebot <public dicebot.lv> wrote:On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with ==std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even =negative attention :) An idea to fix the whole problems I see with char[] being treated ==specially by phobos: introduce an actual string type, with char[] as =backing, that is a dchar range, that actually dictates the rules we =arwant. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation =3D data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=r.2. No more issues with "casse=CC=81"[4], it is a static compiler erro=.3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler=6. Any other special rules we come up with can be dictated by the =library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still ==decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=Use std.algorithm.copy(string1.representation, =at =mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and th=h =code is easily 'fixed' by adding the 'representation' qualifier.It will break any code that slices stored char[] strings directly whic=may or may not be breaking UTF depending on how indices are calculated=. That is already broken. What I'm looking to do is remove the cruft and = "WTF" factor of the current state of affairs (an array that's not an = array). Originally (in that long ago proposal) I had proposed to check for and = disallow invalid slicing during runtime. In fact, it could be added if = desired with the type defined by the library.Also adding one more runtime dependency into language but there are so==many that it probably does not matter.alias string =3D immutable(char)[]; There isn't much extra dependency one must add to revert to the original= = behavior. In fact, one nice thing about this proposal is the compiler = changes can be done and tested before any real meddling with the string = = type is done. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer wrote:That is already broken. What I'm looking to do is remove the cruft and "WTF" factor of the current state of affairs (an array that's not an array). Originally (in that long ago proposal) I had proposed to check for and disallow invalid slicing during runtime. In fact, it could be added if desired with the type defined by the library.Broken as if in "you are not supposed to do it user code"? Yes. Broken as in "does the wrong thing" - no. If your index is properly calculated, it is no different from casting to ubyte[] and then slicing. I am pretty sure even Phobos does it here and there.
Mar 10 2014
On Mon, 10 Mar 2014 11:11:23 -0400, Dicebot <public dicebot.lv> wrote:On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer wrote:If the idea to ensure the user cannot slice a code point was added, you would still be able to slice via str.representation[a..b], or even str.ptr[a..b] if you were so sure of the length you didn't want it to be checked ;) The idea behind the proposal is to make it fully backwards compatible with existing code, except for randomly accessing a char, and probably .length. Slicing would still work as it does now, but could be adjusted later. It will break existing code. To fix those breaks, you would need to use the char[] array directly via the representation member, or rethink your code to be UTF-correct. Basically, instead of pretending an array isn't an array, create a new mostly-compatible type that behaves as we want it to behave in all circumstances, not just when you use phobos algorithms. The breaks may be trivial to work around, and might seem annoying. However, they may be actual UTF bugs that make your code more correct when you fix them. The biggest problem right now is the lack of the ability to implicitly cast to tail-const with a custom struct. We can keep an alias-this link for those cases until we can fix that in the compiler. -SteveThat is already broken. What I'm looking to do is remove the cruft and "WTF" factor of the current state of affairs (an array that's not an array). Originally (in that long ago proposal) I had proposed to check for and disallow invalid slicing during runtime. In fact, it could be added if desired with the type defined by the library.Broken as if in "you are not supposed to do it user code"? Yes. Broken as in "does the wrong thing" - no. If your index is properly calculated, it is no different from casting to ubyte[] and then slicing. I am pretty sure even Phobos does it here and there.
Mar 10 2014
On Mon, Mar 10, 2014 at 09:35:44AM -0400, Steven Schveighoffer wrote: [...]An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation = data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "cassé"), it iterates via dchar 2. No more issues with "cassé"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the library, and not ignored by the compiler.I like this idea. Special-casing char[] in templates was a bad idea. It makes Phobos code needlessly complex, and the inconsistent treatment of char[] sometimes as an array of char and sometimes not causes silly issues like foreach defaulting to char but range iteration defaulting to dchar. Enclosing it in a struct means we can enforce string rules separately from the fact that it's a char array.Note, std.algorithm.copy(string1, mutablestring) will still decode/encode, but it's more explicit. It's EXPLICITLY a dchar range. Use std.algorithm.copy(string1.representation, mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that code is easily 'fixed' by adding the 'representation' qualifier.[...] The only concern I have is the current use of char[] and const(char)[] as mutable strings, and the current implicit conversion from string to const(char)[]. We would need similar wrappers for char[] and const(char)[], and string and mutablestring must be implicitly convertible to conststring, otherwise a LOT of existing code will break in a major way. Plus, these wrappers should also expose the same dchar range API with .representation giving a way to get at the raw code units. T -- It is the quality rather than the quantity that matters. -- Lucius Annaeus Seneca
Mar 10 2014
On Mon, 10 Mar 2014 10:54:50 -0400, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:The only concern I have is the current use of char[] and const(char)[] as mutable strings, and the current implicit conversion from string to const(char)[]. We would need similar wrappers for char[] and const(char)[], and string and mutablestring must be implicitly convertible to conststring, otherwise a LOT of existing code will break in a major way.I agree that is a limitation of the proposal. It's more of a language-wide problem that one cannot make a struct that can be tail-const-ified. One idea to begin with is to weakly bind to immutable(char)[] using alias this. That way, existing code devolves to current behavior. Then you pick off the primitives you want by defining them in the struct itself.Plus, these wrappers should also expose the same dchar range API with .representation giving a way to get at the raw code units.It already does that, representation is a public member. -Steve
Mar 10 2014
I personally love this idea, though I think it probably introduces too much silent breaking changes for it to be universally acceptable by D users. Perhaps naming it 'String', and deprecating 'string' would make it more acceptable? ------------ On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even negative attention :) An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation = data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "cassé"), it iterates via dchar 2. No more issues with "cassé"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still decode/encode, but it's more explicit. It's EXPLICITLY a dchar range. Use std.algorithm.copy(string1.representation, mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that code is easily 'fixed' by adding the 'representation' qualifier. -Steve
Mar 10 2014
On Mon, 10 Mar 2014 11:11:50 -0400, Boyd <gaboonviper gmx.net> wrote:I personally love this idea, though I think it probably introduces too much silent breaking changes for it to be universally acceptable by D users.What silent breaking changes? -Steve
Mar 10 2014
Utf8 aware slicing for strings would be an issue. ---------- On Monday, 10 March 2014 at 15:13:26 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 11:11:50 -0400, Boyd <gaboonviper gmx.net> wrote:I personally love this idea, though I think it probably introduces too much silent breaking changes for it to be universally acceptable by D users.What silent breaking changes? -Steve
Mar 10 2014
On Mon, 10 Mar 2014 11:20:49 -0400, Boyd <gaboonviper gmx.net> wrote:Utf8 aware slicing for strings would be an issue.I'm not proposing to add this. -Steve
Mar 10 2014
Ok, then you just destroyed my sole hypothetical objection to this. ----------- On Monday, 10 March 2014 at 15:22:41 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 11:20:49 -0400, Boyd <gaboonviper gmx.net> wrote:Utf8 aware slicing for strings would be an issue.I'm not proposing to add this. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even negative attention :) An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation = data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "cassé"), it iterates via dchar 2. No more issues with "cassé"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still decode/encode, but it's more explicit. It's EXPLICITLY a dchar range. Use std.algorithm.copy(string1.representation, mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that code is easily 'fixed' by adding the 'representation' qualifier. -SteveGenerally I think it's a good idea. Going a bit further you could also enable Short String Optimization but you'd have to encapsulate the backing array. It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).
Mar 10 2014
On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:How is slicing any better than indexing?It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin <john.loughran.colvin gmail.com> wrote:On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:Because one can slice out a multi-code-unit code point, one cannot access it via index. Strings would be horribly crippled without slicing. Without indexing, they are fine. A possibility is to allow index, but actually decode the code point at that index (error on invalid index). That might actually be the correct mechanism. -SteveOn Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:How is slicing any better than indexing?It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin <john.loughran.colvin gmail.com> wrote:In order to be correct, both require exactly the same knowledge: The beginning of a code point, followed by the end of a code point. In the indexing case they just happen to be the same code-point and happen to be one code unit from each other. I don't see how one is any more or less errror-prone or fundamentally wrong than the other. I do understand that slicing is more important however.On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:Because one can slice out a multi-code-unit code point, one cannot access it via index. Strings would be horribly crippled without slicing. Without indexing, they are fine. A possibility is to allow index, but actually decode the code point at that index (error on invalid index). That might actually be the correct mechanism. -SteveOn Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:How is slicing any better than indexing?It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin <john.loughran.colvin gmail.com> wrote:On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer wrote:Using indexing, you simply cannot get the single code unit that represents a multi-code-unit code point. It doesn't fit in a char. It's guaranteed to fail, whereas slicing will give you access to the all the data in the string. Now, with indexing actually decoding a code point, one can alias a[i] to a[i..$].front(), which means decode the first code point you come to at index i. This means indexing is slow(er), and returns a dchar. I think as a first step, that might be too much to add silently. I'd rather break it first, then add it back later. -SteveBecause one can slice out a multi-code-unit code point, one cannot access it via index. Strings would be horribly crippled without slicing. Without indexing, they are fine. A possibility is to allow index, but actually decode the code point at that index (error on invalid index). That might actually be the correct mechanism.In order to be correct, both require exactly the same knowledge: The beginning of a code point, followed by the end of a code point. In the indexing case they just happen to be the same code-point and happen to be one code unit from each other. I don't see how one is any more or less errror-prone or fundamentally wrong than the other.
Mar 10 2014
On Monday, 10 March 2014 at 20:00:07 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin <john.loughran.colvin gmail.com> wrote:I think I understand your motivation now. Indexing never provides anything that slicing doesn't do more generally.On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer wrote:Using indexing, you simply cannot get the single code unit that represents a multi-code-unit code point. It doesn't fit in a char. It's guaranteed to fail, whereas slicing will give you access to the all the data in the string.Because one can slice out a multi-code-unit code point, one cannot access it via index. Strings would be horribly crippled without slicing. Without indexing, they are fine. A possibility is to allow index, but actually decode the code point at that index (error on invalid index). That might actually be the correct mechanism.In order to be correct, both require exactly the same knowledge: The beginning of a code point, followed by the end of a code point. In the indexing case they just happen to be the same code-point and happen to be one code unit from each other. I don't see how one is any more or less errror-prone or fundamentally wrong than the other.Now, with indexing actually decoding a code point, one can alias a[i] to a[i..$].front(), which means decode the first code point you come to at index i. This means indexing is slow(er), and returns a dchar. I think as a first step, that might be too much to add silently. I'd rather break it first, then add it back later. -SteveOf course that i has to be at the beginning of a code-point. Doesn't seem like that useful a feature and potentially very confusing for people who naively expect normal indexing.
Mar 10 2014
On Mon, 10 Mar 2014 16:54:34 -0400, John Colvin <john.loughran.colvin gmail.com> wrote:Of course that i has to be at the beginning of a code-point. Doesn't seem like that useful a feature and potentially very confusing for people who naively expect normal indexing.What it would do is remove the confusion of is(typeof(r.front) != typeof(r[0])) Naivety is to be expected when you have made your C-derived language's default string type an encoded UTF8 array called char[]. It doesn't magically make D programs UTF aware. I would suggest that a lofty goal is for the string type to be completely safe, and efficient, and only allow raw access via the .representation member. But I don't think, given the current code base, that we can achieve that in one proposal. It has to be gradual. This is a first step. -Steve
Mar 10 2014
Am Mon, 10 Mar 2014 13:55:00 -0400 schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:Unfortunately slicing by code units is probably the most important safety issue with the current implementation: As was mentioned in the other thread: size_t index = str.countUntil('a'); auto slice = str[0..index]; This can be a safety and security issue. (I realize that this would break lots of code so I'm not sure if we should/can fix it. But I think this was the most important problem mentioned in the other thread.)It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
On Mon, 10 Mar 2014 14:54:22 -0400, Johannes Pfau <nospam example.com> wrote:Am Mon, 10 Mar 2014 13:55:00 -0400 schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:Slicing can never be a code point based operation. It would be too slow (read linear complexity). What needs to be broken is the expectation that an index is the number of code points or characters in a string. Think of an index as a position that has no real meaning except they are ordered in the stream. Like a set of ordered numbers, not necessarily consecutive. The index 4 may not exist, while 5 does. At this point, my proposal does not fix that particular problem, but I don't think there's any way to fix that "problem" except to train the user who wrote it not to do that. However, it does not leave us in a worse position. -SteveOn Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:Unfortunately slicing by code units is probably the most important safety issue with the current implementation: As was mentioned in the other thread: size_t index = str.countUntil('a'); auto slice = str[0..index]; This can be a safety and security issue. (I realize that this would break lots of code so I'm not sure if we should/can fix it. But I think this was the most important problem mentioned in the other thread.)It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
On Mon, 10 Mar 2014 16:06:25 -0400, Steven Schveighoffer <schveiguy yahoo.com> wrote:Think of an index as a position that has no real meaning except they are ordered in the stream. Like a set of ordered numbers, not necessarily consecutive. The index 4 may not exist, while 5 does.I said that wrong, of course it has meaning. What I mean is that if you have two positions, the ordering will indicate where the characters/graphemes/code points occur in the stream, but their value will not be indicative of how far they are apart in terms of characters/graphemes/code points. In other words, if I have two characters, at position p1 and p2, then p1 > p2 => p1 comes later in the string than p2 p1 == p2 => p1 and p2 refer to the same character p1 - p2 => not defined to any particular value. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:Sorry, I misunderstood. That sounds reasonable.It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:It is unacceptable to have slicing which is not O(1) for basic types.It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 11 2014
On Tue, 11 Mar 2014 09:11:22 -0400, Dicebot <public dicebot.lv> wrote:On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:It would be O(1), work just like it does today. -SteveOn Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:It is unacceptable to have slicing which is not O(1) for basic types.It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length.
Mar 11 2014
On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer wrote:It would be O(1), work just like it does today. -SteveToday it works by allowing arbitrary index and not checking if resulting slice is valid UTF-8. Anything that implies decoding is O(n). What exactly do you have in mind for this?
Mar 11 2014
On Tue, 11 Mar 2014 10:06:47 -0400, Dicebot <public dicebot.lv> wrote:On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer wrote:Well, a valid improvement would be to throw an exception when the slice didn't start/end on a valid code point. This is easily checkable in O(1) time, but I wouldn't recommend it to begin with, it may have huge performance issues. Typically, one does not arbitrarily slice up via some specific value, they use a function to get an index, and they don't care what the index value actually is. Alternatively, it could be done via assert, to disable it during release mode. This might be acceptable. But I would never expect any kind of indexing or slicing to use "number of code points", which clearly requires O(n) decoding to determine it's position. That would be disastrous. -SteveIt would be O(1), work just like it does today. -SteveToday it works by allowing arbitrary index and not checking if resulting slice is valid UTF-8. Anything that implies decoding is O(n). What exactly do you have in mind for this?
Mar 11 2014
On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer wrote:But I would never expect any kind of indexing or slicing to use "number of code points", which clearly requires O(n) decoding to determine it's position. That would be disastrous.If the indexes put into the slice aren't by code-point, but people need to use proper helper functions to convert a code-point into an index, then we're basically back to where we are today.
Mar 11 2014
On Tue, 11 Mar 2014 13:18:46 -0400, Chris Williams <yoreanon-chrisw yahoo.co.jp> wrote:On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer wrote:No, where we are today is that in some cases, the language treats a char[] as an array of char, in other cases, it treats a char[] as a bi-directional dchar range. What I'm proposing is we have a type that defines "This is what a string looks like," and it is consistent across all uses of the string, instead of the schizophrenic view we have now. I would also point out that quite a bit of deception and nonsense is needed to maintain that view, including things like assert(!hasLength!(char[]) && __traits(compiles, { char[] x; int y = x.length;})). The documentation for hasLength says "Tests if a given range has the length attribute," which is clearly a lie. However, I want to define right here, that index is not a number of code points. One does not frequently get code point counts, one gets indexes. It has always been that way, and I'm not planning to change that. That you can't use the index to determine the number of code points that came before it, is not a frequent issue that arises. e.g., I want to find the first instance of "xyz" in a string, do I care how many code points it has to go through, or what point I have to slice the string to get that? A previous poster brings up this incorrect code: auto index = countUntil(str, "xyz"); auto newstr = str[index..$]; But it can easily be done this way also: auto index = indexOf(str, "xyz"); auto codepts = walkLength(str[0..index]); auto newstr = str[index..$]; Given how D works, I think it would be very costly and near impossible to somehow make the incorrect slice operation statically rejected. One simply has to be trained what a code point is, and what a code unit is. HOWEVER, for the most part, nobody needs to care. Strings work fine without having to randomly access specific code points or slice based on them. Using indexes works just fine. -SteveBut I would never expect any kind of indexing or slicing to use "number of code points", which clearly requires O(n) decoding to determine it's position. That would be disastrous.If the indexes put into the slice aren't by code-point, but people need to use proper helper functions to convert a code-point into an index, then we're basically back to where we are today.
Mar 11 2014
Am Tue, 11 Mar 2014 14:02:26 -0400 schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:A previous poster brings up this incorrect code: auto index = countUntil(str, "xyz"); auto newstr = str[index..$]; But it can easily be done this way also: auto index = indexOf(str, "xyz"); auto codepts = walkLength(str[0..index]); auto newstr = str[index..$]; Given how D works, I think it would be very costly and near impossible to somehow make the incorrect slice operation statically rejected. One simply has to be trained what a code point is, and what a code unit is. HOWEVER, for the most part, nobody needs to care. Strings work fine without having to randomly access specific code points or slice based on them. Using indexes works just fine. -SteveYes, you can workaround the count problem, but then it is not "consistent across all uses of the string". What if the above code was a generic template written for arrays? Then it silently fails for strings and you have to special case it. I think the problem here is that if ranges / algorithms have to work on the same data type as slicing/indexing. If .front returns code units, then indexing/slicing should be done with code units. If it returns code points then slicing has to happen on code points for consistency or it should be disallowed. (Slicing on code units is important - no doubt. But it is error prone and should be explicit in some way: string.sliceCP(a, b) or string.representation[a...b])
Mar 11 2014
On Tue, 11 Mar 2014 14:25:10 -0400, Johannes Pfau <nospam example.com> wrote:Yes, you can workaround the count problem, but then it is not "consistent across all uses of the string". What if the above code was a generic template written for arrays? Then it silently fails for strings and you have to special case it. I think the problem here is that if ranges / algorithms have to work on the same data type as slicing/indexing. If .front returns code units, then indexing/slicing should be done with code units. If it returns code points then slicing has to happen on code points for consistency or it should be disallowed. (Slicing on code units is important - no doubt. But it is error prone and should be explicit in some way: string.sliceCP(a, b) or string.representation[a...b])I look at it a different way -- indexes are increasing, just not consecutive. If there is a way to say "indexes are not linear", then that would be a good trait to expose. For instance, think of a tree-map, which has keys that may not be consecutive. Should you be able to slice such a container? I'd say yes. But tree[0..5] may not get you the first 5 elements. -Steve
Mar 11 2014
On Tuesday, 11 March 2014 at 18:26:36 UTC, Johannes Pfau wrote:I think the problem here is that if ranges / algorithms have to work on the same data type as slicing/indexing. If .front returns code units, then indexing/slicing should be done with code units. If it returns code points then slicing has to happen on code points for consistency or it should be disallowed. (Slicing on code units is important - no doubt. But it is error prone and should be explicit in some way: string.sliceCP(a, b) or string.representation[a...b])I think it is import to remember that in terms of ranges/algorithms, strings are not indexable, nor sliceable ranges. The "only way" to generically slice a string in generic code, is to explicitly test that a range is actually a string, and then knowingly call an "internal primitive" that is NOT a part of the range traits. So slicing/indexing *is* already disallowed, in terms of range/algorithms anyways.
Mar 12 2014
On Tuesday, 11 March 2014 at 18:02:26 UTC, Steven Schveighoffer wrote:No, where we are today is that in some cases, the language treats a char[] as an array of char, in other cases, it treats a char[] as a bi-directional dchar range. -SteveI want to mention something I've had trouble with recently, that I haven't seen mentioned yet, but is related: The ambiguity of the "lone char". By that I mean: When a function accepts 'char' as an argument, it is (IMO) very hard to know if it is actually accepting a? 1. An ascii char in the 0 .. 128 range? 2. A code unit? 3. (heaven forbid) a codepoint in the 0 .. 256 range packed into a char? Currently (fortuantly? unfortunatly?) the current choice taken in our algorithms is 3, which is actually the 'safest' solution. So if you write: find("cassé", cast(char)'é'); It *will* correctly find the 'é', but it *won't* search for it in individual codeunits. -------- Another more pernicious case is that of output ranges. "put" is supposed to know how to convert and string/char width, into any sting/char width. Again, things become funky if you tell "put" to place a string, into a sink that accepts a char. Is the sink actually telling you to feed it code units? or ascii?
Mar 12 2014
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even negative attention :) An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation = data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "cassé"), it iterates via dchar 2. No more issues with "cassé"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still decode/encode, but it's more explicit. It's EXPLICITLY a dchar range. Use std.algorithm.copy(string1.representation, mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that code is easily 'fixed' by adding the 'representation' qualifier. -SteveI know warnings are disliked, but couldn't we make the slicing and indexing work as currently but issue a warning*? It's not ideal but it does mean we get backwards compatibility. In my mind this is an important enough improvement to justify a little unpleasantness. We can't afford the breakage but we also should definitely act on this. *Alternatively, they could just be deprecated from the get-go.
Mar 10 2014
On Mon, 10 Mar 2014 13:59:53 -0400, John Colvin = <john.loughran.colvin gmail.com> wrote:On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with ==std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even =negative attention :) An idea to fix the whole problems I see with char[] being treated ==specially by phobos: introduce an actual string type, with char[] as =backing, that is a dchar range, that actually dictates the rules we =arwant. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation =3D data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=r.2. No more issues with "casse=CC=81"[4], it is a static compiler erro=.3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler=6. Any other special rules we come up with can be dictated by the =library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still ==decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=Use std.algorithm.copy(string1.representation, =at =mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and th=code is easily 'fixed' by adding the 'representation' qualifier. -SteveI know warnings are disliked, but couldn't we make the slicing and =indexing work as currently but issue a warning*? It's not ideal but it==does mean we get backwards compatibility.As I mentioned elsewhere (but repeating here for viewers), I was not = planning on disabling slicing. Indexing is rarely a feature one needs or should use, especially with = encoded strings. -Steve
Mar 10 2014
On Monday, 10 March 2014 at 18:13:14 UTC, Steven Schveighoffer wrote:Indexing is rarely a feature one needs or should use, especially with encoded strings.If I was writing something like a chat or terminal window, I would want to be able to jump to chunks of text based on some sort of buffer length, then search for actual character boundaries. Similarly, if I was indexing text, I don't care what the underlying data is just whether any particular set of n-bytes have been seen together among some document. For the latter case, I don't need to be able to interpret the data as text while indexing, but once I perform an actual search and want to jump the user to that line in the file, being able to take a byte offset that I had stored in the index and convert that to a textual position would be good. I do think that D should have something like alias String8 = UTF!char; alias String16 = UTF!wchar; alias String32 = UTF!dchar; And that those sit on top of an underlying immutable(xchar)[] buffer, providing variants of things like foreach and length based on code-point or grapheme boundaries. But I don't think there's any value in reinterpretting "string". Not being a struct or an object, it doesn't have the extensibility to be useful for all the variations of access that working with Unicode and the underlying bytes warrants.
Mar 10 2014
On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 10 2014
On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:I wholly agree, they should be an array type. But what they are now is worse. -SteveAn idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 10 2014
Am Mon, 10 Mar 2014 11:30:07 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:Question: which type T doesn't have slicing, has an ElementType of dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and still satisfies isArray? It's a string. Would you call that 'an array type'? writeln(isArray!string); //true writeln(hasSlicing!string); //false writeln(ElementType!string.stringof); //dchar writeln(ElementEncodingType!string.stringof); //char I wouldn't call that an array. Part of the problem is that you want string to be arrays (fixed size elements, direct indexing) and Andrei doesn't want them to be arrays (operating on code points => not fixed size => not arrays).An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 10 2014
On Monday, 10 March 2014 at 18:50:28 UTC, Johannes Pfau wrote:Question: which type T doesn't have slicing, has an ElementType of dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and still satisfies isArray?In addition, hasLength!T == false, which totally freaked me out when I first discovered that.
Mar 10 2014
On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:Am Mon, 10 Mar 2014 11:30:07 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:I'm on the fence about this one. The nice thing about strings being an array type, is that it is a familiar concept to C coders, and it allows array slicing for extracting substrings, etc., which fits nicely with the C view of strings as character arrays. As a C coder myself, I like it this way too. But the bad thing about strings being an array type, is that it's a holdover from C, and it allows slicing for extracting substrings -- malformed substrings by permitting slicing a multibyte (multiword) character. Basically, the nice aspects of strings being arrays only apply when you're dealing with ASCII (or mostly-ASCII) strings. These very same "nice" aspects turn into problems when dealing with anything non-ASCII. The only way the user can get it right using only array operations, is if they understand the whole of Unicode in their head and are willing to reinvent Unicode algorithms every time they slice a string or do some operation on it. Since D purportedly supports Unicode by default, it shouldn't be this way. D should *actually* support Unicode all the way -- use proper Unicode algorithms for substring extraction, collation, line-breaking, normalization, etc.. Being a systems language, of course, means that D should allow you to get under the hood and do things directly with the raw string representation -- but this shouldn't be the *default* modus operandi. The default should be a properly-encapsulated string type with Unicode algorithms to operate on it (with the option of reaching into the raw representation where necessary).On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.Question: which type T doesn't have slicing, has an ElementType of dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and still satisfies isArray? It's a string. Would you call that 'an array type'? writeln(isArray!string); //true writeln(hasSlicing!string); //false writeln(ElementType!string.stringof); //dchar writeln(ElementEncodingType!string.stringof); //char I wouldn't call that an array. Part of the problem is that you want string to be arrays (fixed size elements, direct indexing) and Andrei doesn't want them to be arrays (operating on code points => not fixed size => not arrays).Exactly. What we have right now is a frankensteinian hybrid that's neither fully an array, nor fully a Unicode string type. If we call the current messy AA implementation split between compiler, aaA.d, and object.di a design problem, then I'd call the current state of D strings a design problem too. This underlying inconsistency is ultimately what leads to the poor performance of strings in std.algorithm. It's precisely because of this that I've given up on using std.algorithm for strings altogether -- std.regex is far better: more flexible, more expressive, and more performant, and specifically designed to operate on strings. Nowadays I only use std.algorithm for non-string ranges (because then the behaviour is actually consistent!!). T -- MS Windows: 64-bit overhaul of 32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit operating system originally coded for a 4-bit microprocessor, written by a 2-bit company that can't stand 1-bit of competition.
Mar 10 2014
On Monday, 10 March 2014 at 19:48:34 UTC, H. S. Teoh wrote:On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:You started off on the fence, but you seem pretty convinced by the end!Am Mon, 10 Mar 2014 11:30:07 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:I'm on the fence about this one. The nice thing about strings being an array type, is that it is a familiar concept to C coders, and it allows array slicing for extracting substrings, etc., which fits nicely with the C view of strings as character arrays. As a C coder myself, I like it this way too. But the bad thing about strings being an array type, is that it's a holdover from C, and it allows slicing for extracting substrings -- malformed substrings by permitting slicing a multibyte (multiword) character. Basically, the nice aspects of strings being arrays only apply when you're dealing with ASCII (or mostly-ASCII) strings. These very same "nice" aspects turn into problems when dealing with anything non-ASCII. The only way the user can get it right using only array operations, is if they understand the whole of Unicode in their head and are willing to reinvent Unicode algorithms every time they slice a string or do some operation on it. Since D purportedly supports Unicode by default, it shouldn't be this way. D should *actually* support Unicode all the way -- use proper Unicode algorithms for substring extraction, collation, line-breaking, normalization, etc.. Being a systems language, of course, means that D should allow you to get under the hood and do things directly with the raw string representation -- but this shouldn't be the *default* modus operandi. The default should be a properly-encapsulated string type with Unicode algorithms to operate on it (with the option of reaching into the raw representation where necessary).On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 10 2014
On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:BTW, this escaped my view the first time reading your post, but I am NOT proposing a string *class*. In fact, I'm not proposing we change anything technical about strings, the code generated should be basically identical. What I'm proposing is to encapsulate what you can and can't do with a string in the type itself, instead of making the standard library flip over backwards to treat it as something else when the compiler treats it as a simple array of char. -SteveAn idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 10 2014
On 3/10/2014 11:54 AM, Steven Schveighoffer wrote:BTW, this escaped my view the first time reading your post, but I am NOT proposing a string *class*.Right, but here I used the term "class" to be more generic as in being a user defined type, i.e. struct or class. I should have been more clear.
Mar 10 2014
On Mon, 10 Mar 2014 15:01:20 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 3/10/2014 11:54 AM, Steven Schveighoffer wrote:Then I don't understand your point. What strings are already is a user-defined type, but with horrible enforcement. i.e. things that shouldn't be allowed are only disallowed if you opt-in using phobos' template constraints. -SteveBTW, this escaped my view the first time reading your post, but I am NOT proposing a string *class*.Right, but here I used the term "class" to be more generic as in being a user defined type, i.e. struct or class. I should have been more clear.
Mar 10 2014
On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:What strings are already is a user-defined type,No, they are not.but with horrible enforcement.With no enforcement, and that is by design. Keep in mind that D is a systems programming language, and that means unfettered access to strings.
Mar 10 2014
On Monday, 10 March 2014 at 20:52:27 UTC, Walter Bright wrote:On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:I don't see how this proposal would limit that access. The raw immutable(char)[] is still there, ready to be used just as always. It seems like it fits the D ethos: safe and reasonably fast by default, unsafe and lightning fast on request. (Admittedly a bad wording, sometimes the fastest can still be safe)What strings are already is a user-defined type,No, they are not.but with horrible enforcement.With no enforcement, and that is by design. Keep in mind that D is a systems programming language, and that means unfettered access to strings.
Mar 10 2014
On Mon, 10 Mar 2014 16:52:27 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:The functionality added via phobos can hardly be considered extraneous. One would not use strings without the library.What strings are already is a user-defined type,No, they are not.The enforcement is opt-in. That is, you have to use phobos' templates in order to use them "properly": auto getIt(R)(R r, size_t idx) { if(idx < r.length) return r[idx]; } The above compiles fine for strings. However, it does not compile fine if you do: auto getIt(R)(R r, size_t idx) if(hasLength!R && isRandomAccessRange!R) Any other range will fail to compile for the more strict version and the simple implementation without template constraints. In other words, the compiler doesn't believe the same thing phobos does. shooting one's foot is quite easy.but with horrible enforcement.With no enforcement, and that is by design.Keep in mind that D is a systems programming language, and that means unfettered access to strings.Access is fine, with clear intentions. And we do not have unfettered access. I cannot sort a mutable string of ASCII characters without first converting it to ubyte[]. What in my proposal makes you think you don't have unfettered access? The underlying immutable(char)[] representation is accessible. In fact, you would have more access, since phobos functions would then work with a char[] like it's a proper array. -Steve
Mar 10 2014
On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:What in my proposal makes you think you don't have unfettered access? The underlying immutable(char)[] representation is accessible. In fact, you would have more access, since phobos functions would then work with a char[] like it's a proper array.You divide the D world into two camps - those that use 'struct string', and those that use immutable(char)[] strings.I imagine only code that is currently UTF ignorant will break,This also makes it a non-starter.
Mar 10 2014
On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:Really? It's not that divisive. However, the situation is certainly better than today's world of those who use 'string' and those who use 'string.representation'. Those who use string.representation would actually get much more use out of it. Those who use string would see no changes.What in my proposal makes you think you don't have unfettered access? The underlying immutable(char)[] representation is accessible. In fact, you would have more access, since phobos functions would then work with a char[] like it's a proper array.You divide the D world into two camps - those that use 'struct string', and those that use immutable(char)[] strings.> I imagine only code that is currently UTF ignorant will break, This also makes it a non-starter.You're the guardian of changes to the language, clearly holding a veto on any proposals. But this doesn't come across as very open-minded, especially from someone who wanted to do something that would change the fundamental treatment of strings last week. IMO, breaking incorrect code is a good idea, and worth at least exploring. -Steve
Mar 10 2014
On 3/10/2014 3:26 PM, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright <newshound2 digitalmars.com> wrote:I deserve that criticism. On the other hand, I've pretty much given up on fixing std.array.front() because of that. In the last couple days, we also wound up annoying a valuable client with some minor breakage with std.json, reiterating how important it is to not break code if we can at all avoid it.This also makes it a non-starter.You're the guardian of changes to the language, clearly holding a veto on any proposals. But this doesn't come across as very open-minded, especially from someone who wanted to do something that would change the fundamental treatment of strings last week.IMO, breaking incorrect code is a good idea, and worth at least exploring.Breaking broken code, yes.
Mar 10 2014
Walter Bright:In the last couple days, we also wound up annoying a valuable client with some minor breakage with std.json, reiterating how important it is to not break code if we can at all avoid it..There are still some breaking changed that I'd like to perform in D, like deprecating certain usages of the comma operator, etc. Bye, bearophile
Mar 10 2014
On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:Walter Bright:That damnable comma operator is one of the worst things that was inherited from C. IMO, it has no use outside the header of a for loop, and even there it's suspect.In the last couple days, we also wound up annoying a valuable client with some minor breakage with std.json, reiterating how important it is to not break code if we can at all avoid it..There are still some breaking changed that I'd like to perform in D, like deprecating certain usages of the comma operator, etc. Bye, bearophile
Mar 10 2014
On Tue, Mar 11, 2014 at 12:49:40AM +0000, Meta wrote:On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:[...]Walter Bright:In the last couple days, we also wound up annoying a valuable client with some minor breakage with std.json, reiterating how important it is to not break code if we can at all avoid it..There are still some breaking changed that I'd like to perform in D, like deprecating certain usages of the comma operator, etc.That damnable comma operator is one of the worst things that was inherited from C. IMO, it has no use outside the header of a for loop, and even there it's suspect.I've always been of the opinion that the comma operator in a for loop should be treated as special syntax, rather than a language-wide operator. The comma operator must die. :P T -- Public parking: euphemism for paid parking. -- Flora
Mar 11 2014
Meta:That damnable comma operator is one of the worst things that was inherited from C. IMO, it has no use outside the header of a for loop, and even there it's suspect.The place for the discussion about the comma operator: https://d.puremagic.com/issues/show_bug.cgi?id=2659 Bye, bearophile
Mar 11 2014
On Monday, 10 March 2014 at 21:52:04 UTC, Walter Bright wrote:On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:I would go so far as to say this is a good thing, as long as the 'struct string' is transparently the default. If you want good unicode support that works in a sane and relatively transparent manner, just write string, use literals as normal etc. If you want a normal array of characters, that behaves sanely and consistently as an array, use char[] with relevant qualifiers.What in my proposal makes you think you don't have unfettered access? The underlying immutable(char)[] representation is accessible. In fact, you would have more access, since phobos functions would then work with a char[] like it's a proper array.You divide the D world into two camps - those that use 'struct string', and those that use immutable(char)[] strings.
Mar 11 2014
On 3/10/14, 3:30 PM, Walter Bright wrote:On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:You can also look at Erlang, where strings are just lists of numbers. Eventually they realized it was a huge mistake and introduced another type, a binary string, which is much more efficient and works as expected. I think making strings behave like arrays is a design mistake.An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 12 2014
On 3/12/14, 6:24 AM, Ary Borenszweig wrote:On 3/10/14, 3:30 PM, Walter Bright wrote:Erlang's mistake was different from what you believe was D's mistake. There is no comparison to be drawn. AndreiOn 3/10/2014 6:35 AM, Steven Schveighoffer wrote:You can also look at Erlang, where strings are just lists of numbers. Eventually they realized it was a huge mistake and introduced another type, a binary string, which is much more efficient and works as expected. I think making strings behave like arrays is a design mistake.An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 12 2014
On 3/12/14, 1:53 PM, Andrei Alexandrescu wrote:On 3/12/14, 6:24 AM, Ary Borenszweig wrote:What's D's mistake then?On 3/10/14, 3:30 PM, Walter Bright wrote:Erlang's mistake was different from what you believe was D's mistake. There is no comparison to be drawn. AndreiOn 3/10/2014 6:35 AM, Steven Schveighoffer wrote:You can also look at Erlang, where strings are just lists of numbers. Eventually they realized it was a huge mistake and introduced another type, a binary string, which is much more efficient and works as expected. I think making strings behave like arrays is a design mistake.An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals.Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 12 2014
On 3/12/14, 10:29 AM, Ary Borenszweig wrote:What's D's mistake then?I don't think we made a mistake with D's strings. They could have been done better if we made all iteration requests explicit. Andrei
Mar 12 2014
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even negative attention :) An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation = data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "cassé"), it iterates via dchar 2. No more issues with "cassé"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still decode/encode, but it's more explicit. It's EXPLICITLY a dchar range. Use std.algorithm.copy(string1.representation, mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that code is easily 'fixed' by adding the 'representation' qualifier. -Stevejust to check I understand this fully: in this new scheme, what would this do? auto s = "cassé".representation; foreach(i, c; s) write(i, ':', c, ' '); writeln(s); Currently - without the .representation - I get 0:c 1:a 2:s 3:s 4:e 5:̠6:` cassé or, to spell it out a bit more: 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81 cassé
Mar 10 2014
On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin = <john.loughran.colvin gmail.com> wrote:On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:I proposed this inside the long "major performance problem with ==std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even =negative attention :) An idea to fix the whole problems I see with char[] being treated ==specially by phobos: introduce an actual string type, with char[] as =backing, that is a dchar range, that actually dictates the rules we =arwant. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation =3D data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=r.2. No more issues with "casse=CC=81"[4], it is a static compiler erro=.3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler=6. Any other special rules we come up with can be dictated by the =library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still ==decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=Use std.algorithm.copy(string1.representation, =at =mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and th=The plan is for foreach on s to iterate by char, and foreach on "casse=CC= =81" = to iterate by dchar. What this means is the accent will be iterated separately from the e, an= d = likely gets put onto the colon after 5. However, the half code-units tha= t = has no meaning anywhere (xCC and X81) would not be iterated. In your above code, using .representation would be equivalent to what it= = is now without .representation (i.e. over char), and without = .representation would be equivalent to this on today's compiler (except = = faster): foreach(i, dchar c; s) -Stevecode is easily 'fixed' by adding the 'representation' qualifier. -Stevejust to check I understand this fully: in this new scheme, what would this do? auto s =3D "casse=CC=81".representation; foreach(i, c; s) write(i, ':', c, ' '); writeln(s); Currently - without the .representation - I get 0:c 1:a 2:s 3:s 4:e 5:=CC=A06:` casse=CC=81 or, to spell it out a bit more: 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81 casse=CC=81
Mar 10 2014
On Monday, 10 March 2014 at 22:15:34 UTC, Steven Schveighoffer wrote:On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin <john.loughran.colvin gmail.com> wrote:Awesome, let's do this :)On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:The plan is for foreach on s to iterate by char, and foreach on "cassé" to iterate by dchar. What this means is the accent will be iterated separately from the e, and likely gets put onto the colon after 5. However, the half code-units that has no meaning anywhere (xCC and X81) would not be iterated. In your above code, using .representation would be equivalent to what it is now without .representation (i.e. over char), and without .representation would be equivalent to this on today's compiler (except faster): foreach(i, dchar c; s) -SteveI proposed this inside the long "major performance problem with std.array.front," I've also proposed it before, a long time ago. But seems to be getting no attention buried in that thread, not even negative attention :) An idea to fix the whole problems I see with char[] being treated specially by phobos: introduce an actual string type, with char[] as backing, that is a dchar range, that actually dictates the rules we want. Then, make the compiler use this type for literals. e.g.: struct string { immutable(char)[] representation; this(char[] data) { representation = data;} ... // dchar range primitives } Then, a char[] array is simply an array of char[]. points: 1. No more issues with foreach(c; "cassé"), it iterates via dchar 2. No more issues with "cassé"[4], it is a static compiler error. 3. No more awkward ASCII manipulation using ubyte[]. 4. No more phobos schizophrenia saying char[] is not an array. 5. No more special casing char[] array templates to fool the compiler. 6. Any other special rules we come up with can be dictated by the library, and not ignored by the compiler. Note, std.algorithm.copy(string1, mutablestring) will still decode/encode, but it's more explicit. It's EXPLICITLY a dchar range. Use std.algorithm.copy(string1.representation, mutablestring.representation) will avoid the issues. I imagine only code that is currently UTF ignorant will break, and that code is easily 'fixed' by adding the 'representation' qualifier. -Stevejust to check I understand this fully: in this new scheme, what would this do? auto s = "cassé".representation; foreach(i, c; s) write(i, ':', c, ' '); writeln(s); Currently - without the .representation - I get 0:c 1:a 2:s 3:s 4:e 5:̠6:` cassé or, to spell it out a bit more: 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81 cassé
Mar 11 2014
Automatic decoding by default itself is a WTF factor. The problem with it is it encourages unicode ignorance and pretends to work correctly, so it's harder for the developer to discover the incorrectness.
Mar 11 2014
The Unicode standard is too complex for general purpose algorithms to do useful things on D strings. We don't see that however, since our writing systems are sufficiently well supported. As an inspiration I'll leave a string here that contains combined characters in Korean (http://decodeunicode.org/hangul_syllables) and Latin as well as full width characters that span 2 characters in e.g. Latin, Greek or Cyrillic scripts (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms): Halfwidth / =EF=BC=A6=EF=BD=95=EF=BD=8C=EF=BD=8C=EF=BD=97=EF=BD=89=EF=BD=84= =EF=BD=94=EF=BD=88, =E1=86=A8=E1=86=A8=E1=86=A8=E1=86=9A=E1=86=9A=E1=85=B1= =E1=85=B1=E1=85=B1=E1=85=A1=E1=85=93=E1=85=B2=E1=84=84=E1=86=92=E1=84=8B=E1= =86=AE, a=CD=A2b 9=CD=9A c=CC=8A=CC=B9 (I used the "unfonts" package for the Hangul part) What I want to say is that for correct Unicode handling we should either use existing libraries or get a feeling for what the Unicode standard provides, then form use cases out of it. For example when we talk about the length of a string we are actually talking about 4 different things: - number of code units - number of code points - number of user perceived characters - display width using a monospace font The same distinction applies for slicing, depending on use case. Related: - What normalization do D strings use. Both Linux and MacOS X use UTF-8, but the binary representation of non-ASCII file names is different. - How do we handle sorting strings? The topic matter is complex, but not difficult (as in rocket science). If we really want to find a solution, we should form an expert group and stop talking until we read the latest Unicode specs. They are a moving target. Don't expect to ever be "done" with full Unicode support in D. --=20 Marco
Mar 17 2014
18-Mar-2014 10:21, Marco Leise пишет:The Unicode standard is too complex for general purpose algorithms to do useful things on D strings. We don't see that however, since our writing systems are sufficiently well supported.As an inspiration I'll leave a string here that contains combined characters in Korean (http://decodeunicode.org/hangul_syllables) and Latin as well as full width characters that span 2 characters in e.g. Latin, Greek or Cyrillic scripts (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms): Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊ (I used the "unfonts" package for the Hangul part) What I want to say is that for correct Unicode handling we should either use existing libraries or get a feeling for what the Unicode standard provides, then form use cases out of it.There is ICU and very few other things, like support in OSX frameworks (NSString). Industry in general kinda sucks on this point but desperately wants to improve.For example when we talk about the length of a string we are actually talking about 4 different things: - number of code units - number of code points - number of user perceived characters - display width using a monospace font The same distinction applies for slicing, depending on use case. Related: - What normalization do D strings use. Both Linux and MacOS X use UTF-8, but the binary representation of non-ASCII file names is different.There is no single normalization to fix on. D programs may be written for Linux only, for Mac-only or for both. IMO we should just provide ways to normalize strings. (std.uni.normalize has 'normalize' for starters).- How do we handle sorting strings?Unicode collation algorithm and provide ways to tweak the default one.The topic matter is complex, but not difficult (as in rocket science). If we really want to find a solution, we should form an expert group and stop talking until we read the latest Unicode specs.Well, I did. You seem motivated, would you like to join the group?They are a moving target. Don't expect to ever be "done" with full Unicode support in D.The 6.x standard line seems pretty stable to me. There is a point in provding support that worth approaching. After that ROI is drooping steadily as the amount of work to specialize for each specific culture rises. At some point we can only talk about opening up ways to specialize. D (or any library for that matter) won't ever have all possible tinkering that Unicode standard permits. So I expect D to be "done" with Unicode one day simply by reaching a point of having all universally applicable stuff (and stated defaults) plus having a toolbox to craft your own versions of algorithms. This is the goal of new std.uni. -- Dmitry Olshansky
Mar 18 2014
Am Tue, 18 Mar 2014 23:18:16 +0400 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:18-Mar-2014 10:21, Marco Leise =D0=BF=D0=B8=D1=88=D0=B5=D1=82:=BD=84=EF=BD=94=EF=BD=88, =E1=86=A8=E1=86=A8=E1=86=A8=E1=86=9A=E1=86=9A=E1= =85=B1=E1=85=B1=E1=85=B1=E1=85=A1=E1=85=93=E1=85=B2=E1=84=84=E1=86=92=E1=84= =8B=E1=86=AE, a=CD=A2b 9=CD=9A c=CC=8A=CC=B9The Unicode standard is too complex for general purpose algorithms to do useful things on D strings. We don't see that however, since our writing systems are sufficiently well supported.=20As an inspiration I'll leave a string here that contains combined characters in Korean (http://decodeunicode.org/hangul_syllables) and Latin as well as full width characters that span 2 characters in e.g. Latin, Greek or Cyrillic scripts (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms): Halfwidth / =EF=BC=A6=EF=BD=95=EF=BD=8C=EF=BD=8C=EF=BD=97=EF=BD=89=EF=Normalizations C and D are the non lossy ones and as far as I understood equivalent. So I agree. =20(I used the "unfonts" package for the Hangul part) What I want to say is that for correct Unicode handling we should either use existing libraries or get a feeling for what the Unicode standard provides, then form use cases out of it.=20 There is ICU and very few other things, like support in OSX frameworks=20 (NSString). Industry in general kinda sucks on this point but=20 desperately wants to improve.For example when we talk about the length of a string we are actually talking about 4 different things: - number of code units - number of code points - number of user perceived characters - display width using a monospace font The same distinction applies for slicing, depending on use case. Related: - What normalization do D strings use. Both Linux and MacOS X use UTF-8, but the binary representation of non-ASCII file names is different.=20 There is no single normalization to fix on. D programs may be written for Linux only, for Mac-only or for both.IMO we should just provide ways to normalize strings. (std.uni.normalize has 'normalize' for starters).I wondered if anyone will actually read up on normalization prior to touching Unicode strings. I didn't, Andrei didn't and so on... So I expect strA =3D=3D strB to be common enough, just like floatA =3D=3D floatB until the news spread. Since =3D=3D is supposed to compare for equivalence, could we hide all those details in an opaque string type and offer correct comparison functions?I wish I didn't look at the UCA. Jeeeez... But yeah, that's the way to go. Big frameworks like Java added a Collate class with predefined constants for several languages. That's too much work for us. But the API doesn't need to preclude adding those.- How do we handle sorting strings?=20 Unicode collation algorithm and provide ways to tweak the default one.Yes, I'd like to see a Unicode 6.x approved stamp on D. I didn't know that you already wrote all the simple algorithms for 2.064. Those would have been my candidates to work on, too. Is there anything that can be implemented in a day or two? :)The topic matter is complex, but not difficult (as in rocket science). If we really want to find a solution, we should form an expert group and stop talking until we read the latest Unicode specs.=20 Well, I did. You seem motivated, would you like to join the group?=20They are a moving target. Don't expect to ever be "done" with full Unicode support in D.=20 The 6.x standard line seems pretty stable to me. There is a point in=20 provding support that worth approaching. After that ROI is drooping=20 steadily as the amount of work to specialize for each specific culture=20 rises. At some point we can only talk about opening up ways to specialize. =20 D (or any library for that matter) won't ever have all possible=20 tinkering that Unicode standard permits. So I expect D to be "done" with=Unicode one day simply by reaching a point of having all universally=20 applicable stuff (and stated defaults) plus having a toolbox to craft=20 your own versions of algorithms. This is the goal of new std.uni.Sorting strings is a very basic feature, but as I learned now also highly complex. I expected some kind of tables for download that would suffice, but the rules are pretty detailed. E.g. in German phonebook order, =C3=A4/=C3=B6/=C3=BC has the same order as ae/oe/ue. --=20 Marco
Mar 19 2014
19-Mar-2014 18:42, Marco Leise пишет:Am Tue, 18 Mar 2014 23:18:16 +0400 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:Right, the KC & KD ones are really all about fuzzy matching and searching.Normalizations C and D are the non lossy ones and as far as I understood equivalent. So I agree.Related: - What normalization do D strings use. Both Linux and MacOS X use UTF-8, but the binary representation of non-ASCII file names is different.There is no single normalization to fix on. D programs may be written for Linux only, for Mac-only or for both.If that of any comfort other languages are even worse here. In C++ your are hopeless without ICU.IMO we should just provide ways to normalize strings. (std.uni.normalize has 'normalize' for starters).I wondered if anyone will actually read up on normalization prior to touching Unicode strings. I didn't, Andrei didn't and so on... So I expect strA == strB to be common enough, just like floatA == floatB until the news spread.Since == is supposed to compare for equivalence, could we hide all those details in an opaque string type and offer correct comparison functions?Well, turns out the Unicode standard ties equivalence to normalization forms. In other words unless both your strings are normalized the same way there is really no point in trying to compare them. As for opaque type - we could have say String!NFC and String!NFD or some-such. It would then make sure the normalization is the right one.Needless to say I had a nice jaw-dropping moment when I realized what elephant I have missed with our std.uni (somewhere in the middle of the work).I wish I didn't look at the UCA. Jeeeez... But yeah, that's the way to go.- How do we handle sorting strings?Unicode collation algorithm and provide ways to tweak the default one.Big frameworks like Java added a Collate class with predefined constants for several languages. That's too much work for us. But the API doesn't need to preclude adding those.Indeed some kind of Collator is in order. On the use side of things it's simply a functor that compares strings. The fact that it's full of tables and the like is well hidden. The only thing above that is caching preprocessed strings, that maybe useful for databases and string indexes.Cool, consider yourself enlisted :) I reckon word and line breaking algorithms are piece of cake compared to UCA. Given the power toys of CodepointSet and toTrie it shouldn't be that hard to come up with prototype. Then we just move precomputed versions of related tries to std/internal/ and that's it, ready for public consumption.Yes, I'd like to see a Unicode 6.x approved stamp on D. I didn't know that you already wrote all the simple algorithms for 2.064. Those would have been my candidates to work on, too. Is there anything that can be implemented in a day or two? :)The topic matter is complex, but not difficult (as in rocket science). If we really want to find a solution, we should form an expert group and stop talking until we read the latest Unicode specs.Well, I did. You seem motivated, would you like to join the group?This is tailoring, an awful thing that makes cultural differences what they are in Unicode ;) What we need first and furthermost DUCET based version (default Unicode collation element tables). -- Dmitry OlshanskyD (or any library for that matter) won't ever have all possible tinkering that Unicode standard permits. So I expect D to be "done" with Unicode one day simply by reaching a point of having all universally applicable stuff (and stated defaults) plus having a toolbox to craft your own versions of algorithms. This is the goal of new std.uni.Sorting strings is a very basic feature, but as I learned now also highly complex. I expected some kind of tables for download that would suffice, but the rules are pretty detailed. E.g. in German phonebook order, ä/ö/ü has the same order as ae/oe/ue.
Mar 19 2014
Am Thu, 20 Mar 2014 01:55:08 +0400 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:Well, turns out the Unicode standard ties equivalence to normalization=20 forms. In other words unless both your strings are normalized the same=20 way there is really no point in trying to compare them. =20 As for opaque type - we could have say String!NFC and String!NFD or=20 some-such. It would then make sure the normalization is the right one.And I thought of going the slow route where normalized and unnormalized strings can coexist and be compared. No NFD or NFC, just UTF8 strings. Pros: + Learning about normalization isn't needed to use strings correctly. And few people do that. + Strings don't need to be normalized. Every modification to data is bad, e.g. when said string is fed back to the source. Think about a file name on a file system where a different normalization is a different file. Cons: - Comparisons for already normalized strings are unnecessarily slow. Maybe the normalization form (NFC, NFD, mixed) could be stored alongside the string.Cool, consider yourself enlisted :) I reckon word and line breaking algorithms are piece of cake compared to==20UCA. Given the power toys of CodepointSet and toTrie it shouldn't be=20 that hard to come up with prototype. Then we just move precomputed=20 versions of related tries to std/internal/ and that's it, ready for=20 public consumption.Would a typical use case be to find the previous/next boundary given a code unit index? E.g. the cursor sits on a word and you want to jump to the start or end of it. Just iterating the words and lines might not be too useful.thD (or any library for that matter) won't ever have all possible tinkering that Unicode standard permits. So I expect D to be "done" wi=asUnicode one day simply by reaching a point of having all universally applicable stuff (and stated defaults) plus having a toolbox to craft your own versions of algorithms. This is the goal of new std.uni.Sorting strings is a very basic feature, but as I learned now also highly complex. I expected some kind of tables for download that would suffice, but the rules are pretty detailed. E.g. in German phonebook order, =C3=A4/=C3=B6/=C3=BC has the same order=Of course. --=20 Marcoae/oe/ue.=20 This is tailoring, an awful thing that makes cultural differences what=20 they are in Unicode ;) =20 What we need first and furthermost DUCET based version (default Unicode=20 collation element tables).
Mar 19 2014