digitalmars.D - std.algorithm.remove and principle of least astonishment
- klickverbot (43/43) Oct 16 2010 Hello all,
- Steven Schveighoffer (18/53) Oct 16 2010 My guess is that since INPUT is a string, phobos has unwisely decided to...
- Andrei Alexandrescu (4/7) Oct 16 2010 s/un//
- Andrei Alexandrescu (4/8) Oct 16 2010 I'm not seeing that. I'm seeing strings working automagically with most
- Steven Schveighoffer (52/59) Oct 16 2010 I've seen several posts regarding char[] being considered differently by...
- Andrei Alexandrescu (7/9) Oct 16 2010 I did so. It was called byDchar and it would accept a string type. It
- Tomek =?UTF-8?B?U293acWEc2tp?= (4/10) Oct 16 2010 Why it sucked?
- Andrei Alexandrescu (5/13) Oct 16 2010 Because 99% of the times you'd want to pass byDchar, but it was easy to
- Steven Schveighoffer (5/19) Oct 16 2010 So call it string, and make the compiler use it as the default type for ...
- Steven Schveighoffer (11/19) Oct 16 2010 The compiler thinks they are. And they look like arrays (T[] looks like...
- Andrei Alexandrescu (5/25) Oct 16 2010 It would do wrong or useless things otherwise. I'd probably do some
- Rainer Deyke (4/7) Oct 16 2010 Then rename them to something else. Problem solved.
- Bruno Medeiros (26/35) Nov 19 2010 "They are not arrays."? So why are they arrays then? :3
- Andrei Alexandrescu (3/40) Nov 19 2010 I don't think that would mark an improvement.
- Rainer Deyke (12/40) Nov 19 2010 You don't see the advantage of generic types behaving in a generic
- spir (14/54) Nov 20 2010 gn
-
Rainer Deyke
(24/47)
Nov 20 2010
std::vector
in C++ is a specialization of std::vector that packs -
Andrei Alexandrescu
(13/58)
Nov 20 2010
The parallel does not stand scrutiny. The problem with vector
in -
Rainer Deyke
(13/47)
Nov 20 2010
The problem with std::vector
is that it pretends to be a -
Andrei Alexandrescu
(35/80)
Nov 21 2010
char[] does not exhibit the same issues that vector
has. The - Rainer Deyke (25/51) Nov 21 2010 I agree that there are differences. For one thing, if you iterate over
- Andrei Alexandrescu (18/80) Nov 21 2010 This is sensible because a string may be seen as a sequence of code
- Rainer Deyke (40/77) Nov 21 2010 I'm not interested in discussing if char[] is overall a better data
- Andrei Alexandrescu (34/49) Nov 21 2010 This is exactly where your point falls apart. I'm actually glad you
- Rainer Deyke (40/77) Nov 21 2010 That the range view and the array view provide direct access to the same
- Andrei Alexandrescu (10/85) Nov 21 2010 This is not a guarantee by ranges, it's just a mistaken assumption.
- Andrew Wiley (8/33) Nov 21 2010 One gotcha that seems to occur here is this code:
- Rainer Deyke (30/78) Nov 22 2010 Are you saying that arrays of T do not function as ranges of T when T is
- Jonathan M Davis (4/13) Nov 22 2010 I believe that he means that you either use them as ranges or you use th...
- Rainer Deyke (6/12) Nov 22 2010 It is impossible to have a non-empty array without at some point using
- Andrei Alexandrescu (4/15) Nov 22 2010 Thanks.
- Rainer Deyke (11/30) Nov 22 2010 I think this bug is a symptom of a larger issue. The range abstraction
- Jesse Phillips (4/11) Nov 22 2010 Note that this issue with foreach has been discussed before. The suggest...
- spir (19/34) Nov 23 2010 d solution was to have infer dchar instead of char (shot down since iter...
- so (4/18) Nov 22 2010 Or better, if you want both ranges and random access do same thing,
- Steven Schveighoffer (13/20) Nov 22 2010 I want to use char[] as an array. I want to sort the array, how do I do...
- Michel Fortin (19/22) Nov 22 2010 It's amusing to read this from my perspective.
- Andrei Alexandrescu (3/14) Nov 22 2010 Why do you want to sort an array of char?
- Steven Schveighoffer (23/38) Nov 22 2010 You're dodging the question. You claim that if I want to use it as an
- Andrei Alexandrescu (19/58) Nov 22 2010 Of course you can. After you were to admit that it makes next to no
- Steven Schveighoffer (29/52) Nov 22 2010 That wasn't what you said -- you said I can use char[] as an array if I ...
- Andrei Alexandrescu (30/87) Nov 22 2010 That still stays valid. The thing is, sort doesn't sort arrays, it sorts...
- foobar (6/119) Nov 22 2010 Canonical example: DNA.
- Andrei Alexandrescu (7/12) Nov 22 2010 I think it's best to encode DNA strings as sequences of ubyte. UTF
- Jonathan M Davis (7/25) Nov 22 2010 The problem with char is that so many people are used to thinking of cha...
- foobar (6/23) Nov 23 2010 The isn't a quantitative issue but an existential one. I agree that it's...
- Andrei Alexandrescu (5/27) Nov 23 2010 Yes, and the language offers the abstraction abilities to define such
- foobar (9/45) Nov 23 2010 It's simple, a mediocre language (Java) with mediocre libraries has orde...
- Andrei Alexandrescu (18/80) Nov 23 2010 I don't think the dynamics of programming language success can be
- Bruno Medeiros (4/5) Nov 24 2010 Java has mediocre libraries?? Are you serious about that opinion?
- Jonathan M Davis (15/23) Nov 23 2010 I think that what he's saying is that the names char, wchar, and dchar a...
- so (5/27) Nov 23 2010 That actually is an excellent idea, wiping all 3 of them and replacing
- Daniel Gibson (17/34) Nov 23 2010 And in Java a char is a 16bit unicode char that is generally handled as ...
- Don (13/56) Nov 24 2010 I don't think that's a valid comparison, since we have integer types,
- spir (16/22) Nov 24 2010 =20
- Daniel Gibson (3/5) Nov 24 2010 probably because you can't write ubyte[] str = "asdf"; and they want to ...
- Andrei Alexandrescu (3/9) Nov 24 2010 Probably the assignment should be allowed.
- spir (12/18) Nov 24 2010 e ubyte? But instead bug into char. Is this only because of C baggage?
- KennyTM~ (20/80) Nov 22 2010 Right, and D3 should simply disable using char and wchar as an
- Bruno Medeiros (19/24) Nov 24 2010 More exactly, that the following is true for any T:
- Bruno Medeiros (42/70) Nov 24 2010 Actually, I'll reply here, on why I would like these guarantees:
- Jonathan M Davis (16/32) Nov 21 2010 Character arrays are arrays of code units and ranges of code points (of ...
- Jonathan M Davis (10/45) Nov 21 2010 Actually, the better implementation would probably be to provide wrapper...
- Andrei Alexandrescu (25/33) Nov 21 2010 I agree except for the majority of cases part. In fact the original
- Jonathan M Davis (30/69) Nov 21 2010 Well, I don't know for certain whether people would normally want to ite...
- Michel Fortin (29/38) Nov 21 2010 Well, basically these two arguments are the same: iterating by code
- spir (58/93) Nov 22 2010 =20
- spir (31/68) Nov 22 2010 r ranges
- Bruno Medeiros (5/12) Nov 24 2010 Those things you would have done differently, would any of them impact
- Michel Fortin (34/37) Nov 21 2010 It's convenient that char[] and wchar[] expose a dchar bidirectional
- Andrei Alexandrescu (13/46) Nov 21 2010 I understand the concern, and that's why I strongly support formal
- Jonathan M Davis (10/45) Nov 21 2010 We could always define an abstract Character (or whatever you want to ca...
- Lutger Blijdestijn (8/17) Nov 21 2010 Is there a plan to make std.string and std.algorithm more compatible wit...
- spir (20/33) Nov 22 2010 Sure, D helps a lot. I agree with abstraction levels independant of inte...
- spir (45/55) Nov 22 2010 l it)=20
- spir (48/85) Nov 22 2010 =20
- Michel Fortin (14/17) Nov 22 2010 Just to add to the compexity: graphemes aren't always equivalent to
- spir (17/26) Nov 22 2010 Mac=20
- Michel Fortin (13/37) Nov 22 2010 Is searching for a word in a text file less general purpose than
- Michel Fortin (18/81) Nov 22 2010 I agree there might be a use case for a special data type allowing fast
- spir (20/25) Nov 22 2010 It's true as long as you can assert each string is iterated at most once...
- Michel Fortin (14/27) Nov 22 2010 I think you missed my point.
- klickverbot (2/2) Oct 16 2010 In case it was not clear, this is what I want to achive:
- Andrei Alexandrescu (22/65) Oct 16 2010 Thanks for the input. This is not a bug, it's what I believe to be a
- klickverbot (6/11) Oct 16 2010 I see that there is a problem due the difference of code units and code
- Pelle (2/13) Oct 16 2010 Try it with ä or ░ instead of x.
- Andrei Alexandrescu (5/16) Oct 16 2010 Strings are dual types. They have [] and .length but not with the
- Andrei Alexandrescu (7/18) Oct 16 2010 To drive my point home: if you wanted to replace not 'x', but instead a
- foobar (5/13) Nov 24 2010 It all depends on the scale you use.
Hello all, I decided to have a go at solving some easy programming puzzles with D2/Phobos to see how Phobos, especially ranges and std.algorithm, work out in simple real-world use cases (the puzzle in question is from hacker.org, by the way). The following code is a direct translation of a simple problem description to D (it is horrible from performance point of view, but that's certainly no issue here). --- import std.algorithm; import std.conv; import std.stdio; // The original input string is longer, but irrelevant to this post. enum INPUT = "93752xxx746x27x1754xx90x93xxxxx238x44x75xx087509"; void main() { uint sum; auto tmp = INPUT.dup; size_t i; while ( i < tmp.length ) { char c = tmp[ i ]; if ( c == 'x' ) { tmp = remove( tmp, i ); i -= 2; } else { sum += to!uint( [ c ] ); ++i; } } writeln( sum ); } --- Quite contrary to what you would expect, the call to »remove« fails to compile with the following error messages: »std/algorithm.d(4287): Error: front(src) is not an lvalue« and »std/algorithm.d(4287): Error: front(tgt) is not an lvalue«. I am intentionally posting this to this NG and not to d.…D.learn, since this is a quite gross violation of the principle of least surprise in my eyes. If this isn't a bug, a better error message via a template constraint or a static assert would be something worth looking at in my opinion, since one would probably expect this to compile and not to fail within Phobos code. David
Oct 16 2010
On Sat, 16 Oct 2010 14:29:59 -0400, klickverbot <see klickverbot.at> wrote:Hello all, I decided to have a go at solving some easy programming puzzles with D2/Phobos to see how Phobos, especially ranges and std.algorithm, work out in simple real-world use cases (the puzzle in question is from hacker.org, by the way). The following code is a direct translation of a simple problem description to D (it is horrible from performance point of view, but that's certainly no issue here). --- import std.algorithm; import std.conv; import std.stdio; // The original input string is longer, but irrelevant to this post. enum INPUT = "93752xxx746x27x1754xx90x93xxxxx238x44x75xx087509"; void main() { uint sum; auto tmp = INPUT.dup; size_t i; while ( i < tmp.length ) { char c = tmp[ i ]; if ( c == 'x' ) { tmp = remove( tmp, i ); i -= 2; } else { sum += to!uint( [ c ] ); ++i; } } writeln( sum ); } --- Quite contrary to what you would expect, the call to »remove« fails to compile with the following error messages: »std/algorithm.d(4287): Error: front(src) is not an lvalue« and »std/algorithm.d(4287): Error: front(tgt) is not an lvalue«.My guess is that since INPUT is a string, phobos has unwisely decided to treat strings not as random access arrays of chars, but as a bidirectional range of dchar. This means that even though you can randomly access the characters (phobos can't take that away from you), it artificially imposes restrictions (such as making front an rvalue) where it wouldn't do the same to an int[] or ubyte[]. Andrei, I am increasingly seeing people struggling with the decision to make strings bidirectional ranges of dchar instead of what the compiler says they are. This needs a different solution. It's too confusing/difficult to deal with. I suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions. This means people will have to use these ranges when they want to treat them as bidir ranges of dchar, but the current situation is at least annoying, if not a complete turn-off to D. And it vastly simplifies code that uses ranges, since they now don't have to contain special cases for char[] and wchar[]. -Steve
Oct 16 2010
On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:My guess is that since INPUT is a string, phobos has unwisely decided to treat strings not as random access arrays of chars, but as a bidirectional range of dchar.s/un// :o) Andrei
Oct 16 2010
On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:Andrei, I am increasingly seeing people struggling with the decision to make strings bidirectional ranges of dchar instead of what the compiler says they are. This needs a different solution. It's too confusing/difficult to deal with.I'm not seeing that. I'm seeing strings working automagically with most of std.algorithm without ever destroying a wide string. Andrei
Oct 16 2010
On Sat, 16 Oct 2010 15:49:56 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:I've seen several posts regarding char[] being considered differently by the compiler and std.algorithm. The most prominent was the fact that: foreach(x; str) iterates over individual char's, not dchars. While I agree that a bidirectional range is the only sane way to view utf-8 strings, a char[] is not necessarily a utf-8 string. It's an array of utf-8 code points. At least to the compiler. You can interpret it as a utf-8 string, or as an array. And the compiler allows both. std.algorithm doesn't. This half-ass attempt to make strings safe just fosters confusion. My suggestion is to make a range that enforces the correct restrictions on strings. The compiler should treat string literals as a polysemous type that is by default this new type, or could optionally be an array of immutable characters. So for example if you define: struct string(T) if (is(T == char) || is(T == wchar)) { private immutable(T)[] data; // range functions to ensure data is only accessed via dchar ... } Which then is used by the compiler to represent string literals, then we have control over what a string literal allows without littering std.algorithm with special cases (and any external algorithms that might encounter strings). So for example, I'd want something like this: immutable(char)[] asciiarr = "abcdef"; auto str = "abcdef"; // typed as string foreach(x; str) { assert(is(typeof(x) == dchar)); } foreach(ref x; str) // fails foreach(ref x; asciiarr) // ok, x is of type immutable(char) The truth is, 100% of the time for me, I want to use string literals to represent ASCII strings, not utf-8 strings (I speak English, so I care almost nothing for unicode). And std.algorithm steadfastly refuses to treat them as such. I think it's just too limited. Yes, it would be nice if by default strings were bi-directional ranges of dchar, to be on the safe side, but I also want the ability to have an array of chars, which works as an array, even in std.algorithm, *and* is initializeable via string literals. My requirements for the string struct would be: 1. only access via dchar 2. prevent slicing a code point 3. Indexing returns a dchar as well, which provides pseudo-random access (if you access an index that's in the middle of a code point, you get an exception). -SteveAndrei, I am increasingly seeing people struggling with the decision to make strings bidirectional ranges of dchar instead of what the compiler says they are. This needs a different solution. It's too confusing/difficult to deal with.I'm not seeing that. I'm seeing strings working automagically with most of std.algorithm without ever destroying a wide string.
Oct 16 2010
On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:I suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked. char[] and wchar[] are special. They embed their UTF affiliation in their type. I don't think we should make a wash of all that by handling them as arrays. They are not arrays. Andrei
Oct 16 2010
Andrei Alexandrescu napisał:On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:Why it sucked? -- TomekI suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked.
Oct 16 2010
On 10/16/2010 02:58 PM, Tomek Sowiński wrote:Andrei Alexandrescu napisał:Because 99% of the times you'd want to pass byDchar, but it was easy to forget. Then the algorithm would compile and run without byDchar, just with useless semantics. AndreiOn 10/16/2010 01:39 PM, Steven Schveighoffer wrote:Why it sucked?I suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked.
Oct 16 2010
On Sat, 16 Oct 2010 17:28:17 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 10/16/2010 02:58 PM, Tomek Sowiński wrote:So call it string, and make the compiler use it as the default type for string literals. -SteveAndrei Alexandrescu napisał:Because 99% of the times you'd want to pass byDchar, but it was easy to forget. Then the algorithm would compile and run without byDchar, just with useless semantics.On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:Why it sucked?I suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked.
Oct 16 2010
On Sat, 16 Oct 2010 15:51:23 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:The compiler thinks they are. And they look like arrays (T[] looks like an array to me no matter what T is). And I *want* an array of characters in most cases. If you want a special type for strings, make them a special type. D should not have this schizophrenic view of strings. Plus it strikes me as extremely unclean and bloated for every algorithm that might have a range of char's passed into it to treat it specially (ignoring what the compiler says). -SteveI suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked. char[] and wchar[] are special. They embed their UTF affiliation in their type. I don't think we should make a wash of all that by handling them as arrays. They are not arrays.
Oct 16 2010
On 10/16/2010 03:14 PM, Steven Schveighoffer wrote:On Sat, 16 Oct 2010 15:51:23 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:It would do wrong or useless things otherwise. I'd probably do some things differently if I started over, but given the circumstances I think std.algorithm does the best it could ever do with strings. AndreiOn 10/16/2010 01:39 PM, Steven Schveighoffer wrote:The compiler thinks they are. And they look like arrays (T[] looks like an array to me no matter what T is). And I *want* an array of characters in most cases. If you want a special type for strings, make them a special type. D should not have this schizophrenic view of strings. Plus it strikes me as extremely unclean and bloated for every algorithm that might have a range of char's passed into it to treat it specially (ignoring what the compiler says).I suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked. char[] and wchar[] are special. They embed their UTF affiliation in their type. I don't think we should make a wash of all that by handling them as arrays. They are not arrays.
Oct 16 2010
On 10/16/2010 13:51, Andrei Alexandrescu wrote:char[] and wchar[] are special. They embed their UTF affiliation in their type. I don't think we should make a wash of all that by handling them as arrays. They are not arrays.Then rename them to something else. Problem solved. -- Rainer Deyke - rainerd eldwood.com
Oct 16 2010
On 16/10/2010 20:51, Andrei Alexandrescu wrote:On 10/16/2010 01:39 PM, Steven Schveighoffer wrote:"They are not arrays."? So why are they arrays then? :3 Sorry, what I mean is: so we agree that char[] and wchar[] are special. Unlike *all other arrays*, there are restrictions to what you can assign to each element of the array. So conceptually they are not arrays, but in the type system they are very much arrays. (or described alternatively: implemented with arrays). Isn't this a clear sign that what currently is char[] and wchar[] (= UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a struct which would correctly represents the semantics and contracts of char[] and wchar[]? Let me clarify what I'm suggesting: * char[] and wchar[] would be just arrays of char's and wchar's, completely orthogonal with other arrays types, no restrictions on assignment, no further contracts. * UTF-8 and UTF-16 encoded strings would have their own struct-based type, lets called them string and wstring, which would likely use char[] and wchar[] as the contents (but these fields would be internal), and have whatever methods be appropriate, including opIndex. * string literals would be of type string and wstring, not char[] and wchar[]. * for consistency, probably this would be true for UTF-32 as well: we would have a dstring, with dchar[] as the contents. Problem solved. You're welcome. (as John Hodgeman would say) No? -- Bruno Medeiros - Software EngineerI suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked. char[] and wchar[] are special. They embed their UTF affiliation in their type. I don't think we should make a wash of all that by handling them as arrays. They are not arrays. Andrei
Nov 19 2010
On 11/19/10 12:59 PM, Bruno Medeiros wrote:On 16/10/2010 20:51, Andrei Alexandrescu wrote:I don't think that would mark an improvement. AndreiOn 10/16/2010 01:39 PM, Steven Schveighoffer wrote:"They are not arrays."? So why are they arrays then? :3 Sorry, what I mean is: so we agree that char[] and wchar[] are special. Unlike *all other arrays*, there are restrictions to what you can assign to each element of the array. So conceptually they are not arrays, but in the type system they are very much arrays. (or described alternatively: implemented with arrays). Isn't this a clear sign that what currently is char[] and wchar[] (= UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a struct which would correctly represents the semantics and contracts of char[] and wchar[]? Let me clarify what I'm suggesting: * char[] and wchar[] would be just arrays of char's and wchar's, completely orthogonal with other arrays types, no restrictions on assignment, no further contracts. * UTF-8 and UTF-16 encoded strings would have their own struct-based type, lets called them string and wstring, which would likely use char[] and wchar[] as the contents (but these fields would be internal), and have whatever methods be appropriate, including opIndex. * string literals would be of type string and wstring, not char[] and wchar[]. * for consistency, probably this would be true for UTF-32 as well: we would have a dstring, with dchar[] as the contents. Problem solved. You're welcome. (as John Hodgeman would say) No?I suggest wrapping a char[] or wchar[] (of all constancies) with a special range that imposes the restrictions.I did so. It was called byDchar and it would accept a string type. It sucked. char[] and wchar[] are special. They embed their UTF affiliation in their type. I don't think we should make a wash of all that by handling them as arrays. They are not arrays. Andrei
Nov 19 2010
On 11/19/2010 16:40, Andrei Alexandrescu wrote:On 11/19/10 12:59 PM, Bruno Medeiros wrote:You don't see the advantage of generic types behaving in a generic manner? Do you know how much pain std::vector<bool> caused in C++? I asked this before, but I received no answer. Let me ask it again. Imagine a container Vector!T that uses T[] internally. Then consider Vector!char. What would be its correct element type? What would be its correct behavior during iteration? What would be its correct response when asked to return its length? Assuming you come up with a coherent set of semantics for Vector!char, how would you implement it? Do you see how easy it would be to implement it incorrectly? -- Rainer Deyke - rainerd eldwood.comSorry, what I mean is: so we agree that char[] and wchar[] are special. Unlike *all other arrays*, there are restrictions to what you can assign to each element of the array. So conceptually they are not arrays, but in the type system they are very much arrays. (or described alternatively: implemented with arrays). Isn't this a clear sign that what currently is char[] and wchar[] (= UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a struct which would correctly represents the semantics and contracts of char[] and wchar[]? Let me clarify what I'm suggesting: * char[] and wchar[] would be just arrays of char's and wchar's, completely orthogonal with other arrays types, no restrictions on assignment, no further contracts. * UTF-8 and UTF-16 encoded strings would have their own struct-based type, lets called them string and wstring, which would likely use char[] and wchar[] as the contents (but these fields would be internal), and have whatever methods be appropriate, including opIndex. * string literals would be of type string and wstring, not char[] and wchar[]. * for consistency, probably this would be true for UTF-32 as well: we would have a dstring, with dchar[] as the contents. Problem solved. You're welcome. (as John Hodgeman would say) No?I don't think that would mark an improvement.
Nov 19 2010
On Fri, 19 Nov 2010 22:04:51 -0700 Rainer Deyke <rainerd eldwood.com> wrote:On 11/19/2010 16:40, Andrei Alexandrescu wrote:gnOn 11/19/10 12:59 PM, Bruno Medeiros wrote:Sorry, what I mean is: so we agree that char[] and wchar[] are special. Unlike *all other arrays*, there are restrictions to what you can assi=[]to each element of the array. So conceptually they are not arrays, but in the type system they are very much arrays. (or described alternatively: implemented with arrays). Isn't this a clear sign that what currently is char[] and wchar[] (=3D UTF-8 and UTF-16 encoded strings) should not be arrays, but instead a struct which would correctly represents the semantics and contracts of char[] and wchar[]? Let me clarify what I'm suggesting: * char[] and wchar[] would be just arrays of char's and wchar's, completely orthogonal with other arrays types, no restrictions on assignment, no further contracts. * UTF-8 and UTF-16 encoded strings would have their own struct-based type, lets called them string and wstring, which would likely use char=Hello Rainer, The original proposal by Bruno would simplify some project I have in mind (= namely, a higher-level universal text type already evoked). The issues you = point to intuitively seem relevant to me, but I cannot really understand an= y. Would be kind enough and expand a bit on each question? (Thinking at peo= ple who about nothing of C++ -- yes, they exist ;-) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com=20 You don't see the advantage of generic types behaving in a generic manner? Do you know how much pain std::vector<bool> caused in C++? =20 I asked this before, but I received no answer. Let me ask it again. Imagine a container Vector!T that uses T[] internally. Then consider Vector!char. What would be its correct element type? What would be its correct behavior during iteration? What would be its correct response when asked to return its length? Assuming you come up with a coherent set of semantics for Vector!char, how would you implement it? Do you see how easy it would be to implement it incorrectly?and wchar[] as the contents (but these fields would be internal), and have whatever methods be appropriate, including opIndex. * string literals would be of type string and wstring, not char[] and wchar[]. * for consistency, probably this would be true for UTF-32 as well: we would have a dstring, with dchar[] as the contents. Problem solved. You're welcome. (as John Hodgeman would say) No?=20 I don't think that would mark an improvement.
Nov 20 2010
On 11/20/2010 05:12, spir wrote:On Fri, 19 Nov 2010 22:04:51 -0700 Rainer Deyke <rainerd eldwood.com> wrote:std::vector<bool> in C++ is a specialization of std::vector that packs eight booleans into a byte instead of storing each element separately. It doesn't behave exactly like other std::vectors and technically doesn't meet the C++ requirements of a container, although it tries to come as close as possible. This means that any code that uses std::vector<bool> needs to be extra careful to take those differences in account. This is especially an issue when dealing with generic code that uses std::vector<T>, where T may or may not be bool. The issue with Vector!char is similar. Because char[] is not a true array, generic code that uses T[] can unexpectedly fail when T is char. Other containers of char behave like normal containers, iterating over individual chars. char[] iterates over dchars. Vector!char can, depending on its implementation, iterate over chars, iterate over dchars, or fail to compile at all when instantiated with T=char. It's not even clear which of these is the correct behavior. Vector!char is just an example. Any generic code that uses T[] can unexpectedly fail to compile or behave incorrectly used when T=char. If I were to use D2 in its present state, I would try to avoid both char/wchar and arrays as much as possible in order to avoid this trap. This would mean avoiding large parts of Phobos, and providing safe wrappers around the rest. -- Rainer Deyke - rainerd eldwood.comYou don't see the advantage of generic types behaving in a generic manner? Do you know how much pain std::vector<bool> caused in C++? I asked this before, but I received no answer. Let me ask it again. Imagine a container Vector!T that uses T[] internally. Then consider Vector!char. What would be its correct element type? What would be its correct behavior during iteration? What would be its correct response when asked to return its length? Assuming you come up with a coherent set of semantics for Vector!char, how would you implement it? Do you see how easy it would be to implement it incorrectly?Hello Rainer, The original proposal by Bruno would simplify some project I have in mind (namely, a higher-level universal text type already evoked). The issues you point to intuitively seem relevant to me, but I cannot really understand any. Would be kind enough and expand a bit on each question? (Thinking at people who about nothing of C++ -- yes, they exist ;-)
Nov 20 2010
On 11/20/10 12:32 PM, Rainer Deyke wrote:On 11/20/2010 05:12, spir wrote:The parallel does not stand scrutiny. The problem with vector<bool> in C++ is that it implements no formal abstraction, although it is a specialization of one. D strings exhibit no such problems. They expose their implementation - array of code units. Having that available is often handy. They also obey a formal interface - bidirectional ranges.On Fri, 19 Nov 2010 22:04:51 -0700 Rainer Deyke<rainerd eldwood.com> wrote:std::vector<bool> in C++ is a specialization of std::vector that packs eight booleans into a byte instead of storing each element separately. It doesn't behave exactly like other std::vectors and technically doesn't meet the C++ requirements of a container, although it tries to come as close as possible. This means that any code that uses std::vector<bool> needs to be extra careful to take those differences in account. This is especially an issue when dealing with generic code that uses std::vector<T>, where T may or may not be bool. The issue with Vector!char is similar. Because char[] is not a true array, generic code that uses T[] can unexpectedly fail when T is char. Other containers of char behave like normal containers, iterating over individual chars. char[] iterates over dchars. Vector!char can, depending on its implementation, iterate over chars, iterate over dchars, or fail to compile at all when instantiated with T=char. It's not even clear which of these is the correct behavior.You don't see the advantage of generic types behaving in a generic manner? Do you know how much pain std::vector<bool> caused in C++? I asked this before, but I received no answer. Let me ask it again. Imagine a container Vector!T that uses T[] internally. Then consider Vector!char. What would be its correct element type? What would be its correct behavior during iteration? What would be its correct response when asked to return its length? Assuming you come up with a coherent set of semantics for Vector!char, how would you implement it? Do you see how easy it would be to implement it incorrectly?Hello Rainer, The original proposal by Bruno would simplify some project I have in mind (namely, a higher-level universal text type already evoked). The issues you point to intuitively seem relevant to me, but I cannot really understand any. Would be kind enough and expand a bit on each question? (Thinking at people who about nothing of C++ -- yes, they exist ;-)Vector!char is just an example. Any generic code that uses T[] can unexpectedly fail to compile or behave incorrectly used when T=char. If I were to use D2 in its present state, I would try to avoid both char/wchar and arrays as much as possible in order to avoid this trap. This would mean avoiding large parts of Phobos, and providing safe wrappers around the rest.It may be wise in fact to start using D2 and make criticism grounded in reality that could help us improve the state of affairs. The above is only fallacious presupposition. Algorithms in Phobos are abstracted on the formal range interface, and as such you won't be exposed to risks when using them with strings. Andrei
Nov 20 2010
On 11/20/2010 16:58, Andrei Alexandrescu wrote:On 11/20/10 12:32 PM, Rainer Deyke wrote:The problem with std::vector<bool> is that it pretends to be a std::vector, but isn't. If it was called dynamic_bitset instead, nobody would have complained. char[] has exactly the same problem.std::vector<bool> in C++ is a specialization of std::vector that packs eight booleans into a byte instead of storing each element separately. It doesn't behave exactly like other std::vectors and technically doesn't meet the C++ requirements of a container, although it tries to come as close as possible. This means that any code that uses std::vector<bool> needs to be extra careful to take those differences in account. This is especially an issue when dealing with generic code that uses std::vector<T>, where T may or may not be bool. The issue with Vector!char is similar. Because char[] is not a true array, generic code that uses T[] can unexpectedly fail when T is char. Other containers of char behave like normal containers, iterating over individual chars. char[] iterates over dchars. Vector!char can, depending on its implementation, iterate over chars, iterate over dchars, or fail to compile at all when instantiated with T=char. It's not even clear which of these is the correct behavior.The parallel does not stand scrutiny. The problem with vector<bool> in C++ is that it implements no formal abstraction, although it is a specialization of one.Sorry, but no. It would take a huge investment of time and effort on my part to switch from C++ to D. I'm not going to make that leap without looking first, and I'm not going to make it when I can see that I'm about to jump into a spike pit.Vector!char is just an example. Any generic code that uses T[] can unexpectedly fail to compile or behave incorrectly used when T=char. If I were to use D2 in its present state, I would try to avoid both char/wchar and arrays as much as possible in order to avoid this trap. This would mean avoiding large parts of Phobos, and providing safe wrappers around the rest.It may be wise in fact to start using D2 and make criticism grounded in reality that could help us improve the state of affairs.The above is only fallacious presupposition. Algorithms in Phobos are abstracted on the formal range interface, and as such you won't be exposed to risks when using them with strings.I'm not concerned about algorithms, I'm concerned about code that uses arrays directly. Like my Vector!char example, which I see you still haven't addressed. -- Rainer Deyke - rainerd eldwood.com
Nov 20 2010
On 11/20/10 9:42 PM, Rainer Deyke wrote:On 11/20/2010 16:58, Andrei Alexandrescu wrote:char[] does not exhibit the same issues that vector<bool> has. The situation is very different, and again, trying to reduce one to another misses a lot of the picture. vector<bool> hides representation and in doing so becomes non-compliant with vector<T> which does expose representation. Worse, vector<bool> is not compliant with any concept, express or implied, which makes vector<bool> virtually unusable with generic code. In contrast, char[] exposes a meaningful representation (array of code units) that is often useful, and obeys a slightly weaker formal abstraction (bidirectional range) which is also useful. It's simply a very different setup from vector<bool>, and again attempting to use one in predicting the fare of the other is a poor approach.On 11/20/10 12:32 PM, Rainer Deyke wrote:The problem with std::vector<bool> is that it pretends to be a std::vector, but isn't. If it was called dynamic_bitset instead, nobody would have complained. char[] has exactly the same problem.std::vector<bool> in C++ is a specialization of std::vector that packs eight booleans into a byte instead of storing each element separately. It doesn't behave exactly like other std::vectors and technically doesn't meet the C++ requirements of a container, although it tries to come as close as possible. This means that any code that uses std::vector<bool> needs to be extra careful to take those differences in account. This is especially an issue when dealing with generic code that uses std::vector<T>, where T may or may not be bool. The issue with Vector!char is similar. Because char[] is not a true array, generic code that uses T[] can unexpectedly fail when T is char. Other containers of char behave like normal containers, iterating over individual chars. char[] iterates over dchars. Vector!char can, depending on its implementation, iterate over chars, iterate over dchars, or fail to compile at all when instantiated with T=char. It's not even clear which of these is the correct behavior.The parallel does not stand scrutiny. The problem with vector<bool> in C++ is that it implements no formal abstraction, although it is a specialization of one.You may rest assured that if anything, strings are not a problem. The way the abstractions are laid out make D's strings the best approach to Unicode strings I know about.Sorry, but no. It would take a huge investment of time and effort on my part to switch from C++ to D. I'm not going to make that leap without looking first, and I'm not going to make it when I can see that I'm about to jump into a spike pit.Vector!char is just an example. Any generic code that uses T[] can unexpectedly fail to compile or behave incorrectly used when T=char. If I were to use D2 in its present state, I would try to avoid both char/wchar and arrays as much as possible in order to avoid this trap. This would mean avoiding large parts of Phobos, and providing safe wrappers around the rest.It may be wise in fact to start using D2 and make criticism grounded in reality that could help us improve the state of affairs.When you define your abstractions, you are free to decide how you want to go about them. The D programming language makes it unequivocally clear that char[] is an array of UTF-8 code units that offers a bidirectional range of code points. Same about wchar[] (replace UTF-8 with UTF-16). dchar[] is an array of UTF-32 code points which are equivalent to code units, and as such is a full random-access range. If you define your own function that uses an array directly, such as sort(), then attempting to sort a char[] will get you exactly what you expect - you sort the code units in the array. The sort routine in the standard library is modeled to work with random access ranges, and will refuse to sort a char[]. I have often reflected whether I'd do things differently if I could go back in time and join Walter when he invented D's strings. I might have done one or two things differently, but the gain would be marginal at best. In fact, it's not impossible the balance of things could have been hurt. Between speed, simplicity, effectiveness, abstraction, access to representation, and economy of means, D's strings are the best compromise out there that I know of, bar none by a wide margin. AndreiThe above is only fallacious presupposition. Algorithms in Phobos are abstracted on the formal range interface, and as such you won't be exposed to risks when using them with strings.I'm not concerned about algorithms, I'm concerned about code that uses arrays directly. Like my Vector!char example, which I see you still haven't addressed.
Nov 21 2010
On 11/21/2010 11:23, Andrei Alexandrescu wrote:On 11/20/10 9:42 PM, Rainer Deyke wrote:I agree that there are differences. For one thing, if you iterate over a std::vector<bool> you get actual booleans, albeit through an extra layer of indirection. If you iterate over char[] you might get chars or you might get dchars depending on the method you use for iterating. char[] isn't the equivalent of std::vector<bool>. It's worse. char[] is the equivalent of a vector<bool> that keeps the current behavior of std::vector<bool> when iterating through iterators, but gives access to bytes of packed booleans when using operator[].On 11/20/2010 16:58, Andrei Alexandrescu wrote:char[] does not exhibit the same issues that vector<bool> has. The situation is very different, and again, trying to reduce one to another misses a lot of the picture.The parallel does not stand scrutiny. The problem with vector<bool> in C++ is that it implements no formal abstraction, although it is a specialization of one.The problem with std::vector<bool> is that it pretends to be a std::vector, but isn't. If it was called dynamic_bitset instead, nobody would have complained. char[] has exactly the same problem.vector<bool> hides representation and in doing so becomes non-compliant with vector<T> which does expose representation. Worse, vector<bool> is not compliant with any concept, express or implied, which makes vector<bool> virtually unusable with generic code.The ways in which std::vector<bool> differs from any other vector are well understood. It uses proxies instead of true references. Its iterators meet the requirements of input/output iterators (or in boost terms, readable, writable iterators with random access traversal). Any generic code written with these limitations in mind can use std::vector<T> freely. (The C++ standard library doesn't play nicely with std::vector<bool>, but that's another issue entirely.) std::vector<bool> is a useful type, it just isn't a std::vector. In that respect, its situation is analogous to that of char[].I'm not concerned about strings, I'm concerned about *arrays*. Arrays of T, where T may or not be a character type. I see that you ignored my Vector!char example yet again. Your assurances aren't increasing my confidence in D, they're decreasing my confidence in your judgment (and by extension my confidence in D). -- Rainer Deyke - rainerd eldwood.comYou may rest assured that if anything, strings are not a problem.It may be wise in fact to start using D2 and make criticism grounded in reality that could help us improve the state of affairs.Sorry, but no. It would take a huge investment of time and effort on my part to switch from C++ to D. I'm not going to make that leap without looking first, and I'm not going to make it when I can see that I'm about to jump into a spike pit.
Nov 21 2010
On 11/21/10 6:12 PM, Rainer Deyke wrote:On 11/21/2010 11:23, Andrei Alexandrescu wrote:This is sensible because a string may be seen as a sequence of code points or a sequence of code units. Either view is useful.On 11/20/10 9:42 PM, Rainer Deyke wrote:I agree that there are differences. For one thing, if you iterate over a std::vector<bool> you get actual booleans, albeit through an extra layer of indirection. If you iterate over char[] you might get chars or you might get dchars depending on the method you use for iterating.On 11/20/2010 16:58, Andrei Alexandrescu wrote:char[] does not exhibit the same issues that vector<bool> has. The situation is very different, and again, trying to reduce one to another misses a lot of the picture.The parallel does not stand scrutiny. The problem with vector<bool> in C++ is that it implements no formal abstraction, although it is a specialization of one.The problem with std::vector<bool> is that it pretends to be a std::vector, but isn't. If it was called dynamic_bitset instead, nobody would have complained. char[] has exactly the same problem.char[] isn't the equivalent of std::vector<bool>. It's worse. char[] is the equivalent of a vector<bool> that keeps the current behavior of std::vector<bool> when iterating through iterators, but gives access to bytes of packed booleans when using operator[].I explained why char[] is better than vector<bool>. Ignoring the explanation and restating a fallacious conclusion based on an overstretched parallel does hardly much to push forward the discussion. Again: code units _are_ well-defined, useful to have access to, and good for a variety of uses. Please understand this.I sure have replied to it, but probably my reply hasn't been read. Please allow me to paste it again:vector<bool> hides representation and in doing so becomes non-compliant with vector<T> which does expose representation. Worse, vector<bool> is not compliant with any concept, express or implied, which makes vector<bool> virtually unusable with generic code.The ways in which std::vector<bool> differs from any other vector are well understood. It uses proxies instead of true references. Its iterators meet the requirements of input/output iterators (or in boost terms, readable, writable iterators with random access traversal). Any generic code written with these limitations in mind can use std::vector<T> freely. (The C++ standard library doesn't play nicely with std::vector<bool>, but that's another issue entirely.) std::vector<bool> is a useful type, it just isn't a std::vector. In that respect, its situation is analogous to that of char[].I'm not concerned about strings, I'm concerned about *arrays*. Arrays of T, where T may or not be a character type. I see that you ignored my Vector!char example yet again.You may rest assured that if anything, strings are not a problem.It may be wise in fact to start using D2 and make criticism grounded in reality that could help us improve the state of affairs.Sorry, but no. It would take a huge investment of time and effort on my part to switch from C++ to D. I'm not going to make that leap without looking first, and I'm not going to make it when I can see that I'm about to jump into a spike pit.When you define your abstractions, you are free to decide how you want to go about them. The D programming language makes it unequivocally clear that char[] is an array of UTF-8 code units that offers a bidirectional range of code points. Same about wchar[] (replace UTF-8 with UTF-16). dchar[] is an array of UTF-32 code points which are equivalent to code units, and as such is a full random-access range.So it's up to you what Vector!char does. In D char[] is an array of code units that can be iterated as a bidirectional range of code points. I don't see anything cagey about that.Your assurances aren't increasing my confidence in D, they're decreasing my confidence in your judgment (and by extension my confidence in D).I prefaced my assurances with logical arguments that I can only assume went unread. You are of course free to your opinion (though it would be great if it were more grounded in real reasons); the rest of us will continue enjoying D2 strings. Andrei
Nov 21 2010
On 11/21/2010 17:31, Andrei Alexandrescu wrote:On 11/21/10 6:12 PM, Rainer Deyke wrote:I don't dispute that either view is useful.I agree that there are differences. For one thing, if you iterate over a std::vector<bool> you get actual booleans, albeit through an extra layer of indirection. If you iterate over char[] you might get chars or you might get dchars depending on the method you use for iterating.This is sensible because a string may be seen as a sequence of code points or a sequence of code units. Either view is useful.I'm not interested in discussing if char[] is overall a better data structure than std::vector<bool>. I'm focusing on one particular property of both. std::vector<bool> fails to provide some of the guarantees of all other instances of std::vector<T>. This means that generic code that uses std::vector<T> needs to take special consideration of std::vector<bool> if it wants to work correctly when T = bool. This is an indisputable fact. char[] and wchar[] fail to provide some of the guarantees of all other instances of T[]. This means that generic code that uses T[] needs to take special consideration of char[] if it wants to work correctly when T = char. This is also an indisputable fact. I don't think it's much a stretch to draw an analogy from std::vector<bool> to char[] based on this. However, even if std::vector<bool> did not exist, I would still consider this a design flaw of char[].char[] isn't the equivalent of std::vector<bool>. It's worse. char[] is the equivalent of a vector<bool> that keeps the current behavior of std::vector<bool> when iterating through iterators, but gives access to bytes of packed booleans when using operator[].I explained why char[] is better than vector<bool>. Ignoring the explanation and restating a fallacious conclusion based on an overstretched parallel does hardly much to push forward the discussion.Again: code units _are_ well-defined, useful to have access to, and good for a variety of uses. Please understand this.Again, I understand this and don't dispute it. It's a complete non-sequitur to this discussion. I'm not arguing against the string type providing access to both code points and code units. I'm arguing against the string type having the name of the array when it doesn't share the behavior of an array.Ah, I did read that, but it doesn't address my concerns about Vector!char at all. I'm aware that I can write Vector!char to act like a container of code units. I'm also aware that I can write Vector!char to automatically translate to code points. My concerns are these: - When writing code that uses T[], it is often natural to mix range-based access and index-based access, with the assumption that both provide direct access to the same underlying data. However, with char[] this assumption is incorrect, as the underlying data is transformed when viewing the array as a range. This means that generic code that uses T[] must take special consideration of char[] or it may unexpectedly produce incorrect results when T = char. - char[] sets a precedent of Container!char providing a dchar range interface. Other containers must choose to either follow this precedent or to avoid it. Either choice may require extra work when implementing the container. Either choice can lead to surprising behavior for the user of the container. -- Rainer Deyke - rainerd eldwood.comI'm not concerned about strings, I'm concerned about *arrays*. Arrays of T, where T may or not be a character type. I see that you ignored my Vector!char example yet again.I sure have replied to it, but probably my reply hasn't been read. Please allow me to paste it again:When you define your abstractions, you are free to decide how you want to go about them. The D programming language makes it unequivocally clear that char[] is an array of UTF-8 code units that offers a bidirectional range of code points. Same about wchar[] (replace UTF-8 with UTF-16). dchar[] is an array of UTF-32 code points which are equivalent to code units, and as such is a full random-access range.So it's up to you what Vector!char does. In D char[] is an array of code units that can be iterated as a bidirectional range of code points. I don't see anything cagey about that.
Nov 21 2010
On 11/21/10 22:09 CST, Rainer Deyke wrote:On 11/21/2010 17:31, Andrei Alexandrescu wrote: char[] and wchar[] fail to provide some of the guarantees of all other instances of T[].What exactly are those guarantees?- When writing code that uses T[], it is often natural to mix range-based access and index-based access, with the assumption that both provide direct access to the same underlying data. However, with char[] this assumption is incorrect, as the underlying data is transformed when viewing the array as a range. This means that generic code that uses T[] must take special consideration of char[] or it may unexpectedly produce incorrect results when T = char.This is exactly where your point falls apart. I'm actually glad you wrote it down explicitly because this makes it simple to achieve the goal of putting you in the position to both understand where your point is wrong, and also the goal of putting you in the position for an "aha" moment or at least a "all right, grumble grumble" moment. What you're saying is that you write generic code that requires T[], and then the code itself uses front, popFront, and other range-specific functions in conjunction with it. But this is exactly the problem. If you want to use range primitives, you submit to the requirement of ranges. So you write the generic function to ask for ranges (with e.g. isForwardRange etc). Otherwise your code is incorrect. If you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined. So: if you want to use char[] as an array with the built-in array interface, no problem. If you want to use char[] as a range with the range interface as defined by std.range, again no problem. But asking for one and then surreptitiously using the other is simply incorrect code. You can't use std.range while at the same time complaining you can't be bothered to read its docs.- char[] sets a precedent of Container!char providing a dchar range interface. Other containers must choose to either follow this precedent or to avoid it. Either choice may require extra work when implementing the container. Either choice can lead to surprising behavior for the user of the container.Encoded strings bring with them the necessity of encoding and decoding. That is an expected feature. It is up to your container whether it wants to do so or it needs to pass it to the client. I challenge you to define an alternative built-in string that fares better than string & Comp. Before long you'll be overwhelmed by the various necessities imposed by your constraints. Andrei
Nov 21 2010
On 11/21/2010 21:56, Andrei Alexandrescu wrote:On 11/21/10 22:09 CST, Rainer Deyke wrote:That the range view and the array view provide direct access to the same data. One of the useful features of most arrays is that an array of T can be treated as a range of T. However, this feature is missing for arrays of char and wchar.On 11/21/2010 17:31, Andrei Alexandrescu wrote: char[] and wchar[] fail to provide some of the guarantees of all other instances of T[].What exactly are those guarantees?No, I'm saying that I write generic code that declares T[] and then passes it off to a function that operates on ranges, or to a foreach loop.- When writing code that uses T[], it is often natural to mix range-based access and index-based access, with the assumption that both provide direct access to the same underlying data. However, with char[] this assumption is incorrect, as the underlying data is transformed when viewing the array as a range. This means that generic code that uses T[] must take special consideration of char[] or it may unexpectedly produce incorrect results when T = char.What you're saying is that you write generic code that requires T[], and then the code itself uses front, popFront, and other range-specific functions in conjunction with it.But this is exactly the problem. If you want to use range primitives, you submit to the requirement of ranges. So you write the generic function to ask for ranges (with e.g. isForwardRange etc). Otherwise your code is incorrect.Again, my generic function declares the array as a local variable or a member variable. It cannot declare a generic range.If you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined.It absolutely is natural to mix these in code that is written without consideration for strings, especially when you consider that foreach also uses the range interface. Let's say I have an array and I want to iterate over the first ten items. My first instinct would be to write something like this: foreach (item; array[0 .. 10]) { doSomethingWith(item); } Simple, natural, readable code. Broken for arrays of char or wchar, but in a way that is difficult to detect.So: if you want to use char[] as an array with the built-in array interface, no problem. If you want to use char[] as a range with the range interface as defined by std.range, again no problem. But asking for one and then surreptitiously using the other is simply incorrect code. You can't use std.range while at the same time complaining you can't be bothered to read its docs.This would sound reasonable if I were using char[] directly. I'm not. I'm using T[] in a generic context. I may not have considered the case of T = char when I wrote the code. The code may even have originally used Widget[] before I decided to make it generic.I challenge you to define an alternative built-in string that fares better than string & Comp. Before long you'll be overwhelmed by the various necessities imposed by your constraints.Easy: - string_t becomes a keyword. - Syntactically speaking, string_t!T is the name of a type when T is a type. - For every built-in character type T (including const and immutable versions), the type currently called T[] is now called string_t!T, but otherwise maintains all of its current behavior. - For every other type T, string_t!T is an error. - char[] and wchar[] (including const and immutable versions) are plain arrays of code units, even when viewed as a range. It's not my preferred solution, but it's easy to explain, it fixes the main problem with the current system, and it only costs one keyword. (I'd rather treat string_t as a library template with compiler support like and rename it to String, but then it wouldn't be a built-in string.) -- Rainer Deyke - rainerd eldwood.com
Nov 21 2010
On 11/21/10 11:59 PM, Rainer Deyke wrote:On 11/21/2010 21:56, Andrei Alexandrescu wrote:Where do ranges state that assumption?On 11/21/10 22:09 CST, Rainer Deyke wrote:That the range view and the array view provide direct access to the same data.On 11/21/2010 17:31, Andrei Alexandrescu wrote: char[] and wchar[] fail to provide some of the guarantees of all other instances of T[].What exactly are those guarantees?One of the useful features of most arrays is that an array of T can be treated as a range of T. However, this feature is missing for arrays of char and wchar.This is not a guarantee by ranges, it's just a mistaken assumption.A function that operates on ranges would have an appropriate constraint so it would work properly or not at all. foreach works fine with all arrays.No, I'm saying that I write generic code that declares T[] and then passes it off to a function that operates on ranges, or to a foreach loop.- When writing code that uses T[], it is often natural to mix range-based access and index-based access, with the assumption that both provide direct access to the same underlying data. However, with char[] this assumption is incorrect, as the underlying data is transformed when viewing the array as a range. This means that generic code that uses T[] must take special consideration of char[] or it may unexpectedly produce incorrect results when T = char.What you're saying is that you write generic code that requires T[], and then the code itself uses front, popFront, and other range-specific functions in conjunction with it.Why is it broken? Please try it to convince yourself of the contrary.But this is exactly the problem. If you want to use range primitives, you submit to the requirement of ranges. So you write the generic function to ask for ranges (with e.g. isForwardRange etc). Otherwise your code is incorrect.Again, my generic function declares the array as a local variable or a member variable. It cannot declare a generic range.If you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined.It absolutely is natural to mix these in code that is written without consideration for strings, especially when you consider that foreach also uses the range interface. Let's say I have an array and I want to iterate over the first ten items. My first instinct would be to write something like this: foreach (item; array[0 .. 10]) { doSomethingWith(item); } Simple, natural, readable code. Broken for arrays of char or wchar, but in a way that is difficult to detect.Fine. Use T[] generically in conjunction with the array primitives. If you plan to use them with the range primitives, you do as ranges do.So: if you want to use char[] as an array with the built-in array interface, no problem. If you want to use char[] as a range with the range interface as defined by std.range, again no problem. But asking for one and then surreptitiously using the other is simply incorrect code. You can't use std.range while at the same time complaining you can't be bothered to read its docs.This would sound reasonable if I were using char[] directly. I'm not. I'm using T[] in a generic context. I may not have considered the case of T = char when I wrote the code. The code may even have originally used Widget[] before I decided to make it generic.I very much prefer the current state of affairs. AndreiI challenge you to define an alternative built-in string that fares better than string& Comp. Before long you'll be overwhelmed by the various necessities imposed by your constraints.Easy: - string_t becomes a keyword. - Syntactically speaking, string_t!T is the name of a type when T is a type. - For every built-in character type T (including const and immutable versions), the type currently called T[] is now called string_t!T, but otherwise maintains all of its current behavior. - For every other type T, string_t!T is an error. - char[] and wchar[] (including const and immutable versions) are plain arrays of code units, even when viewed as a range. It's not my preferred solution, but it's easy to explain, it fixes the main problem with the current system, and it only costs one keyword. (I'd rather treat string_t as a library template with compiler support like and rename it to String, but then it wouldn't be a built-in string.)
Nov 21 2010
On Mon, Nov 22, 2010 at 1:08 AM, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:On 11/21/10 11:59 PM, Rainer Deyke wrote:One gotcha that seems to occur here is this code: foreach(index, character; someString) assert(someString[index] == character); I don't really have much that's meaningful to add to this discussion except to say that it shouldn't be easy to write code like the above. I spent a few hours today figuring out why that wouldn't work.On 11/21/2010 21:56, Andrei Alexandrescu wrote:A function that operates on ranges would have an appropriate constraint so it would work properly or not at all. foreach works fine with all arrays.On 11/21/10 22:09 CST, Rainer Deyke wrote:No, I'm saying that I write generic code that declares T[] and then passes it off to a function that operates on ranges, or to a foreach loop.- When writing code that uses T[], it is often natural to mix range-based access and index-based access, with the assumption that both provide direct access to the same underlying data. However, with char[] this assumption is incorrect, as the underlying data is transformed when viewing the array as a range. This means that generic code that uses T[] must take special consideration of char[] or it may unexpectedly produce incorrect results when T = char.What you're saying is that you write generic code that requires T[], and then the code itself uses front, popFront, and other range-specific functions in conjunction with it.
Nov 21 2010
On 11/22/2010 00:08, Andrei Alexandrescu wrote:On 11/21/10 11:59 PM, Rainer Deyke wrote:Are you saying that arrays of T do not function as ranges of T when T is not a character type?That the range view and the array view provide direct access to the same data.Where do ranges state that assumption?I'm not saying that this feature is guaranteed for all arrays, because it clearly isn't. I'm saying that this feature is present for T[] where T is not a character type, and missing for T[] where T is a character type. When writing code that is not intended to operate on character data, it is natural to use this feature. The code then breaks when the code is used with character data.One of the useful features of most arrays is that an array of T can be treated as a range of T. However, this feature is missing for arrays of char and wchar.This is not a guarantee by ranges, it's just a mistaken assumption.It "works", but produces different results than when iterating over a character array than when iterating over a non-character array. Code can compile, have well-defined behavior, run, produce correct results in most cases, but still be wrong.No, I'm saying that I write generic code that declares T[] and then passes it off to a function that operates on ranges, or to a foreach loop.A function that operates on ranges would have an appropriate constraint so it would work properly or not at all. foreach works fine with all arrays.I see, foreach still iterates over code units by default. Of course, this means that foreach over ranges doesn't work with strings, which in turn means that algorithms that use foreach over ranges are broken. Observe: import std.stdio; import std.algorithm; void main() { writeln(count!("true")("日本語")); // Three characters. } Output (compiled with Digital Marse D Compiler v2.050): 9Let's say I have an array and I want to iterate over the first ten items. My first instinct would be to write something like this: foreach (item; array[0 .. 10]) { doSomethingWith(item); } Simple, natural, readable code. Broken for arrays of char or wchar, but in a way that is difficult to detect.Why is it broken? Please try it to convince yourself of the contrary.Fine. Use T[] generically in conjunction with the array primitives. If you plan to use them with the range primitives, you do as ranges do.If arrays can't operate as ranges, what's the point of giving them a range interface?Care to support that with some arguments, or is it just a purely subjective preference? -- Rainer Deyke - rainerd eldwood.comEasy: - string_t becomes a keyword. - Syntactically speaking, string_t!T is the name of a type when T is a type. - For every built-in character type T (including const and immutable versions), the type currently called T[] is now called string_t!T, but otherwise maintains all of its current behavior. - For every other type T, string_t!T is an error. - char[] and wchar[] (including const and immutable versions) are plain arrays of code units, even when viewed as a range. It's not my preferred solution, but it's easy to explain, it fixes the main problem with the current system, and it only costs one keyword. (I'd rather treat string_t as a library template with compiler support like and rename it to String, but then it wouldn't be a built-in string.)I very much prefer the current state of affairs.
Nov 22 2010
On Monday 22 November 2010 02:01:38 Rainer Deyke wrote:On 11/22/2010 00:08, Andrei Alexandrescu wrote:I believe that he means that you either use them as ranges or you use them as arrays. Mixing the two sets of operations is asking for trouble. - Jonathan M DavisOn 11/21/10 11:59 PM, Rainer Deyke wrote:Are you saying that arrays of T do not function as ranges of T when T is not a character type?That the range view and the array view provide direct access to the same data.Where do ranges state that assumption?
Nov 22 2010
On 11/22/2010 03:57, Jonathan M Davis wrote:On Monday 22 November 2010 02:01:38 Rainer Deyke wrote:It is impossible to have a non-empty array without at some point using an array operation. If you can't mix array operations with range operations, then you can't use arrays as ranges. -- Rainer Deyke - rainerd eldwood.comAre you saying that arrays of T do not function as ranges of T when T is not a character type?I believe that he means that you either use them as ranges or you use them as arrays. Mixing the two sets of operations is asking for trouble.
Nov 22 2010
On 11/22/10 4:01 AM, Rainer Deyke wrote:I see, foreach still iterates over code units by default. Of course, this means that foreach over ranges doesn't work with strings, which in turn means that algorithms that use foreach over ranges are broken. Observe: import std.stdio; import std.algorithm; void main() { writeln(count!("true")("日本語")); // Three characters. } Output (compiled with Digital Marse D Compiler v2.050): 9Thanks. http://d.puremagic.com/issues/show_bug.cgi?id=5257 Andrei
Nov 22 2010
On 11/22/2010 11:55, Andrei Alexandrescu wrote:On 11/22/10 4:01 AM, Rainer Deyke wrote:I think this bug is a symptom of a larger issue. The range abstraction is too fragile. If even you can't use the range abstraction correctly (in the library that defines this abstraction no less), how can you expect anyone else to do so? At the very least, this is a sign that std.algorithm needs more thorough testing, and/or a through code review. This is far from the only use of foreach on a range in std.algorithm. It just happens to be the first example I found to illustrate my point. -- Rainer Deyke - rainerd eldwood.comI see, foreach still iterates over code units by default. Of course, this means that foreach over ranges doesn't work with strings, which in turn means that algorithms that use foreach over ranges are broken. Observe: import std.stdio; import std.algorithm; void main() { writeln(count!("true")("日本語")); // Three characters. } Output (compiled with Digital Marse D Compiler v2.050): 9Thanks. http://d.puremagic.com/issues/show_bug.cgi?id=5257
Nov 22 2010
Rainer Deyke Wrote:On 11/22/2010 11:55, Andrei Alexandrescu wrote:Note that this issue with foreach has been discussed before. The suggested solution was to have infer dchar instead of char (shot down since iterating char is useful and it is simple to add the type dchar). Maybe a range interface (as found in std.string) should take precedence over arrays in foreach? Or maybe foreach should only work with ranges and opApply (that would mean std.array would need imported to use foreach with arrays)? That wouldn't address your exact issue. I tend to agree with Andrei as you should be coding to the Range interface which will prevent any miss use of char/wchar. On the other hand, why can't I have a range of char (I mean get one from an array, not that I would ever want to)? Anyway, I agree char[] is a special case, but I also agree it isn't an issue.http://d.puremagic.com/issues/show_bug.cgi?id=5257I think this bug is a symptom of a larger issue. The range abstraction is too fragile. If even you can't use the range abstraction correctly (in the library that defines this abstraction no less), how can you expect anyone else to do so?
Nov 22 2010
On Tue, 23 Nov 2010 00:10:40 -0500 Jesse Phillips <jessekphillips+D gmail.com> wrote:Rainer Deyke Wrote: =20d solution was to have infer dchar instead of char (shot down since iterati= ng char is useful and it is simple to add the type dchar). Maybe a range in= terface (as found in std.string) should take precedence over arrays in fore= ach? Or maybe foreach should only work with ranges and opApply (that would = mean std.array would need imported to use foreach with arrays)?On 11/22/2010 11:55, Andrei Alexandrescu wrote:=20 Note that this issue with foreach has been discussed before. The suggeste=http://d.puremagic.com/issues/show_bug.cgi?id=3D5257=20 I think this bug is a symptom of a larger issue. The range abstraction is too fragile. If even you can't use the range abstraction correctly (in the library that defines this abstraction no less), how can you expect anyone else to do so?=20 That wouldn't address your exact issue. I tend to agree with Andrei as yo=u should be coding to the Range interface which will prevent any miss use o= f char/wchar. On the other hand, why can't I have a range of char (I mean g= et one from an array, not that I would ever want to)?=20 Anyway, I agree char[] is a special case, but I also agree it isn't an is=sue. This issue may also be interpreted as one more sign that text types in gene= ral are special enough to require a distinct (set of) type(s). Which would = not prevent freely using *char[] as a plain array (even if I personly canno= t imagine what for). Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 23 2010
Easy: - string_t becomes a keyword. - Syntactically speaking, string_t!T is the name of a type when T is a type. - For every built-in character type T (including const and immutable versions), the type currently called T[] is now called string_t!T, but otherwise maintains all of its current behavior. - For every other type T, string_t!T is an error. - char[] and wchar[] (including const and immutable versions) are plain arrays of code units, even when viewed as a range. It's not my preferred solution, but it's easy to explain, it fixes the main problem with the current system, and it only costs one keyword. (I'd rather treat string_t as a library template with compiler support like and rename it to String, but then it wouldn't be a built-in string.)Or better, if you want both ranges and random access do same thing, convert it to byte[] short[] and int[]. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Nov 22 2010
On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:If you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined.I want to use char[] as an array. I want to sort the array, how do I do this? (assume array.sort as a property is deprecated, as it should be) The problem is that the library *won't let you* treat them as arrays. Some functions see char[] as an array, and some see them as a range of dchars, you can't declare to those functions "No! this is an array!" or "No, this is a dchar range!" That is the main problem I see with how the current code works. BTW, you may not understand that we don't want to go back to the days of 'byDchar'. We want strings (including literals) to be special type because they are a special type (not an array). -Steve
Nov 22 2010
On 2010-11-22 10:37:48 -0500, "Steven Schveighoffer" <schveiguy yahoo.com> said:BTW, you may not understand that we don't want to go back to the days of 'byDchar'. We want strings (including literals) to be special type because they are a special type (not an array).It's amusing to read this from my perspective. In my project where I'm implementing the Objective-C object model, I implemented literal Objective-C strings a few days ago. It's basically a fourth string type understood by the compiler that generates a static NSString instance in the object file. String literals with no explicit type are implicitly converted whenever needed, so it really is painless to use: NSString str = "hello"; // implicit conversion, but only for compile-time constants Here you have your NSString, all stored as static data, no memory allocation at all. So you now have your special string type that works with literals and is not an array. But it's Cocoa-only. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Nov 22 2010
On 11/22/10 9:37 AM, Steven Schveighoffer wrote:On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Why do you want to sort an array of char? AndreiIf you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined.I want to use char[] as an array. I want to sort the array, how do I do this? (assume array.sort as a property is deprecated, as it should be)
Nov 22 2010
On Mon, 22 Nov 2010 12:07:55 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 11/22/10 9:37 AM, Steven Schveighoffer wrote:You're dodging the question. You claim that if I want to use it as an array, I use it as an array, if I want to use it as a range, use it as a range. I'm simply pointing out why you can't use it as an array -- because phobos treats it as a bidirectional range, and you can't force it to do what you want. More points -- what about a redblacktree!(char)? Is that 'invalid'? I mean, it's automatically sorted, so what should I do, throw an error if you try to build one? Is an Array!char a string? What about an SList!char? The thing is, *only* when one wants to create strings, does one want to view the data type as a bidirectional string. When one wants to deal with chars as an element of a container, I don't want to be restricted to utf requirements. FWIW, I deal in ASCII pretty much exclusively, so sorting an array of char is not out of the question. You might say "oh, well that's stupid!" but then so is using the index operator on a char[] array, no? I see no difference. I'm going to drop out of this discussion in order to develop a viable alternative to using arrays to represent strings. Then we can discuss the merits/drawbacks of such a type. I think it will be simple to build. -SteveOn Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Why do you want to sort an array of char?If you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined.I want to use char[] as an array. I want to sort the array, how do I do this? (assume array.sort as a property is deprecated, as it should be)
Nov 22 2010
On 11/22/10 11:22 AM, Steven Schveighoffer wrote:On Mon, 22 Nov 2010 12:07:55 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Of course you can. After you were to admit that it makes next to no sense to sort an array of code units, I would have said "well if somehow you do imagine such a situation, you achieve that by saying what you means: cast the char[] to ubyte[] and sort that".On 11/22/10 9:37 AM, Steven Schveighoffer wrote:You're dodging the question. You claim that if I want to use it as an array, I use it as an array, if I want to use it as a range, use it as a range. I'm simply pointing out why you can't use it as an array -- because phobos treats it as a bidirectional range, and you can't force it to do what you want.On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Why do you want to sort an array of char?If you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined.I want to use char[] as an array. I want to sort the array, how do I do this? (assume array.sort as a property is deprecated, as it should be)More points -- what about a redblacktree!(char)? Is that 'invalid'? I mean, it's automatically sorted, so what should I do, throw an error if you try to build one?No, it still has well-defined semantics. It just doesn't have much sense to it. Why would you use a redblacktree of char? Probably you want one of ubyte, so then why don't you say so?Is an Array!char a string? What about an SList!char?Depends on how Array or SList are defined. D chose to convey char[] and wchar[] specific meaning revealing that they are sequences of code points, i.e. Unicode strings.The thing is, *only* when one wants to create strings, does one want to view the data type as a bidirectional string. When one wants to deal with chars as an element of a container, I don't want to be restricted to utf requirements.If you don't want to be restricted to utf requirements, use ubyte and ushort. You're saying "I want to use UTF code points without any associated UTF meaning".FWIW, I deal in ASCII pretty much exclusively, so sorting an array of char is not out of the question.Example?You might say "oh, well that's stupid!" but then so is using the index operator on a char[] array, no? I see no difference.There is a difference. Often in a loop you know the index at which a code point starts.I'm going to drop out of this discussion in order to develop a viable alternative to using arrays to represent strings. Then we can discuss the merits/drawbacks of such a type. I think it will be simple to build.I think that's a great idea. Andrei
Nov 22 2010
On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 11/22/10 11:22 AM, Steven Schveighoffer wrote:That wasn't what you said -- you said I can use char[] as an array if I want to use it as an array, not that I can use ubyte[] as an array (nobody disputes that).You're dodging the question. You claim that if I want to use it as an array, I use it as an array, if I want to use it as a range, use it as a range. I'm simply pointing out why you can't use it as an array -- because phobos treats it as a bidirectional range, and you can't force it to do what you want.Of course you can. After you were to admit that it makes next to no sense to sort an array of code units, I would have said "well if somehow you do imagine such a situation, you achieve that by saying what you means: cast the char[] to ubyte[] and sort that".A literal defining an array of ubytes or ushorts is considerably more painful than one of chars.The thing is, *only* when one wants to create strings, does one want to view the data type as a bidirectional string. When one wants to deal with chars as an element of a container, I don't want to be restricted to utf requirements.If you don't want to be restricted to utf requirements, use ubyte and ushort. You're saying "I want to use UTF code points without any associated UTF meaning".In some poker-hand detection code I've written in C++ (and actually in D too) in the past, I can use characters to represent each card. A straightforward way to do this is to add each 'card' to a string, then sort the string. This allows me to use string functions and regex to detect hand types. You can do the same with ubytes, but it's not as easy to understand. And easy to understand means easier to avoid mistakes. The point is, the domain of valid elements in my application is defined by me, not by the library. The library is making assumptions that my poker hands may contain utf8 characters, while I know in my case they cannot. If I could convey this in a way that allows me to keep the nice properties of char arrays (i.e. printing as strings), then I would be fine with the library assuming unless I told it so. But there is no way currently, the library steadfastly refuses to look at it any other way than a utf-8 code sequence. It doesn't help matters that the compiler steadfastly looks at them as arrays. What I want is for the compiler *and* the library to look at strings as not arrays, and for both to look at char[] as an array. So I can clearly define my intent of how I want them to treat such variables.FWIW, I deal in ASCII pretty much exclusively, so sorting an array of char is not out of the question.Example?Here I am continuing to argue. I swear I'll stop after this :) At least until I have my string type ready. -SteveI'm going to drop out of this discussion in order to develop a viable alternative to using arrays to represent strings. Then we can discuss the merits/drawbacks of such a type. I think it will be simple to build.
Nov 22 2010
On 11/22/10 12:01 PM, Steven Schveighoffer wrote:On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:That still stays valid. The thing is, sort doesn't sort arrays, it sorts random-access ranges.On 11/22/10 11:22 AM, Steven Schveighoffer wrote:That wasn't what you said -- you said I can use char[] as an array if I want to use it as an array, not that I can use ubyte[] as an array (nobody disputes that).You're dodging the question. You claim that if I want to use it as an array, I use it as an array, if I want to use it as a range, use it as a range. I'm simply pointing out why you can't use it as an array -- because phobos treats it as a bidirectional range, and you can't force it to do what you want.Of course you can. After you were to admit that it makes next to no sense to sort an array of code units, I would have said "well if somehow you do imagine such a situation, you achieve that by saying what you means: cast the char[] to ubyte[] and sort that".I've been thinking for a while to have to!(const(ubyte)[]) simply insert a cast when passed const(char)[]. The cast is sound - you are asking for a view of individual code points in a string. That should help with literals.A literal defining an array of ubytes or ushorts is considerably more painful than one of chars.The thing is, *only* when one wants to create strings, does one want to view the data type as a bidirectional string. When one wants to deal with chars as an element of a container, I don't want to be restricted to utf requirements.If you don't want to be restricted to utf requirements, use ubyte and ushort. You're saying "I want to use UTF code points without any associated UTF meaning".Why not ubytes?In some poker-hand detection code I've written in C++ (and actually in D too) in the past, I can use characters to represent each card.FWIW, I deal in ASCII pretty much exclusively, so sorting an array of char is not out of the question.Example?A straightforward way to do this is to add each 'card' to a string, then sort the string. This allows me to use string functions and regex to detect hand types. You can do the same with ubytes, but it's not as easy to understand.Why?And easy to understand means easier to avoid mistakes. The point is, the domain of valid elements in my application is defined by me, not by the library. The library is making assumptions that my poker hands may contain utf8 characters, while I know in my case they cannot.Then what's wrong with ubyte? Why do you encode as UTF something that you know isn't UTF? Would you put an integral in a real even though you know it's only integral?If I could convey this in a way that allows me to keep the nice properties of char arrays (i.e. printing as strings), then I would be fine with the library assuming unless I told it so.How would printing as strings be meaningful? I'd suspect you'd want to print a poker hand better than by using one character per card. Even if for some odd reason you want to print ubytes as characters in some exceptional situation, why don't you write a routine that does that and get over with?But there is no way currently, the library steadfastly refuses to look at it any other way than a utf-8 code sequence. It doesn't help matters that the compiler steadfastly looks at them as arrays. What I want is for the compiler *and* the library to look at strings as not arrays, and for both to look at char[] as an array. So I can clearly define my intent of how I want them to treat such variables.I totally understand where you're coming from. I believe you also understand where I'm coming from: within the constraints of making UTF built-in, integrated, efficient, and easy to understand, I think the current decisions taken by the language are good. To directly reply to your point: instead of ascribing your desired meaning to char[], you should use char[] for UTF-8 strings exclusively. For arrays of bytes, there's always ubyte[].I suspect you'll notice before long that it's a considerably more difficult task than it might seem in the beginning, and that the result is bound to be less satisfactory than the current strings in at least some dimensions. But I welcome the initiative to bring a concrete abstraction (heh, oxymoron) on the table. AndreiHere I am continuing to argue. I swear I'll stop after this :) At least until I have my string type ready.I'm going to drop out of this discussion in order to develop a viable alternative to using arrays to represent strings. Then we can discuss the merits/drawbacks of such a type. I think it will be simple to build.
Nov 22 2010
Andrei Alexandrescu Wrote:On 11/22/10 12:01 PM, Steven Schveighoffer wrote:Canonical example: DNA. I shouldn't need to write a special function to print it since it IS a string. I shouldn't need to cast it in order to do operations on it like sort, find, etc. D's [w|D|]char types make no sense since they are NOT characters and the concept doesn't fit for unicode since as someone else wrote, there are different levels of abstractions in unicode (copde point, code unit, grapheme). Naming matters and having a cat called dog (char is actually code unit) is a source of bugs.On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:That still stays valid. The thing is, sort doesn't sort arrays, it sorts random-access ranges.On 11/22/10 11:22 AM, Steven Schveighoffer wrote:That wasn't what you said -- you said I can use char[] as an array if I want to use it as an array, not that I can use ubyte[] as an array (nobody disputes that).You're dodging the question. You claim that if I want to use it as an array, I use it as an array, if I want to use it as a range, use it as a range. I'm simply pointing out why you can't use it as an array -- because phobos treats it as a bidirectional range, and you can't force it to do what you want.Of course you can. After you were to admit that it makes next to no sense to sort an array of code units, I would have said "well if somehow you do imagine such a situation, you achieve that by saying what you means: cast the char[] to ubyte[] and sort that".I've been thinking for a while to have to!(const(ubyte)[]) simply insert a cast when passed const(char)[]. The cast is sound - you are asking for a view of individual code points in a string. That should help with literals.A literal defining an array of ubytes or ushorts is considerably more painful than one of chars.The thing is, *only* when one wants to create strings, does one want to view the data type as a bidirectional string. When one wants to deal with chars as an element of a container, I don't want to be restricted to utf requirements.If you don't want to be restricted to utf requirements, use ubyte and ushort. You're saying "I want to use UTF code points without any associated UTF meaning".Why not ubytes?In some poker-hand detection code I've written in C++ (and actually in D too) in the past, I can use characters to represent each card.FWIW, I deal in ASCII pretty much exclusively, so sorting an array of char is not out of the question.Example?A straightforward way to do this is to add each 'card' to a string, then sort the string. This allows me to use string functions and regex to detect hand types. You can do the same with ubytes, but it's not as easy to understand.Why?And easy to understand means easier to avoid mistakes. The point is, the domain of valid elements in my application is defined by me, not by the library. The library is making assumptions that my poker hands may contain utf8 characters, while I know in my case they cannot.Then what's wrong with ubyte? Why do you encode as UTF something that you know isn't UTF? Would you put an integral in a real even though you know it's only integral?If I could convey this in a way that allows me to keep the nice properties of char arrays (i.e. printing as strings), then I would be fine with the library assuming unless I told it so.How would printing as strings be meaningful? I'd suspect you'd want to print a poker hand better than by using one character per card. Even if for some odd reason you want to print ubytes as characters in some exceptional situation, why don't you write a routine that does that and get over with?But there is no way currently, the library steadfastly refuses to look at it any other way than a utf-8 code sequence. It doesn't help matters that the compiler steadfastly looks at them as arrays. What I want is for the compiler *and* the library to look at strings as not arrays, and for both to look at char[] as an array. So I can clearly define my intent of how I want them to treat such variables.I totally understand where you're coming from. I believe you also understand where I'm coming from: within the constraints of making UTF built-in, integrated, efficient, and easy to understand, I think the current decisions taken by the language are good. To directly reply to your point: instead of ascribing your desired meaning to char[], you should use char[] for UTF-8 strings exclusively. For arrays of bytes, there's always ubyte[].I suspect you'll notice before long that it's a considerably more difficult task than it might seem in the beginning, and that the result is bound to be less satisfactory than the current strings in at least some dimensions. But I welcome the initiative to bring a concrete abstraction (heh, oxymoron) on the table. AndreiHere I am continuing to argue. I swear I'll stop after this :) At least until I have my string type ready.I'm going to drop out of this discussion in order to develop a viable alternative to using arrays to represent strings. Then we can discuss the merits/drawbacks of such a type. I think it will be simple to build.
Nov 22 2010
On 11/22/10 5:59 PM, foobar wrote:Canonical example: DNA. I shouldn't need to write a special function to print it since it IS a string. I shouldn't need to cast it in order to do operations on it like sort, find, etc.I think it's best to encode DNA strings as sequences of ubyte. UTF routines will work slower on them than functions for ubyte.D's [w|D|]char types make no sense since they are NOT characters and the concept doesn't fit for unicode since as someone else wrote, there are different levels of abstractions in unicode (copde point, code unit, grapheme). Naming matters and having a cat called dog (char is actually code unit) is a source of bugs.I think the names are fine. It doesn't take much learning to understand that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units respectively. I mean it would be odd if they were something else. Andrei
Nov 22 2010
On Monday 22 November 2010 16:45:43 Andrei Alexandrescu wrote:On 11/22/10 5:59 PM, foobar wrote:The problem with char is that so many people are used to thinking of char as a character rather than a code unit. Once you get passed that, though, it's fine. I think that it's very well thought out as it is. It just takes some getting used to. Unfortunately though, it seems thinking of a char as UTF-8 code unit and _never_ dealing with it as a character is hard for a lot of people to adjust to. - Jonathan M DavisCanonical example: DNA. I shouldn't need to write a special function to print it since it IS a string. I shouldn't need to cast it in order to do operations on it like sort, find, etc.I think it's best to encode DNA strings as sequences of ubyte. UTF routines will work slower on them than functions for ubyte.D's [w|D|]char types make no sense since they are NOT characters and the concept doesn't fit for unicode since as someone else wrote, there are different levels of abstractions in unicode (copde point, code unit, grapheme). Naming matters and having a cat called dog (char is actually code unit) is a source of bugs.I think the names are fine. It doesn't take much learning to understand that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units respectively. I mean it would be odd if they were something else.
Nov 22 2010
Andrei Alexandrescu Wrote:On 11/22/10 5:59 PM, foobar wrote:how would I go about printing DNA sequences then? printing a ubyte should print it's numeric value, and NOT a char. What actually needed here is a ASCIIChar type or even a more stricter DNAChar.Canonical example: DNA. I shouldn't need to write a special function to print it since it IS a string. I shouldn't need to cast it in order to do operations on it like sort, find, etc.I think it's best to encode DNA strings as sequences of ubyte. UTF routines will work slower on them than functions for ubyte.The isn't a quantitative issue but an existential one. I agree that it's easy to use dogs once someone tells you that everywhere you want a dog you should denote it with "cat". Why do you need to learn that mistake _AT_ALL_ ? it is odd for YOU to think otherwise because you have ALREADY learned and accustomed to use a "cat" every time you need a dog. That does not mean that this is indeed correct. This is the same issue people having with D's enum. You just don't seem to get that learning is location depended. What makes sense to YOU based on your location on the learning curve isn't absolute and does NOT reflect on people with a different location on the learning curve. This goes with many of your excellent implementations that get awful names. Very C++ on your part - you need to be a c++ guru just to write a hello world app.D's [w|D|]char types make no sense since they are NOT characters and the concept doesn't fit for unicode since as someone else wrote, there are different levels of abstractions in unicode (copde point, code unit, grapheme). Naming matters and having a cat called dog (char is actually code unit) is a source of bugs.I think the names are fine. It doesn't take much learning to understand that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units respectively. I mean it would be odd if they were something else.Andrei
Nov 23 2010
On 11/23/10 3:49 AM, foobar wrote:Andrei Alexandrescu Wrote:Yes, and the language offers the abstraction abilities to define such types.On 11/22/10 5:59 PM, foobar wrote:how would I go about printing DNA sequences then? printing a ubyte should print it's numeric value, and NOT a char. What actually needed here is a ASCIIChar type or even a more stricter DNAChar.Canonical example: DNA. I shouldn't need to write a special function to print it since it IS a string. I shouldn't need to cast it in order to do operations on it like sort, find, etc.I think it's best to encode DNA strings as sequences of ubyte. UTF routines will work slower on them than functions for ubyte.I think I don't understand what you're suggesting. AndreiThe isn't a quantitative issue but an existential one. I agree that it's easy to use dogs once someone tells you that everywhere you want a dog you should denote it with "cat". Why do you need to learn that mistake _AT_ALL_ ? it is odd for YOU to think otherwise because you have ALREADY learned and accustomed to use a "cat" every time you need a dog. That does not mean that this is indeed correct. This is the same issue people having with D's enum. You just don't seem to get that learning is location depended. What makes sense to YOU based on your location on the learning curve isn't absolute and does NOT reflect on people with a different location on the learning curve. This goes with many of your excellent implementations that get awful names. Very C++ on your part - you need to be a c++ guru just to write a hello world app.D's [w|D|]char types make no sense since they are NOT characters and the concept doesn't fit for unicode since as someone else wrote, there are different levels of abstractions in unicode (copde point, code unit, grapheme). Naming matters and having a cat called dog (char is actually code unit) is a source of bugs.I think the names are fine. It doesn't take much learning to understand that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units respectively. I mean it would be odd if they were something else.
Nov 23 2010
Andrei Alexandrescu Wrote:On 11/23/10 3:49 AM, foobar wrote:It's simple, a mediocre language (Java) with mediocre libraries has orders of magnitude more success than C++ with it's libs fine tuned for performance. Why? because from a regular programmer's POV which just wants to get things done (TM), Java is geared towards easy and quick use. the are many libs for all common use cases, there is a common style and good naming conventions and 9/10 times you can write code by the feel without spending half an hour to read documentation. There are no obscure function names in Latin or Greek (even if the Latin/Greek term is more precise in math terms) in short, Java is KISS, C++ is not. If you want D to succeed you need to acknowledge this and act according to this. Make the common case trivial and the special case possible. "char" is NOT fine and is misleading. I'm not asking to change this right now and would accept a response like "it's too late to change now" or whatever. However, I do expect you to at least acknowledge the issue and not dismiss it. Your code might be excellent but it caters only to you and a small amount of programmers that share your style. D will not succeed in general programmer public until you start catering for the common people and stop dismissing their complaints. D2 is way more complex than D1 becasue of this (and the const system) and I'm singling you out because you are the main developer of D's standard lib and because you set the design goals/style of it.Andrei Alexandrescu Wrote:Yes, and the language offers the abstraction abilities to define such types.On 11/22/10 5:59 PM, foobar wrote:how would I go about printing DNA sequences then? printing a ubyte should print it's numeric value, and NOT a char. What actually needed here is a ASCIIChar type or even a more stricter DNAChar.Canonical example: DNA. I shouldn't need to write a special function to print it since it IS a string. I shouldn't need to cast it in order to do operations on it like sort, find, etc.I think it's best to encode DNA strings as sequences of ubyte. UTF routines will work slower on them than functions for ubyte.I think I don't understand what you're suggesting. AndreiThe isn't a quantitative issue but an existential one. I agree that it's easy to use dogs once someone tells you that everywhere you want a dog you should denote it with "cat". Why do you need to learn that mistake _AT_ALL_ ? it is odd for YOU to think otherwise because you have ALREADY learned and accustomed to use a "cat" every time you need a dog. That does not mean that this is indeed correct. This is the same issue people having with D's enum. You just don't seem to get that learning is location depended. What makes sense to YOU based on your location on the learning curve isn't absolute and does NOT reflect on people with a different location on the learning curve. This goes with many of your excellent implementations that get awful names. Very C++ on your part - you need to be a c++ guru just to write a hello world app.D's [w|D|]char types make no sense since they are NOT characters and the concept doesn't fit for unicode since as someone else wrote, there are different levels of abstractions in unicode (copde point, code unit, grapheme). Naming matters and having a cat called dog (char is actually code unit) is a source of bugs.I think the names are fine. It doesn't take much learning to understand that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units respectively. I mean it would be odd if they were something else.
Nov 23 2010
On 11/23/10 12:15 PM, foobar wrote:Andrei Alexandrescu Wrote:I don't think the dynamics of programming language success can be represented with a one-dimensional explanation. There are many other factors (marketing, perception, historical setting, etc. etc. etc.) Many languages offer easier and quicker ways to get done than Java, which is quite verbose. And Java programmers in fact spend large amounts of time reading documentation of the massive APIs they are working with. I'm not framing that as a bad thing; I'm just clarifying why I think your attempt at explaining Java's success is not only incomplete, but wrong.On 11/23/10 3:49 AM, foobar wrote:It's simple, a mediocre language (Java) with mediocre libraries has orders of magnitude more success than C++ with it's libs fine tuned for performance. Why? because from a regular programmer's POV which just wants to get things done (TM), Java is geared towards easy and quick use. the are many libs for all common use cases, there is a common style and good naming conventions and 9/10 times you can write code by the feel without spending half an hour to read documentation. There are no obscure function names in Latin or Greek (even if the Latin/Greek term is more precise in math terms) in short, Java is KISS, C++ is not.Andrei Alexandrescu Wrote:Yes, and the language offers the abstraction abilities to define such types.On 11/22/10 5:59 PM, foobar wrote:how would I go about printing DNA sequences then? printing a ubyte should print it's numeric value, and NOT a char. What actually needed here is a ASCIIChar type or even a more stricter DNAChar.Canonical example: DNA. I shouldn't need to write a special function to print it since it IS a string. I shouldn't need to cast it in order to do operations on it like sort, find, etc.I think it's best to encode DNA strings as sequences of ubyte. UTF routines will work slower on them than functions for ubyte.I think I don't understand what you're suggesting. AndreiThe isn't a quantitative issue but an existential one. I agree that it's easy to use dogs once someone tells you that everywhere you want a dog you should denote it with "cat". Why do you need to learn that mistake _AT_ALL_ ? it is odd for YOU to think otherwise because you have ALREADY learned and accustomed to use a "cat" every time you need a dog. That does not mean that this is indeed correct. This is the same issue people having with D's enum. You just don't seem to get that learning is location depended. What makes sense to YOU based on your location on the learning curve isn't absolute and does NOT reflect on people with a different location on the learning curve. This goes with many of your excellent implementations that get awful names. Very C++ on your part - you need to be a c++ guru just to write a hello world app.D's [w|D|]char types make no sense since they are NOT characters and the concept doesn't fit for unicode since as someone else wrote, there are different levels of abstractions in unicode (copde point, code unit, grapheme). Naming matters and having a cat called dog (char is actually code unit) is a source of bugs.I think the names are fine. It doesn't take much learning to understand that char, wchar, and dchar are UTF-8, UTF-16, and UTF-32 code units respectively. I mean it would be odd if they were something else.If you want D to succeed you need to acknowledge this and act according to this. Make the common case trivial and the special case possible. "char" is NOT fine and is misleading. I'm not asking to change this right now and would accept a response like "it's too late to change now" or whatever. However, I do expect you to at least acknowledge the issue and not dismiss it.What would be a good replacement name for "char"?Your code might be excellent but it caters only to you and a small amount of programmers that share your style.I'm curious how you validated this assumption.D will not succeed in general programmer public until you start catering for the common people and stop dismissing their complaints.Since you are trying to build the impression that this is a common pattern, you should have no trouble finding plenty of examples.D2 is way more complex than D1 becasue of this (and the const system) and I'm singling you out because you are the main developer of D's standard lib and because you set the design goals/style of it.I have had a Google alert tuned for the exact string "D programming language" for a good while. The general opinion that I seem to have gathered is that Phobos 2 is a major pro, not a con, in deciding to choose D2 versus D1. Andrei
Nov 23 2010
On 23/11/2010 18:15, foobar wrote:It's simple, a mediocre language (Java) with mediocre libraries has orders of magnitude more success than C++ with it's libs fine tuned for performance. Why?Java has mediocre libraries?? Are you serious about that opinion? -- Bruno Medeiros - Software Engineer
Nov 24 2010
On Tuesday, November 23, 2010 09:05:05 Andrei Alexandrescu wrote:I think that what he's saying is that the names char, wchar, and dchar as UTF-8, UTF-16, and UTF-32 code points respectively make sense to you because you're used to them, but for anyone learning D (particularly those who are used to char in other languages being an ASCII character) don't find it at all intuitive or obvious. Honestly, the only semi-reasonable alternative to char, wchar, and dchar that I can think of would be utf8, utf16, and utf32. But then everyone would be wondering where char was, and I'm not sure that it would really help any in the long run anyway. It would be more explicit though. But given char and wchar_t in C++, I really don't think that it's much of a stretch to use char, wchar, and dchar. The only thing really different about it is that D insists that char is always a UTF-8 code unit rather than it really being useable as an ASCII character. - Jonathan M DavisYou just don't seem to get that learning is location depended. What makes sense to YOU based on your location on the learning curve isn't absolute and does NOT reflect on people with a different location on the learning curve. This goes with many of your excellent implementations that get awful names. Very C++ on your part - you need to be a c++ guru just to write a hello world app.I think I don't understand what you're suggesting.
Nov 23 2010
I think that what he's saying is that the names char, wchar, and dchar as UTF-8, UTF-16, and UTF-32 code points respectively make sense to you because you're used to them, but for anyone learning D (particularly those who are used to char in other languages being an ASCII character) don't find it at all intuitive or obvious.They should first realize this is another language.Honestly, the only semi-reasonable alternative to char, wchar, and dchar that I can think of would be utf8, utf16, and utf32. But then everyone would be wondering where char was, and I'm not sure that it would really help any in the long run anyway. It would be more explicit though. But given char and wchar_t in C++, I really don't think that it's much of a stretch to use char, wchar, and dchar. The only thing really different about it is that D insists that char is always a UTF-8 code unit rather than it really being useable as an ASCII character.That actually is an excellent idea, wiping all 3 of them and replacing with these. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Nov 23 2010
Jonathan M Davis schrieb:On Tuesday, November 23, 2010 09:05:05 Andrei Alexandrescu wrote:And in Java a char is a 16bit unicode char that is generally handled as a code unit (since Java 1.5 32bit surrogate pair code units consisting of 2 chars are also supported, but I don't know if that really works in the whole standard lib and if people actually use it). So also for Java programmers 1 char == 1 printed character, even though it supports more than ASCII.I think that what he's saying is that the names char, wchar, and dchar as UTF-8, UTF-16, and UTF-32 code points respectively make sense to you because you're used to them, but for anyone learning D (particularly those who are used to char in other languages being an ASCII character) don't find it at all intuitive or obvious.You just don't seem to get that learning is location depended. What makes sense to YOU based on your location on the learning curve isn't absolute and does NOT reflect on people with a different location on the learning curve. This goes with many of your excellent implementations that get awful names. Very C++ on your part - you need to be a c++ guru just to write a hello world app.I think I don't understand what you're suggesting.Honestly, the only semi-reasonable alternative to char, wchar, and dchar that I can think of would be utf8, utf16, and utf32.Naa, that sounds like it's a whole UTF-* string and not just a code point to me. utf8codepoint maybe? or utf8cp? .. That sucks, IMHO it should stay the way it is. But maybe an ASCII type (or maybe a more general 8bit text type that also supports ISO-* charsets etc?) would be helpful. One that string literals can implicitly be converted to (so ubyte[] or an alias of that won't work). Also the compiler would have to make sure that all characters of the string can be represented in ASCII (or ISO-*). Cheers, - Daniel
Nov 23 2010
Andrei Alexandrescu wrote:On 11/22/10 12:01 PM, Steven Schveighoffer wrote:On Mon, 22 Nov 2010 12:40:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:That still stays valid. The thing is, sort doesn't sort arrays, it sorts random-access ranges.On 11/22/10 11:22 AM, Steven Schveighoffer wrote:That wasn't what you said -- you said I can use char[] as an array if I want to use it as an array, not that I can use ubyte[] as an array (nobody disputes that).You're dodging the question. You claim that if I want to use it as an array, I use it as an array, if I want to use it as a range, use it as a range. I'm simply pointing out why you can't use it as an array -- because phobos treats it as a bidirectional range, and you can't force it to do what you want.Of course you can. After you were to admit that it makes next to no sense to sort an array of code units, I would have said "well if somehow you do imagine such a situation, you achieve that by saying what you means: cast the char[] to ubyte[] and sort that".Then what's wrong with ubyte? Why do you encode as UTF something that you know isn't UTF?And easy to understand means easier to avoid mistakes. The point is, the domain of valid elements in my application is defined by me, not by the library. The library is making assumptions that my poker hands may contain utf8 characters, while I know in my case they cannot.The thing is, *only* when one wants to create strings, does one want to view the data type as a bidirectional string. When one wants to deal with chars as an element of a container, I don't want to be restricted to utf requirements.If you don't want to be restricted to utf requirements, use ubyte and ushort. You're saying "I want to use UTF code points without any associated UTF meaning".Would you put an integral in a real even though you know it's only integral?I don't think that's a valid comparison, since we have integer types, but we don't have ASCII types. Here's the issue as I see it: there are very common use cases (and lots of existing C code) for a type which stores an ASCII character. I think we're seeing the exact same issue that causes to people to mistakenly use 'uint' when they mean 'positive integer'. It LOOKS as though a char is a subset of dchar (ie, a dchar in the range 0..0x7F). It LOOKS as though a uint is a subset of int (ie, an int in the range 0..int.max). But in both cases, the possibility that the high bit could be set, changes the semantics.
Nov 24 2010
On Wed, 24 Nov 2010 13:39:19 +0100 Don <nospam nospam.com> wrote:I think we're seeing the exact same issue that causes to people to=20 mistakenly use 'uint' when they mean 'positive integer'. It LOOKS as though a char is a subset of dchar (ie, a dchar in the range==200..0x7F).Cannot be, in the sense of uint beeing a subset ulong. That's why "char", i= f not perfect, is a good name, providing the programmer with a hint about a= ctual semantics. What i don't understand is why people who need unsigned by= tes do not use ubyte? But instead bug into char. Is this only because of C = baggage?It LOOKS as though a uint is a subset of int (ie, an int in the range=20 0..int.max).This indeed is a big issue. I would prefere uint (=3D Natural) to be implem= ented as a subset of int: uint 0 --> +7fffffff=20 int -f000000 --> +7fffffff Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 24 2010
spir schrieb:What i don't understand is why people who need unsigned bytes do not use ubyte? But instead bug into char. Is this only because of C baggage?probably because you can't write ubyte[] str = "asdf"; and they want to have "ascii-chars" in their ubyte arrays
Nov 24 2010
On 11/24/10 9:35 AM, Daniel Gibson wrote:spir schrieb:Probably the assignment should be allowed. AndreiWhat i don't understand is why people who need unsigned bytes do not use ubyte? But instead bug into char. Is this only because of C baggage?probably because you can't write ubyte[] str = "asdf"; and they want to have "ascii-chars" in their ubyte arrays
Nov 24 2010
On Wed, 24 Nov 2010 16:35:59 +0100 Daniel Gibson <metalcaedes gmail.com> wrote:spir schrieb:e ubyte? But instead bug into char. Is this only because of C baggage?What i don't understand is why people who need unsigned bytes do not us=have=20=20=20 probably because you can't write ubyte[] str =3D "asdf"; and they want to="ascii-chars" in their ubyte arraysOh yes, sorry for the noise. Then, I don't see any other solution else havi= ng a proper ByteString type built in the compiler (that would indeed work f= or any single-byte encoding, not only ASCII), with a corresponding string l= iteral pre/post-fix (one more ;-). Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 24 2010
On Nov 23, 10 01:40, Andrei Alexandrescu wrote:On 11/22/10 11:22 AM, Steven Schveighoffer wrote:Right, and D3 should simply disable using char and wchar as an independent type, like void, since using a single code unit makes next to no sense either. As a side-effect, no one can complain containers of char and wchar doesn't work as expected because it simply won't compile. Then we can rightfully say char[] and wchar[] are special. char c = 'A'; // error: A single code unit makes no sense. Make it a ubyte or dchar instead. int[char] d; // error: Indexing by a code unit makes no sense. Make it an int[ubyte] or int[dchar] instead. :pOn Mon, 22 Nov 2010 12:07:55 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Of course you can. After you were to admit that it makes next to no sense to sort an array of code units, I would have said "well if somehow you do imagine such a situation, you achieve that by saying what you means: cast the char[] to ubyte[] and sort that".On 11/22/10 9:37 AM, Steven Schveighoffer wrote:You're dodging the question. You claim that if I want to use it as an array, I use it as an array, if I want to use it as a range, use it as a range. I'm simply pointing out why you can't use it as an array -- because phobos treats it as a bidirectional range, and you can't force it to do what you want.On Sun, 21 Nov 2010 23:56:17 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Why do you want to sort an array of char?If you want to work with arrays, use a[0] to access the front, a[$ - 1] to access the back, and a = a[1 .. $] to chop off the first element of the array. It is not AT ALL natural to mix those with a.front, a.back etc. It is not - why? because std.range defines them with specific meanings for arrays in general and for arrays of characters in particular. If you submit to use std.range's abstraction, you submit to using it the way it is defined.I want to use char[] as an array. I want to sort the array, how do I do this? (assume array.sort as a property is deprecated, as it should be)One possible application could be (assume ASCII for a moment) pure bool slowIsAnagramOf(in char[] a, in char[] b) { auto c = a.dup, d = b.dup; sort(c); sort(d); return c == d; }More points -- what about a redblacktree!(char)? Is that 'invalid'? I mean, it's automatically sorted, so what should I do, throw an error if you try to build one?No, it still has well-defined semantics. It just doesn't have much sense to it. Why would you use a redblacktree of char? Probably you want one of ubyte, so then why don't you say so?Is an Array!char a string? What about an SList!char?Depends on how Array or SList are defined. D chose to convey char[] and wchar[] specific meaning revealing that they are sequences of code points, i.e. Unicode strings.The thing is, *only* when one wants to create strings, does one want to view the data type as a bidirectional string. When one wants to deal with chars as an element of a container, I don't want to be restricted to utf requirements.If you don't want to be restricted to utf requirements, use ubyte and ushort. You're saying "I want to use UTF code points without any associated UTF meaning".FWIW, I deal in ASCII pretty much exclusively, so sorting an array of char is not out of the question.Example?You might say "oh, well that's stupid!" but then so is using the index operator on a char[] array, no? I see no difference.There is a difference. Often in a loop you know the index at which a code point starts.I'm going to drop out of this discussion in order to develop a viable alternative to using arrays to represent strings. Then we can discuss the merits/drawbacks of such a type. I think it will be simple to build.I think that's a great idea. Andrei
Nov 22 2010
On 22/11/2010 04:56, Andrei Alexandrescu wrote:On 11/21/10 22:09 CST, Rainer Deyke wrote:More exactly, that the following is true for any T: foreach(character; (T[]).init) { static assert(is(typeof(character) == T)); } static assert(std.range.isRandomAccessRange!(T[])); It is not true for char and wchar (the second assert fails). Another guarantee, similar in nature, and roughly described, is that functions in std.algorithm should never fail or throw when using an array as a argument (assuming the other arguments are valid). So for example: std.algorithm.filter!("true")(anArray) Should not throw, for any value of anArray. But it may if anArray is of type char[] or wchar[] and there is an encoding exception. I'll leave the arguing of whether we want those guarantees for other subthreads, but it should be well agreed by now, that the above is not guaranteed. -- Bruno Medeiros - Software EngineerOn 11/21/2010 17:31, Andrei Alexandrescu wrote: char[] and wchar[] fail to provide some of the guarantees of all other instances of T[].What exactly are those guarantees?
Nov 24 2010
On 24/11/2010 13:07, Bruno Medeiros wrote:On 22/11/2010 04:56, Andrei Alexandrescu wrote:Actually, I'll reply here, on why I would like these guarantees: I think these guarantees are desirable due to a general design principle of mine that goes something like this: * Avoid "bad" abstractions: the abstraction should reflect intent as closely and clearly as possible. Yeah, that may not tell anyone much because it's very hard to objectively define whether an abstraction is "bad" or not, or better or worse than another. However, here are a few guidelines: - within the same level of functionality, things should be as simple and as orthogonal as possible. - don't confuse implementation with contract/interface/API. (note that I said "confuse", not "expose") char[] is not as orthogonal as possible. char[] does not reflect it's underlying intent as clearly as it could. If it was defined in a struct, you could directly document the expectation that the underlying string must be a valid UTF-8 encoding. In fact, you could even make that a contract. If instead of an argument based on a design principle, you ask for concrete examples of why this is undesirable, well, I have no examples to give... I haven't used D enough to run into real-world examples, but I believe that whenever the above principle is violated, then it is very likely that problems and/or annoyances will occur sooner or later. I should point out however, that, at least for me, the undesirability of the current behavior is actually very low. Compared to other language issues (whether current ones, or past ones), it does not seem that significant. For example, static arrays not being proper values types (plus their .init thing) was much worse, man, that annoyed the shit out of me. Then again, someone with more experience using D might encounter a more serious real-world case regarding the current behavior. Also, regarding this: On 22/11/2010 17:40, Andrei Alexandrescu wrote:On 11/21/10 22:09 CST, Rainer Deyke wrote:More exactly, that the following is true for any T: foreach(character; (T[]).init) { static assert(is(typeof(character) == T)); } static assert(std.range.isRandomAccessRange!(T[])); It is not true for char and wchar (the second assert fails). Another guarantee, similar in nature, and roughly described, is that functions in std.algorithm should never fail or throw when using an array as a argument (assuming the other arguments are valid). So for example: std.algorithm.filter!("true")(anArray) Should not throw, for any value of anArray. But it may if anArray is of type char[] or wchar[] and there is an encoding exception. I'll leave the arguing of whether we want those guarantees for other subthreads, but it should be well agreed by now, that the above is not guaranteed.On 11/21/2010 17:31, Andrei Alexandrescu wrote: char[] and wchar[] fail to provide some of the guarantees of all other instances of T[].What exactly are those guarantees?Of course you can. After you were to admit that it makes next to no sense to sort an array of code units, I would have said "well if somehow you do imagine such a situation, you achieve that by saying what you means: cast the char[] to ubyte[] and sort that".Casting to ubyte[] does solve the use case, I agree. It does so with a minor inconvenience (having to cast), but it's very minor and I don't think it's that significant. Rather, I'm more concerned with the use cases that actually want to use a char[] as a UTF-8 encoded string. As I mentioned above, I'm afraid of situations where this inconsistency might cause more significant inconveniences, maybe even bugs! -- Bruno Medeiros - Software Engineer
Nov 24 2010
On Sunday 21 November 2010 16:12:14 Rainer Deyke wrote:Character arrays are arrays of code units and ranges of code points (of dchar specifically). If you want them to be treated as code points, access them as ranges. If you want to treat them as code units, access them as arrays. So, as far as character arrays go, there shouldn't be any problems. You just have to be aware of the difference between a char or wchar and a character. Now, as for Array!char or any other container which could be considered a sequence of code units, there, we could be in trouble if we want to treat them as code points rather than code units. I believe that ranges over them would be over code units rather than code points, and if that's the case, you're going to have to deal with char and wchar as arrays if you want to treat them as ranges of dchar. We should be able to get around the problem by special-casing the containers on char and wchar, but that would mean more work for anyone implementing a container where it would be reasonable to see its elements as a sequence of code units making up a string. It's quite doable though. - Jonathan M DavisI'm not concerned about strings, I'm concerned about *arrays*. Arrays of T, where T may or not be a character type. I see that you ignored my Vector!char example yet again. Your assurances aren't increasing my confidence in D, they're decreasing my confidence in your judgment (and by extension my confidence in D).You may rest assured that if anything, strings are not a problem.It may be wise in fact to start using D2 and make criticism grounded in reality that could help us improve the state of affairs.Sorry, but no. It would take a huge investment of time and effort on my part to switch from C++ to D. I'm not going to make that leap without looking first, and I'm not going to make it when I can see that I'm about to jump into a spike pit.
Nov 21 2010
On Sunday 21 November 2010 16:48:53 Jonathan M Davis wrote:On Sunday 21 November 2010 16:12:14 Rainer Deyke wrote:Actually, the better implementation would probably be to provide wrapper ranges for ranges of char and wchar so that you could access them as ranges of dchar. Doing otherwise would make it so that you couldn't access them directly as ranges of char or wchar, which would be limiting, and since it's likely that anyone actually wanting strings would just use strings, there's a good chance that in the majority of cases, what you'd want would really be a range of char or wchar anyway. Regardless, it's quite possible to access containers of char or wchar as ranges of dchar if you need to. - Jonathan M DavisCharacter arrays are arrays of code units and ranges of code points (of dchar specifically). If you want them to be treated as code points, access them as ranges. If you want to treat them as code units, access them as arrays. So, as far as character arrays go, there shouldn't be any problems. You just have to be aware of the difference between a char or wchar and a character. Now, as for Array!char or any other container which could be considered a sequence of code units, there, we could be in trouble if we want to treat them as code points rather than code units. I believe that ranges over them would be over code units rather than code points, and if that's the case, you're going to have to deal with char and wchar as arrays if you want to treat them as ranges of dchar. We should be able to get around the problem by special-casing the containers on char and wchar, but that would mean more work for anyone implementing a container where it would be reasonable to see its elements as a sequence of code units making up a string. It's quite doable though.I'm not concerned about strings, I'm concerned about *arrays*. Arrays of T, where T may or not be a character type. I see that you ignored my Vector!char example yet again. Your assurances aren't increasing my confidence in D, they're decreasing my confidence in your judgment (and by extension my confidence in D).You may rest assured that if anything, strings are not a problem.It may be wise in fact to start using D2 and make criticism grounded in reality that could help us improve the state of affairs.Sorry, but no. It would take a huge investment of time and effort on my part to switch from C++ to D. I'm not going to make that leap without looking first, and I'm not going to make it when I can see that I'm about to jump into a spike pit.
Nov 21 2010
On 11/21/10 7:00 PM, Jonathan M Davis wrote:Actually, the better implementation would probably be to provide wrapper ranges for ranges of char and wchar so that you could access them as ranges of dchar. Doing otherwise would make it so that you couldn't access them directly as ranges of char or wchar, which would be limiting, and since it's likely that anyone actually wanting strings would just use strings, there's a good chance that in the majority of cases, what you'd want would really be a range of char or wchar anyway. Regardless, it's quite possible to access containers of char or wchar as ranges of dchar if you need to.I agree except for the majority of cases part. In fact the original design of range interfaces for char[] and wchar[] was to require byDchar() to get a bidirectional interface over the arrays of code units. That design, with which I experimented for a while, had two drawbacks: 1. It had the default reversed, i.e. most often you want to regard a char[] or a wchar[] as a range of code points, not as an array of code units. 2. It had the unpleasant effect that most algorithms in std.algorithm and beyond did the wrong thing by default, and the right thing only if you wrapped everything with byDchar(). The second iteration of the design, which is currently in use, was to define in std.range the primitives such that char[] and wchar[] offer by default the bidirectional range interface. I have gone through all algorithms in std.algorithm and std.string and noticed with amazed satisfaction that they most always did the right thing, and that I could tweak the few that didn't to complete a satisfactory implementation. (indexOf has slipped through the cracks.) I think that experience with the current design is speaking in its favor. One thing could be done to drive the point home: a function byCodeUnit() could be added that actually does iterate a char[] or a wchar[] one code unit at a time (and consequently restores their behavior as T[]). That function could be simply a cast to ubyte[]/ushort[], or it could introduce a random-access range. Andrei
Nov 21 2010
On Sunday 21 November 2010 17:21:27 Andrei Alexandrescu wrote:On 11/21/10 7:00 PM, Jonathan M Davis wrote:Well, I don't know for certain whether people would normally want to iterate over Array!char as a char range or a dchar range. However, when thinking about the likely uses, it seems to me that you if you really want a string, you'd likely be using a string rather than Array!char, so I figure that the most likely use case for Array!char would be to iterate over a range of char. But I could be totally wrong about that. As for character arrays, I do think that the normal use case is to want to see them as ranges of dchar rather than char or wchar. However, that can get a bit funny due to the fact that while the _programmer_ almost always views them that way, the _algorithms_ vary quite a bit more in whether they really want dchar or whether char or wchar works just fine. I do agree though that the current design works quite well overall though. If I were to change it, I'd probably make strings into structs which have an array property (giving access to the char[] or wchar[] array if you need it) and give the struct a range interface which was over dchar. To really make that work, though, you'd need uniform function call syntax (or things like str.splitlines() would quick working), and there could be other reasons why it would fall apart. However, it would quickly and easily make dchar iteration the default while still allowing access to the interior char[] or wchar[]. But since you'd still have to special case functions which actually wanted the char[] or wchar[], I'm not sure if you ultimately gain much - though it does fix the foreach error where it defaults to char or wchar. Overally, what we have works quite well. It _is_ a bit convoluted at times, but it's generally convoluted because of the nature of unicode rather than how we're implementing it. It's not perfect (unicode is too disgusting for perfection to be possible anyway), but it works _far_ better than any other language that I've used, and I actually understand unicode and its issues far better than I did before messing around with D strings. - Jonathan M DavisActually, the better implementation would probably be to provide wrapper ranges for ranges of char and wchar so that you could access them as ranges of dchar. Doing otherwise would make it so that you couldn't access them directly as ranges of char or wchar, which would be limiting, and since it's likely that anyone actually wanting strings would just use strings, there's a good chance that in the majority of cases, what you'd want would really be a range of char or wchar anyway. Regardless, it's quite possible to access containers of char or wchar as ranges of dchar if you need to.I agree except for the majority of cases part. In fact the original design of range interfaces for char[] and wchar[] was to require byDchar() to get a bidirectional interface over the arrays of code units. That design, with which I experimented for a while, had two drawbacks: 1. It had the default reversed, i.e. most often you want to regard a char[] or a wchar[] as a range of code points, not as an array of code units. 2. It had the unpleasant effect that most algorithms in std.algorithm and beyond did the wrong thing by default, and the right thing only if you wrapped everything with byDchar(). The second iteration of the design, which is currently in use, was to define in std.range the primitives such that char[] and wchar[] offer by default the bidirectional range interface. I have gone through all algorithms in std.algorithm and std.string and noticed with amazed satisfaction that they most always did the right thing, and that I could tweak the few that didn't to complete a satisfactory implementation. (indexOf has slipped through the cracks.) I think that experience with the current design is speaking in its favor. One thing could be done to drive the point home: a function byCodeUnit() could be added that actually does iterate a char[] or a wchar[] one code unit at a time (and consequently restores their behavior as T[]). That function could be simply a cast to ubyte[]/ushort[], or it could introduce a random-access range.
Nov 21 2010
On 2010-11-21 20:21:27 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:That design, with which I experimented for a while, had two drawbacks: 1. It had the default reversed, i.e. most often you want to regard a char[] or a wchar[] as a range of code points, not as an array of code units. 2. It had the unpleasant effect that most algorithms in std.algorithm and beyond did the wrong thing by default, and the right thing only if you wrapped everything with byDchar().Well, basically these two arguments are the same: iterating by code unit isn't a good default. And I agree. But I'm unconvinced that iterating by dchar is the right default either. For one thing it has more overhead, and for another it still doesn't represent a character. Now, add graphemes to the equation and you have a representation that matches the user-perceived character concept, but for that you add another layer of decoding overhead and a variable-size data type to manipulate (a grapheme is a sequence of code points). And you have to use Unicode normalization when comparing graphemes. So is that a good default? Probably not. It might be "correct" in some sense, but it's totally overkill for most cases. My thinking is that there is no good default. If you write an XML parser, you'll probably want to work at the code point level; if you write a JSON parser, you can easily skip the overhead and work at the UTF-8 code unit level until you start parsing a string; if you write something to count the number of user-perceived characters or want to map characters to a font then you'll want graphemes... Perhaps there should be simply no default; perhaps you should be forced to choose explicitly at which layer you want to operate each time you apply an algorithm on a string... and to make this less painful we could have functions in std.string acting as a thin layer over similar ones in std.algorithm that would automatically choose the right representation for the algorithm depending on the operation. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Nov 21 2010
On Sun, 21 Nov 2010 21:26:53 -0500 Michel Fortin <michel.fortin michelf.com> wrote:On 2010-11-21 20:21:27 -0500, Andrei Alexandrescu=20 <SeeWebsiteForEmail erdani.org> said: =20=20That design, with which I experimented for a while, had two drawbacks: =20 1. It had the default reversed, i.e. most often you want to regard a=20 char[] or a wchar[] as a range of code points, not as an array of code==20units. =20 2. It had the unpleasant effect that most algorithms in std.algorithm=20 and beyond did the wrong thing by default, and the right thing only if=Hello Michel,you wrapped everything with byDchar().Well, basically these two arguments are the same: iterating by code=20 unit isn't a good default. And I agree. But I'm unconvinced that=20 iterating by dchar is the right default either. For one thing it has=20 more overhead, and for another it still doesn't represent a character.This is an issue evoked in a previous thread some weeks ago. More on it bel= ow.Now, add graphemes to the equation and you have a representation that=20 matches the user-perceived character concept, but for that you add=20 another layer of decoding overhead and a variable-size data type to=20 manipulate (a grapheme is a sequence of code points). And you have to=20 use Unicode normalization when comparing graphemes. So is that a good=20 default? Probably not. It might be "correct" in some sense, but it's=20 totally overkill for most cases.It is not possible, as writer of a textprocessing lib ot Text type, to defi= ne a right level of abstraction (code unit, code point, or grapheme) that w= ould both be usually efficent and avoid unexpected failures for "naive" use= of the tool. The only safe level in 99% cases is the highest-level one, namely grapheme.= Only then can one be sure that, for instance text.count("=C3=A4") will act= ually count "=C3=A4"'s in source text. But in most cases, this is overkill.= It depends on what the text actually, *and* on what the programmer knows a= bout it (I mean that texts may be plain ASCII, so that even unsigned byte s= trings would do the job, but if the programmer cannot guess it...). The tool writer cannot guess anything.My thinking is that there is no good default. If you write an XML=20 parser, you'll probably want to work at the code point level; if you=20 write a JSON parser, you can easily skip the overhead and work at the=20 UTF-8 code unit level until you start parsing a string; if you write=20 something to count the number of user-perceived characters or want to=20 map characters to a font then you'll want graphemes...At least 3 factors must be taken into account: 1. The actual content of source texts. For instance, 99.999% of all texts w= on't ever hold code points > ffff. This tells which size should be used for= code units. The safe general choice indeed beeing 32 bits. 2. The normalisation form of graphemes; whether they are decomposed (the ri= ght choice), or in unknown form or possibly in mixed forms, or as precompos= ed as possible. In the latter case (by far the most common one for western = language texts), and one can assert that every grapheme in every source tex= t to be dealt with has a fully precomposed form (=3D 1 single code *point*)= , then the level of code points is safe enough. 3. Whether text is just transferred through an app or is also processed. Ma= ny apps just use some bits of input texts (files, user input, literals) as = is, without any processing, and often output some of them, possibly concate= nated. This is safe whatever the abstraction level of text representation u= sed; one can concat plain utf8 representing composite graphemes in decompos= ed form.=20 But as soon as any text-processing routine is used (indexing, slicing, find= , count, replace...), then questions arise about correctness of the app. And, as said already, to be able to safely choose any lower-level of repree= ntation, the programmer must know about the content, its properties, its UC= S coding. For instance, imagine you need to write an app dealing with texts= containing phonetic symbols (IPA). How do you know which is the lowest saf= e level? * What is the common coding of IPA graphemes in UCS? * Can they be coded in various ways (yes!, too bad..) * What is the highest code point ever possibly needed? (=3D=3D> is utf8 or = utf16 enough for code points?) * Do all graphemes have a fully precomposed form? * Can I be sure that all texts will actually be coded in precomposed form (= this depends on text producing tools), "for ever"?Perhaps there should be simply no default; perhaps you should be forced=20 to choose explicitly at which layer you want to operate each time you=20 apply an algorithm on a string... and to make this less painful we=20 could have functions in std.string acting as a thin layer over similar=20 ones in std.algorithm that would automatically choose the right=20 representation for the algorithm depending on the operation.My next project should be to write one Text type dealing at the highest-lev= el -- if only to showcase the issues invloved by the "missing level of abst= raction" in common tools supposed to deal with universal text. This is much easier in D thank to proper string types, and availibility of = tools to cope with lower-level issues, mainly decoding/encoding and validit= y checking (I do not know yet how practicle said tools are). denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 22 2010
On Sun, 21 Nov 2010 19:21:27 -0600 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 11/21/10 7:00 PM, Jonathan M Davis wrote:r rangesActually, the better implementation would probably be to provide wrappe=dchar.for ranges of char and wchar so that you could access them as ranges of=asDoing otherwise would make it so that you couldn't access them directly=thatranges of char or wchar, which would be limiting, and since it's likely=chanceanyone actually wanting strings would just use strings, there's a good =of charthat in the majority of cases, what you'd want would really be a range =f char oror wchar anyway. Regardless, it's quite possible to access containers o=I find these points most relevant. The issue is that *char[] actually are t= he mutable variants of *string. So that one needs to use them as textual ty= pes, meaning as strings of code points. Thus, I do not think the most commo= n case is to have them iterated as strings of code _units_.wchar as ranges of dchar if you need to.=20 I agree except for the majority of cases part. In fact the original=20 design of range interfaces for char[] and wchar[] was to require=20 byDchar() to get a bidirectional interface over the arrays of code units. =20 That design, with which I experimented for a while, had two drawbacks: =20 1. It had the default reversed, i.e. most often you want to regard a=20 char[] or a wchar[] as a range of code points, not as an array of code=20 units. =20 2. It had the unpleasant effect that most algorithms in std.algorithm=20 and beyond did the wrong thing by default, and the right thing only if=20 you wrapped everything with byDchar().The second iteration of the design, which is currently in use, was to=20 define in std.range the primitives such that char[] and wchar[] offer by==20default the bidirectional range interface. I have gone through all=20 algorithms in std.algorithm and std.string and noticed with amazed=20 satisfaction that they most always did the right thing, and that I could==20tweak the few that didn't to complete a satisfactory implementation.=20 (indexOf has slipped through the cracks.) I think that experience with=20 the current design is speaking in its favor.This makes the safe and common case default.One thing could be done to drive the point home: a function byCodeUnit()==20could be added that actually does iterate a char[] or a wchar[] one code==20unit at a time (and consequently restores their behavior as T[]). That=20 function could be simply a cast to ubyte[]/ushort[], or it could=20 introduce a random-access range.For sure, this would be useful in the cases where really needs code units. = And make it clear that default iteration is _not_ over code units (thus avo= iding part of the critics). Maybe an alternative would be (or have been) to have complete lexical disti= nction between (text) strings and true char arrays, that applies whatever c= onstness or mutability is wished. * char[] is always an array of plain unsigned ints * mutable strings can be defined using mutable(string) for text processing,= still beeing indexed and iterated as strings of code _points_.AndreiDenis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 22 2010
On 21/11/2010 18:23, Andrei Alexandrescu wrote:I have often reflected whether I'd do things differently if I could go back in time and join Walter when he invented D's strings. I might have done one or two things differently, but the gain would be marginal at best. In fact, it's not impossible the balance of things could have been hurt. Between speed, simplicity, effectiveness, abstraction, access to representation, and economy of means, D's strings are the best compromise out there that I know of, bar none by a wide margin.Those things you would have done differently, would any of them impact this particular issue? -- Bruno Medeiros - Software Engineer
Nov 24 2010
On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:D strings exhibit no such problems. They expose their implementation - array of code units. Having that available is often handy. They also obey a formal interface - bidirectional ranges.It's convenient that char[] and wchar[] expose a dchar bidirectional range interface... but only when a dchar bidirectional range is what you want to use. If you want to iterate over code units (lower-level representation), or graphemes (upper-level representation), then it gets in your way. There is no easy notion of "character" in unicode. A code point is *not* a character. One character can span multiple code points. I fear treating dchars as "the default character unit" is repeating same kind of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and treating each 2-byte code unit as a character. I mean, what's the point of working with the intermediary representation (code points) when it doesn't represent a character? Instead, I think it'd be better that the level one wants to work at be made explicit. If one wants to work with code points, he just rolls a code-point bidirectional range on top of the string. If one wants to work with graphemes (user-perceived characters), he just rolls a grapheme bidirectional range on top of the string. In other words: string str = "hello"; foreach (cu; str) {} // code unit iteration foreach (cp; str.codePoints) {} // code point iteration, bidirectional range of dchar foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional range of graphemes That'd be much cleaner than having some sort of hybrid code-point/code-unit array/range. Here's a nice reference about unicode graphemes, word segmentation, and related algorithms. <http://unicode.org/reports/tr29/> -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Nov 21 2010
On 11/21/10 7:11 PM, Michel Fortin wrote:On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I agree.D strings exhibit no such problems. They expose their implementation - array of code units. Having that available is often handy. They also obey a formal interface - bidirectional ranges.It's convenient that char[] and wchar[] expose a dchar bidirectional range interface... but only when a dchar bidirectional range is what you want to use. If you want to iterate over code units (lower-level representation), or graphemes (upper-level representation), then it gets in your way.There is no easy notion of "character" in unicode. A code point is *not* a character. One character can span multiple code points. I fear treating dchars as "the default character unit" is repeating same kind of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and treating each 2-byte code unit as a character. I mean, what's the point of working with the intermediary representation (code points) when it doesn't represent a character?I understand the concern, and that's why I strongly support formal abstractions that are supported by, but largely independent from, representations. If graphemes are to be modeled, D is in better shape than other languages. What we need to do is define a range byGrapheme() that accepts char[], wchar[], or dchar[].Instead, I think it'd be better that the level one wants to work at be made explicit. If one wants to work with code points, he just rolls a code-point bidirectional range on top of the string. If one wants to work with graphemes (user-perceived characters), he just rolls a grapheme bidirectional range on top of the string. In other words: string str = "hello"; foreach (cu; str) {} // code unit iteration foreach (cp; str.codePoints) {} // code point iteration, bidirectional range of dchar foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional range of graphemes That'd be much cleaner than having some sort of hybrid code-point/code-unit array/range. Here's a nice reference about unicode graphemes, word segmentation, and related algorithms. <http://unicode.org/reports/tr29/>I agree except for the fact that in my experience you want to iterate over code points much more often than over code units. Iterating by code unit by default is almost always wrong. That's why D's strings offer the bidirectional interface by default. I have reasons to believe it was a good decision. Andrei
Nov 21 2010
On Sunday 21 November 2010 17:27:06 Andrei Alexandrescu wrote:On 11/21/10 7:11 PM, Michel Fortin wrote:We could always define an abstract Character (or whatever you want to call it) which holds a character - regardless of whether it uses a grapheme or not - and make it relatively easy to iterate over Characters rather than dchars. It would be nice if they abolished graphemes though... It is quite possible that while D's handling of unicode is a huge improvement over other languages, by treating dchar as a full character essentially everywhere, we're opening ourselves up for a variety of bugs caused by graphemes which will be subtle and hard to find. But I'm not sure what the correct solution to that is. - Jonathan M DavisOn 2010-11-20 18:58:33 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I agree.D strings exhibit no such problems. They expose their implementation - array of code units. Having that available is often handy. They also obey a formal interface - bidirectional ranges.It's convenient that char[] and wchar[] expose a dchar bidirectional range interface... but only when a dchar bidirectional range is what you want to use. If you want to iterate over code units (lower-level representation), or graphemes (upper-level representation), then it gets in your way.There is no easy notion of "character" in unicode. A code point is *not* a character. One character can span multiple code points. I fear treating dchars as "the default character unit" is repeating same kind of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and treating each 2-byte code unit as a character. I mean, what's the point of working with the intermediary representation (code points) when it doesn't represent a character?I understand the concern, and that's why I strongly support formal abstractions that are supported by, but largely independent from, representations. If graphemes are to be modeled, D is in better shape than other languages. What we need to do is define a range byGrapheme() that accepts char[], wchar[], or dchar[].Instead, I think it'd be better that the level one wants to work at be made explicit. If one wants to work with code points, he just rolls a code-point bidirectional range on top of the string. If one wants to work with graphemes (user-perceived characters), he just rolls a grapheme bidirectional range on top of the string. In other words:
Nov 21 2010
Andrei Alexandrescu wrote: ...I agree except for the fact that in my experience you want to iterate over code points much more often than over code units. Iterating by code unit by default is almost always wrong. That's why D's strings offer the bidirectional interface by default. I have reasons to believe it was a good decision. AndreiIs there a plan to make std.string and std.algorithm more compatible with this view? Nearly all algorithms in std.string work with slices or substrings rather than code unit or points. I found it sometimes hard to mix and match that approach with the api that std.algorithm offers. Maybe I'm missing something.
Nov 21 2010
On Sun, 21 Nov 2010 19:27:06 -0600 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Sure, D helps a lot. I agree with abstraction levels independant of interna= l representation in the general case (I think it's one major aspect and adv= antage of ranges, isn't it?). But it yields a huge efficiency issue in this= very case. Namely that if one deals with a text at the level graphemes whi= le the representation of of a string of code points, then every little rout= ine has to reconstruct the graphemes on the fly. For instance, indexing 3 t= imes will do 3 times the job of constructing a string of graphemes (up to t= he given indices). Thus, when one has to do text processing, even of the simplest kind, it is = necessary to use a dedicated type (or any kind of tool using a high-level r= epresentation). (Analog to the need of first decoding code units into code = points, only once, before dealing with code points -- but at a higher level= .) See also answer to Michel's post. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.comThere is no easy notion of "character" in unicode. A code point is *not* a character. One character can span multiple code points. I fear treating dchars as "the default character unit" is repeating same kind of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and treating each 2-byte code unit as a character. I mean, what's the point of working with the intermediary representation (code points) when it doesn't represent a character? =20=20 I understand the concern, and that's why I strongly support formal=20 abstractions that are supported by, but largely independent from,=20 representations. If graphemes are to be modeled, D is in better shape=20 than other languages. What we need to do is define a range byGrapheme()=20 that accepts char[], wchar[], or dchar[].
Nov 22 2010
On Sun, 21 Nov 2010 17:56:15 -0800 Jonathan M Davis <jmdavisProg gmx.com> wrote:We could always define an abstract Character (or whatever you want to cal=l it)=20which holds a character - regardless of whether it uses a grapheme or not=- and=20make it relatively easy to iterate over Characters rather than dchars.This is not a solution, it would force constructing graphemes for each rout= ine applied to a given text. You need to do it only once.It would=20 be nice if they abolished graphemes though...What is the alternative? For a given set of base characters (say ascii lett= ers, cardinal NC) and a given set of "combining marks" (say latin diacritic= s, cardinal ND), what is the number of combinations? If I'm right, the answ= er is NC * 2^ND (in other words, an astronomical number). We would need tho= usands of bits for each code point ;-) Also, we cannot predict future. Think that for each new diacritic, you must= double the number of precomposed characters, simply by adding this diacrit= ic to every already existing combination. We cannot know what would be need= ed in a few years. The error UCS & Unicode have done is the opposite one. To silently pretend = that code points represent characters (I cannot believe that choosing the t= erm "abstract character" to denote what is coded by a code point was innoce= nt. It can only introduce confusion). They should have said that a code poi= nt represents, say, an "abstract marks". And made clear that a character, m= eaning a logical text element, is represented by a mini-array of code units= (what I call a code stack, see other post for why). This would have avoided confusion from start on, and encouraged programmer = to design proper, correct, text representations -- at least for text proces= sing. Now, and only because of that, everybody seems to discover consequent= issues 20 years too late. Even in unicode circles: I have tried to evoke t= his on the usincode maling list several times in past years, with about no = echo at all. People do not *want* to hear of it. I think this has been a deliberate marketing choice for the UCS/Unicode sta= ndard. Probably they were afraid of reactions from programming communities= if they had made clear dealing with universal text required adding *2* lev= els of abstraction over plain ASCII. Another error was to promote using cod= e units for space-efficiency. Else, there would be only 1 new level.It is quite possible that while=20 D's handling of unicode is a huge improvement over other languages, by tr=eating=20dchar as a full character essentially everywhere, we're opening ourselves=up for=20a variety of bugs caused by graphemes which will be subtle and hard to fi=nd.But I'm not sure what the correct solution to that is.There is one general solution as long as efficiency is considered irrelevan= t: a text is represented as a string of graphemes. There is no solution wit= h efficiency because cases for which this is overkil are most common one (a= s of now, but this will change with the growth of computing is asiatic coun= tries). Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 22 2010
On Sun, 21 Nov 2010 20:11:23 -0500 Michel Fortin <michel.fortin michelf.com> wrote:On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu=20 <SeeWebsiteForEmail erdani.org> said: =20=20D strings exhibit no such problems. They expose their implementation -=True.array of code units. Having that available is often handy. They also=20 obey a formal interface - bidirectional ranges.=20 It's convenient that char[] and wchar[] expose a dchar bidirectional=20 range interface... but only when a dchar bidirectional range is what=20 you want to use. If you want to iterate over code units (lower-level=20 representation), or graphemes (upper-level representation), then it=20 gets in your way.There is no easy notion of "character" in unicode. A code point is=20 *not* a character. One character can span multiple code points. I fear=20 treating dchars as "the default character unit" is repeating same kind=20 of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and=20 treating each 2-byte code unit as a character. I mean, what's the point=20 of working with the intermediary representation (code points) when it=20 doesn't represent a character?True, but only partially. The error of using utf16 to represent code points= is far less serious in practice, because code point > ffff have about no c= hance to ever be present in any text one programmer will ever have to deal = with. (This error was in fact initially caused by the standard people who f= irst thought ffff was enough, so that 16-bit tools and encodings were creat= ed and used.) But I fully agree with "what's the point of working with the intermediary r= epresentation (code points) when it doesn't represent a character?". *This*= is wrong and may cause much damage. Actually, it means apps simply do not = work correctly; a logical error; and one that can hardly be automatically d= etected. A side-issue is that in present times we mostly deal with source texts for = which there exists precomposed characters, _and_ text-prodcuing tools usual= ly use them. So that programmers who ignore the issue may think they are ri= ght. But both of those facts may soon be wrong.Instead, I think it'd be better that the level one wants to work at be=20 made explicit. If one wants to work with code points, he just rolls a=20 code-point bidirectional range on top of the string. If one wants to=20 work with graphemes (user-perceived characters), he just rolls a=20 grapheme bidirectional range on top of the string. In other words: =20 string str =3D "hello"; foreach (cu; str) {} // code unit iteration foreach (cp; str.codePoints) {} // code point iteration, bidirectional=20 range of dchar foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional=20 range of graphemes =20 That'd be much cleaner than having some sort of hybrid=20 code-point/code-unit array/range.Yop, but the ability to iterate over graphemes, while the internal represen= tation is of a string of codes, or code units, is *not* what we need: text.count(c); would have to construct graphemes on the fly on the whole string. Every tex= t processing routine performed on a given text will have to do it on all or= part of the text (indexing for instance would do it only up to given index= ). Meaning every routine would have to do the job of constructing a string = of graphemes (and normalising it) that should be done only once. Hope I'm c= lear. Reason why we need a proper Text type as a string of graphemes. The same ab= stration offered by dchar (from code units to code points) is needed at a h= igher-level (from code points to graphemes). Each element would be what I c= all a "stack", a mini-array of dchars. Then, we can deal with it like with = a palin ASCII or Latin-1 text. c c c c c c c c c dstring =3D dchar[] --> coded string c c c c c c c c c text =3D stack[] --> logical stringHere's a nice reference about unicode graphemes, word segmentation, and=20 related algorithms.<http://unicode.org/reports/tr29/>I have implemented once the algorithm used to construct graphemes put of co= de points, as a base for a grapheme-level Text type, with all common text p= rocessing routines (*) (in Lua). I plan to do this in & for D in a short wh= ile. As said, it should simpler thank to D's true string types who already = abstract from lower-level issues. (*) Actually, once one a has a string of <graphemes/codes/code-units>, rout= ines are the same whatever the kind of element. There could be a generic ve= rsion in std.string. -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 22 2010
On 2010-11-22 06:57:36 -0500, spir <denis.spir gmail.com> said:(*) Actually, once one a has a string of <graphemes/codes/code-units>, rout ines are the same whatever the kind of element. There could be a generic ve rsion in std.string.Just to add to the compexity: graphemes aren't always equivalent to user-perceived characters either. Ligatures can contain more than one user-perceived characters. If you're looking for the substring "flourish" in a string, should it fail to match when it encounters "flourish" just because of the "fl" (fl) ligature? On most Mac applications it matches both thanks to sensible defaults in NSString's search and comparison algorithms. So perhaps we need yet another layer over graphemes to represent user-perceived characters. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Nov 22 2010
On Mon, 22 Nov 2010 07:34:15 -0500 Michel Fortin <michel.fortin michelf.com> wrote:Just to add to the compexity: graphemes aren't always equivalent to=20 user-perceived characters either. Ligatures can contain more than one=20 user-perceived characters. If you're looking for the substring=20 "flourish" in a string, should it fail to match when it encounters=20 "=EF=AC=82ourish" just because of the "=EF=AC=82" (fl) ligature? On most =Mac=20applications it matches both thanks to sensible defaults in NSString's=20 search and comparison algorithms.That's true. I guess you're thinking at the distinction between NFD/NFC "ca= nonical forms" and NFKD/NFKC ones (so-called "compatibility").So perhaps we need yet another layer over graphemes to represent=20 user-perceived characters.In my view, this is not the responsability of a general-purpose tool. I gue= ss, but may be wrong, we are clearly entering the field of app logics and s= emantics. These are for me _not_ general-purpose points (but builtin types = & libraries often offer clearly non-general routines like one dealing with = casing, or even less general: the set of ASCII letters). These issues would= have to be dealt with either by apps or by domain-specific libraries. I find it wrong that Unicode even simply provides standard canonical forms = for them (but fortunately common libs do not implement them AFAIK) denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 22 2010
On 2010-11-22 08:57:39 -0500, spir <denis.spir gmail.com> said:On Mon, 22 Nov 2010 07:34:15 -0500 Michel Fortin <michel.fortin michelf.com> wrote:Is searching for a word in a text file less general purpose than searching for a specific combination of graphemes forming that word? That the implementation to get it right is quite complex doesn't make a tool less general purpose. The sole reason searching works this way in most Mac OS X (and iOS) applications is that Apple implemented it at the core of its string type and made it the default way of searching substrings and comparing strings. It's dubious whether even half of Mac applications would have implemented the thing correctly otherwise. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Just to add to the compexity: graphemes aren't always equivalent to user-perceived characters either. Ligatures can contain more than one user-perceived characters. If you're looking for the substring "flourish" in a string, should it fail to match when it encounters "flourish" just because of the "fl" (fl) ligature? On mostMacapplications it matches both thanks to sensible defaults in NSString's search and comparison algorithms.That's true. I guess you're thinking at the distinction between NFD/NFC "ca nonical forms" and NFKD/NFKC ones (so-called "compatibility").So perhaps we need yet another layer over graphemes to represent user-perceived characters.In my view, this is not the responsability of a general-purpose tool. I gue ss, but may be wrong, we are clearly entering the field of app logics and s emantics. These are for me _not_ general-purpose points (but builtin types & libraries often offer clearly non-general routines like one dealing with casing, or even less general: the set of ASCII letters). These issues would have to be dealt with either by apps or by domain-specific libraries.
Nov 22 2010
On 2010-11-22 06:57:36 -0500, spir <denis.spir gmail.com> said:On Sun, 21 Nov 2010 20:11:23 -0500 Michel Fortin <michel.fortin michelf.com> wrote:I agree there might be a use case for a special data type allowing fast random access to graphemes and able to retain the precise count of graphemes. But if what you do only requires iterating over all graphemes, a wrapper range that converts to graphemes on the fly might be less overhead than building a separate data structure. In fact, this separate data structure to hold graphemes is probably going to require more memory, and more memory will fit worse in the processor's cache. Compare the cost in performance of a cache miss versus one or two comparisons to check if the next code point is part of the same grapheme and you might actually find the version that iterates by converting code points to graphemes on the fly faster for long strings. As long as you don't need random access to the graphemes, I don't think you need a separate data structure. -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:D strings exhibit no such problems. They expose their implementation -True.array of code units. Having that available is often handy. They also obey a formal interface - bidirectional ranges.It's convenient that char[] and wchar[] expose a dchar bidirectional range interface... but only when a dchar bidirectional range is what you want to use. If you want to iterate over code units (lower-level representation), or graphemes (upper-level representation), then it gets in your way.There is no easy notion of "character" in unicode. A code point is *not* a character. One character can span multiple code points. I fear treating dchars as "the default character unit" is repeating same kind of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and treating each 2-byte code unit as a character. I mean, what's the point of working with the intermediary representation (code points) when it doesn't represent a character?True, but only partially. The error of using utf16 to represent code points is far less serious in practice, because code point > ffff have about no c hance to ever be present in any text one programmer will ever have to deal with. (This error was in fact initially caused by the standard people who f irst thought ffff was enough, so that 16-bit tools and encodings were creat ed and used.) But I fully agree with "what's the point of working with the intermediary r epresentation (code points) when it doesn't represent a character?". *This* is wrong and may cause much damage. Actually, it means apps simply do not work correctly; a logical error; and one that can hardly be automatically d etected. A side-issue is that in present times we mostly deal with source texts for which there exists precomposed characters, _and_ text-prodcuing tools usual ly use them. So that programmers who ignore the issue may think they are ri ght. But both of those facts may soon be wrong.Instead, I think it'd be better that the level one wants to work at be made explicit. If one wants to work with code points, he just rolls a code-point bidirectional range on top of the string. If one wants to work with graphemes (user-perceived characters), he just rolls a grapheme bidirectional range on top of the string. In other words: string str = "hello"; foreach (cu; str) {} // code unit iteration foreach (cp; str.codePoints) {} // code point iteration, bidirectional range of dchar foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional range of graphemes That'd be much cleaner than having some sort of hybrid code-point/code-unit array/range.Yop, but the ability to iterate over graphemes, while the internal represen tation is of a string of codes, or code units, is *not* what we need: text.count(c); would have to construct graphemes on the fly on the whole string.
Nov 22 2010
On Mon, 22 Nov 2010 08:24:33 -0500 Michel Fortin <michel.fortin michelf.com> wrote:I agree there might be a use case for a special data type allowing fast=20 random access to graphemes and able to retain the precise count of=20 graphemes. But if what you do only requires iterating over all=20 graphemes, a wrapper range that converts to graphemes on the fly might=20 be less overhead than building a separate data structure.It's true as long as you can assert each string is iterated at most once. B= ut the job of constructing an instance of "UText" (say, grapheme string) sh= ould be exactly the same as what each iteration has to do on the fly. Or do= i miss a point? Also, it's not only about indexing or iterating. Simply finding/counting/re= placing given characters (I mean in the sense of graphemes) or slices requi= res the string to be not only grouped, but also normalised (else how is the= routine supposed to recognise the same char in another form?). A heavy job= as well, you don't want to do twice. Grouping makes normalising easier (yo= u only cope with a mini-array of codes at once, already known to represent = a whole char) (and sorting codes in stacks is easier as well). Finally, to avoid reprocessing already processed text, I had the idea of "u= tf33" ;-) This is utf32 plus the guaranty that character forms are already = normalised and sorted. denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 22 2010
On 2010-11-22 09:09:41 -0500, spir <denis.spir gmail.com> said:On Mon, 22 Nov 2010 08:24:33 -0500 Michel Fortin <michel.fortin michelf.com> wrote:I think you missed my point. My point was that decoding on the fly while iterating might be as fast or maybe faster in most cases (which don't include grapheme clusters) than if you had already predecoded the graphemes and stored them in a grapheme-oriented data structure. I say that mostly because of the variable-length nature of a grapheme makes it hard to store one efficiently. That's my opinion, but debating that is rather pointless in the absence of an implementation of each to compare. -- Michel Fortin michel.fortin michelf.com http://michelf.com/I agree there might be a use case for a special data type allowing fast random access to graphemes and able to retain the precise count of graphemes. But if what you do only requires iterating over all graphemes, a wrapper range that converts to graphemes on the fly might be less overhead than building a separate data structure.It's true as long as you can assert each string is iterated at most once. B ut the job of constructing an instance of "UText" (say, grapheme string) sh ould be exactly the same as what each iteration has to do on the fly. Or do i miss a point?
Nov 22 2010
In case it was not clear, this is what I want to achive: »tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ];«
Oct 16 2010
On 10/16/2010 01:29 PM, klickverbot wrote:Hello all, I decided to have a go at solving some easy programming puzzles with D2/Phobos to see how Phobos, especially ranges and std.algorithm, work out in simple real-world use cases (the puzzle in question is from hacker.org, by the way). The following code is a direct translation of a simple problem description to D (it is horrible from performance point of view, but that's certainly no issue here). --- import std.algorithm; import std.conv; import std.stdio; // The original input string is longer, but irrelevant to this post. enum INPUT = "93752xxx746x27x1754xx90x93xxxxx238x44x75xx087509"; void main() { uint sum; auto tmp = INPUT.dup; size_t i; while ( i < tmp.length ) { char c = tmp[ i ]; if ( c == 'x' ) { tmp = remove( tmp, i ); i -= 2; } else { sum += to!uint( [ c ] ); ++i; } } writeln( sum ); } --- Quite contrary to what you would expect, the call to »remove« fails to compile with the following error messages: »std/algorithm.d(4287): Error: front(src) is not an lvalue« and »std/algorithm.d(4287): Error: front(tgt) is not an lvalue«. I am intentionally posting this to this NG and not to d.…D.learn, since this is a quite gross violation of the principle of least surprise in my eyes. If this isn't a bug, a better error message via a template constraint or a static assert would be something worth looking at in my opinion, since one would probably expect this to compile and not to fail within Phobos code. DavidThanks for the input. This is not a bug, it's what I believe to be a very intentional feature: strings are not ordinary arrays because characters have variable length. As such, assigning to "the first character in a string" is not allowed because the assignment might mess up the next character. It's a good test bed. Simply replacing this: auto tmp = INPUT.dup; with this: auto tmp = cast(ubyte[]) INPUT.dup; makes the program work and print 322 (you also must include std.conv). How do you all believe we could improve this example? 1. remove() could be specialized for char[] and wchar[] because it can be made to work with some effort and is a worthwhile algorithms for strings. 2. to!(ubyte[]) should work for char[] by making a copy and casting it to ubyte[]. So this should have worked: auto tmp = to!(ubyte[])(INPUT); to! is better than cast because it always does the right thing and never undermines type safety. Whadday'all think? Andrei
Oct 16 2010
On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:Thanks for the input. This is not a bug, it's what I believe to be a very intentional feature: strings are not ordinary arrays because characters have variable length. As such, assigning to "the first character in a string" is not allowed because the assignment might mess up the next character.I see that there is a problem due the difference of code units and code points, but why does the following work then? tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ]; This is equivalent to my (naïve?) mental model of remove(), and thus it seems very counter-intuitive to me that one works, but the other doesn't.
Oct 16 2010
On 10/16/2010 09:56 PM, klickverbot wrote:On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:Try it with ä or ░ instead of x.Thanks for the input. This is not a bug, it's what I believe to be a very intentional feature: strings are not ordinary arrays because characters have variable length. As such, assigning to "the first character in a string" is not allowed because the assignment might mess up the next character.I see that there is a problem due the difference of code units and code points, but why does the following work then? tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ]; This is equivalent to my (naïve?) mental model of remove(), and thus it seems very counter-intuitive to me that one works, but the other doesn't.
Oct 16 2010
On 10/16/2010 02:56 PM, klickverbot wrote:On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:Strings are dual types. They have [] and .length but not with the semantics required by ranges. So formally they don't support isRandomAccessRange and hasLength. AndreiThanks for the input. This is not a bug, it's what I believe to be a very intentional feature: strings are not ordinary arrays because characters have variable length. As such, assigning to "the first character in a string" is not allowed because the assignment might mess up the next character.I see that there is a problem due the difference of code units and code points, but why does the following work then? tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ]; This is equivalent to my (naïve?) mental model of remove(), and thus it seems very counter-intuitive to me that one works, but the other doesn't.
Oct 16 2010
On 10/16/2010 02:56 PM, klickverbot wrote:On 10/16/10 9:47 PM, Andrei Alexandrescu wrote:To drive my point home: if you wanted to replace not 'x', but instead a multibyte character, your algorithm wouldn't work. It essentially assumes the string has one byte per character, and the needed cast to byte[] reflects that. If anything, I'd call this a success. AndreiThanks for the input. This is not a bug, it's what I believe to be a very intentional feature: strings are not ordinary arrays because characters have variable length. As such, assigning to "the first character in a string" is not allowed because the assignment might mess up the next character.I see that there is a problem due the difference of code units and code points, but why does the following work then? tmp = tmp[ 0 .. i ] ~ tmp[ ( i + 1 ) .. $ ]; This is equivalent to my (naïve?) mental model of remove(), and thus it seems very counter-intuitive to me that one works, but the other doesn't.
Oct 16 2010
Bruno Medeiros Wrote:On 23/11/2010 18:15, foobar wrote:It all depends on the scale you use. If we equate programming with cooking than using C++ is like trying to make a feast out of single atoms. Using Java would then be equivalent to buying at the supermarket. It's fine for most people. On this scale, the naked sheaf has his own farm with organic livestock and also a garden so he can get the best ingredients.It's simple, a mediocre language (Java) with mediocre libraries has orders of magnitude more success than C++ with it's libs fine tuned for performance. Why?Java has mediocre libraries?? Are you serious about that opinion? -- Bruno Medeiros - Software Engineer
Nov 24 2010