digitalmars.D - string is rarely useful as a function argument
- Peter Alexander (17/17) Dec 28 2011 string is immutable(char)[]
- bearophile (5/12) Dec 28 2011 What are the Phobos functions that unnecessarily accept a string?
- Peter Alexander (7/17) Dec 28 2011 Any time you want to create a string without allocating memory.
- bearophile (15/20) Dec 28 2011 I have discussed a bit two or three times about this topic. In a post I ...
- Peter Alexander (3/23) Dec 28 2011 That only works when you allocate memory for the string, which is what I...
- Walter Bright (2/6) Dec 28 2011 Is the buffer ever going to be reused with a different string in it?
- Peter Alexander (12/19) Dec 28 2011 Possibly.
- Timon Gehr (5/25) Dec 28 2011 You are approximately saying (paraphrasing): "The question is whether a
- Peter Alexander (5/34) Dec 28 2011 No, I'm saying that people talk about animals more often than cows, so
- Walter Bright (4/21) Dec 28 2011 If such a change is made, then people will use const string when they me...
- Peter Alexander (4/13) Dec 28 2011 Then people should learn what const and immutable mean!
- Walter Bright (10/17) Dec 28 2011 People do what is convenient, and as endless experience shows, doing the...
- Andrei Alexandrescu (6/26) Dec 28 2011 Yes. Contrary to the OP, I don't think it's fair to dismiss a valid
- Walter Bright (7/10) Dec 28 2011 And as Bruce Eckel discovered, even the people who know better will deli...
- Andrei Alexandrescu (10/30) Dec 28 2011 Oh, one more thing - one good thing that could come out of this thread
- Robert Jacques (2/36) Dec 28 2011 Would slicing, i.e. s[i..j] still be valid? If so, what would be the rec...
- Andrei Alexandrescu (4/7) Dec 28 2011 find, findSplit etc. from std.algorithm, std.utf functions etc.
- Timon Gehr (3/10) Dec 28 2011 That does not do the right thing. It would look more like
- foobar (7/50) Dec 28 2011 That's a good idea which I wonder about its implementation
- Andrei Alexandrescu (3/4) Dec 28 2011 Implementation would entail a change in the compiler.
- Timon Gehr (5/9) Dec 28 2011 Special casing char[] and wchar[] in the language would be extremely
- foobar (5/10) Dec 28 2011 Why? D should be plenty powerful to implement this without
- Andrei Alexandrescu (3/13) Dec 28 2011 It's an awesome idea, but for an academic debate at best.
- foobar (7/25) Dec 28 2011 I don't follow you. You've suggested a change that I agree with.
- Andrei Alexandrescu (16/39) Dec 28 2011 If we have two facilities (string and e.g. String) we've lost. We'd need...
- Adam D. Ruppe (16/18) Dec 28 2011 Have you actually tried to do it? Thanks to alias this, the custom
- Walter Bright (3/7) Dec 28 2011 I've seen the damage done in C++ with multiple string types. Being able ...
- Adam D. Ruppe (32/34) Dec 28 2011 Note that I'm on your side here re strings, but you're
- Andrei Alexandrescu (5/12) Dec 28 2011 Nah, that still breaks a lotta code because people parameterize on T[],
- Adam D. Ruppe (13/15) Dec 29 2011 /* snip struct string */
- Andrei Alexandrescu (9/18) Dec 28 2011 This.
- Jakob Ovrum (13/35) Dec 28 2011 I don't think this is a problem you can solve without educating
- Jonathan M Davis (19/31) Dec 28 2011 Ultimately, the programmer _does_ need to understand unicode properly if...
- deadalnix (3/34) Dec 29 2011 That is the whole point of D IMO. I think we shouldn't let an ego
- Walter Bright (7/16) Dec 28 2011 I think this goes to, at some point, the language is no longer able to h...
- Walter Bright (6/11) Dec 28 2011 If that ever happens, I owe you a beer. Maybe two!
- Timon Gehr (3/18) Dec 29 2011 I fully agree. If I had to design an imperative programming language,
- Derek (10/12) Dec 29 2011 I'm not quite sure about that last sentence. I suspect that the better w...
- Sean Kelly (17/54) Dec 29 2011 Don't we already have String-like support with ranges? I'm not sure I u...
- Jonathan M Davis (16/18) Dec 29 2011 To avoid common misusage. It's way to easy to misuse the length property...
- Adam D. Ruppe (28/29) Dec 28 2011 I don't think I agree. Wouldn't something like this work?
- foobar (8/37) Dec 28 2011 My thinking exactly. Of course we can't put "@disable" right away
- Adam D. Ruppe (4/7) Dec 28 2011 I actually like strings just the way they are... but if
- Timon Gehr (3/48) Dec 28 2011 In what way would the proposed change improve encapsulation, and why
- foobar (8/13) Dec 28 2011 I'm not sure what are you asking here. Are you asking what are
- Timon Gehr (5/18) Dec 28 2011 I know the benefits of encapsulation and none of them applies here. The
- Timon Gehr (3/37) Dec 28 2011 Why? char and wchar are unicode code units, ubyte/ushort are unsigned
- Jonathan M Davis (29/31) Dec 28 2011 It's an issue of the correct usage being the easy path. As it stands, it...
- Timon Gehr (17/50) Dec 28 2011 I was educated enough not to make that mistake, because I read the
- foobar (26/42) Dec 28 2011 I agree that it's useful. It is however the incorrect abstraction
- Timon Gehr (11/54) Dec 28 2011 Well, if the alternative is slowly butchering the language I will be
- foobar (12/104) Dec 28 2011 From a pragmatic view point people can also continue programming
- Timon Gehr (13/96) Dec 29 2011 I disagree.
- bearophile (7/20) Dec 28 2011 We have discussed this topic some times in past, it's not an easy topic....
- Vladimir Panteleev (6/47) Dec 29 2011 I think it would be simpler to just make dstring the default
- Gor Gyolchanyan (13/57) Dec 29 2011 This a a great idea! In this case the default string will be a
- Walter Bright (3/10) Dec 29 2011 dstring consumes 4x the memory, and this can easily cause perf degradati...
- Gor Gyolchanyan (41/52) Dec 29 2011 What if the string converted itself from utf-8 to utf-32 back and
- Gor Gyolchanyan (10/64) Dec 29 2011 oops. I accidentally made a recursive call in the setter. scratch
- Andrei Alexandrescu (3/6) Dec 29 2011 memory == time
- Don (12/20) Dec 29 2011 If I understand this correctly, most others don't. Effectively, .rep
- Andrei Alexandrescu (6/27) Dec 29 2011 Yes, I mean "rep" as a short for "representation" but upon first sight
- Regan Heath (6/38) Dec 30 2011 +1 for this idea, however named.
- Joshua Reusch (11/42) Dec 30 2011 Maybe it could happen if we
- Timon Gehr (22/69) Dec 30 2011 Wrong.
- Jakob Ovrum (9/27) Dec 30 2011 I strongly agree with this. It would be nice to have everything
- Andrei Alexandrescu (4/5) Dec 30 2011 What we have now is adequate. The scheme I proposed is optimal.
- deadalnix (14/77) Dec 30 2011 ATOS origin was hacked because of bad management of unicode in string in...
- Timon Gehr (7/98) Dec 30 2011 I am not. I am just assuming that the proposed change does not help with...
- Chad J (16/24) Dec 31 2011 Tsk tsk. Missing the point.
- Timon Gehr (11/35) Dec 31 2011 Not at all. And I don't take anyone seriously who feels the need to 'Tsk...
- Chad J (15/63) Dec 31 2011 Well, you've certainly a right to it.
- deadalnix (5/26) Jan 01 2012 Well, if you write correct code, you don't need assertion. They will
- Timon Gehr (3/31) Jan 01 2012 You miss the point. Testing and assertions are part of how I write
- deadalnix (5/41) Jan 04 2012 So, to write correct code, you need to asume you'll write incorrect
- Timon Gehr (7/51) Jan 04 2012 You are free to believe whatever you want, but I think that strategy you...
- Timon Gehr (3/6) Jan 04 2012 Another major use of them is the checked documentation of assumptions,
- Walter Bright (6/7) Dec 30 2011 Consider your X macro implementation. Strip out the utf.stride code and ...
- Timon Gehr (3/10) Dec 30 2011 You are right, that obviously needs fixing. ☺
- Andrei Alexandrescu (8/15) Dec 30 2011 It's true for any encoding with the prefix property, such as Huffman.
- Timon Gehr (5/22) Dec 30 2011 auto raw(S)(S s) if(isNarrowString!S){
- Andrei Alexandrescu (4/31) Dec 30 2011 Almost there.
- Timon Gehr (2/34) Dec 30 2011 alias std.string.representation raw;
- Andrei Alexandrescu (7/8) Dec 30 2011 I meant your implementation is incomplete.
- Timon Gehr (21/29) Dec 30 2011 D strings are arrays. An array without .length and operator[] is close
- Don (6/26) Dec 31 2011 No, it isn't. That's the problem. char[] is not an array of char.
- Timon Gehr (6/36) Dec 31 2011 char[] is an array of char and the additional invariant is not enforced
- Don (9/38) Dec 31 2011 No, it isn't an ordinary array. For example with concatenation. char[]
- Timon Gehr (18/58) Jan 01 2012 Yes it will.
- Sean Kelly (10/19) Dec 31 2011 I'm not sure I understand what's wrong with length. Of all the times I ...
- Walter Bright (2/13) Dec 30 2011 Any other multibyte character encoding I've seen standardized for use in...
- Michel Fortin (30/36) Dec 30 2011 After reading most of the thread, it seems to me like you're
- Jonathan M Davis (19/22) Dec 30 2011 The problem is that what's more likely to happen in a lot of cases is th...
- Timon Gehr (9/31) Dec 30 2011 Then that is the fault of the guy who created the tests. At least that
- Walter Bright (12/15) Dec 30 2011 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
- Andrei Alexandrescu (10/18) Dec 30 2011 The lower frequency of bugs makes them that much more difficult to spot....
- Brad Anderson (19/40) Dec 30 2011 I don't know that Phobos would be an appropriate place for it but offeri...
- Walter Bright (11/22) Dec 31 2011 I'm not so sure it's quite the same. Java was designed before there were...
- kenji hara (3/35) Dec 31 2011 I fully agree with Walter. No need more wrapper for string.
- Andrei Alexandrescu (17/48) Dec 31 2011 Disagree. I mean simple they are, no contest. They could and should be
- Michel Fortin (41/54) Dec 31 2011 Perfect? At one time Java and other frameworks started to use UTF-16 as
- Andrei Alexandrescu (19/74) Dec 31 2011 I'm not sure how you concluded I drew such assumptions.
- Vladimir Panteleev (8/12) Dec 31 2011 According to my research[1], std.array.replace (which uses
- Michel Fortin (61/87) Dec 31 2011 1: Because treating UTF-8 strings as a range of code point encourage
- Andrei Alexandrescu (46/126) Dec 31 2011 That's sort of difficult to refute. Anyhow, I think it's great that
- Michel Fortin (34/105) Dec 31 2011 As I keep saying, if you handle combining code points at the range
- Andrei Alexandrescu (3/8) Dec 31 2011 You just found a bug!
- Timon Gehr (6/22) Dec 31 2011 There is nothing wrong with the scheme on the conceptual level (except
- Timon Gehr (2/12) Dec 31 2011 +1.
- Sean Kelly (8/23) Dec 31 2011 I don't know that Unicode expertise is really required here anyway. All...
- Andrei Alexandrescu (7/14) Dec 31 2011 Clearly this is a what-if debate. The best level of agreement we could
- Timon Gehr (2/16) Dec 31 2011 That would be great.
- Michel Fortin (24/28) Dec 31 2011 It's not bytes vs. characters, it's code units vs. code points vs. user
- Sean Kelly (20/41) Dec 31 2011 Sorry, I was simplifying. The distinction I was trying to make was betwe...
- bearophile (5/6) Dec 31 2011 I don't know if we need, but I agree those things are an improvement ove...
- Piotr Szturmaj (19/23) Dec 31 2011 +1
- Chad J (53/141) Dec 31 2011 *sigh*, FINE. Code units and /code points/ would be the same.
- Timon Gehr (31/172) Dec 31 2011 int[]
- Chad J (62/285) Dec 31 2011 I'll do one better and ultra relax:
- Timon Gehr (41/326) Dec 31 2011 It is imo already mostly a non-problem, but YMMV:
- Chad J (35/374) Dec 31 2011 Meh, I'd still prefer it be an array of UTF-8 code /points/ represented
- a (6/8) Jan 01 2012 By saying you want an array of code points you already define
- Timon Gehr (3/12) Jan 01 2012 That actually looks like a bug that might happen in real world code.
- Chad J (17/33) Jan 01 2012 In my mind it's defined something like this:
- Timon Gehr (4/37) Jan 01 2012 I think the main issue here is that char implicitly converts to dchar:
- Chad J (13/56) Jan 01 2012 I agree.
- Timon Gehr (6/62) Jan 01 2012 I think the conversion char -> dchar should just require an explicit
- Chad J (33/104) Jan 01 2012 What of valid transfers of ASCII characters into dchar?
- Timon Gehr (4/108) Jan 01 2012 That is an interesting point of view. Your proposal would therefore be
- Chad J (23/143) Jan 01 2012 I just ran the example and wow, x didn't type-infer to dchar like I
- Artur Skawina (4/19) Dec 28 2011 eg things like std.demangle? (which wraps core.demangle and that one acc...
- Gor Gyolchanyan (7/22) Dec 28 2011 I agree, the string parameters are indeed irritating, but changing the
- mta`chrono (15/15) Dec 28 2011 I understand your intention. It was one of the main irritations when I
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (3/20) Dec 28 2011 Agreed. I've talked about this in D.learn a number of times myself.
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (6/7) Dec 28 2011 After seeing others' comments that focus more on the alias, I need to
- Andrei Alexandrescu (20/37) Dec 28 2011 I'm afraid you're wrong here. The current setup is very good, and much
- Peter Alexander (10/24) Dec 28 2011 I don't follow your argument. You've said (paraphrasing) "If a function
- Andrei Alexandrescu (8/35) Dec 28 2011 I'm saying (paraphrasing) "X is modularly bankrupt and unsafe, and Y is
- Jakob Ovrum (6/10) Dec 28 2011 Also, 'in char[]', which is conceptually much safer, isn't that
- Jonathan M Davis (9/14) Dec 28 2011 in char[] is _not_ safer than immutable(char)[]. In fact it's _less_ saf...
- Jakob Ovrum (3/10) Dec 28 2011 I didn't say it was. Please read more closely.
- Jonathan M Davis (13/36) Dec 28 2011 Agreed. And for a number of functions, taking const(char)[] would be wor...
- deadalnix (3/31) Dec 29 2011 Is inout a solution for the standard lib here ?
- Jonathan M Davis (21/30) Dec 29 2011 ith
- Walter Bright (8/10) Dec 28 2011 I have a very different experience with strings. I can't even remember a...
- Andrei Alexandrescu (4/15) Dec 28 2011 I remember the day at Kahili we figured immutable(char)[] will just work...
- Timon Gehr (3/20) Dec 28 2011 I agree. But I am confused by the fact that you are suggesting it
- Peter Alexander (8/19) Dec 28 2011 We can disagree on this, but I think the fact that Phobos rarely uses
- Sean Kelly (12/19) Dec 28 2011 Most common to me buffer reuse. I'll read a line of a file into a buffer...
- Dejan Lekic (3/3) Dec 28 2011 Peter, having string as immutable(char)[] was perhaps one of the
- Gor Gyolchanyan (8/11) Dec 28 2011 Having a mutable string is a bad idea also because it's mutability is
- so (8/25) Dec 28 2011 As you said string is not a structure but an alias.
- mta`chrono (11/11) Dec 30 2011 there are lot of people suggesting to change how string behaves. but
string is immutable(char)[] I rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string. This is quite irritating because "string" is the most convenient and intuitive thing to type. I often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[]. I could copy the char[] into a new string, but that's expensive, and I'd rather I could just call the function. I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[]. In an ideal world I'd much prefer if string was an alias for const(char)[], but string literals were immutable(char)[]. It would require a little more effort when dealing with concurrency, but that's a price I would be willing to pay to make the string alias useful in function parameters.
Dec 28 2011
Peter Alexander:I often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[].I suggest you to show some of such situations.I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[].What are the Phobos functions that unnecessarily accept a string? Bye, bearophile
Dec 28 2011
On 28/12/11 12:42 PM, bearophile wrote:Peter Alexander:Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringI often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[].I suggest you to show some of such situations.Good question. I can't see any just now, although I have come across some in the past. Perhaps they have already been fixed.I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[].What are the Phobos functions that unnecessarily accept a string?
Dec 28 2011
Peter Alexander:Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringI have discussed a bit two or three times about this topic. In a post I even did suggest the idea of "scoped immutability", that was not appreciated. Generally creating immutable data structures is a source of troubles in all languages, and in D it's not a much solved problem yet. In D today you are sometimes able to rewrite that as: string foo(in int n) pure { auto buffer = new char[n]; // write into buffer return buffer; } void bar(string s) {} void main() { string s = foo(5); bar(s); // use buffer as string } Bye, bearophile
Dec 28 2011
On 28/12/11 1:27 PM, bearophile wrote:Peter Alexander:That only works when you allocate memory for the string, which is what I would like to avoid.Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringI have discussed a bit two or three times about this topic. In a post I even did suggest the idea of "scoped immutability", that was not appreciated. Generally creating immutable data structures is a source of troubles in all languages, and in D it's not a much solved problem yet. In D today you are sometimes able to rewrite that as: string foo(in int n) pure { auto buffer = new char[n]; // write into buffer return buffer; } void bar(string s) {} void main() { string s = foo(5); bar(s); // use buffer as string } Bye, bearophile
Dec 28 2011
On 12/28/2011 5:16 AM, Peter Alexander wrote:Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringIs the buffer ever going to be reused with a different string in it?
Dec 28 2011
On 28/12/11 5:16 PM, Walter Bright wrote:On 12/28/2011 5:16 AM, Peter Alexander wrote:Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!" I know this. These functions should request immutable(char)[] because that's what they need. Functions that don't store the string should use const(char)[]. The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringIs the buffer ever going to be reused with a different string in it?
Dec 28 2011
On 12/28/2011 07:07 PM, Peter Alexander wrote:On 28/12/11 5:16 PM, Walter Bright wrote:You are approximately saying (paraphrasing): "The question is whether a cow is a cow or an animal. In my experience (which is echoed at the farm down the valley) is that there are more animals than there are cows. So we should call all our animals cows."On 12/28/2011 5:16 AM, Peter Alexander wrote:Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!" I know this. These functions should request immutable(char)[] because that's what they need. Functions that don't store the string should use const(char)[]. The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringIs the buffer ever going to be reused with a different string in it?
Dec 28 2011
On 28/12/11 6:03 PM, Timon Gehr wrote:On 12/28/2011 07:07 PM, Peter Alexander wrote:No, I'm saying that people talk about animals more often than cows, so it should be easier and more intuitive to say "animal" than it is to say "cow". People can still call things cows if that is what they're talking about.On 28/12/11 5:16 PM, Walter Bright wrote:You are approximately saying (paraphrasing): "The question is whether a cow is a cow or an animal. In my experience (which is echoed at the farm down the valley) is that there are more animals than there are cows. So we should call all our animals cows."On 12/28/2011 5:16 AM, Peter Alexander wrote:Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!" I know this. These functions should request immutable(char)[] because that's what they need. Functions that don't store the string should use const(char)[]. The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringIs the buffer ever going to be reused with a different string in it?
Dec 28 2011
On 12/28/2011 10:07 AM, Peter Alexander wrote:On 28/12/11 5:16 PM, Walter Bright wrote:Exactly.On 12/28/2011 5:16 AM, Peter Alexander wrote:Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!"Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as stringIs the buffer ever going to be reused with a different string in it?I know this. These functions should request immutable(char)[] because that's what they need. Functions that don't store the string should use const(char)[]. The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.
Dec 28 2011
On 28/12/11 6:15 PM, Walter Bright wrote:On 12/28/2011 10:07 AM, Peter Alexander wrote:Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.
Dec 28 2011
On 12/28/2011 10:35 AM, Peter Alexander wrote:On 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 28 2011
On 12/28/11 12:46 PM, Walter Bright wrote:On 12/28/2011 10:35 AM, Peter Alexander wrote:Yes. Contrary to the OP, I don't think it's fair to dismiss a valid concern by framing it as a user education issue. It's has very often been aired in the olden days of C++, and never in a winning argument. (Right off the bat - auto_ptr.) AndreiOn 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 28 2011
On 12/28/2011 10:56 AM, Andrei Alexandrescu wrote:Yes. Contrary to the OP, I don't think it's fair to dismiss a valid concern by framing it as a user education issue. It's has very often been aired in the olden days of C++, and never in a winning argument. (Right off the bat - auto_ptr.)And as Bruce Eckel discovered, even the people who know better will deliberately pick the wrong method, because it's easier, and they justify it to themselves by saying they'll go back and fix it later. And of course that doesn't happen. Bruce decided there was something fundamentally wrong with a feature that he'd actually write articles about exhorting people to do X instead of Y, and then in his own code he preferred to do the simpler Y.
Dec 28 2011
On 12/28/11 12:46 PM, Walter Bright wrote:On 12/28/2011 10:35 AM, Peter Alexander wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. AndreiOn 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 28 2011
On Wed, 28 Dec 2011 11:00:52 -0800, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 12/28/11 12:46 PM, Walter Bright wrote:Would slicing, i.e. s[i..j] still be valid? If so, what would be the recommended way of finding i and j?On 12/28/2011 10:35 AM, Peter Alexander wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. AndreiOn 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 28 2011
On 12/28/11 1:17 PM, Robert Jacques wrote:Would slicing, i.e. s[i..j] still be valid?No, only s.rep[i .. j].If so, what would be the recommended way of finding i and j?find, findSplit etc. from std.algorithm, std.utf functions etc. Andrei
Dec 28 2011
On 12/28/2011 08:29 PM, Andrei Alexandrescu wrote:On 12/28/11 1:17 PM, Robert Jacques wrote:That does not do the right thing. It would look more like cast(string)s.rep[i .. j].Would slicing, i.e. s[i..j] still be valid?No, only s.rep[i .. j].If so, what would be the recommended way of finding i and j?find, findSplit etc. from std.algorithm, std.utf functions etc. Andrei
Dec 28 2011
On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:On 12/28/11 12:46 PM, Walter Bright wrote:That's a good idea which I wonder about its implementation strategy. ATM string is simply an alias of a char array, are you suggesting string should be a wrapper struct instead (like the one previously suggested by Steven)? I'm all for making string a properly encapsulated type.On 12/28/2011 10:35 AM, Peter Alexander wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. AndreiOn 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. a strategy that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 28 2011
On 12/28/11 1:18 PM, foobar wrote:That's a good idea which I wonder about its implementation strategy.Implementation would entail a change in the compiler. Andrei
Dec 28 2011
On 12/28/2011 08:30 PM, Andrei Alexandrescu wrote:On 12/28/11 1:18 PM, foobar wrote:Special casing char[] and wchar[] in the language would be extremely ugly and inconsistent and would break nearly every D program. And for me, it would cripple Ds strings quite a lot. Why do you think it is worthwhile?That's a good idea which I wonder about its implementation strategy.Implementation would entail a change in the compiler. Andrei
Dec 28 2011
On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:On 12/28/11 1:18 PM, foobar wrote:Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.That's a good idea which I wonder about its implementation strategy.Implementation would entail a change in the compiler. Andrei
Dec 28 2011
On 12/28/11 1:48 PM, foobar wrote:On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:It's an awesome idea, but for an academic debate at best. AndreiOn 12/28/11 1:18 PM, foobar wrote:Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.That's a good idea which I wonder about its implementation strategy.Implementation would entail a change in the compiler. Andrei
Dec 28 2011
On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote:On 12/28/11 1:48 PM, foobar wrote:I don't follow you. You've suggested a change that I agree with. Adam provided a prototype string library type that accomplishes your specified goals without any changes to the compiler. What are we missing here? IF it boils down to changing the compiler or leaving the status-quo, I'm voting against the compiler change.On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:It's an awesome idea, but for an academic debate at best. AndreiOn 12/28/11 1:18 PM, foobar wrote:Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.That's a good idea which I wonder about its implementation strategy.Implementation would entail a change in the compiler. Andrei
Dec 28 2011
On 12/28/11 4:18 PM, foobar wrote:On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote:If we have two facilities (string and e.g. String) we've lost. We'd need to slowly change the built-in string type. I discussed the matter with Walter. He completely disagrees, and sees the idea as a sheer way to complicate stuff for no good. He mentions how he frequently uses .length, indexing, and slicing in narrow strings. I know Walter's code, so I know where he's coming from. He understands UTF in and out, and I have zero doubt he actually knows all essential constants, masks, and ranges by heart. I've seen his code and indeed it's an amazing feat of minimal opportunistic on-demand decoding. So I know where he's coming from, but I also know next to nobody codes like him. A casual string user almost always writes string code (iteration, indexing) the wrong way and would be tremendously helped by a clean distinction between abstraction and representation. Nagonna happen. AndreiOn 12/28/11 1:48 PM, foobar wrote:I don't follow you. You've suggested a change that I agree with. Adam provided a prototype string library type that accomplishes your specified goals without any changes to the compiler. What are we missing here? IF it boils down to changing the compiler or leaving the status-quo, I'm voting against the compiler change.On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:It's an awesome idea, but for an academic debate at best. AndreiOn 12/28/11 1:18 PM, foobar wrote:Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.That's a good idea which I wonder about its implementation strategy.Implementation would entail a change in the compiler. Andrei
Dec 28 2011
On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:If we have two facilities (string and e.g. String) we've lost. We'd need to slowly change the built-in string type.Have you actually tried to do it? Thanks to alias this, the custom string can be used with existing std.string functions and assignments from literals. I suppose that technically there's two facilities: immutable(char)[] and string, but I don't see what difference that makes at all. string is just an alias. It could be changed to a struct with ease; you can do it in your own private module. I really think you (you!) are underestimating D's current capabilities. (Again, I do not think this is a good move - I'm with Walter on it - but let's not sell the language short.)
Dec 28 2011
On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.If we have two facilities (string and e.g. String) we've lost. We'd need to slowly change the built-in string type.Have you actually tried to do it?
Dec 28 2011
On Thursday, 29 December 2011 at 05:37:00 UTC, Walter Bright wrote:I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.Note that I'm on your side here re strings, but you're underselling the D language too! These conversions are implicit both ways, and completely free. D structs can wrap other D types perfectly well. Check this out: string a = "hello"; a = a.replace("h", "j"); assert(a == "jello"); this actually works, today, with a custom string type in the D language. Just define a struct string in your module. alias this does most the magic. In C++, std::string and char* are very different. === #include<string> void a(const char* str) {} int main() { std::string me = "lol"; // works a(me); // ...but this doesn't work return 0; } === But, in D, that *does work*. A struct string can be used on a function that calls for a const(char)[]. It can be used for a function that calls for an immutable(char)[]. It can be used for a function that calls for a struct string. A string struct works exactly the same way as a string alias. Right down to the name! It's not storeable in a variable typed char[] (or wchar[] nor dchar[]), but neither are D strings today.
Dec 28 2011
On 12/29/11 12:01 AM, Adam D. Ruppe wrote:On Thursday, 29 December 2011 at 05:37:00 UTC, Walter Bright wrote:Nah, that still breaks a lotta code because people parameterize on T[], use isSomeString/isSomeChar etc. Nagonna. AndreiI've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.Note that I'm on your side here re strings, but you're underselling the D language too! These conversions are implicit both ways, and completely free. D structs can wrap other D types perfectly well.
Dec 28 2011
On Thursday, 29 December 2011 at 06:09:17 UTC, Andrei Alexandrescu wrote:Nah, that still breaks a lotta code because people parameterize on T[], use isSomeString/isSomeChar etc./* snip struct string */ import std.traits; void tem(T)(T t) if(isSomeString!T) {} void tem2(T : immutable(char)[])(T t) {} string a = "test"; tem(a); // works tem2(a); // works It's the alias this magic again. (btw I also tried renaming struct string to struct STRING, and it still worked, so it wasn't just naming coincidence!)
Dec 29 2011
On 12/28/11 11:36 PM, Walter Bright wrote:On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:This. The only solution is to explain Walter no other programmer in the world codes UTF like him. Really. I emulate that sometimes (learned from him) but I see code from hundreds of people day in and day out - it's never like his. Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is awesome. Let's do it." AndreiOn Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.If we have two facilities (string and e.g. String) we've lost. We'd need to slowly change the built-in string type.Have you actually tried to do it?
Dec 28 2011
On Thursday, 29 December 2011 at 06:08:05 UTC, Andrei Alexandrescu wrote:On 12/28/11 11:36 PM, Walter Bright wrote:I don't think this is a problem you can solve without educating people. They will need to know a thing or two about how UTF works to know the performance implications of many of the "safe" ways to handle UTF strings. Further, for much use of Unicode strings in D you can't get away with not knowing anything anyway because D only abstracts up to code points, not graphemes. Imagine trying to explain to the unknowing programmer what is going on when an algorithm function broke his grapheme and he doesn't know the first thing about Unicode. I'm not claiming to be an expert myself, but I believe D offers Unicode the right way as it is.On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:This. The only solution is to explain Walter no other programmer in the world codes UTF like him. Really. I emulate that sometimes (learned from him) but I see code from hundreds of people day in and day out - it's never like his. Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is awesome. Let's do it." AndreiOn Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.If we have two facilities (string and e.g. String) we've lost. We'd need to slowly change the built-in string type.Have you actually tried to do it?
Dec 28 2011
On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:I don't think this is a problem you can solve without educating people. They will need to know a thing or two about how UTF works to know the performance implications of many of the "safe" ways to handle UTF strings. Further, for much use of Unicode strings in D you can't get away with not knowing anything anyway because D only abstracts up to code points, not graphemes. Imagine trying to explain to the unknowing programmer what is going on when an algorithm function broke his grapheme and he doesn't know the first thing about Unicode. I'm not claiming to be an expert myself, but I believe D offers Unicode the right way as it is.Ultimately, the programmer _does_ need to understand unicode properly if they're going to write code which is both correct and efficient. However, if the easy way to use strings in D is correct, even if it's not as efficient as we'd like, then at least code will tend to be correct in its use of unicode. And then if the programmer wants to their string processing to be more efficient, they need to actually learn how unicode works so that they code for it more efficiently. The issue, however, is that it's currently _way_ too easy to use strings completely incorrectly and operate on code units as if they were characters. A _lot_ of programmers will be using string and char[] as if a char were a character, and that's going to create a lot of bugs. Making it harder to operate on a char[] or string as if it were an array of characters will seriously reduce such bugs and on some level will force people to become better educated about unicode. No, it doesn't completely solve the problem, since then we're operating at the code point level rather than the unicode level, but it's still a _lot_ better than operating on the code unit level as is likely to happen now. - Jonathan M Davis
Dec 28 2011
Le 29/12/2011 07:48, Jonathan M Davis a écrit :On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:That is the whole point of D IMO. I think we shouldn't let an ego question dictate language decision.I don't think this is a problem you can solve without educating people. They will need to know a thing or two about how UTF works to know the performance implications of many of the "safe" ways to handle UTF strings. Further, for much use of Unicode strings in D you can't get away with not knowing anything anyway because D only abstracts up to code points, not graphemes. Imagine trying to explain to the unknowing programmer what is going on when an algorithm function broke his grapheme and he doesn't know the first thing about Unicode. I'm not claiming to be an expert myself, but I believe D offers Unicode the right way as it is.Ultimately, the programmer _does_ need to understand unicode properly if they're going to write code which is both correct and efficient. However, if the easy way to use strings in D is correct, even if it's not as efficient as we'd like, then at least code will tend to be correct in its use of unicode. And then if the programmer wants to their string processing to be more efficient, they need to actually learn how unicode works so that they code for it more efficiently. The issue, however, is that it's currently _way_ too easy to use strings completely incorrectly and operate on code units as if they were characters. A _lot_ of programmers will be using string and char[] as if a char were a character, and that's going to create a lot of bugs. Making it harder to operate on a char[] or string as if it were an array of characters will seriously reduce such bugs and on some level will force people to become better educated about unicode. No, it doesn't completely solve the problem, since then we're operating at the code point level rather than the unicode level, but it's still a _lot_ better than operating on the code unit level as is likely to happen now. - Jonathan M Davis
Dec 29 2011
On 12/28/2011 10:33 PM, Jakob Ovrum wrote:I don't think this is a problem you can solve without educating people. They will need to know a thing or two about how UTF works to know the performance implications of many of the "safe" ways to handle UTF strings. Further, for much use of Unicode strings in D you can't get away with not knowing anything anyway because D only abstracts up to code points, not graphemes. Imagine trying to explain to the unknowing programmer what is going on when an algorithm function broke his grapheme and he doesn't know the first thing about Unicode. I'm not claiming to be an expert myself, but I believe D offers Unicode the right way as it is.I think this goes to, at some point, the language is no longer able to hide the realities of the underlying machine. This happens with floating point (they are NOT mathematical real numbers), integers (they overflow), etc. Keep in mind that D already has a string type where the code points match the characters: dstring[]
Dec 28 2011
On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:The only solution is to explain Walter no other programmer in the world codes UTF like him. Really. I emulate that sometimes (learned from him) but I see code from hundreds of people day in and day out - it's never like his. Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is awesome. Let's do it."If that ever happens, I owe you a beer. Maybe two! Maybe it's hubris, but I think D nails what a string type should be. I'm extremely reluctant to mess with its success. It strikes the right balance between aesthetics, efficiency and utility. C++11 and C11 appear to have copied it.
Dec 28 2011
On 12/29/2011 07:53 AM, Walter Bright wrote:On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:I fully agree. If I had to design an imperative programming language, this is how its strings would work.The only solution is to explain Walter no other programmer in the world codes UTF like him. Really. I emulate that sometimes (learned from him) but I see code from hundreds of people day in and day out - it's never like his. Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is awesome. Let's do it."If that ever happens, I owe you a beer. Maybe two! Maybe it's hubris, but I think D nails what a string type should be. I'm extremely reluctant to mess with its success. It strikes the right balance between aesthetics, efficiency and utility.C++11 and C11 appear to have copied it.
Dec 29 2011
On Thu, 29 Dec 2011 16:36:59 +1100, Walter Bright <newshound2 digitalmars.com> wrote:I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.I'm not quite sure about that last sentence. I suspect that the better way for applications to handle strings of characters would be to internally store and manipulate them as utf-32 (dchar[]) and only when doing I/O use the other utf forms. So converting from the different forms is very helpful. -- Derek Parnell Melbourne, Australia
Dec 29 2011
Don't we already have String-like support with ranges? I'm not sure I under= stand the point in having special behavior for char arrays.=20 Sent from my iPhone On Dec 28, 2011, at 8:17 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.= org> wrote:On 12/28/11 4:18 PM, foobar wrote::On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote=eOn 12/28/11 1:48 PM, foobar wrote:On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:On 12/28/11 1:18 PM, foobar wrote:=20 Why? D should be plenty powerful to implement this without modifying th=That's a good idea which I wonder about its implementation strategy.=20 Implementation would entail a change in the compiler. =20 Andreio slowly change the built-in string type.=20 If we have two facilities (string and e.g. String) we've lost. We'd need t==20 I don't follow you. You've suggested a change that I agree with. Adam provided a prototype string library type that accomplishes your specified goals without any changes to the compiler. What are we missing here? IF it boils down to changing the compiler or leaving the status-quo, I'm voting against the compiler change.compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.=20 It's an awesome idea, but for an academic debate at best. =20 Andrei=20 I discussed the matter with Walter. He completely disagrees, and sees the i=dea as a sheer way to complicate stuff for no good. He mentions how he frequ= ently uses .length, indexing, and slicing in narrow strings.=20 I know Walter's code, so I know where he's coming from. He understands UTF=in and out, and I have zero doubt he actually knows all essential constants= , masks, and ranges by heart. I've seen his code and indeed it's an amazing f= eat of minimal opportunistic on-demand decoding. So I know where he's coming= from, but I also know next to nobody codes like him. A casual string user a= lmost always writes string code (iteration, indexing) the wrong way and woul= d be tremendously helped by a clean distinction between abstraction and repr= esentation.=20 Nagonna happen. =20 =20 Andrei =20
Dec 29 2011
On Thursday, December 29, 2011 11:32:52 Sean Kelly wrote:Don't we already have String-like support with ranges? I'm not sure I understand the point in having special behavior for char arrays.To avoid common misusage. It's way to easy to misuse the length property on narrow strings. Programmers shouldn't be using the length property on narrow strings unless they know what they're doing, but it's likely the first thing that any programmer is going to use for the length of a string, because that's how arrays in general work. If it weren't legal to simply use the length property of a char[] or to directly slice it or index it, then those common misuages would be harder to do. You could still do them via .rep or .raw or whatever we'd call it, but it would no longer be the path of least resistance. Yes, Phobos may avoid the issue, because for the most part its developers understand the issues, but many programmers who do not understand them, will make mistakes in their own code which should arguably be harder to make, simply because it's the path of least resistance, and they don't know any better. - Jonathan M Davis
Dec 29 2011
On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:Implementation would entail a change in the compiler.I don't think I agree. Wouldn't something like this work? === struct string { immutable(char)[] rep; alias rep this; auto opAssign(immutable(char)[] rhs) { rep = rhs; return this; } this(immutable(char)[] rhs) { rep = rhs; } // disable these here so it isn't passed on to .rep disable void opSlice(){ assert(0); }; disable size_t length() { assert(0); }; } === I did some quick tests and the basics seemed ok: /* paste impl from above */ import std.string : replace; void main() { string a = "test"; // works a = a.replace("test", "mang"); // works // a = a[0..1]; // correctly fails to compile assert(0, a); // works }
Dec 28 2011
On Wednesday, 28 December 2011 at 19:48:28 UTC, Adam D. Ruppe wrote:On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:My thinking exactly. Of course we can't put " disable" right away and should start with " deprecated" to allow for a proper migration period. I'd also like a transition of the string related functions to this type. the previous ones can remain as simple wrappers/aliases/whatever for backwards compatibility.Implementation would entail a change in the compiler.I don't think I agree. Wouldn't something like this work? === struct string { immutable(char)[] rep; alias rep this; auto opAssign(immutable(char)[] rhs) { rep = rhs; return this; } this(immutable(char)[] rhs) { rep = rhs; } // disable these here so it isn't passed on to .rep disable void opSlice(){ assert(0); }; disable size_t length() { assert(0); }; } === I did some quick tests and the basics seemed ok: /* paste impl from above */ import std.string : replace; void main() { string a = "test"; // works a = a.replace("test", "mang"); // works // a = a[0..1]; // correctly fails to compile assert(0, a); // works }
Dec 28 2011
On Wednesday, 28 December 2011 at 20:01:15 UTC, foobar wrote:I'd also like a transition of the string related functions to this type. the previous ones can remain as simple wrappers/aliases/whatever for backwards compatibility.I actually like strings just the way they are... but if we had to change, I'm sure we can do a good job in the library relatively easily.
Dec 28 2011
On 12/28/2011 08:18 PM, foobar wrote:On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:In what way would the proposed change improve encapsulation, and why would it even be desirable for such a basic data structure?On 12/28/11 12:46 PM, Walter Bright wrote:That's a good idea which I wonder about its implementation strategy. ATM string is simply an alias of a char array, are you suggesting string should be a wrapper struct instead (like the one previously suggested by Steven)? I'm all for making string a properly encapsulated type.On 12/28/2011 10:35 AM, Peter Alexander wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. AndreiOn 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 28 2011
On Wednesday, 28 December 2011 at 19:38:53 UTC, Timon Gehr wrote: [snip]I'm not sure what are you asking here. Are you asking what are the benefits of encapsulation? This topic was discussed to death more than once and I'd suggest searching the NG archives for the details. Also, If you hadn't already I'd suggest reading about Unicode and its levels of abstraction: code point, code units, graphemes, etc...I'm all for making string a properly encapsulated type.In what way would the proposed change improve encapsulation, and why would it even be desirable for such a basic data structure?
Dec 28 2011
On 12/28/2011 08:55 PM, foobar wrote:On Wednesday, 28 December 2011 at 19:38:53 UTC, Timon Gehr wrote: [snip]I know the benefits of encapsulation and none of them applies here. The proposed change is nothing but a breaking interface change.I'm not sure what are you asking here. Are you asking what are the benefits of encapsulation?I'm all for making string a properly encapsulated type.In what way would the proposed change improve encapsulation, and why would it even be desirable for such a basic data structure?This topic was discussed to death more than once and I'd suggest searching the NG archives for the details. Also, If you hadn't already I'd suggest reading about Unicode and its levels of abstraction: code point, code units, graphemes, etc...'char' is a code unit. Therefore that is the level of abstraction the data type char[] provides.
Dec 28 2011
On 12/28/2011 08:00 PM, Andrei Alexandrescu wrote:On 12/28/11 12:46 PM, Walter Bright wrote:Why? char and wchar are unicode code units, ubyte/ushort are unsigned integrals. It is clear that char/wchar are a better match.On 12/28/2011 10:35 AM, Peter Alexander wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.On 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. Andrei
Dec 28 2011
On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:Why? char and wchar are unicode code units, ubyte/ushort are unsigned integrals. It is clear that char/wchar are a better match.It's an issue of the correct usage being the easy path. As it stands, it's incredibly easy to use narrow strings incorrectly. By forcing any array of char or wchar to use .rep.length instead of .length, the relatively automatic (and generally incorrect) usage of .length on a string wouldn't immediately work. It would force you to work more at doing the wrong thing. Unfortunately, walkLength isn't necessarily any easier than .rep.length, but it does force people to look into why they can't do .length, which will generally better educate them and will hopefully reduce the misuse of narrow strings. If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then we reinforce the fact that you shouldn't operate on chars or wchars. It also makes it simply for the compiler to never allow you to use length on char[] or wchar[], since it doesn't have to worry about whether you got that char[] or wchar[] from a rep property or not. Now, I don't know if this is really a good move at this point. If we were to really do this right, we'd need to disallow indexing and slicing of the char[] and wchar[] as well, which would break that much more code. It also pretty quickly makes it look like string should be its own type rather than an array, since it's acting less and less like an array. Not to mention, even the correct usage of .rep would become rather irritating (e.g. slicing it when you know that the indicies that you're dealing with aren't going to cut into any code points), because you'd have to cast from ubyte[] to char[] whenever you did that. So, I think that the general sentiment behind this is a good one, but I don't know if the exact idea is ultimately a good one - particularly at this stage in the game. If we're going to make a change like this which would break as much code as this would, we'd need to be _very_ certain that it's what we want to do. - Jonathan M Davis
Dec 28 2011
Apparently my previous post was lost. Apologies if this comes out twice. On 12/28/2011 09:39 PM, Jonathan M Davis wrote:On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:I was educated enough not to make that mistake, because I read the entire language specification before deciding the language was awesome and downloading the compiler. I find it strange that the product should be made less usable because we do not expect users to read the manual. But it is of course a valid point.Why? char and wchar are unicode code units, ubyte/ushort are unsigned integrals. It is clear that char/wchar are a better match.It's an issue of the correct usage being the easy path. As it stands, it's incredibly easy to use narrow strings incorrectly. By forcing any array of char or wchar to use .rep.length instead of .length, the relatively automatic (and generally incorrect) usage of .length on a string wouldn't immediately work. It would force you to work more at doing the wrong thing. Unfortunately, walkLength isn't necessarily any easier than .rep.length, but it does force people to look into why they can't do .length, which will generally better educate them and will hopefully reduce the misuse of narrow strings.If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then we reinforce the fact that you shouldn't operate on chars or wchars.There is nothing wrong with operating at the code unit level. Efficient slicing is very desirable.It also makes it simply for the compiler to never allow you to use length on char[] or wchar[], since it doesn't have to worry about whether you got that char[] or wchar[] from a rep property or not. Now, I don't know if this is really a good move at this point. If we were to really do this right, we'd need to disallow indexing and slicing of the char[] and wchar[] as well, which would break that much more code. It also pretty quickly makes it look like string should be its own type rather than an array, since it's acting less and less like an array.Exactly. It is acting less and less like an array of code units. But it *is* an array of code units. If the general consensus is that we need a string data type that acts at a different abstraction level by default (with which I'd disagree, but apparently I don't have a popular opinion here), then we need a string type in the standard library to do that. Changing the language so that an array of code units stops behaving like an array of code units is not a solution.Not to mention, even the correct usage of .rep would become rather irritating (e.g. slicing it when you know that the indicies that you're dealing with aren't going to cut into any code points), because you'd have to cast from ubyte[] to char[] whenever you did that. So, I think that the general sentiment behind this is a good one, but I don't know if the exact idea is ultimately a good one - particularly at this stage in the game. If we're going to make a change like this which would break as much code as this would, we'd need to be _very_ certain that it's what we want to do. - Jonathan M DavisI agree.
Dec 28 2011
On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:I was educated enough not to make that mistake, because I read the entire language specification before deciding the language was awesome and downloading the compiler. I find it strange that the product should be made less usable because we do not expect users to read the manual. But it is of course a valid point.That's awfully optimistic to expect people to read the manual.There is nothing wrong with operating at the code unit level. Efficient slicing is very desirable.I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code. i.e. if I need a name variable in a class: codeUnit[] name; // bug! string Name; // correct I expect that most uses of code-unit arrays should be in the standard library anyway since it provides the string manipulation routines. It all boils down to making the common case trivial and the rare case possible. You can use the underlying data structure (code units) if you need it but the default "string" is what people expect when thinking about what such a type does (a string of letters). D's already 80% there since Phobos already treats strings as bi-directional ranges of code-points which is much closer to the mental image of a string of letters, so I think this is about bringing the current design to its final conclusion.Exactly. It is acting less and less like an array of code units. But it *is* an array of code units. If the general consensus is that we need a string data type that acts at a different abstraction level by default (with which I'd disagree, but apparently I don't have a popular opinion here), then we need a string type in the standard library to do that. Changing the language so that an array of code units stops behaving like an array of code units is not a solution.I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.
Dec 28 2011
On 12/28/2011 11:12 PM, foobar wrote:On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:Well, if the alternative is slowly butchering the language I will be awfully optimistic about it all day long.I was educated enough not to make that mistake, because I read the entire language specification before deciding the language was awesome and downloading the compiler. I find it strange that the product should be made less usable because we do not expect users to read the manual. But it is of course a valid point.That's awfully optimistic to expect people to read the manual.I would not go as far as to call it 'incorrect'.There is nothing wrong with operating at the code unit level. Efficient slicing is very desirable.I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code.i.e. if I need a name variable in a class: codeUnit[] name; // bug! string Name; // correctFrom a pragmatic viewpoint it does not matter because if string is used like this, then codeUnit[] does exactly the same thing. Nobody forces anyone to index or slice into a string variable when they don't need that functionality. All engineers have to work with leaky abstractions. Why is it such a big deal?I expect that most uses of code-unit arrays should be in the standard library anyway since it provides the string manipulation routines. It all boils down to making the common case trivial and the rare case possible. You can use the underlying data structure (code units) if you need it but the default "string" is what people expect when thinking about what such a type does (a string of letters). D's already 80% there since Phobos already treats strings as bi-directional ranges of code-points which is much closer to the mental image of a string of letters, so I think this is about bringing the current design to its final conclusion.Well, that mental image is just not the right one when dealing with Unicode.What will such a type offer, except that it disallows indexing and slicing?Exactly. It is acting less and less like an array of code units. But it *is* an array of code units. If the general consensus is that we need a string data type that acts at a different abstraction level by default (with which I'd disagree, but apparently I don't have a popular opinion here), then we need a string type in the standard library to do that. Changing the language so that an array of code units stops behaving like an array of code units is not a solution.I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.
Dec 28 2011
On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:On 12/28/2011 11:12 PM, foobar wrote:From a pragmatic view point people can also continue programming in C++ instead of investing a lot of effort learning a new language. The only difference between programming languages is the human interface aspect. Anything you can program with D you could also do in assembly yet you prefer D because it's more convenient. In that regard, a code-unit array is definitely worse than a string type. A programmer can choose to either change his 'naive' mental image or change the programming language. Most will do the latter. Computers need to adapt and be human friendly, not vice-versa.On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:Well, if the alternative is slowly butchering the language I will be awfully optimistic about it all day long.I was educated enough not to make that mistake, because I read the entire language specification before deciding the language was awesome and downloading the compiler. I find it strange that the product should be made less usable because we do not expect users to read the manual. But it is of course a valid point.That's awfully optimistic to expect people to read the manual.I would not go as far as to call it 'incorrect'.There is nothing wrong with operating at the code unit level. Efficient slicing is very desirable.I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code.i.e. if I need a name variable in a class: codeUnit[] name; // bug! string Name; // correctFrom a pragmatic viewpoint it does not matter because if string is used like this, then codeUnit[] does exactly the same thing. Nobody forces anyone to index or slice into a string variable when they don't need that functionality. All engineers have to work with leaky abstractions. Why is it such a big deal?I expect that most uses of code-unit arrays should be in the standard library anyway since it provides the string manipulation routines. It all boils down to making the common case trivial and the rare case possible. You can use the underlying data structure (code units) if you need it but the default "string" is what people expect when thinking about what such a type does (a string of letters). D's already 80% there since Phobos already treats strings as bi-directional ranges of code-points which is much closer to the mental image of a string of letters, so I think this is about bringing the current design to its final conclusion.Well, that mental image is just not the right one when dealing with Unicode.What will such a type offer, except that it disallows indexing and slicing?Exactly. It is acting less and less like an array of code units. But it *is* an array of code units. If the general consensus is that we need a string data type that acts at a different abstraction level by default (with which I'd disagree, but apparently I don't have a popular opinion here), then we need a string type in the standard library to do that. Changing the language so that an array of code units stops behaving like an array of code units is not a solution.I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.
Dec 28 2011
On 12/29/2011 07:45 AM, foobar wrote:On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:I disagree. Pragmatism: "Dealing with things sensibly and realistically in a way that is based on practical rather than theoretical considerations." In practice, programming in D beats the pants off programming in C++.On 12/28/2011 11:12 PM, foobar wrote:From a pragmatic view point people can also continue programming in C++ instead of investing a lot of effort learning a new language.On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:Well, if the alternative is slowly butchering the language I will be awfully optimistic about it all day long.I was educated enough not to make that mistake, because I read the entire language specification before deciding the language was awesome and downloading the compiler. I find it strange that the product should be made less usable because we do not expect users to read the manual. But it is of course a valid point.That's awfully optimistic to expect people to read the manual.I would not go as far as to call it 'incorrect'.There is nothing wrong with operating at the code unit level. Efficient slicing is very desirable.I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code.i.e. if I need a name variable in a class: codeUnit[] name; // bug! string Name; // correctFrom a pragmatic viewpoint it does not matter because if string is used like this, then codeUnit[] does exactly the same thing. Nobody forces anyone to index or slice into a string variable when they don't need that functionality. All engineers have to work with leaky abstractions. Why is it such a big deal?I expect that most uses of code-unit arrays should be in the standard library anyway since it provides the string manipulation routines. It all boils down to making the common case trivial and the rare case possible. You can use the underlying data structure (code units) if you need it but the default "string" is what people expect when thinking about what such a type does (a string of letters). D's already 80% there since Phobos already treats strings as bi-directional ranges of code-points which is much closer to the mental image of a string of letters, so I think this is about bringing the current design to its final conclusion.Well, that mental image is just not the right one when dealing with Unicode.What will such a type offer, except that it disallows indexing and slicing?Exactly. It is acting less and less like an array of code units. But it *is* an array of code units. If the general consensus is that we need a string data type that acts at a different abstraction level by default (with which I'd disagree, but apparently I don't have a popular opinion here), then we need a string type in the standard library to do that. Changing the language so that an array of code units stops behaving like an array of code units is not a solution.I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.The only difference between programming languages is the human interface aspect.No. There is also the aspect of how well it maps to the machine it will run on. An interface always has two sides.Anything you can program with D you could also do in assembly yet you prefer D because it's more convenient.I prefer D because it is more productive.In that regard, a code-unit array is definitely worse than a string type.A code-unit array type is a string type, albeit a simple one.A programmer can choose to either change his 'naive' mental image or change the programming language. Most will do the latter.A programmer does not care about how D strings work or he is happy that they are so simple to work with.Computers need to adapt and be human friendly, not vice-versa.When I meet a computer that adapts itself in order to be human friendly, I'll buy you a cookie.
Dec 29 2011
Andrei Alexandrescu:one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation.Robert Jacques:We have discussed this topic some times in past, it's not an easy topic. I agree with the general desires under your ideas Andrei, I suggested something related, time ago. The idea of forbidding s.length, s[i] and s[i..j] for narrow strings seems interesting. (I suggested something different, to keep them but turn them into operations that do the right thing on narrow strings. Some people didn't appreciate the idea because it changes the computational complexity of such operations). But I suggest to step a bit back and look at the situation from a bit more distance, to avoid small patches to D that look like a pirate eyepatch :-) Narrow strings are more memory (and performance) efficient, and sometimes I want to slice them too, and do it correctly (so somestring.rep[i..j] is not enough). So I suggest to give something to perform correct slicing of narrow strings too. Bye, bearophileWould slicing, i.e. s[i..j] still be valid?No, only s.rep[i .. j].If so, what would be the recommended way of finding i and j?find, findSplit etc. from std.algorithm, std.utf functions etc.
Dec 28 2011
On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:On 12/28/11 12:46 PM, Walter Bright wrote:I think it would be simpler to just make dstring the default string type. dstring is simple and safe. People who want better memory usage can use UTF-8 at their own discretion.On 12/28/2011 10:35 AM, Peter Alexander wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation.On 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. a strategy that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 29 2011
This a a great idea! In this case the default string will be a random-access range, not a bidirectional range. Also, processing dstring is faster, then string, because no encoding needs to be done. Processing power is more expensive, then memory. utf-8 is valuable only to pass it as an ASCII string (which is not too common) and to store large chunks of it. Both these cases are much less common then all the rest of string processing. +1 On Thu, Dec 29, 2011 at 12:04 PM, Vladimir Panteleev <vladimir thecybershadow.net> wrote:On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:-- Bye, Gor Gyolchanyan.On 12/28/11 12:46 PM, Walter Bright wrote:I think it would be simpler to just make dstring the default string type. dstring is simple and safe. People who want better memory usage can use UTF-8 at their own discretion.On 12/28/2011 10:35 AM, Peter Alexander wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation.On 28/12/11 6:15 PM, Walter Bright wrote:People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. that never works very well - not for programming, nor any other endeavor.If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 29 2011
On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:This a a great idea! In this case the default string will be a random-access range, not a bidirectional range. Also, processing dstring is faster, then string, because no encoding needs to be done. Processing power is more expensive, then memory. utf-8 is valuable only to pass it as an ASCII string (which is not too common) and to store large chunks of it. Both these cases are much less common then all the rest of string processing.dstring consumes 4x the memory, and this can easily cause perf degradations due to thrashing and poor cache locality.
Dec 29 2011
What if the string converted itself from utf-8 to utf-32 back and forth as necessary (utf-8 for storing and utf-32 for processing): struct String { public: bool encoded() property const { return _encoded; } bool encoded(bool should) property { if(should) if(!encoded) { _utf8 = to!string(_utf32); encoded = true; } else if(encoded) { _utf32 = to!dstring(_utf8); encoded = false; } } // Here goes the part where you get to use the string private: bool _encoded; union { string _utf8; dstring _utf32; } } This has a lot of drawbacks and is purely a curiosity. The idea of expressing the encoding of string as a property of strings, rather, then a difference between separate types of strings. On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright <newshound2 digitalmars.com> wrote:On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:-- Bye, Gor Gyolchanyan.This a a great idea! In this case the default string will be a random-access range, not a bidirectional range. Also, processing dstring is faster, then string, because no encoding needs to be done. Processing power is more expensive, then memory. utf-8 is valuable only to pass it as an ASCII string (which is not too common) and to store large chunks of it. Both these cases are much less common then all the rest of string processing.dstring consumes 4x the memory, and this can easily cause perf degradations due to thrashing and poor cache locality.
Dec 29 2011
oops. I accidentally made a recursive call in the setter. scratch that, it should change the attribute. On Thu, Dec 29, 2011 at 6:58 PM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> wrote:What if the string converted itself from utf-8 to utf-32 back and forth as necessary (utf-8 for storing and utf-32 for processing): struct String { public: =C2=A0 =C2=A0bool encoded() property const =C2=A0 =C2=A0{ =C2=A0 =C2=A0 =C2=A0 =C2=A0return _encoded; =C2=A0 =C2=A0} =C2=A0 =C2=A0bool encoded(bool should) property =C2=A0 =C2=A0{ =C2=A0 =C2=A0 =C2=A0 =C2=A0if(should) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(!encoded) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0_utf8 =3D to!strin=g(_utf32);=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0encoded =3D true; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0} =C2=A0 =C2=A0 =C2=A0 =C2=A0else =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(encoded) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0_utf32 =3D to!dstr=ing(_utf8);=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0encoded =3D false; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0} =C2=A0 =C2=A0} =C2=A0 =C2=A0// Here goes the part where you get to use the string private: =C2=A0 =C2=A0bool _encoded; =C2=A0 =C2=A0union =C2=A0 =C2=A0{ =C2=A0 =C2=A0 =C2=A0 =C2=A0string _utf8; =C2=A0 =C2=A0 =C2=A0 =C2=A0dstring _utf32; =C2=A0 =C2=A0} } This has a lot of drawbacks and is purely a curiosity. The idea of expressing the encoding of string as a property of strings, rather, then a difference between separate types of strings. On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright <newshound2 digitalmars.com> wrote:onsOn 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:This a a great idea! In this case the default string will be a random-access range, not a bidirectional range. Also, processing dstring is faster, then string, because no encoding needs to be done. Processing power is more expensive, then memory. utf-8 is valuable only to pass it as an ASCII string (which is not too common) and to store large chunks of it. Both these cases are much less common then all the rest of string processing.dstring consumes 4x the memory, and this can easily cause perf degradati=--=20 Bye, Gor Gyolchanyan.due to thrashing and poor cache locality.-- Bye, Gor Gyolchanyan.
Dec 29 2011
On 12/29/11 2:04 AM, Vladimir Panteleev wrote:I think it would be simpler to just make dstring the default string type. dstring is simple and safe. People who want better memory usage can use UTF-8 at their own discretion.memory == time Andrei
Dec 29 2011
On 28.12.2011 20:00, Andrei Alexandrescu wrote:Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change. If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.
Dec 29 2011
On 12/29/11 12:28 PM, Don wrote:On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen... Andrei
Dec 29 2011
On Thu, 29 Dec 2011 18:36:27 -0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 12/29/11 12:28 PM, Don wrote:+1 for this idea, however named. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...
Dec 30 2011
Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type -- code units and characters would be the same or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindex so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32) But generally I liked the idea of just having an alias for strings...On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...Andrei-- Joshua Reusch
Dec 30 2011
On 12/30/2011 08:33 PM, Joshua Reusch wrote:Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:Inefficient.On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type --On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...code units and characters would be the sameWrong.or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing. 2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again. I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.
Dec 30 2011
On Friday, 30 December 2011 at 19:55:45 UTC, Timon Gehr wrote:I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing. 2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again. I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.I strongly agree with this. It would be nice to have everything be simple, work correctly *and* efficiently at the same time, but I don't believe the proposed changes make a definite improvement. In the end, if you don't want to use the standard library or other UTF-aware string libraries, you'll have to know the basics of UTF to write the correct code. I too wish it was harder to write it incorrectly, but the current solution is simply the best one to appear yet.
Dec 30 2011
On 12/30/11 1:55 PM, Timon Gehr wrote:Me too. I think the way we have it now is optimal.What we have now is adequate. The scheme I proposed is optimal. I agree with all of your other remarks. Andrei
Dec 30 2011
Le 30/12/2011 20:55, Timon Gehr a écrit :On 12/30/2011 08:33 PM, Joshua Reusch wrote:ATOS origin was hacked because of bad management of unicode in string in some of their software. Consequences can be more importants than you may think. Additionnaly, you make an asumption that is realy wrong : an educated programmer will not make mistake. C programmers will just tell you excactly the same thing is the discution comes to pointers. But the fact is, we all do mistakes. Many of them ! We should go into unsafe behaviour, that rely on programmer capabilities only when needed. I do understand pointers. I do make mistake with them and it does have crazy consequences sometime. And I do not trust anyone that say me he/she doesn't. Because sometime we all are morrons.Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:Inefficient.On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type --On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...code units and characters would be the sameWrong.or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen?
Dec 30 2011
On 12/30/2011 10:36 PM, deadalnix wrote:Le 30/12/2011 20:55, Timon Gehr a écrit :And cast(string)s.rep[i..j] would magically fix all those bugs?On 12/30/2011 08:33 PM, Joshua Reusch wrote:ATOS origin was hacked because of bad management of unicode in string in some of their software.Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:Inefficient.On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type --On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...code units and characters would be the sameWrong.or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen?Consequences can be more importants than you may think. Additionnaly, you make an asumption that is realy wrong : an educated programmer will not make mistake.I am not. I am just assuming that the proposed change does not help with that.C programmers will just tell you excactly the same thing is the discution comes to pointers. But the fact is, we all do mistakes. Many of them ! We should go into unsafe behaviour, that rely on programmer capabilities only when needed. I do understand pointers. I do make mistake with them and it does have crazy consequences sometime. And I do not trust anyone that say me he/she doesn't. Because sometime we all are morrons.as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.
Dec 30 2011
On 12/30/2011 05:27 PM, Timon Gehr wrote:On 12/30/2011 10:36 PM, deadalnix wrote:Tsk tsk. Missing the point. I believe what deadalnix is trying to say is this: Programmers should try to write correct code, but should never trust themselves to write correct code. ... Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously. That said, it is extremely pleasant to have a language that catches you when you inevitably fall.Because sometime we all are morrons.as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.
Dec 31 2011
On 12/31/2011 06:32 PM, Chad J wrote:On 12/30/2011 05:27 PM, Timon Gehr wrote:Not at all. And I don't take anyone seriously who feels the need to 'Tsk tsk' btw.On 12/30/2011 10:36 PM, deadalnix wrote:Tsk tsk. Missing the point.Because sometime we all are morrons.as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.I believe what deadalnix is trying to say is this: Programmers should try to write correct code, but should never trust themselves to write correct code.No, programmers should write correct code and then test it thoroughly. 'Trying to' is the wrong way to go about anything. And there is no need to distrust oneself. Anyway, I have a _very hard time_ translating 'acting like a moron' to 'writing correct code'.... Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.Testing is the main part of my development. Furthermore, I use assertions all over the place.That said, it is extremely pleasant to have a language that catches you when you inevitably fall.That is why I also like Haskell.
Dec 31 2011
On 12/31/2011 01:13 PM, Timon Gehr wrote:On 12/31/2011 06:32 PM, Chad J wrote:Well, you've certainly a right to it. I just take it a little rough when it seems like someone's words are being intentionally misread.On 12/30/2011 05:27 PM, Timon Gehr wrote:Not at all. And I don't take anyone seriously who feels the need to 'Tsk tsk' btw.On 12/30/2011 10:36 PM, deadalnix wrote:Tsk tsk. Missing the point.Because sometime we all are morrons.as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.There's a perfect reason to distrust oneself: oneself is a squishy meatbag that makes mistakes. Repeated "trying" with rigor applied will lead to success.I believe what deadalnix is trying to say is this: Programmers should try to write correct code, but should never trust themselves to write correct code.No, programmers should write correct code and then test it thoroughly. 'Trying to' is the wrong way to go about anything. And there is no need to distrust oneself.Anyway, I have a _very hard time_ translating 'acting like a moron' to 'writing correct code'.I'm pretty sure it's suggestive. If an intelligent or careful person acts like a moron, then they will be forced to assume that they will make mistakes, and therefore take measures to ensure that the ALL mistakes are caught and fixed or mitigated. That is how you get from 'acting like a moron' to 'writing correct code'.I hear ya. I feel Haskell is an important language to understand, if not know how to use effectively. I wish I knew how to use it better than I do, but I haven't had too many projects that are amenable to it.... Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.Testing is the main part of my development. Furthermore, I use assertions all over the place.That said, it is extremely pleasant to have a language that catches you when you inevitably fall.That is why I also like Haskell.
Dec 31 2011
Le 31/12/2011 19:13, Timon Gehr a écrit :On 12/31/2011 06:32 PM, Chad J wrote:Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with See how stupid this becomes ?On 12/30/2011 05:27 PM, Timon Gehr wrote:Testing is the main part of my development. Furthermore, I use assertions all over the place.On 12/30/2011 10:36 PM, deadalnix wrote:Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.Because sometime we all are morrons.as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.
Jan 01 2012
On 01/01/2012 11:36 PM, deadalnix wrote:Le 31/12/2011 19:13, Timon Gehr a écrit :You miss the point. Testing and assertions are part of how I write correct code.On 12/31/2011 06:32 PM, Chad J wrote:Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with See how stupid this becomes ?On 12/30/2011 05:27 PM, Timon Gehr wrote:Testing is the main part of my development. Furthermore, I use assertions all over the place.On 12/30/2011 10:36 PM, deadalnix wrote:Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.Because sometime we all are morrons.as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.
Jan 01 2012
Le 01/01/2012 23:46, Timon Gehr a écrit :On 01/01/2012 11:36 PM, deadalnix wrote:So, to write correct code, you need to asume you'll write incorrect code. Writing correct code is your goal. Asuming you'll do stupid stuff is a quality required to advance toward this goal. And, saying that you test and assert a lot, you confirm that point.Le 31/12/2011 19:13, Timon Gehr a écrit :You miss the point. Testing and assertions are part of how I write correct code.On 12/31/2011 06:32 PM, Chad J wrote:Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with See how stupid this becomes ?On 12/30/2011 05:27 PM, Timon Gehr wrote:Testing is the main part of my development. Furthermore, I use assertions all over the place.On 12/30/2011 10:36 PM, deadalnix wrote:Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.Because sometime we all are morrons.acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.
Jan 04 2012
On 01/04/2012 07:08 PM, deadalnix wrote:Le 01/01/2012 23:46, Timon Gehr a écrit :You are free to believe whatever you want, but I think that strategy you are describing is a recipe for writing buggy code.On 01/01/2012 11:36 PM, deadalnix wrote:So, to write correct code, you need to asume you'll write incorrect code. Writing correct code is your goal. Asuming you'll do stupid stuff is a quality required to advance toward this goal.Le 31/12/2011 19:13, Timon Gehr a écrit :You miss the point. Testing and assertions are part of how I write correct code.On 12/31/2011 06:32 PM, Chad J wrote:Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with See how stupid this becomes ?On 12/30/2011 05:27 PM, Timon Gehr wrote:Testing is the main part of my development. Furthermore, I use assertions all over the place.On 12/30/2011 10:36 PM, deadalnix wrote:Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.Because sometime we all are morrons.acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.And, saying that you test and assert a lot,Code for which no tests exist is neither correct nor incorrect. Assertions are a neat way to detect parts of the application whose implementation is incomplete.you confirm that point.No.
Jan 04 2012
On 01/04/2012 11:31 PM, Timon Gehr wrote:Code for which no tests exist is neither correct nor incorrect. Assertions are a neat way to detect parts of the application whose implementation is incomplete.Another major use of them is the checked documentation of assumptions, mainly in method preconditions.
Jan 04 2012
On 12/30/2011 11:55 AM, Timon Gehr wrote:Me too. I think the way we have it now is optimal.Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
On 12/30/2011 11:01 PM, Walter Bright wrote:On 12/30/2011 11:55 AM, Timon Gehr wrote:You are right, that obviously needs fixing. ☺ Thanks!Me too. I think the way we have it now is optimal.Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8.That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
On 12/30/11 4:01 PM, Walter Bright wrote:On 12/30/2011 11:55 AM, Timon Gehr wrote:It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. AndreiMe too. I think the way we have it now is optimal.Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:On 12/30/11 4:01 PM, Walter Bright wrote:auto raw(S)(S s) if(isNarrowString!S){ static if(is(S==string)) return cast(ubyte[])s; else static if(is(S==wstring)) return cast(ushort[])s; }On 12/30/2011 11:55 AM, Timon Gehr wrote:It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. AndreiMe too. I think the way we have it now is optimal.Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
On 12/30/11 5:07 PM, Timon Gehr wrote:On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:Almost there. https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809 AndreiOn 12/30/11 4:01 PM, Walter Bright wrote:auto raw(S)(S s) if(isNarrowString!S){ static if(is(S==string)) return cast(ubyte[])s; else static if(is(S==wstring)) return cast(ushort[])s; }On 12/30/2011 11:55 AM, Timon Gehr wrote:It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. AndreiMe too. I think the way we have it now is optimal.Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
On 12/31/2011 01:03 AM, Andrei Alexandrescu wrote:On 12/30/11 5:07 PM, Timon Gehr wrote:alias std.string.representation raw;On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:Almost there. https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809 AndreiOn 12/30/11 4:01 PM, Walter Bright wrote:auto raw(S)(S s) if(isNarrowString!S){ static if(is(S==string)) return cast(ubyte[])s; else static if(is(S==wstring)) return cast(ushort[])s; }On 12/30/2011 11:55 AM, Timon Gehr wrote:It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. AndreiMe too. I think the way we have it now is optimal.Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
On 12/30/11 6:07 PM, Timon Gehr wrote:alias std.string.representation raw;I meant your implementation is incomplete. But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context. Andrei
Dec 30 2011
On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:On 12/30/11 6:07 PM, Timon Gehr wrote:It was more a sketch than an implementation. It is not even type safe :o).alias std.string.representation raw;I meant your implementation is incomplete.But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context.D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units. length gives the number of code units. operator[i] gives the i-th code unit. Nothing wrong or good-for-nothing about that. .raw would return ubyte[], therefore it would lose all type information. Effectively, what .raw does is a type cast that will let code point data alias with integral data. Consider: void foo(ubyte[] b)in{assert(b.length);}body{ b[0]=2; // perfectly fine } void main(){ char[] s = "☺".dup; auto b = s.raw; foo(b); writeln(s); // oops... } I fail to understand why that is desirable.
Dec 30 2011
On 31.12.2011 01:56, Timon Gehr wrote:On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated. In reality, char[] and wchar[] are compressed forms of dstring.On 12/30/11 6:07 PM, Timon Gehr wrote:It was more a sketch than an implementation. It is not even type safe :o).alias std.string.representation raw;I meant your implementation is incomplete.But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context.D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units..raw would return ubyte[], therefore it would lose all type information. Effectively, what .raw does is a type cast that will let code point data alias with integral data.Exactly. It's just a "I know what I'm doing" signal.
Dec 31 2011
On 12/31/2011 01:15 PM, Don wrote:On 31.12.2011 01:56, Timon Gehr wrote:char[] is an array of char and the additional invariant is not enforced by the language.On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated.On 12/30/11 6:07 PM, Timon Gehr wrote:It was more a sketch than an implementation. It is not even type safe :o).alias std.string.representation raw;I meant your implementation is incomplete.But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context.D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units.In reality, char[] and wchar[] are compressed forms of dstring.No, it is a "I don't know what I'm doing" signal: ubyte[] does not carry any sign of an additional invariant, and the aliasing can be used to break the invariant that is commonly assumed for char[]. That was my point..raw would return ubyte[], therefore it would lose all type information. Effectively, what .raw does is a type cast that will let code point data alias with integral data.Exactly. It's just a "I know what I'm doing" signal.
Dec 31 2011
On 31.12.2011 17:13, Timon Gehr wrote:On 12/31/2011 01:15 PM, Don wrote:No, it isn't an ordinary array. For example with concatenation. char[] ~ int will never create an invalid string. You can end up with multiple chars being appended, even from a single append. foreach is different, too. They are a bit magical. There's quite a lot of code in the compiler to make sure that strings remain valid. The additional invariant is not enforced in the case of slicing; that's the point.On 31.12.2011 01:56, Timon Gehr wrote:char[] is an array of char and the additional invariant is not enforced by the language.On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated.On 12/30/11 6:07 PM, Timon Gehr wrote:It was more a sketch than an implementation. It is not even type safe :o).alias std.string.representation raw;I meant your implementation is incomplete.But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context.D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units.
Dec 31 2011
On 01/01/2012 08:10 AM, Don wrote:On 31.12.2011 17:13, Timon Gehr wrote:Yes it will. void main() { char[] x; writeln(x~255); }On 12/31/2011 01:15 PM, Don wrote:No, it isn't an ordinary array. For example with concatenation. char[] ~ int will never create an invalid string.On 31.12.2011 01:56, Timon Gehr wrote:char[] is an array of char and the additional invariant is not enforced by the language.On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated.On 12/30/11 6:07 PM, Timon Gehr wrote:It was more a sketch than an implementation. It is not even type safe :o).alias std.string.representation raw;I meant your implementation is incomplete.But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context.D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units.You can end up with multiple chars being appended, even from a single append. foreach is different, too. They are a bit magical.Fair enough, but type conversion rules are a bit magical in general. void main() { auto a = cast(short[])[1,2,3]; auto b = [1,2,3]; auto c = cast(short[])b; assert(a!=c); }There's quite a lot of code in the compiler to make sure that strings remain valid.At the same time, there are many language features that allow to create invalid strings. auto a = "\377\252\314"; auto b = x"FF AA CC"; auto c = import("binary");The additional invariant is not enforced in the case of slicing; that's the point.
Jan 01 2012
I'm not sure I understand what's wrong with length. Of all the times I get a= length in one sizable i18nalized app at work I can think of only one instan= ce where I actually want the character count rather than the byte count. Is t= here some other reason I'm not aware of that length is undesirable? Sent from my iPhone On Dec 30, 2011, at 4:12 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.= org> wrote:On 12/30/11 6:07 PM, Timon Gehr wrote:. The availability of good-for-nothing .length and operator[] are the issue.= Putting in place the convention of using .raw is hardly useful within the c= ontext.alias std.string.representation raw;=20 I meant your implementation is incomplete. =20 But the main point is that presence of representation/raw is not the issue==20 =20 Andrei
Dec 31 2011
On 12/30/2011 3:00 PM, Andrei Alexandrescu wrote:On 12/30/11 4:01 PM, Walter Bright wrote:Any other multibyte character encoding I've seen standardized for use in C.On 12/30/2011 11:55 AM, Timon Gehr wrote:It's true for any encoding with the prefix property, such as Huffman.Me too. I think the way we have it now is optimal.Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
On 2011-12-30 23:00:49 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw.After reading most of the thread, it seems to me like you're deconstructing strings as arrays one piece at a time, to the point where instead of arrays we'd basically get a string struct and do things on it. Maybe it's part of a grand scheme, more likely it's one realization after another leading to one change after another… let's see where all this will lead us: 0. in the beginning, strings were char[] arrays 1. arrays are generalized as ranges 2. phobos starts treating char arrays as bidirectional ranges of dchar (instead of random access ranges of char) 3. foreach on char[] should iterate over dchar by default 4. remove .length, random access, and slicing from char arrays 5. replace char[] with a struct { ubyte[] raw; } Number 1 is great by itself, no debate there. Number 2 is debatable. Number 3 and 4 are somewhat required for consistency with number 2. Number 5 is just the logical conclusion of all these changes. If we want a fundamental change to what strings are in D, perhaps we should start focusing on the broader issue instead of trying to pass piecemeal changes one after the other. For consistency's sake, I think we should either stop after 1 or go all the way to 5. Either we do it fully or we don't do it at all. All those divergent interpretations of strings end up hurting the language. Walter and Andrei ought to find a way to agree with each other. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 30 2011
On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.The problem is that what's more likely to happen in a lot of cases is that they use it wrong and don't notice, because they're only using ASCII in testing, _but_ they have bugs all over the place, because their code is actually used with unicode in the field. Yes, diligent programmers will generally find such problems, but with the current scheme, it's _so_ easy to use length when you shouldn't, that it's pretty much a guarantee that it's going to happen. I'm not sure that Andrei's suggestion is the best one at this point, but I sure wouldn't be against it being introduced. It wouldn't entirely fix the problem by any means, but programmers would then have to work harder at screwing it up and so there would be fewer mistakes. Arguably, the first issue with D strings is that we have char. In most languages, char is supposed to be a character, so many programmers will code with that expectation. If we had something like utf8unit, utf16unit, and utf32unit (arguably very bad, albeit descriptive, names) and no char, then it would force programmers to become semi-educated about the issues. There's no way that that's changing at this point though. - Jonathan M Davis
Dec 30 2011
On 12/31/2011 04:30 AM, Jonathan M Davis wrote:On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:Then that is the fault of the guy who created the tests. At least that guy should be familiar with the issues, otherwise he is at the wrong position. Software should never be released without thorough testing.1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.The problem is that what's more likely to happen in a lot of cases is that they use it wrong and don't notice, because they're only using ASCII in testing, _but_ they have bugs all over the place, because their code is actually used with unicode in the field.Yes, diligent programmers will generally find such problems, but with the current scheme, it's _so_ easy to use length when you shouldn't, that it's pretty much a guarantee that it's going to happen. I'm not sure that Andrei's suggestion is the best one at this point, but I sure wouldn't be against it being introduced. It wouldn't entirely fix the problem by any means, but programmers would then have to work harder at screwing it up and so there would be fewer mistakes.Programmers would then also have to work harder at doing it right and at memoizing special cases, so there is absolutely no net gain.Arguably, the first issue with D strings is that we have char. In most languages, char is supposed to be a character, so many programmers will code with that expectation. If we had something like utf8unit, utf16unit, and utf32unit (arguably very bad, albeit descriptive, names) and no char, then it would force programmers to become semi-educated about the issues. There's no way that that's changing at this point though. - Jonathan M DavisA programmer has to have basic knowledge of the language he is programming in. That includes knowing the meaning of all basic types. If he fails at that, testing should definitely catch that kind of trivial bugs.
Dec 30 2011
On 12/30/2011 7:30 PM, Jonathan M Davis wrote:Yes, diligent programmers will generally find such problems, but with the current scheme, it's _so_ easy to use length when you shouldn't, that it's pretty much a guarantee that it's going to happen.I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional. This was definitely not true of older multibyte schemes, like Shift-JIS (shudder), but those schemes ought to be terminated with extreme prejudice. But it definitely will take a long time to live down the bugs and miasma of code that had to deal with them. C and C++ still live with that because of their agenda of backwards compatibility. They still support EBCDIC, after all, that was obsolete even in the 70's. And I still see posts on comp.moderated.c++ that say "you shouldn't write string code like that, because it won't work on EBCDIC!" Sheesh!
Dec 30 2011
On 12/30/11 10:09 PM, Walter Bright wrote:On 12/30/2011 7:30 PM, Jonathan M Davis wrote:The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well. We need .raw and we must abolish .length and [] for narrow strings. AndreiYes, diligent programmers will generally find such problems, but with the current scheme, it's _so_ easy to use length when you shouldn't, that it's pretty much a guarantee that it's going to happen.I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.
Dec 30 2011
On Sat, Dec 31, 2011 at 12:09 AM, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:On 12/30/11 10:09 PM, Walter Bright wrote:I don't know that Phobos would be an appropriate place for it but offering some easy to access string data containing extensive and advanced unicode which users could easily add to their programs unit tests may help people ensure proper unicode usage. Unicode seems to be one of those things where you either know it really well or you know just enough to get yourself in trouble so having test data written by unicode experts could be very useful for the rest of us mortals. I googled around a bit. This Stack Overflow came up < http://stackoverflow.com/questions/6136800/unicode-test-strings-for-unit-tests> that recommends these - UTF-8 stress test: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt - Quick Brown Fox in a variety of languages: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt I didn't see too much beyond those two. Regards, Brad A.On 12/30/2011 7:30 PM, Jonathan M Davis wrote:The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well. We need .raw and we must abolish .length and [] for narrow strings. AndreiYes, diligent programmers will generally find such problems, but with the current scheme, it's _so_ easy to use length when you shouldn't, that it's pretty much a guarantee that it's going to happen.I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.
Dec 30 2011
On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:On 12/30/11 10:09 PM, Walter Bright wrote:I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture. This is not true of D. It's designed from the ground up to deal properly with UTF. D has very simple language features to deal with it.I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.We need .raw and we must abolish .length and [] for narrow strings.I don't believe that fixes anything and breaks every D project out there. We're chasing phantoms here, and I worry a lot about over-engineering trivia. And, we already have a type to deal with it: dstring
Dec 31 2011
2011/12/31 Walter Bright <newshound2 digitalmars.com>:On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:I fully agree with Walter. No need more wrapper for string. Kenji HaraOn 12/30/11 10:09 PM, Walter Bright wrote:I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture. This is not true of D. It's designed from the ground up to deal properly with UTF. D has very simple language features to deal with it.I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.We need .raw and we must abolish .length and [] for narrow strings.I don't believe that fixes anything and breaks every D project out there. We're chasing phantoms here, and I worry a lot about over-engineering trivia. And, we already have a type to deal with it: dstring
Dec 31 2011
On 12/31/11 2:04 AM, Walter Bright wrote:On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:I disagree. It is designed to make dealing with UTF possible.On 12/30/11 10:09 PM, Walter Bright wrote:I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture. This is not true of D. It's designed from the ground up to deal properly with UTF.I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.D has very simple language features to deal with it.Disagree. I mean simple they are, no contest. They could and should be much better, make correct code easier to write, and make incorrect code more difficult to write. Claiming we reached perfection there doesn't quite fit.I agree. This is the only reason that keeps me from furthering the issue.We need .raw and we must abolish .length and [] for narrow strings.I don't believe that fixes anything and breaks every D project out there.We're chasing phantoms here, and I worry a lot about over-engineering trivia.I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.And, we already have a type to deal with it: dstringNo. Andrei
Dec 31 2011
On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 12/31/11 2:04 AM, Walter Bright wrote:Perfect? At one time Java and other frameworks started to use UTF-16 as if they were characters, that turned wrong on them. Now we know that not even code points should be considered characters, thanks to characters spanning on multiple code points. You might call it perfect, but for that you have made two assumptions: 1. treating code points as characters is good enough, and 2. the performance penalty of decoding everything is tolerable Ranges of code points might be perfect for you, but it's a tradeoff that won't work in every situations. The whole concept of generic algorithms working on strings efficiently doesn't work. Applying generic algorithms to strings by treating them as a range of code points is both wasteful (because it forces you to decode everything) and incomplete (because of multi-code-point characters) and it should be avoided. Algorithms working on Unicode strings should be designed with Unicode in mind. And the best way to design efficient Unicode algorithms is to access the array of code units directly and read each character at the level of abstraction required and know what you're doing. I'm not against making strings more opaque to encourage people to use the Unicode algorithms from the standard library instead of rolling their own. But I doubt the current approach of using .raw alone will prevent many from doing dumb things. On the other side I'm sure it'll make it it more complicated to write Unicode algorithms because accessing and especially slicing the raw content of char[] will become tiresome. I'm not convinced it's a net win. As for Walter being the only one coding by looking at the code units directly, that's not true. All my parser code look at code units directly and only decode to code points where necessary (just look at the XML parsing code I posted a while ago to get an idea to how it can apply to ranges). And I don't think it's because I've seen Walter code before, I think it is because I know how Unicode works and I want to make my parser efficient. I've done the same for a parser in C++ a while ago. I can hardly imagine I'm the only one (with Walter and you). I think this is how efficient algorithms dealing with Unicode should be written. -- Michel Fortin michel.fortin michelf.com http://michelf.com/We're chasing phantoms here, and I worry a lot about over-engineering trivia.I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.
Dec 31 2011
On 12/31/11 8:17 CST, Michel Fortin wrote:On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:Sorry, I exaggerated. I meant "a net improvement while keeping simplicity".On 12/31/11 2:04 AM, Walter Bright wrote:Perfect?We're chasing phantoms here, and I worry a lot about over-engineering trivia.I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.At one time Java and other frameworks started to use UTF-16 as if they were characters, that turned wrong on them. Now we know that not even code points should be considered characters, thanks to characters spanning on multiple code points. You might call it perfect, but for that you have made two assumptions: 1. treating code points as characters is good enough, and 2. the performance penalty of decoding everything is tolerableI'm not sure how you concluded I drew such assumptions.Ranges of code points might be perfect for you, but it's a tradeoff that won't work in every situations.Ranges can be defined to span logical glyphs that span multiple code points.The whole concept of generic algorithms working on strings efficiently doesn't work.Apparently std.algorithm does.Applying generic algorithms to strings by treating them as a range of code points is both wasteful (because it forces you to decode everything) and incomplete (because of multi-code-point characters) and it should be avoided.An algorithm that gains by accessing the encoding can do so - and indeed some do. Spanning multi-code-point characters is a matter of defining the range appropriately; it doesn't break the abstraction.Algorithms working on Unicode strings should be designed with Unicode in mind. And the best way to design efficient Unicode algorithms is to access the array of code units directly and read each character at the level of abstraction required and know what you're doing.As I said, that's happening already.I'm not against making strings more opaque to encourage people to use the Unicode algorithms from the standard library instead of rolling their own.I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).But I doubt the current approach of using .raw alone will prevent many from doing dumb things.I agree. But I think it would be a sensible improvement over now, when you get to do a ton of dumb things with much more ease.On the other side I'm sure it'll make it it more complicated to write Unicode algorithms because accessing and especially slicing the raw content of char[] will become tiresome. I'm not convinced it's a net win.Many Unicode algorithms don't need slicing. Those that do carefully mix manipulation of code points with manipulation of representation. It is a net win that the two operations are explicitly distinguished.As for Walter being the only one coding by looking at the code units directly, that's not true. All my parser code look at code units directly and only decode to code points where necessary (just look at the XML parsing code I posted a while ago to get an idea to how it can apply to ranges). And I don't think it's because I've seen Walter code before, I think it is because I know how Unicode works and I want to make my parser efficient. I've done the same for a parser in C++ a while ago. I can hardly imagine I'm the only one (with Walter and you). I think this is how efficient algorithms dealing with Unicode should be written.Congratulations. Andrei
Dec 31 2011
On Saturday, 31 December 2011 at 15:03:13 UTC, Andrei Alexandrescu wrote:According to my research[1], std.array.replace (which uses std.algorithm under the hood) can be at least 40% faster when there is a match and 70% faster when there isn't one. I don't think this is actually related to UTF, though. [1]: http://dump.thecybershadow.net/5cfb6713ce6628686c6aa8a23b15c99e/test.dThe whole concept of generic algorithms working on strings efficiently doesn't work.Apparently std.algorithm does.
Dec 31 2011
On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 12/31/11 8:17 CST, Michel Fortin wrote:1: Because treating UTF-8 strings as a range of code point encourage people to think so. 2: From things you posted on the newsgroup previously. Sorry I don't have the references, but it'd take too long to dig them back.At one time Java and other frameworks started to use UTF-16 as if they were characters, that turned wrong on them. Now we know that not even code points should be considered characters, thanks to characters spanning on multiple code points. You might call it perfect, but for that you have made two assumptions: 1. treating code points as characters is good enough, and 2. the performance penalty of decoding everything is tolerableI'm not sure how you concluded I drew such assumptions.I'm talking about the default interpretation, where string ranges are ranges of code units, making that tradeoff the default. And also, I think we can agree that a logical glyph range would be terribly inefficient in practice, although it could be a nice teaching tool.Ranges of code points might be perfect for you, but it's a tradeoff that won't work in every situations.Ranges can be defined to span logical glyphs that span multiple code points.First, it doesn't really work. It seems to work fine, but it doesn't handle (yet) characters spanning multiple code points. To handle this case, you could use a logical glyph range, but that'd be quite inefficient. Or you can improve the algorithm working on code points so that it checks for combining characters on the edges, but then is it still a generic algorithm? Second, it doesn't work efficiently. Sure you can specialize the algorithm so it does not decode all code units when it's not necessary, but then does it still classify as a generic algorithm? My point is that *generic* algorithms cannot work *efficiently* with Unicode, not that they can't work at all. And even then, for the inneficient generic algorithm to work correctly with all input, the user need to choose the correct Unicode representation to for the problem at hand, which requires some general knowledge of Unicode. Which is why I'd just discourage generic algorithms for strings.The whole concept of generic algorithms working on strings efficiently doesn't work.Apparently std.algorithm does.It's a good abstraction to show the theory of Unicode. But it's not the way to go if you want efficiency. For efficiency you need for each element in the string to use the lowest abstraction required to handle this element, so your algorithm needs to know about the various abstraction layers. This is the kind of "range" I'd use to create algorithms dealing with Unicode properly: struct UnicodeRange(U) { U frontUnit() property; dchar frontPoint() property; immutable(U)[] frontGlyph() property; void popFrontUnit(); void popFrontPoint(); void popFrontGlyph(); ... } Not really a range per your definition of ranges, but basically it lets you intermix working with units, code points, and glyphs. Add a way to slice at the unit level and a way to know the length at the unit level and it's all I need to make an efficient parser, or any algorithm really. The problem with .raw is that it creates a separate range for the units. This means you can't look at the frontUnit and then decide to pop the unit and then look at the next, decide you need to decode using frontPoint, then call popPoint and return to looking at the front unit. Also, I'm not sure the "glyph" part of that range is required most of the time, because most of the time you don't need to decode glyphs to be glyph-aware. But it'd be nice if you wanted to count them and having it there alongside the rest makes teaches makes users aware of them. -- Michel Fortin michel.fortin michelf.com http://michelf.com/I'm not against making strings more opaque to encourage people to use the Unicode algorithms from the standard library instead of rolling their own.I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).
Dec 31 2011
On 12/31/11 10:47 AM, Michel Fortin wrote:On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:That's sort of difficult to refute. Anyhow, I think it's great that algorithms can use types to go down to the representation if needed, and stay up at bidirectional range level otherwise.On 12/31/11 8:17 CST, Michel Fortin wrote:1: Because treating UTF-8 strings as a range of code point encourage people to think so. 2: From things you posted on the newsgroup previously. Sorry I don't have the references, but it'd take too long to dig them back.At one time Java and other frameworks started to use UTF-16 as if they were characters, that turned wrong on them. Now we know that not even code points should be considered characters, thanks to characters spanning on multiple code points. You might call it perfect, but for that you have made two assumptions: 1. treating code points as characters is good enough, and 2. the performance penalty of decoding everything is tolerableI'm not sure how you concluded I drew such assumptions.Well people who want that could use byGlyph() or something. If you want glyphs, you gotta pay the price.I'm talking about the default interpretation, where string ranges are ranges of code units, making that tradeoff the default. And also, I think we can agree that a logical glyph range would be terribly inefficient in practice, although it could be a nice teaching tool.Ranges of code points might be perfect for you, but it's a tradeoff that won't work in every situations.Ranges can be defined to span logical glyphs that span multiple code points.Oh yes it does.First, it doesn't really work.The whole concept of generic algorithms working on strings efficiently doesn't work.Apparently std.algorithm does.It seems to work fine, but it doesn't handle (yet) characters spanning multiple code points.That's the job of std.range, not std.algorithm.To handle this case, you could use a logical glyph range, but that'd be quite inefficient. Or you can improve the algorithm working on code points so that it checks for combining characters on the edges, but then is it still a generic algorithm? Second, it doesn't work efficiently. Sure you can specialize the algorithm so it does not decode all code units when it's not necessary, but then does it still classify as a generic algorithm? My point is that *generic* algorithms cannot work *efficiently* with Unicode, not that they can't work at all. And even then, for the inneficient generic algorithm to work correctly with all input, the user need to choose the correct Unicode representation to for the problem at hand, which requires some general knowledge of Unicode. Which is why I'd just discourage generic algorithms for strings.I think you are in a position that is defensible, but not generous and therefore undesirable. The military equivalent would be defending a fortified landfill drained by a sewer. You don't _want_ to be there. Taking your argument to its ultimate conclusion is that we give up on genericity for strings and go home. Strings are a variable-length encoding on top of an array. That is a relatively easy abstraction to model. Currently we don't have a dedicated model for that - we offer the encoded data as a bidirectional range and also the underlying array. Algorithms that work with bidirectional ranges work out of the box. Those that can use the representation gainfully can opportunistically specialize on isSomeString!R. You contend that that doesn't "work", and I think you're wrong. But to the extent you have a case, an abstraction could be defined for variable-length encodings, and algorithms could be defined to work with that abstraction. I thought several times about that, but couldn't gather enough motivation for the simple reason that the current approach _works_.Correct.It's a good abstraction to show the theory of Unicode. But it's not the way to go if you want efficiency. For efficiency you need for each element in the string to use the lowest abstraction required to handle this element, so your algorithm needs to know about the various abstraction layers.I'm not against making strings more opaque to encourage people to use the Unicode algorithms from the standard library instead of rolling their own.I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).This is the kind of "range" I'd use to create algorithms dealing with Unicode properly: struct UnicodeRange(U) { U frontUnit() property; dchar frontPoint() property; immutable(U)[] frontGlyph() property; void popFrontUnit(); void popFrontPoint(); void popFrontGlyph(); ... }We already have most of that. For a string s, s[0] is frontUnit, s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is popFrontPoint. We only need to define the glyph routines. But I think you'd be stopping short. You want generic variable-length encoding, not the above.Not really a range per your definition of ranges, but basically it lets you intermix working with units, code points, and glyphs. Add a way to slice at the unit level and a way to know the length at the unit level and it's all I need to make an efficient parser, or any algorithm really.Except for the glpyhs implementation, we're already there. You are talking about existing capabilities!The problem with .raw is that it creates a separate range for the units.That's the best part about it.This means you can't look at the frontUnit and then decide to pop the unit and then look at the next, decide you need to decode using frontPoint, then call popPoint and return to looking at the front unit.Of course you can. while (condition) { if (s.raw.front == someFrontUnitThatICareAbout) { s.raw.popFront(); auto c = s.front; s.popFront(); } } Now that I wrote it I'm even more enthralled with the coolness of the scheme. You essentially have access to two separate ranges on top of the same fabric. Andrei
Dec 31 2011
On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 12/31/11 10:47 AM, Michel Fortin wrote:As I keep saying, if you handle combining code points at the range level you'll have very inefficient code. But I think you get that.It seems to work fine, but it doesn't handle (yet) characters spanning multiple code points.That's the job of std.range, not std.algorithm.I don't get the analogy.To handle this case, you could use a logical glyph range, but that'd be quite inefficient. Or you can improve the algorithm working on code points so that it checks for combining characters on the edges, but then is it still a generic algorithm? Second, it doesn't work efficiently. Sure you can specialize the algorithm so it does not decode all code units when it's not necessary, but then does it still classify as a generic algorithm? My point is that *generic* algorithms cannot work *efficiently* with Unicode, not that they can't work at all. And even then, for the inneficient generic algorithm to work correctly with all input, the user need to choose the correct Unicode representation to for the problem at hand, which requires some general knowledge of Unicode. Which is why I'd just discourage generic algorithms for strings.I think you are in a position that is defensible, but not generous and therefore undesirable. The military equivalent would be defending a fortified landfill drained by a sewer. You don't _want_ to be there.Taking your argument to its ultimate conclusion is that we give up on genericity for strings and go home.That is more or less what I am saying. Genericity for strings leads to inefficient algorithms, and you don't want inefficient algorithms, at least not without being warned in advance. This is why for instance you give a special name to inefficient (linear) operations in std.container. In the same way, I think generic operations on strings should be disallowed unless you opt-in by explicitly saying on which representation you want to algorithm to perform its task.Indeed. I came with this concept when writing my XML parser, I defined frontUnit and popFrontUnit and used it all over the place (in conjunction with slicing). And I rarely needed to decode whole code points using front and popFront.This is the kind of "range" I'd use to create algorithms dealing with Unicode properly: struct UnicodeRange(U) { U frontUnit() property; dchar frontPoint() property; immutable(U)[] frontGlyph() property; void popFrontUnit(); void popFrontPoint(); void popFrontGlyph(); ... }We already have most of that. For a string s, s[0] is frontUnit, s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is popFrontPoint. We only need to define the glyph routines.But I think you'd be stopping short. You want generic variable-length encoding, not the above.Really? How'd that work?Except for the glpyhs implementation, we're already there. You are talking about existing capabilities!Depends. It should create a *linked* range, not a *separate* one, in the sense that if you advance the "raw" range with popFront, it should advance the underlying "code point" range too.The problem with .raw is that it creates a separate range for the units.That's the best part about it.But will s.raw.popFront() also pop a single unit from s? "raw" would need to be defined as a reinterpret cast of the reference to the char[] to do what I want, something like this: ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; } The current std.string.representation doesn't do that at all. Also, how does it work with slicing? It can work with raw, but you'll have to cast things everywhere because raw is a ubyte[]: string = "éà"; s = cast(typeof(s))s.raw[0..4];This means you can't look at the frontUnit and then decide to pop the unit and then look at the next, decide you need to decode using frontPoint, then call popPoint and return to looking at the front unit.Of course you can. while (condition) { if (s.raw.front == someFrontUnitThatICareAbout) { s.raw.popFront(); auto c = s.front; s.popFront(); } }Now that I wrote it I'm even more enthralled with the coolness of the scheme. You essentially have access to two separate ranges on top of the same fabric.Glad you like the concept. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 31 2011
On 12/31/11 2:44 PM, Michel Fortin wrote:But will s.raw.popFront() also pop a single unit from s? "raw" would need to be defined as a reinterpret cast of the reference to the char[] to do what I want, something like this: ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; } The current std.string.representation doesn't do that at all.You just found a bug! Andrei
Dec 31 2011
On 12/31/2011 07:56 PM, Andrei Alexandrescu wrote:On 12/31/11 10:47 AM, Michel Fortin wrote:There is nothing wrong with the scheme on the conceptual level (except maybe that .raw.popFront() lets you invalidate the code point range). But making built-in arrays behave that way is like fitting a square peg in a round hole. immutable(char)[] is actually what .raw should return, not what it should be called on. It is already the raw representation.This means you can't look at the frontUnit and then decide to pop the unit and then look at the next, decide you need to decode using frontPoint, then call popPoint and return to looking at the front unit.Of course you can. while (condition) { if (s.raw.front == someFrontUnitThatICareAbout) { s.raw.popFront(); auto c = s.front; s.popFront(); } } Now that I wrote it I'm even more enthralled with the coolness of the scheme. You essentially have access to two separate ranges on top of the same fabric. Andrei
Dec 31 2011
On 12/31/2011 03:17 PM, Michel Fortin wrote:As for Walter being the only one coding by looking at the code units directly, that's not true. All my parser code look at code units directly and only decode to code points where necessary (just look at the XML parsing code I posted a while ago to get an idea to how it can apply to ranges). And I don't think it's because I've seen Walter code before, I think it is because I know how Unicode works and I want to make my parser efficient. I've done the same for a parser in C++ a while ago. I can hardly imagine I'm the only one (with Walter and you). I think this is how efficient algorithms dealing with Unicode should be written.+1.
Dec 31 2011
I don't know that Unicode expertise is really required here anyway. All one= has to know is that UTF8 is a multibyte encoding and built-in string attrib= utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s= cience. That said, I'm on the fence about this change. It breaks consistency= for a benefit I'm still weighing. With this change, the char type will stil= l be a single byte, correct? What happens to foreach on strings? Sent from my iPhone On Dec 31, 2011, at 8:20 AM, Timon Gehr <timon.gehr gmx.ch> wrote:On 12/31/2011 03:17 PM, Michel Fortin wrote:=20 As for Walter being the only one coding by looking at the code units directly, that's not true. All my parser code look at code units directly and only decode to code points where necessary (just look at the XML parsing code I posted a while ago to get an idea to how it can apply to ranges). And I don't think it's because I've seen Walter code before, I think it is because I know how Unicode works and I want to make my parser efficient. I've done the same for a parser in C++ a while ago. I can hardly imagine I'm the only one (with Walter and you). I think this is how efficient algorithms dealing with Unicode should be written. =20=20 +1.
Dec 31 2011
On 12/31/11 10:47 AM, Sean Kelly wrote:I don't know that Unicode expertise is really required here anyway. All one has to know is that UTF8 is a multibyte encoding and built-in string attributes talk in bytes. Knowing when one wants bytes vs characters isn't rocket science. That said, I'm on the fence about this change. It breaks consistency for a benefit I'm still weighing. With this change, the char type will still be a single byte, correct? What happens to foreach on strings?Clearly this is a what-if debate. The best level of agreement we could ever reach is "well, it would've been nice... sigh". It's possible that we'll define a Rope type in std.container - a heavy-duty string type with small string optimization, interning, the works. That type may use insights we are deriving from this exchange. Andrei
Dec 31 2011
On 12/31/2011 08:06 PM, Andrei Alexandrescu wrote:On 12/31/11 10:47 AM, Sean Kelly wrote:That would be great.I don't know that Unicode expertise is really required here anyway. All one has to know is that UTF8 is a multibyte encoding and built-in string attributes talk in bytes. Knowing when one wants bytes vs characters isn't rocket science. That said, I'm on the fence about this change. It breaks consistency for a benefit I'm still weighing. With this change, the char type will still be a single byte, correct? What happens to foreach on strings?Clearly this is a what-if debate. The best level of agreement we could ever reach is "well, it would've been nice... sigh". It's possible that we'll define a Rope type in std.container - a heavy-duty string type with small string optimization, interning, the works. That type may use insights we are deriving from this exchange. Andrei
Dec 31 2011
On 2011-12-31 16:47:40 +0000, Sean Kelly <sean invisibleduck.org> said:I don't know that Unicode expertise is really required here anyway. All one has to know is that UTF8 is a multibyte encoding and built-in string attrib utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s cience.It's not bytes vs. characters, it's code units vs. code points vs. user perceived characters (grapheme clusters). One character can span multiple code points, and can be represented in various ways depending on which Unicode normalization you pick. But most people don't know that. If you want to count the number of *characters*, counting code points isn't really it, as you should avoid counting the combining ones. If you want to search for a substring, you need to be sure both strings use the same normalization first, and if not normalize them appropriately so that equivalent code point combinations are always represented the same. That said, if you are implementing an XML or JSON parser, since those specs are defined in term of code points you should probably write your code in term of code points (hopefully without decoding code points when you don't need to). On the other hand, if you're writing something that processes text (like counting the average number of *character* per word in a document), then you should be aware of combining characters. How to pack all this into an easy to use package is most challenging. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 31 2011
Sorry, I was simplifying. The distinction I was trying to make was between g= eneric operations (in my experience the majority) vs. encoding-aware ones.=20= Sent from my iPhone On Dec 31, 2011, at 12:48 PM, Michel Fortin <michel.fortin michelf.com> wrot= e:On 2011-12-31 16:47:40 +0000, Sean Kelly <sean invisibleduck.org> said: =20neI don't know that Unicode expertise is really required here anyway. All o=ibhas to know is that UTF8 is a multibyte encoding and built-in string attr=et sutes talk in bytes. Knowing when one wants bytes vs characters isn't rock=rceived characters (grapheme clusters). One character can span multiple code= points, and can be represented in various ways depending on which Unicode n= ormalization you pick. But most people don't know that.cience.=20 It's not bytes vs. characters, it's code units vs. code points vs. user pe==20 If you want to count the number of *characters*, counting code points isn'=t really it, as you should avoid counting the combining ones. If you want to= search for a substring, you need to be sure both strings use the same norma= lization first, and if not normalize them appropriately so that equivalent c= ode point combinations are always represented the same.=20 That said, if you are implementing an XML or JSON parser, since those spec=s are defined in term of code points you should probably write your code in t= erm of code points (hopefully without decoding code points when you don't ne= ed to). On the other hand, if you're writing something that processes text (= like counting the average number of *character* per word in a document), the= n you should be aware of combining characters.=20 How to pack all this into an easy to use package is most challenging. =20 =20 --=20 Michel Fortin michel.fortin michelf.com http://michelf.com/ =20
Dec 31 2011
Andrei Alexandrescu:We need .raw and we must abolish .length and [] for narrow strings.I don't know if we need, but I agree those things are an improvement over the current state. To replace the disabled slicing I think something Python islice() will be useful. Bye,bear bearophile
Dec 31 2011
Timon Gehr wrote:Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80.+1 From D's string docs: "char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format. dchar[] strings are in UTF-32 format." I would additionally add some clarifications: char[] is an array of 8-bit code units. Unicode code point may take up to 4 chars. wchar[] is an array of 16-bit code units. Unicode code point may take up to 2 wchars. dchar[] is an array of 32-bit code units. Unicode code point always fits into one dchar. Each of these formats may encode any Unicode string. If you need indexing or slicing use: * char[] or string when working with ASCII code points. * wchar[] or wstring when working with Basic Multilingual Plane (BMP) code points. * dchar[] or dstring when working with all possible code points. If you do not need indexing or slicing you may use any of the formats.
Dec 31 2011
On 12/30/2011 02:55 PM, Timon Gehr wrote:On 12/30/2011 08:33 PM, Joshua Reusch wrote:But correct (enough).Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:Inefficient.On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type --On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...*sigh*, FINE. Code units and /code points/ would be the same.code units and characters would be the sameWrong.Inconsistent? How? Inefficiency is a lot easier to deal with than incorrect. If something is inefficient, then in the right places I will NOTICE. If something is incorrect, it can hide for years until that one person (or country, in this case) with a different usage pattern than the others uncovers it.or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise. ... There's another issue at play here too: efficiency vs correctness as a default. Here's the tradeoff -- Option A: char[i] returns the i'th byte of the string as a (char) type. Consequences: (1) Code is efficient and INcorrect. (2) It requires extra effort to write correct code. (3) Detecting the incorrect code may take years, as these errors can hide easily. Option B: char[i] returns the i'th codepoint of the string as a (dchar) type. Consequences: (1) Code is INefficient and correct. (2) It requires extra effort to write efficient code. (3) Detecting the inefficient code happens in minutes. It is VERY noticable when your program runs too slowly. This is how I see it. And I really like my correct code. If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away. I'm totally digging option B.
Dec 31 2011
On 12/31/2011 07:22 PM, Chad J wrote:On 12/30/2011 02:55 PM, Timon Gehr wrote:Relax.On 12/30/2011 08:33 PM, Joshua Reusch wrote:But correct (enough).Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:Inefficient.On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type --On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...*sigh*, FINE. Code units and /code points/ would be the same.code units and characters would be the sameWrong.int[] bool[] float[] char[]Inconsistent? How?or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).Inefficiency is a lot easier to deal with than incorrect. If something is inefficient, then in the right places I will NOTICE. If something is incorrect, it can hide for years until that one person (or country, in this case) with a different usage pattern than the others uncovers it.Except that the proposal would make slicing strings go away.What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.... There's another issue at play here too: efficiency vs correctness as a default. Here's the tradeoff -- Option A: char[i] returns the i'th byte of the string as a (char) type. Consequences: (1) Code is efficient and INcorrect.Do you have an example of impactful incorrect code resulting from those semantics?(2) It requires extra effort to write correct code. (3) Detecting the incorrect code may take years, as these errors can hide easily.None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.Option B: char[i] returns the i'th codepoint of the string as a (dchar) type. Consequences: (1) Code is INefficient and correct.It is awfully optimistic to assume the code will be correct.(2) It requires extra effort to write efficient code. (3) Detecting the inefficient code happens in minutes. It is VERY noticable when your program runs too slowly.Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.This is how I see it. And I really like my correct code. If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away. I'm totally digging option B.Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice. Option B is not even on the table. This thread is about a breaking interface change and special casing T[] for T in {char, wchar}.
Dec 31 2011
On 12/31/2011 02:02 PM, Timon Gehr wrote:On 12/31/2011 07:22 PM, Chad J wrote:I'll do one better and ultra relax: http://www.youtube.com/watch?v=jimQoWXzc0Q ;)On 12/30/2011 02:55 PM, Timon Gehr wrote:Relax.On 12/30/2011 08:33 PM, Joshua Reusch wrote:But correct (enough).Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:Inefficient.On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type --On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...*sigh*, FINE. Code units and /code points/ would be the same.code units and characters would be the sameWrong.I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters. Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points. That seems very doable. I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things. I'd much rather have char[] behave more like an array of code points than an array of bytes. I don't need an array of bytes. That's ubyte[]; I have that already.int[] bool[] float[] char[]Inconsistent? How?or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).Yeah, Andrei's proposal says that. But I'm speaking of Joshua's:Inefficiency is a lot easier to deal with than incorrect. If something is inefficient, then in the right places I will NOTICE. If something is incorrect, it can hide for years until that one person (or country, in this case) with a different usage pattern than the others uncovers it.Except that the proposal would make slicing strings go away.What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.I kind-of like either, but I'd prefer Joshua's suggestion.so programmers could use the slices/indexing/length ...Or, you know, we could design the language a little differently and make this become mostly a non-problem. That would be cool.Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.Probably not. I played fast and loose with this a lot in my early D code. Then this same conversation happened like ~3 years ago on this newsgroup. Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing. I thought I could just index char[]s willy nilly. But no, I can't. And the compiler won't tell me. It just silently does what I don't want. Maybe unicode is easy, but we sure as hell aren't born with it, and the language doesn't give beginners ANY red flags about this. I find myself pretty fortified against this issue due to having known about it before anything unpleasant happened, but I don't like the idea of others having to learn the hard way.There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.Nope. Sorry. I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.... There's another issue at play here too: efficiency vs correctness as a default. Here's the tradeoff -- Option A: char[i] returns the i'th byte of the string as a (char) type. Consequences: (1) Code is efficient and INcorrect.Do you have an example of impactful incorrect code resulting from those semantics?I can get behind this. Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it. Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible. If I need more performance or more unicode pedantics, I'll do my homework then and only then. Of course this is probably never going to happen I'm afraid. Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.(2) It requires extra effort to write correct code. (3) Detecting the incorrect code may take years, as these errors can hide easily.None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.I see what you mean there. I'm still not entirely happy with it though. I don't think these are reasonable requirements. It sounds like forced premature optimization to me. I have found myself in a number of places in different problem domains where optimality-is-correctness. Make it too slow and the program isn't worth writing. I can't imagine doing this for workloads I can't test on or anticipate though: I'd have to operate like NASA and make things 10x more expensive than they need to be. Correctness, on the other hand, can be easily (relatively speaking) obtained by only allowing the user to input data you can handle and then making sure the program can handle it as promised. Test, test, test, etc.Option B: char[i] returns the i'th codepoint of the string as a (dchar) type. Consequences: (1) Code is INefficient and correct.It is awfully optimistic to assume the code will be correct.(2) It requires extra effort to write efficient code. (3) Detecting the inefficient code happens in minutes. It is VERY noticable when your program runs too slowly.Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.This is how I see it. And I really like my correct code. If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away. I'm totally digging option B.Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.Option B is not even on the table. This thread is about a breaking interface change and special casing T[] for T in {char, wchar}.Yeah, I know. I'm refering to what Joshua wrote, because I like option B. Even if it's academic, I'll say I like it anyways, if only for the sake of argument.
Dec 31 2011
On 01/01/2012 02:34 AM, Chad J wrote:On 12/31/2011 02:02 PM, Timon Gehr wrote:char[] is not an array of bytes: it is an array of UTF-8 code units.On 12/31/2011 07:22 PM, Chad J wrote:I'll do one better and ultra relax: http://www.youtube.com/watch?v=jimQoWXzc0Q ;)On 12/30/2011 02:55 PM, Timon Gehr wrote:Relax.On 12/30/2011 08:33 PM, Joshua Reusch wrote:But correct (enough).Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:Inefficient.On 12/29/11 12:28 PM, Don wrote:Maybe it could happen if we 1. make dstring the default strings type --On 28.12.2011 20:00, Andrei Alexandrescu wrote:Exactly!Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum.If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...*sigh*, FINE. Code units and /code points/ would be the same.code units and characters would be the sameWrong.I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters. Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points. That seems very doable. I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things. I'd much rather have char[] behave more like an array of code points than an array of bytes. I don't need an array of bytes. That's ubyte[]; I have that already.int[] bool[] float[] char[]Inconsistent? How?or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).It is imo already mostly a non-problem, but YMMV: void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')' && --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } That code is UTF aware, even though it does not explicitly deal with UTF. I'd claim it is like this most of the time.Yeah, Andrei's proposal says that. But I'm speaking of Joshua's:Inefficiency is a lot easier to deal with than incorrect. If something is inefficient, then in the right places I will NOTICE. If something is incorrect, it can hide for years until that one person (or country, in this case) with a different usage pattern than the others uncovers it.Except that the proposal would make slicing strings go away.What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.I kind-of like either, but I'd prefer Joshua's suggestion.so programmers could use the slices/indexing/length ...Or, you know, we could design the language a little differently and make this become mostly a non-problem. That would be cool.Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.How often do you actually need to get, for example, the 10th character of a string? I think it is a very uncommon operation. If the indexing is just part of an iteration that looks once at each char and handles some ASCII characters in certain ways, there is no potential correctness problem. As soon as code talks about non-ascii characters, it has to be UTF aware anyway.Probably not. I played fast and loose with this a lot in my early D code. Then this same conversation happened like ~3 years ago on this newsgroup. Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing. I thought I could just index char[]s willy nilly. But no, I can't. And the compiler won't tell me. It just silently does what I don't want.There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.Maybe unicode is easy, but we sure as hell aren't born with it, and the language doesn't give beginners ANY red flags about this. I find myself pretty fortified against this issue due to having known about it before anything unpleasant happened, but I don't like the idea of others having to learn the hard way.Hm, well. The first thing I looked up when I learned D supports Unicode is how Unicode/UTF work in detail. After that, the semantics of char[] were very clear to me.I might be wrong, but I somewhat have the impression we might be chasing phantoms here. I have so far never seen a bug in real world code caused by inadvertent misuse of D string indexing or slicing.Nope. Sorry. I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.... There's another issue at play here too: efficiency vs correctness as a default. Here's the tradeoff -- Option A: char[i] returns the i'th byte of the string as a (char) type. Consequences: (1) Code is efficient and INcorrect.Do you have an example of impactful incorrect code resulting from those semantics?It is using a better algorithm that performs faster by a linear factor. I would be very leery of something that looks like a constant time array indexing operation take linear time. I think premature optimization is about writing near-optimal hard-to-debug and maintain code that only gains some constant factors in parts of the code that are not performance critical.I can get behind this. Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it. Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible. If I need more performance or more unicode pedantics, I'll do my homework then and only then. Of course this is probably never going to happen I'm afraid. Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.(2) It requires extra effort to write correct code. (3) Detecting the incorrect code may take years, as these errors can hide easily.None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.I see what you mean there. I'm still not entirely happy with it though. I don't think these are reasonable requirements. It sounds like forced premature optimization to me.Option B: char[i] returns the i'th codepoint of the string as a (dchar) type. Consequences: (1) Code is INefficient and correct.It is awfully optimistic to assume the code will be correct.(2) It requires extra effort to write efficient code. (3) Detecting the inefficient code happens in minutes. It is VERY noticable when your program runs too slowly.Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.I have found myself in a number of places in different problem domains where optimality-is-correctness. Make it too slow and the program isn't worth writing. I can't imagine doing this for workloads I can't test on or anticipate though: I'd have to operate like NASA and make things 10x more expensive than they need to be. Correctness, on the other hand, can be easily (relatively speaking) obtained by only allowing the user to input data you can handle and then making sure the program can handle it as promised. Test, test, test, etc.Yes, what I meant is, that if the inefficiencies are spread out more or less uniformly, then fixing it all up might seem to be too much work and too much risk.I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.This is how I see it. And I really like my correct code. If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away. I'm totally digging option B.Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.OK.Option B is not even on the table. This thread is about a breaking interface change and special casing T[] for T in {char, wchar}.Yeah, I know. I'm refering to what Joshua wrote, because I like option B. Even if it's academic, I'll say I like it anyways, if only for the sake of argument.
Dec 31 2011
On 12/31/2011 09:17 PM, Timon Gehr wrote:On 01/01/2012 02:34 AM, Chad J wrote:Meh, I'd still prefer it be an array of UTF-8 code /points/ represented by an array of bytes (which are the UTF-8 code units).On 12/31/2011 02:02 PM, Timon Gehr wrote:char[] is not an array of bytes: it is an array of UTF-8 code units.On 12/31/2011 07:22 PM, Chad J wrote:I'll do one better and ultra relax: http://www.youtube.com/watch?v=jimQoWXzc0Q ;)On 12/30/2011 02:55 PM, Timon Gehr wrote:Relax.On 12/30/2011 08:33 PM, Joshua Reusch wrote:But correct (enough).Maybe it could happen if we 1. make dstring the default strings type --Inefficient.*sigh*, FINE. Code units and /code points/ would be the same.code units and characters would be the sameWrong.I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters. Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points. That seems very doable. I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things. I'd much rather have char[] behave more like an array of code points than an array of bytes. I don't need an array of bytes. That's ubyte[]; I have that already.int[] bool[] float[] char[]Inconsistent? How?or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindexInconsistent and inefficient (it blows up the algorithmic complexity).I'm willing to agree with this. I still don't like the possibility that folks encounter corner-cases in that not-most-of-the-time. I'm not going to rage-face too hard if this never changes though. There would be a number of other things more important to fix before this, IMO.It is imo already mostly a non-problem, but YMMV: void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')' && --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } That code is UTF aware, even though it does not explicitly deal with UTF. I'd claim it is like this most of the time.Yeah, Andrei's proposal says that. But I'm speaking of Joshua's:Inefficiency is a lot easier to deal with than incorrect. If something is inefficient, then in the right places I will NOTICE. If something is incorrect, it can hide for years until that one person (or country, in this case) with a different usage pattern than the others uncovers it.Except that the proposal would make slicing strings go away.What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.I kind-of like either, but I'd prefer Joshua's suggestion.so programmers could use the slices/indexing/length ...Or, you know, we could design the language a little differently and make this become mostly a non-problem. That would be cool.Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?But generally I liked the idea of just having an alias for strings...Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i < str.length; i++ ) { font.render(str[i]); // Ewww. ... } It'd be neat if that gave a compiler error, or just passed code points as dchar's. Maybe a compiler error is best in this light.How often do you actually need to get, for example, the 10th character of a string? I think it is a very uncommon operation. If the indexing is just part of an iteration that looks once at each char and handles some ASCII characters in certain ways, there is no potential correctness problem. As soon as code talks about non-ascii characters, it has to be UTF aware anyway.Probably not. I played fast and loose with this a lot in my early D code. Then this same conversation happened like ~3 years ago on this newsgroup. Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing. I thought I could just index char[]s willy nilly. But no, I can't. And the compiler won't tell me. It just silently does what I don't want.There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.Possibly.Maybe unicode is easy, but we sure as hell aren't born with it, and the language doesn't give beginners ANY red flags about this. I find myself pretty fortified against this issue due to having known about it before anything unpleasant happened, but I don't like the idea of others having to learn the hard way.Hm, well. The first thing I looked up when I learned D supports Unicode is how Unicode/UTF work in detail. After that, the semantics of char[] were very clear to me.I might be wrong, but I somewhat have the impression we might be chasing phantoms here. I have so far never seen a bug in real world code caused by inadvertent misuse of D string indexing or slicing.Nope. Sorry. I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.... There's another issue at play here too: efficiency vs correctness as a default. Here's the tradeoff -- Option A: char[i] returns the i'th byte of the string as a (char) type. Consequences: (1) Code is efficient and INcorrect.Do you have an example of impactful incorrect code resulting from those semantics?This wouldn't be the first data structure to require linear time indexing. I mean, linked lists exists. I do feel that heavy-duty optimization puts the onus on the programmer to know what to do. The programming language is responsible for merely making it possible, not for making it the default path. The latter is fairly impossible. Correctness, on the other hand, should involve some hand-holding. It's that notion of the language catching me when I fall. I think the language should (and can) help a lot with program correctness if designed right. D is already really good on these counts, and even helps quite a bit when optimization gets down-and-dirty.It is using a better algorithm that performs faster by a linear factor. I would be very leery of something that looks like a constant time array indexing operation take linear time. I think premature optimization is about writing near-optimal hard-to-debug and maintain code that only gains some constant factors in parts of the code that are not performance critical.I can get behind this. Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it. Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible. If I need more performance or more unicode pedantics, I'll do my homework then and only then. Of course this is probably never going to happen I'm afraid. Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.(2) It requires extra effort to write correct code. (3) Detecting the incorrect code may take years, as these errors can hide easily.None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.I see what you mean there. I'm still not entirely happy with it though. I don't think these are reasonable requirements. It sounds like forced premature optimization to me.Option B: char[i] returns the i'th codepoint of the string as a (dchar) type. Consequences: (1) Code is INefficient and correct.It is awfully optimistic to assume the code will be correct.(2) It requires extra effort to write efficient code. (3) Detecting the inefficient code happens in minutes. It is VERY noticable when your program runs too slowly.Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.Ah, right. Because code refactoring tends to suck. I get you. This is, of course, still the same reason why I'd never want to have to go through my code and replace all of the "font.render(str[i]);". Yeah, starting a number of years ago it won't happen to me, but it might get someone else.I have found myself in a number of places in different problem domains where optimality-is-correctness. Make it too slow and the program isn't worth writing. I can't imagine doing this for workloads I can't test on or anticipate though: I'd have to operate like NASA and make things 10x more expensive than they need to be. Correctness, on the other hand, can be easily (relatively speaking) obtained by only allowing the user to input data you can handle and then making sure the program can handle it as promised. Test, test, test, etc.Yes, what I meant is, that if the inefficiencies are spread out more or less uniformly, then fixing it all up might seem to be too much work and too much risk.I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.This is how I see it. And I really like my correct code. If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away. I'm totally digging option B.Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.OK.Option B is not even on the table. This thread is about a breaking interface change and special casing T[] for T in {char, wchar}.Yeah, I know. I'm refering to what Joshua wrote, because I like option B. Even if it's academic, I'll say I like it anyways, if only for the sake of argument.
Dec 31 2011
Meh, I'd still prefer it be an array of UTF-8 code /points/ represented by an array of bytes (which are the UTF-8 code units).By saying you want an array of code points you already define representation. And if you want that there already is dchar[]. You probably meant a range of code points represented by an array of code units. But such a range can't have opIndex, since opIndex implies a constant time operation. If you want nth element of the range, you can use std.range.drop or write your own nth() function.
Jan 01 2012
On 01/01/2012 05:53 AM, Chad J wrote:If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?
Jan 01 2012
On 01/01/2012 07:59 AM, Timon Gehr wrote:On 01/01/2012 05:53 AM, Chad J wrote:In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?
Jan 01 2012
On 01/01/2012 04:13 PM, Chad J wrote:On 01/01/2012 07:59 AM, Timon Gehr wrote:I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.On 01/01/2012 05:53 AM, Chad J wrote:In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?
Jan 01 2012
On 01/01/2012 10:39 AM, Timon Gehr wrote:On 01/01/2012 04:13 PM, Chad J wrote:I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late. The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.On 01/01/2012 07:59 AM, Timon Gehr wrote:I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.On 01/01/2012 05:53 AM, Chad J wrote:In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?
Jan 01 2012
On 01/01/2012 08:01 PM, Chad J wrote:On 01/01/2012 10:39 AM, Timon Gehr wrote:I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;On 01/01/2012 04:13 PM, Chad J wrote:I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.On 01/01/2012 07:59 AM, Timon Gehr wrote:I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.On 01/01/2012 05:53 AM, Chad J wrote:In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.
Jan 01 2012
On 01/01/2012 02:25 PM, Timon Gehr wrote:On 01/01/2012 08:01 PM, Chad J wrote:What of valid transfers of ASCII characters into dchar? Normally this is a widening operation, so I can see how it is permissible.On 01/01/2012 10:39 AM, Timon Gehr wrote:I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;On 01/01/2012 04:13 PM, Chad J wrote:I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.On 01/01/2012 07:59 AM, Timon Gehr wrote:I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.On 01/01/2012 05:53 AM, Chad J wrote:In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?See above. I think that assigning from a char[i] to another char[j] is probably safe. Similarly for slicing. These calculations tend to occur, I suspect, when the text is well-anchored. I believe your balanced parentheses example falls into this category: (repasted for reader convenience) void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')' && --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } With these observations in hand, I would consider the safety of operations to go like this: char[i] = char[j]; // (Reasonably) Safe char[i1..i2] = char[j1..j2]; // (Reasonably) Safe char = char; // Safe dchar = char // Safe. Widening. char = char[i]; // Not safe. Should error. dchar = char[i]; // Not safe. Should error. (Corollary) dchar = dchar[i]; // Safe. char = char[i1..i2]; // Nonsensical; already an error.The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.
Jan 01 2012
On 01/02/2012 12:16 AM, Chad J wrote:On 01/01/2012 02:25 PM, Timon Gehr wrote:That is an interesting point of view. Your proposal would therefore be to constrain char to the ASCII range except if it is embedded in an array? It would break the balanced parentheses example.On 01/01/2012 08:01 PM, Chad J wrote:What of valid transfers of ASCII characters into dchar? Normally this is a widening operation, so I can see how it is permissible.On 01/01/2012 10:39 AM, Timon Gehr wrote:I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;On 01/01/2012 04:13 PM, Chad J wrote:I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.On 01/01/2012 07:59 AM, Timon Gehr wrote:I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.On 01/01/2012 05:53 AM, Chad J wrote:In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?See above. I think that assigning from a char[i] to another char[j] is probably safe. Similarly for slicing. These calculations tend to occur, I suspect, when the text is well-anchored. I believe your balanced parentheses example falls into this category: (repasted for reader convenience) void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')'&& --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } With these observations in hand, I would consider the safety of operations to go like this: char[i] = char[j]; // (Reasonably) Safe char[i1..i2] = char[j1..j2]; // (Reasonably) Safe char = char; // Safe dchar = char // Safe. Widening. char = char[i]; // Not safe. Should error. dchar = char[i]; // Not safe. Should error. (Corollary) dchar = dchar[i]; // Safe. char = char[i1..i2]; // Nonsensical; already an error.The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.
Jan 01 2012
On 01/01/2012 06:36 PM, Timon Gehr wrote:On 01/02/2012 12:16 AM, Chad J wrote:I just ran the example and wow, x didn't type-infer to dchar like I expected it to. I thought the comment might be wrong, but no, it is correct, x type-infers to char. I expected it to behave more like the old days before type inference showed up everywhere: void main(){ string s = readln(); int nest = 0; foreach(dchar x;s){ // iterates by code POINT; notice the dchar. if(x=='(') nest++; else if(x==')'&& --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } This version wouldn't be broken. If the type inference changed, the other version wouldn't be broken either. This could break other things though. Bummer.On 01/01/2012 02:25 PM, Timon Gehr wrote:That is an interesting point of view. Your proposal would therefore be to constrain char to the ASCII range except if it is embedded in an array? It would break the balanced parentheses example.On 01/01/2012 08:01 PM, Chad J wrote:What of valid transfers of ASCII characters into dchar? Normally this is a widening operation, so I can see how it is permissible.On 01/01/2012 10:39 AM, Timon Gehr wrote:I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;On 01/01/2012 04:13 PM, Chad J wrote:I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.On 01/01/2012 07:59 AM, Timon Gehr wrote:I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.On 01/01/2012 05:53 AM, Chad J wrote:In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i< str.length; i++ ) { font.render(str[i]); // Ewww. ... }That actually looks like a bug that might happen in real world code. What is the signature of font.render?See above. I think that assigning from a char[i] to another char[j] is probably safe. Similarly for slicing. These calculations tend to occur, I suspect, when the text is well-anchored. I believe your balanced parentheses example falls into this category: (repasted for reader convenience) void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')'&& --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } With these observations in hand, I would consider the safety of operations to go like this: char[i] = char[j]; // (Reasonably) Safe char[i1..i2] = char[j1..j2]; // (Reasonably) Safe char = char; // Safe dchar = char // Safe. Widening. char = char[i]; // Not safe. Should error. dchar = char[i]; // Not safe. Should error. (Corollary) dchar = dchar[i]; // Safe. char = char[i1..i2]; // Nonsensical; already an error.The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.
Jan 01 2012
On 12/28/11 13:42, bearophile wrote:Peter Alexander:eg things like std.demangle? (which wraps core.demangle and that one accepts const(char)[]). IIRC eg the stdio functions taking file names want strings too; never investigated if they really need this, just .iduped the args... In general, a lot of things break when trying to switch to "proper" const(char)[] in apps, usually because the app itself used "string" instead of the const version, but fixing it up often also uncovers lib API issues. arturI often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[].I suggest you to show some of such situations.I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[].What are the Phobos functions that unnecessarily accept a string?
Dec 28 2011
I agree, the string parameters are indeed irritating, but changing the alias would bring much more pain, then it would relieve. On Wed, Dec 28, 2011 at 4:06 PM, Peter Alexander <peter.alexander.au gmail.com> wrote:string is immutable(char)[] I rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string. This is quite irritating because "string" is the most convenient and intuitive thing to type. I often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[]. I could copy the char[] into a new string, but that's expensive, and I'd rather I could just call the function. I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[]. In an ideal world I'd much prefer if string was an alias for const(char)[], but string literals were immutable(char)[]. It would require a little more effort when dealing with concurrency, but that's a price I would be willing to pay to make the string alias useful in function parameters.-- Bye, Gor Gyolchanyan.
Dec 28 2011
I understand your intention. It was one of the main irritations when I moved to D. Here is a function that unnecessarily uses string. /** * replaces foo by bar within text. */ string replace(string text, string foo, string bar) { // ... } The function is crap because it can't be called with mutable char[]. Okay, that's true. Therefore you'd suggested to alias const(char)[] instead of immutable(char)[] ??? But I think inout() is your man in this case. If I remeber correctly, it has been fixed recently. I'm not quite sure if I got your point. So forgive me if I was wrong.
Dec 28 2011
On 12/28/2011 04:06 AM, Peter Alexander wrote:string is immutable(char)[] I rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string. This is quite irritating because "string" is the most convenient and intuitive thing to type. I often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[]. I could copy the char[] into a new string, but that's expensive, and I'd rather I could just call the function. I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[]. In an ideal world I'd much prefer if string was an alias for const(char)[], but string literals were immutable(char)[]. It would require a little more effort when dealing with concurrency, but that's a price I would be willing to pay to make the string alias useful in function parameters.Agreed. I've talked about this in D.learn a number of times myself. Ali
Dec 28 2011
On 12/28/2011 08:00 AM, Ali Çehreli wrote:Agreed. I've talked about this in D.learn a number of times myself.After seeing others' comments that focus more on the alias, I need to clarify: I don't have an opinion on the alias itself. I agree with the subject line that function parameter lists should mostly have const(char)[] instead of string. Ali
Dec 28 2011
On 12/28/11 6:06 AM, Peter Alexander wrote:string is immutable(char)[] I rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string. This is quite irritating because "string" is the most convenient and intuitive thing to type. I often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[]. I could copy the char[] into a new string, but that's expensive, and I'd rather I could just call the function. I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[]. In an ideal world I'd much prefer if string was an alias for const(char)[], but string literals were immutable(char)[]. It would require a little more effort when dealing with concurrency, but that's a price I would be willing to pay to make the string alias useful in function parameters.I'm afraid you're wrong here. The current setup is very good, and much better than one in which "string" would be an alias for const(char)[]. The problem is escaping. A function that transitorily operates on a string indeed does not care about the origin of the string, but storing a string inside an object is a completely different deal. The setup class Query { string name; ... } is safe, minimizes data copying, and never causes surprises to anyone ("I set the name of my query and a little later it's all messed up!"). So immutable(char)[] is the best choice for a correct string abstraction compared against both char[] and const(char)[]. In fact it's in a way good that const(char)[] takes longer to type, because it also carries larger liabilities. If you want to create a string out of a char[] or const(char)[], use std.conv.to or the unsafe assumeUnique. Andrei
Dec 28 2011
On 28/12/11 4:27 PM, Andrei Alexandrescu wrote:The problem is escaping. A function that transitorily operates on a string indeed does not care about the origin of the string, but storing a string inside an object is a completely different deal. The setup class Query { string name; ... } is safe, minimizes data copying, and never causes surprises to anyone ("I set the name of my query and a little later it's all messed up!"). So immutable(char)[] is the best choice for a correct string abstraction compared against both char[] and const(char)[]. In fact it's in a way good that const(char)[] takes longer to type, because it also carries larger liabilities.I don't follow your argument. You've said (paraphrasing) "If a function does A then X is best, but if a function does B then Y is best, so Y is best." If a function needs to store the string then by all means it should use immutable(char)[]. However, this is a much rarer case than functions that simply use the string transitorily as you put it. Again, there are very, very few functions in Phobos that accept a string as an argument. The vast majority accept `const(char)[]` or `in char[]`. This speaks volumes about how useful the string alias is.
Dec 28 2011
On 12/28/11 11:42 AM, Peter Alexander wrote:On 28/12/11 4:27 PM, Andrei Alexandrescu wrote:I'm saying (paraphrasing) "X is modularly bankrupt and unsafe, and Y is modular and safe, so Y is best".The problem is escaping. A function that transitorily operates on a string indeed does not care about the origin of the string, but storing a string inside an object is a completely different deal. The setup class Query { string name; ... } is safe, minimizes data copying, and never causes surprises to anyone ("I set the name of my query and a little later it's all messed up!"). So immutable(char)[] is the best choice for a correct string abstraction compared against both char[] and const(char)[]. In fact it's in a way good that const(char)[] takes longer to type, because it also carries larger liabilities.I don't follow your argument. You've said (paraphrasing) "If a function does A then X is best, but if a function does B then Y is best, so Y is best."If a function needs to store the string then by all means it should use immutable(char)[]. However, this is a much rarer case than functions that simply use the string transitorily as you put it.Rarity is a secondary concern to modularity and safety.Again, there are very, very few functions in Phobos that accept a string as an argument. The vast majority accept `const(char)[]` or `in char[]`. This speaks volumes about how useful the string alias is.Phobos consists of many functions and few entity types. Application code is rife with entity types. I kindly suggest you reconsider your position; the current setup is indeed very solid. Andrei
Dec 28 2011
On Wednesday, 28 December 2011 at 16:27:15 UTC, Andrei Alexandrescu wrote:So immutable(char)[] is the best choice for a correct string abstraction compared against both char[] and const(char)[]. In fact it's in a way good that const(char)[] takes longer to type, because it also carries larger liabilities.Also, 'in char[]', which is conceptually much safer, isn't that much longer to type. It would be cool if 'scope' was actually implemented apart from an optimization though.
Dec 28 2011
On Wednesday, December 28, 2011 19:25:15 Jakob Ovrum wrote:Also, 'in char[]', which is conceptually much safer, isn't that much longer to type. It would be cool if 'scope' was actually implemented apart from an optimization though.in char[] is _not_ safer than immutable(char)[]. In fact it's _less_ safe. Itals also far more restrictive. Many, many functions return a portion of the string that they are passed in. That slicing would be impossible with scope, and because in char[] makes no guarantees about the elements not changing after the function call, you'd often have to dup or idup it in order to avoid bugs. immutable(char)[] avoids all of that. You can safely slice it without having to worry about duping it to avoid it changing out from under you. - Jonathan M Davis
Dec 28 2011
On Wednesday, 28 December 2011 at 20:49:54 UTC, Jonathan M Davis wrote:On Wednesday, December 28, 2011 19:25:15 Jakob Ovrum wrote:I didn't say it was. Please read more closely.Also, 'in char[]', which is conceptually much safer, isn't that much longer to type. It would be cool if 'scope' was actually implemented apart from an optimization though.in char[] is _not_ safer than immutable(char)[].
Dec 28 2011
On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:I'm afraid you're wrong here. The current setup is very good, and much better than one in which "string" would be an alias for const(char)[]. The problem is escaping. A function that transitorily operates on a string indeed does not care about the origin of the string, but storing a string inside an object is a completely different deal. The setup class Query { string name; ... } is safe, minimizes data copying, and never causes surprises to anyone ("I set the name of my query and a little later it's all messed up!"). So immutable(char)[] is the best choice for a correct string abstraction compared against both char[] and const(char)[]. In fact it's in a way good that const(char)[] takes longer to type, because it also carries larger liabilities. If you want to create a string out of a char[] or const(char)[], use std.conv.to or the unsafe assumeUnique.Agreed. And for a number of functions, taking const(char)[] would be worse, because they would have to dup or idup the string, whereas with immutable(char)[], they can safely slice it without worrying about its value changing. I think that if we want to make it so that immutable(char)[] isn't forced as much, then we need to make proper use of templates (which also can allow you to not force char over wchar or dchar) and inout - and perhaps in some cases, a templated function could allow you to indicate what type of character you want returned. But in general, string is by far the most useful and least likely to cause bugs with slicing. So, I think that string should remain immutable(char)[]. - Jonathan M Davis
Dec 28 2011
Le 28/12/2011 21:43, Jonathan M Davis a écrit :On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:Is inout a solution for the standard lib here ? The user could idup if a string is needed from a const/mutable char[]I'm afraid you're wrong here. The current setup is very good, and much better than one in which "string" would be an alias for const(char)[]. The problem is escaping. A function that transitorily operates on a string indeed does not care about the origin of the string, but storing a string inside an object is a completely different deal. The setup class Query { string name; ... } is safe, minimizes data copying, and never causes surprises to anyone ("I set the name of my query and a little later it's all messed up!"). So immutable(char)[] is the best choice for a correct string abstraction compared against both char[] and const(char)[]. In fact it's in a way good that const(char)[] takes longer to type, because it also carries larger liabilities. If you want to create a string out of a char[] or const(char)[], use std.conv.to or the unsafe assumeUnique.Agreed. And for a number of functions, taking const(char)[] would be worse, because they would have to dup or idup the string, whereas with immutable(char)[], they can safely slice it without worrying about its value changing.
Dec 29 2011
On Thursday, December 29, 2011 17:01:19 deadalnix wrote:Le 28/12/2011 21:43, Jonathan M Davis a =C3=A9crit :eAgreed. And for a number of functions, taking const(char)[] would b=ithworse, because they would have to dup or idup the string, whereas w=itsimmutable(char)[], they can safely slice it without worrying about =In some places, yes. Phobos doesn't use inout as much as it probably sh= ould,=20 simply because it was only recently that inout was made to work properl= y.=20 Regardless, you have to be careful about taking const(char)[], because = there's=20 a risk of forcing what could be an unnecessary idup. The best solution = to=20 that, however, depends on what exactly the function is doing. If it's s= imply=20 slicing a portion of the string that's passed in and returning it, then= inout=20 is a great solution. On the other hand, if it actually needs an=20 immutable(char)[] internally, then there's a good chance that it should= just=20 take a string. It depends on what the function is ultimately doing. - Jonathan M Davisvalue changing.=20 Is inout a solution for the standard lib here ? =20 The user could idup if a string is needed from a const/mutable char[]=
Dec 29 2011
On 12/28/2011 4:06 AM, Peter Alexander wrote:I rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string.I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it). What immutable strings make possible is treating strings as if they were value types. Nearly every language I know of treats them as immutable except for C and C++.
Dec 28 2011
On 12/28/11 11:11 AM, Walter Bright wrote:On 12/28/2011 4:06 AM, Peter Alexander wrote:I remember the day at Kahili we figured immutable(char)[] will just work as it needs to. It felt pretty awesome. AndreiI rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string.I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it). What immutable strings make possible is treating strings as if they were value types. Nearly every language I know of treats them as immutable except for C and C++.
Dec 28 2011
On 12/28/2011 06:40 PM, Andrei Alexandrescu wrote:On 12/28/11 11:11 AM, Walter Bright wrote:I agree. But I am confused by the fact that you are suggesting it actually does not work as it needs to at other places in this thread.On 12/28/2011 4:06 AM, Peter Alexander wrote:I remember the day at Kahili we figured immutable(char)[] will just work as it needs to. It felt pretty awesome. AndreiI rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string.I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it). What immutable strings make possible is treating strings as if they were value types. Nearly every language I know of treats them as immutable except for C and C++.
Dec 28 2011
On 28/12/11 5:11 PM, Walter Bright wrote:On 12/28/2011 4:06 AM, Peter Alexander wrote:We can disagree on this, but I think the fact that Phobos rarely uses 'string' and instead uses 'const(char)[]' or 'in char[]' speaks louder than either of our experiences.I rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string.I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it).What immutable strings make possible is treating strings as if they were value types. Nearly every language I know of treats them as immutable except for C and C++.Yes, and I wouldn't want to remove that. Immutable strings are good, but requiring immutable strings when you don't need them is definitely not good. Phobos knows this, so it doesn't use string, which leads me to question what use the string alias is.
Dec 28 2011
Most common to me buffer reuse. I'll read a line of a file into a buffer, op= erate on it, then read the next line into the same buffer. If references to t= he buffer may escape, it's obviously unsafe to cast to immutable.=20 Sent from my iPhone On Dec 28, 2011, at 9:11 AM, Walter Bright <newshound2 digitalmars.com> wrot= e:On 12/28/2011 4:06 AM, Peter Alexander wrote:ar)[].I rarely *ever* need an immutable string. What I usually need is const(ch=ase where I wanted to modify an existing string (this includes all my C and C= ++ usage of strings). It's always assemble a string at one place, and then r= efer to that string ever after (and never modify it).I'd say 99%+ of the time I need only a const string.=20 I have a very different experience with strings. I can't even remember a c==20 What immutable strings make possible is treating strings as if they were v=alue types. Nearly every language I know of treats them as immutable except f= or C and C++.
Dec 28 2011
Peter, having string as immutable(char)[] was perhaps one of the best D2 decisions so far, in my humble opinion. I strongly disagree with you on this one.
Dec 28 2011
Having a mutable string is a bad idea also because it's mutability is in the form of array element manipulations, but the string (except for the dstring) is not semantically an array and its element mutation isn't safe. On Wed, Dec 28, 2011 at 9:19 PM, Dejan Lekic <dejan.lekic gmail.com> wrote:Peter, having string as immutable(char)[] was perhaps one of the best D2 decisions so far, in my humble opinion. I strongly disagree with you on this one.-- Bye, Gor Gyolchanyan.
Dec 28 2011
On Wed, 28 Dec 2011 14:06:06 +0200, Peter Alexander <peter.alexander.au gmail.com> wrote:string is immutable(char)[] I rarely *ever* need an immutable string. What I usually need is const(char)[]. I'd say 99%+ of the time I need only a const string. This is quite irritating because "string" is the most convenient and intuitive thing to type. I often get into situations where I've written a function that takes a string, and then I can't call it because all I have is a char[]. I could copy the char[] into a new string, but that's expensive, and I'd rather I could just call the function. I think it's telling that most Phobos functions use 'const(char)[]' or 'in char[]' instead of 'string' for their arguments. The ones that use 'string' are usually using it unnecessarily and should be fixed to use const(char)[]. In an ideal world I'd much prefer if string was an alias for const(char)[], but string literals were immutable(char)[]. It would require a little more effort when dealing with concurrency, but that's a price I would be willing to pay to make the string alias useful in function parameters.As you said string is not a structure but an alias. Your arguments not against string but the functions that support only strings which you think they shouldn't. If you are sure, that function is able to work on your "string" (but it won't) it just shows that we need to focus on the function rather than the string, no?
Dec 28 2011
there are lot of people suggesting to change how string behaves. but remember, d is awesome compared to other languages for not wrapping string in a class or struct. you can use string/char[] without loosing your _nativeness_. programmers targeting embedded systems are really happy because of this. by the way, I don't want to blame someone, but I think we diverged from the original purpose of this topic. __"string is rarely useful as a function argument"__ I think he points out that choosing _string_ type in function arguments is _wrong_ in most cases. and there isn't much use of inout in phobos as it was broken for a long time.
Dec 30 2011