digitalmars.D - The Case Against Autodecode
- Walter Bright (28/33) May 12 2016 Here are some that are not matters of opinion.
- Vladimir Panteleev (9/52) May 12 2016 12. The result of autodecoding, a range of Unicode code points,
- H. S. Teoh via Digitalmars-d (22/32) May 12 2016 Example of string special-casing leading to bugs:
- H. S. Teoh via Digitalmars-d (27/36) May 12 2016 A range of Unicode code points is not the same as a range of graphemes
- Marc =?UTF-8?B?U2Now7x0eg==?= (6/14) May 13 2016 In fact, even most European languages are affected if NFD
- Marco Leise (13/19) May 13 2016 +1 for leaning back and contemplate exactly what auto-decode
- H. S. Teoh via Digitalmars-d (7/28) May 13 2016 [...]
- Daniel Kozak (12/55) May 12 2016 For me it is not about autodecoding. I would like to have
- Walter Bright (2/10) May 12 2016 I can't find any actionable information in this.
- Marco Leise (13/15) May 12 2016 More precisely they are byte strings with '/' reserved to
- Walter Bright (9/11) May 12 2016 I would have agreed with you in the past, but more and more it just does...
- Jack Stouffer (17/18) May 12 2016 If you're serious about removing auto-decoding, which I think you
- Jack Stouffer (11/19) May 12 2016 To hammer this home a little more, Python 3 had a really useful
- Walter Bright (2/6) May 12 2016 I agree, if it is possible at all.
- Chris (13/23) May 13 2016 I don't know to which extent my problems with string handling are
- Walter Bright (2/5) May 13 2016 You can avoid autodecode by using .byChar
- Chris (8/16) May 13 2016 Hm. It would be difficult to make sure that my whole code base
- Vladimir Panteleev (2/7) May 13 2016 https://twitter.com/StopForumSpam
- Chris (3/11) May 13 2016 I don't understand. Does that mean we have to solve CAPTCHAs
- Iakh (9/19) May 13 2016 A plan:
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (10/16) May 13 2016 Python 2 is/was deployed at a much larger scale and with far more
- Nick Treleaven (8/12) May 13 2016 char[] is always going to be unsafe for UTF-8. I don't think we
- H. S. Teoh via Digitalmars-d (9/21) May 13 2016 alias String = typeof(std.uni.byGrapheme(immutable(char)[].init));
- Nick Sabalausky (20/35) May 29 2016 As much as I agree on the importance of a good smooth migration path, I
- Jack Stouffer (12/22) May 29 2016 If it happens, they better. The D1 fork was maintained for almost
- Nick Sabalausky (3/9) May 30 2016 D1 -> D2 was a vastly more disruptive change than getting rid of
- Jack Stouffer (3/5) May 30 2016 Don't be so sure. All string handling code would become broken,
- Andrei Alexandrescu (3/8) May 30 2016 That kind of makes this thread less productive than "How to improve
- Dmitry Olshansky (6/15) May 30 2016 1. Generalize to all ranges of code units i.e. ranges of char/wchar.
- Jack Stouffer (5/7) May 30 2016 Please don't misunderstand, I'm for fixing string behavior. But,
- Andrei Alexandrescu (5/9) May 30 2016 Surely the misunderstanding is not on this side of the table :o). By
- Jonathan M Davis via Digitalmars-d (16/25) May 31 2016 I think that the first step is getting Phobos to work with all ranges of
- Andrei Alexandrescu (3/10) May 31 2016 Great. Could you put together a sample PR so we understand the
- Vladimir Panteleev (11/16) May 30 2016 Assuming silent breakage is on the table, what would be broken,
- Seb (4/21) May 30 2016 132 lines in Phobos use auto-decoding - that should be fixable ;-)
- Andrei Alexandrescu (3/26) May 30 2016 Thanks for this investigation! Results are about as I'd have speculated....
- Jack Stouffer (8/11) May 30 2016 Did it, the results are a large number of phobos modules fail to
- Andrei Alexandrescu (3/14) May 30 2016 It was also made at a time when the community was smaller by a couple
- Chris (10/34) May 30 2016 I suggest providing an automatic tool (either within the compiler
- Marco Leise (39/44) May 30 2016 It makes a difference for every function. But it still isn't
- Chris (4/5) May 30 2016 I was actually talking about ICU with a colleague today. Could it
- Marco Leise (40/45) May 30 2016 You have to compare to the situation before, when every
- Joakim (33/38) May 31 2016 Part of it is the complexity of written language, part of it is
- Jonathan M Davis via Digitalmars-d (18/24) May 31 2016 Considering that *nix land uses UTF-8 almost exclusively, and many C
- Joakim (26/55) May 31 2016 And there are a lot more languages that will be twice as long
- Marco Leise (36/68) May 31 2016 Maybe you can dig up your old post and we can look at each of
- Joakim (29/97) May 31 2016 Not interested. I believe you were part of that thread then.
- Timon Gehr (3/16) May 31 2016 It is probably this one. Not sure what "exactly the issues" are though.
- Walter Bright (10/11) May 31 2016 I agree. I dealt the madness of code pages, Shift-JIS, EBCDIC, locales, ...
- ag0aep6g (4/8) May 31 2016 Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate
- Walter Bright (2/5) May 31 2016 Thanks for the correction.
- Marco Leise (11/14) May 31 2016 I think so too, although more APIs than just Windows use
- ag0aep6g (4/10) May 31 2016 Guys, may I ask you to move this discussion to a new thread? I'd like to...
- Joakim (4/18) May 31 2016 No, this is the root of the problem, but I'm not interested in
- Marc =?UTF-8?B?U2Now7x0eg==?= (14/33) Jun 01 2016 I assume you're talking about the web here. In this case, plain
- Joakim (12/48) Jun 01 2016 No, I explicitly said not the web in a subsequent post. The
- Marco Leise (23/25) Jun 01 2016 I've used 56k and had a phone conversation with my sister
- Joakim (92/165) Jun 01 2016 I see that max 2G speeds are 100-200 kbits/s. At that rate, it
- Wyatt (20/30) Jun 01 2016 It's not hard. I think a lot of us remember when a 14.4 modem
- Patrick Schluter (17/50) Jun 01 2016 Indeed, Joakim's proposal is so insane it beggars belief (why not
- deadalnix (2/4) Jun 01 2016 That should be obvious to anyone living outside the USA.
- Nick Sabalausky (4/8) Jun 01 2016 Or anyone in the USA who's ever touched a product that includes a manual...
- Kagamin (2/7) Jun 01 2016 https://msdn.microsoft.com/th-th inside too :)
- Kagamin (3/7) Jun 01 2016 UTF-8 encoded SMS work fine for me in GSM network, didn't notice
- Adam D. Ruppe (10/12) May 30 2016 Actually, my main rule of thumb is: don't mess with strings. Get
- Jack Stouffer (4/11) May 12 2016 This is a great example of special casing in Phobos that someone
- Bill Hicks (10/43) May 12 2016 Wow, that's eleven things wrong with just one tiny element of D,
- Ethan Watson (5/6) May 13 2016 Actually, chap, it's the attitude that's the turn-off in your
- poliklosio (10/21) May 13 2016 You get banned because there is a difference between torpedoing a
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (49/58) May 15 2016 Not really. The dominating precursor to C, BCPL was a
- Chris (12/21) May 13 2016 Is there any PL that doesn't have multiple issues? Look at Swift.
- Kagamin (4/6) May 13 2016 D is a better broken thing among all the broken things in this
- Walter Bright (4/8) May 13 2016 Posts that engage in personal attacks and bring up personal issues about...
- Jonathan M Davis via Digitalmars-d (37/73) May 13 2016 It also results in constantly special-casing algorithms for narrow strin...
- Chris (4/16) May 13 2016 Why not just try it in a separate test release? Only then can we
- Marc =?UTF-8?B?U2Now7x0eg==?= (11/30) May 13 2016 char[], wchar[] etc. can simply be made non-ranges, so that the
- Jonathan M Davis via Digitalmars-d (13/32) May 13 2016 It also means yet more special cases. You have arrays which aren't treat...
- Kagamin (3/6) May 13 2016 UTF-16 was a migration from UCS-2, and UCS-2 was superior at the
- Jonathan M Davis via Digitalmars-d (25/31) May 13 2016 The history of why UTF-16 was chosen isn't really relevant to my point
- Marc =?UTF-8?B?U2Now7x0eg==?= (3/8) May 13 2016 This just means that filenames mustn't be represented as strings;
- Walter Bright (5/12) May 13 2016 It means much more than that, filenames are just an example. I recently ...
- Steven Schveighoffer (17/19) May 13 2016 I'll repeat what I said in the other thread.
- Alex Parrill (9/31) May 13 2016 Well, the "auto" part of autodecoding means "automatically doing
- Steven Schveighoffer (10/44) May 13 2016 No, the problem isn't the auto-decoding. The problem is having *arrays*
- Jon D (49/59) May 15 2016 Given the importance of performance in the auto-decoding topic,
- Jack Stouffer (5/9) May 15 2016 Here is another benchmark (see the above comment for the code to
- H. S. Teoh via Digitalmars-d (99/110) May 15 2016 I decide to do my own benchmarking too. Here's the code:
- jmh530 (2/12) May 16 2016 Interesting that LDC is slower than DMD for char[].
- Andrei Alexandrescu (75/114) May 26 2016 This might be a good time to discuss this a tad further. I'd appreciate
- Jack Stouffer (28/39) May 26 2016 For an example where the std.algorithm/range functions don't cut
- H. S. Teoh via Digitalmars-d (87/134) May 26 2016 [...]
- Andrei Alexandrescu (2/9) May 26 2016 No, that's not necessary (or correct). -- Andrei
- Marco Leise (12/36) May 30 2016 Am Thu, 26 May 2016 16:23:16 -0700
- Andrew Godfrey (4/4) May 30 2016 I like "make string iteration explicit" but I wonder about other
- Adam D. Ruppe (5/9) May 30 2016 The comparison predicate does that...
- Andrew Godfrey (7/16) May 30 2016 Thanks! You left out some details but I think I see - an example
- Marco Leise (33/37) May 30 2016 You are just scratching the surface! Unicode strings are
- Vladimir Panteleev (31/111) May 26 2016 It is completely wasted mental effort.
- Jonathan M Davis via Digitalmars-d (5/20) May 31 2016 In addition, as soon as you have ubyte[], none of the string-related
- Andrei Alexandrescu (2/5) May 31 2016 That'd be nice to fix indeed. Please break the ground? -- Andrei
- Kagamin (8/14) May 27 2016 Sounds like you want to say that string should be smarter than an
- Andrei Alexandrescu (3/8) May 27 2016 That's my understanding too. And I think the design rationale is wrong.
- Marc =?UTF-8?B?U2Now7x0eg==?= (30/104) May 27 2016 It is not, which has been shown by various posts in this thread.
- Andrei Alexandrescu (3/4) May 27 2016 Couldn't quite find strong arguments. Could you please be more explicit
- Marc =?UTF-8?B?U2Now7x0eg==?= (36/41) May 28 2016 There are several possibilities of what iteration over a char
- Andrei Alexandrescu (10/13) May 28 2016 OK, that's a fair argument, thanks. So it seems there should be no
- Walter Bright (6/8) May 28 2016 An array of code units provides consistency, predictability, flexibility...
- Andrew Godfrey (18/27) May 28 2016 You're right. An "array of code units" is a very useful low-level
- Chris (14/17) May 29 2016 Unicode graphemes are not always the same as graphemes in natural
- Tobias =?UTF-8?B?TcO8bGxlcg==?= (10/22) May 29 2016 No, this is well established terminology, you are confusing
- default0 (6/29) May 29 2016 I am pretty sure that a single grapheme in unicode does not
- Tobias M (9/13) May 29 2016 Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a
- Chris (4/18) May 29 2016 Which is why we need to agree on a terminology, i.e. be clear
-
Chris
(14/37)
May 29 2016
Ok, you have a point there, to be precise
is a multigraph (a - Tobias M (6/18) May 29 2016 What I meant was, a phoneme is the "character" (smallest unit) in
- H. S. Teoh via Digitalmars-d (22/34) May 29 2016 [...]
- Walter Bright (4/8) May 29 2016 As far as D is concerned, we are not going to invent our own concepts ar...
- Walter Bright (2/3) May 29 2016 For D, we should stick with the terminology as defined by Unicode.
- Andrei Alexandrescu (3/11) May 30 2016 Buying it. -- Andrei
- Timon Gehr (3/13) May 30 2016 I'm buying it. IMO alias string=immutable(char)[] is the most useful
- Andrei Alexandrescu (22/36) May 30 2016 Wouldn't D then be seen (and rightfully so) as largely not supporting
- H. S. Teoh via Digitalmars-d (24/64) May 30 2016 They already randomly work or not work on ranges of dchar. I hope we
- Walter Bright (4/6) May 30 2016 When I wrote Warp, the only point of which was speed, I couldn't use pho...
- Chris (6/14) May 31 2016 Two questions:
- Walter Bright (8/11) May 31 2016 It's been a while so I don't remember exactly, but as I recall if the AP...
- Timon Gehr (16/57) May 30 2016 In D, enum does not mean enumeration, const does not mean constant, pure...
- Nick Sabalausky (2/4) May 30 2016 My new favorite quote :)
- Jack Stouffer (5/9) May 28 2016 Yes!
- Dicebot (13/20) May 29 2016 Ideally there should not be a way to iterate a (unicode) string at all
- Marc =?UTF-8?B?U2Now7x0eg==?= (11/26) May 30 2016 I think this is going too far. It's sufficient if they (= char
- Andrei Alexandrescu (2/20) May 30 2016 That's... what I said. -- Andrei
- Adam D. Ruppe (9/10) May 30 2016 You said "not arrays", he said "not ranges".
- Seb (15/26) May 30 2016 That's a great idea - the compiler should also issue deprecation
- ag0aep6g (5/16) May 30 2016 All this is only sensible when we move to a dedicated string type that's...
- Marc =?UTF-8?B?U2Now7x0eg==?= (5/10) May 30 2016 I agree; most of the troubles have been with auto-decoding. In an
- Walter Bright (3/4) May 30 2016 Why? strings are arrays of code units. All the trouble comes from errati...
- Andrei Alexandrescu (5/10) May 30 2016 That's not an argument. Objects are arrays of bytes, or tuples of their
- Walter Bright (4/16) May 31 2016 If there is an abstraction for strings that is efficient, consistent, us...
- deadalnix (4/27) May 31 2016 Thing is, more info is needed to support unicode properly.
- Andrei Alexandrescu (6/25) May 31 2016 It's been mentioned several times: a string type that does not offer
- deadalnix (4/5) May 31 2016 It is a slice type. It should work as a slice type. Every other
- Jonathan M Davis via Digitalmars-d (55/61) May 31 2016 Not exactly. Such a string type does not hide the fact that it's UTF.
- Andrei Alexandrescu (2/12) May 31 2016 How is that different from what I said? -- Andrei
- Jonathan M Davis via Digitalmars-d (9/22) May 31 2016 My point was that Walter was stating that you can't have a type that hid...
- Marc =?UTF-8?B?U2Now7x0eg==?= (38/43) May 31 2016 So, strings are _implemented_ as arrays of code units. But
- Seb (8/14) May 31 2016 If we follow Adam's proposal to deprecate front, back, popFront
- ag0aep6g (6/9) May 31 2016 After checking some of those 132 places, they are in generic functions
- Andrei Alexandrescu (5/7) May 31 2016 It is terrible, no two ways about it. We've been very very careful with
- Kagamin (4/11) May 31 2016 If the user doesn't know how he wants to iterate and you leave
- Adam D. Ruppe (5/7) May 30 2016 I don't agree on changing those. Indexing and slicing a char[] is
- Walter Bright (2/5) May 30 2016 Yup. It isn't hard at all to use arrays of codeunits correctly.
- Andrei Alexandrescu (3/10) May 30 2016 Trouble is, it isn't hard at all to use arrays of codeunits incorrectly,...
- H. S. Teoh via Digitalmars-d (6/16) May 30 2016 Neither does autodecoding make code anymore correct. It just better
- default0 (10/26) May 31 2016 Thinking about this a bit more - what algorithms are actually
- Marco Leise (9/16) May 31 2016 Calculating the buffer size of a string, validation and
- Jonathan M Davis via Digitalmars-d (19/28) May 31 2016 Equality does not require decoding. Similarly, functions like find don't
- Andrei Alexandrescu (9/20) May 31 2016 Good idea. We could overload functions such as find on char, wchar, and
- Marco Leise (14/21) May 31 2016 Both "equality" and "find" require byGrapheme.
- Dmitry Olshansky (6/13) May 31 2016 Ehm as long as all you care for is operating on substrings I'd say.
- Jonathan M Davis via Digitalmars-d (8/20) May 31 2016 Yeah, but Phobos provides the tools to do that reasonably easily even wh...
- H. S. Teoh via Digitalmars-d (15/26) May 31 2016 [...]
- Chris (24/53) May 27 2016 On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
- Andrei Alexandrescu (6/48) May 27 2016 That's what happens with autodecoding.
- ag0aep6g (5/19) May 27 2016 They only work "properly" if you define "properly" as "in terms of code
- Chris (11/18) May 27 2016 I agree. It has happened to me that characters like "é" return
- Andrei Alexandrescu (2/3) May 27 2016 Would normalization make length 1? -- Andrei
- Adam D. Ruppe (2/3) May 27 2016 In some, but not all cases.
- Dmitry Olshansky (4/7) May 27 2016 No, this is not the point of normalization.
- Andrei Alexandrescu (2/8) May 27 2016 What is? -- Andrei
- Minas Mina (5/15) May 27 2016 This video will be helpfull :)
- tsbockman (16/19) May 27 2016 1) A grapheme may include several combining characters (such as
- Dmitry Olshansky (5/23) May 28 2016 Quite accurate statement of the goals. Normalization is all about having...
- Minas Mina (12/22) May 27 2016 Here is an example about normalization.
- David Nadlinger (4/7) May 27 2016 Unless I'm mistaken, this depends on the form used. For example,
- Jonathan M Davis via Digitalmars-d (7/13) May 31 2016 Yeah. For better or worse, there are different normalization schemes for
- Chris (3/7) May 28 2016 No, I've tried it. I think dchar[] returns one or you check by
- H. S. Teoh via Digitalmars-d (51/74) May 27 2016 Exactly. And we just keep getting stuck on this point. It seems that the
- Andrei Alexandrescu (4/8) May 27 2016 Which languages are covered by code points, and which languages require
- ag0aep6g (9/12) May 27 2016 I don't think there is value in distinguishing by language. The point of...
- Andrei Alexandrescu (3/5) May 27 2016 It seems code points are kind of useless because they don't really mean
- ag0aep6g (5/7) May 27 2016 I think so, yeah.
- H. S. Teoh via Digitalmars-d (7/13) May 27 2016 That's what we've been trying to say all along! :-P They're a kind of
- Andrei Alexandrescu (2/3) May 27 2016 If that's the case things are pretty dire, autodecoding or not. -- Andre...
- H. S. Teoh via Digitalmars-d (8/13) May 27 2016 Like it or not, Unicode ain't merely some glorified form of C's ASCII
- Jonathan M Davis via Digitalmars-d (21/24) May 31 2016 True enough. Correctly handling Unicode in the general case is ridiculou...
- Tobias M (9/21) May 29 2016 Code points are *the fundamental unit* of unicode. AFAIK most
- Andrei Alexandrescu (2/21) May 29 2016 So now code points are good? -- Andrei
- H. S. Teoh via Digitalmars-d (10/35) May 29 2016 It depends on what you're trying to accomplish. That's the point we're
- Andrei Alexandrescu (4/10) May 30 2016 I see. Again this all to me sounds like "naked arrays of characters are
- Jonathan M Davis via Digitalmars-d (18/26) May 31 2016 Exactly. And even a given function can't necessarily always be defined t...
- Adam D. Ruppe (11/13) May 27 2016 It might help to think of code points as being a kind of byte
- H. S. Teoh via Digitalmars-d (9/27) May 27 2016 Fun fact: on some old Unix boxen, Backspace + underscore was interpreted
- Steven Schveighoffer (6/11) May 27 2016 The only unmistakably correct use I can think of is transcoding from one...
- H. S. Teoh via Digitalmars-d (43/52) May 27 2016 This is a complicated issue; for a full explanation you'll probably want
- Marco Leise (11/33) May 30 2016 1: Auto-decoding shall ALWAYS do the proper thing
- Marco Leise (4/5) May 30 2016 *Correction: Koreans
- Jonathan M Davis via Digitalmars-d (22/71) May 31 2016 Exactly. Saying that operating at the code point level - UTF-32 - is cor...
- Andrei Alexandrescu (4/6) May 31 2016 Could you please substantiate that? My understanding is that code unit
- Jonathan M Davis via Digitalmars-d (57/63) May 31 2016 Okay. If you have the letter A, it will fit in one UTF-8 code unit, one
- Andrei Alexandrescu (9/60) May 31 2016 Does walkLength yield the same number for all representations?
- Timon Gehr (3/6) May 31 2016 code point
- Andrei Alexandrescu (2/5) May 31 2016 foreach, too. -- Andrei
- ZombineDev (3/9) Jun 01 2016 Incorrect. https://dpaste.dzfl.pl/ba7a65d59534
- Andrei Alexandrescu (2/11) Jun 01 2016 Try typing the iteration variable with "dchar". -- Andrei
- Adam D. Ruppe (4/5) Jun 01 2016 Or you can type it as wchar...
- ZombineDev (7/20) Jun 01 2016 I think you are not getting my point. This is not autodecoding.
- ZombineDev (2/23) Jun 01 2016 in std.range.primitives.
- Andrei Alexandrescu (14/16) Jun 01 2016 I understand where you're coming from, but it actually is autodecoding.
- Jack Stouffer (9/12) Jun 01 2016 This seems to be a miscommunication with semantics. This is not
- Andrei Alexandrescu (7/14) Jun 01 2016 No, this is autodecoding pure and simple. We can't move the goals
- Timon Gehr (5/21) Jun 02 2016 It does not share most of the characteristics that make Phobos'
- ZombineDev (25/43) Jun 01 2016 Regardless of how different people may call it, it's not what
- Andrei Alexandrescu (13/25) Jun 01 2016 Yes, definitely - but then again we can't after each invalidated claim
- Kagamin (5/12) Jun 02 2016 Do you mean you agree that range primitives for strings can be
- ZombineDev (51/80) Jun 02 2016 My claim was not invalidated. I just didn't want to waste time
- ZombineDev (7/15) Jun 02 2016 B) This strange feature is here because we chose compatibility
- Andrei Alexandrescu (78/158) Jun 02 2016 Your claim was obliterated, and now you continue arguing it by adjusting...
- Timon Gehr (3/25) Jun 02 2016 It's not "on the fly". You two were presumably using different
- cym13 (23/97) Jun 02 2016 If you are to stay with autodecoding (and I hope you won't) then
- tsbockman (4/8) Jun 02 2016 That would cause just as much - if not more - code breakage as
- Andrei Alexandrescu (13/38) Jun 02 2016 That's not going to work. A false impression created in this thread has
- Marc =?UTF-8?B?U2Now7x0eg==?= (9/11) Jun 02 2016 They _are_ useless for almost anything you can do with strings.
- Andrei Alexandrescu (51/59) Jun 02 2016 Pretty much everything. Consider s and s1 string variables with possibly...
- ag0aep6g (5/9) Jun 02 2016 Doesn't work with autodecoding (to code points) when a combining
- Andrei Alexandrescu (2/11) Jun 02 2016 Works if s is normalized appropriately. No?
- Timon Gehr (2/13) Jun 02 2016 No. assert(!"ö̶".normalize!NFC.any!(c => c== 'ö'));
- ag0aep6g (7/18) Jun 02 2016 Works when normalized to precomposed characters, yes.
- tsbockman (10/13) Jun 02 2016 Your 'ö' examples will NOT work reliably with auto-decoded code
- Andrei Alexandrescu (7/19) Jun 02 2016 They do work per spec: find this code point. It would be surprising if
- Brad Anderson (4/15) Jun 02 2016 If there were to be a unicode lieutenant, Dmitry seems to be the
- ag0aep6g (6/8) Jun 02 2016 The "spec" here is how the range primitives for narrow strings are
- Andrei Alexandrescu (3/4) Jun 02 2016 And want to return to the point where char[] is but an indiscriminated
- default0 (4/10) Jun 02 2016 Just make RCStr the most amazing string type of any standard
- Andrei Alexandrescu (2/10) Jun 02 2016 Soon as this thread ends. -- Andrei
- ag0aep6g (6/8) Jun 02 2016 I think you'd have to substantiate how that would be worse than
- Andrei Alexandrescu (5/13) Jun 02 2016 I gave a long list of std.algorithm uses that perform virtually randomly...
- ag0aep6g (5/6) Jun 02 2016 Yes it does. You've been given plenty examples where it falls apart.
- Andrei Alexandrescu (6/12) Jun 02 2016 That is correct.
- Timon Gehr (4/6) Jun 02 2016 So far, I needed to count the number of characters 'ö' inside some
- Timon Gehr (4/11) Jun 02 2016 (Obviously this isn't even what the example would do. I predict I will
- Andrei Alexandrescu (4/14) Jun 02 2016 You may look for a specific dchar, and it'll work. How about
- Timon Gehr (11/27) Jun 02 2016 .̂ ̪.̂
- Andrei Alexandrescu (4/14) Jun 02 2016 Count delimited words. Did you also look at balancedParens?
- Timon Gehr (5/30) Jun 02 2016 On 02.06.2016 22:01, Timon Gehr wrote:
- ag0aep6g (7/9) Jun 02 2016 They're simply not possible. Won't compile. There is no single UTF-8
- ag0aep6g (4/10) Jun 02 2016 I'm ignoring combining characters there. You can search for 'a' in code
- Andrei Alexandrescu (12/23) Jun 02 2016 Of course you can. Can you search for an int in a short[]? Oh yes you
- Andrei Alexandrescu (2/6) Jun 02 2016 Correx, indeed you can't. -- Andrei
- ag0aep6g (4/13) Jun 02 2016 Yes, you're right, of course they do. char implicitly converts to dchar....
- Andrei Alexandrescu (3/17) Jun 02 2016 I do think that's an interesting option in PL design space, but that
- tsbockman (23/42) Jun 02 2016 Your examples will pass or fail depending on how (and whether)
- Andrei Alexandrescu (8/14) Jun 02 2016 And that's fine. Want graphemes, .byGrapheme wags its tail in that
- H. S. Teoh via Digitalmars-d (27/44) Jun 02 2016 [...]
- deadalnix (44/101) Jun 02 2016 False. Many characters can be represented by different sequences
- Andrei Alexandrescu (2/10) Jun 02 2016 True. "Are all code points equal to this one?" -- Andrei
- Timon Gehr (2/13) Jun 02 2016 I.e. you are saying that 'works' means 'operates on code points'.
- Andrei Alexandrescu (2/3) Jun 02 2016 Affirmative. -- Andrei
- H. S. Teoh via Digitalmars-d (23/27) Jun 02 2016 Again, a ridiculous position. I can use exactly the same line of
- cym13 (12/26) Jun 02 2016 A:“We should decode to code points”
- Andrei Alexandrescu (3/11) Jun 02 2016 With autodecoding all of std.algorithm operates correctly on code
- Timon Gehr (2/14) Jun 02 2016 No, without it, it operates correctly on code units.
- cym13 (35/50) Jun 02 2016 Allow me to try another angle:
- Andrei Alexandrescu (17/40) Jun 02 2016 You mean all 35 of them?
- H. S. Teoh via Digitalmars-d (10/23) Jun 02 2016 With ASCII strings, all of std.algorithm operates correctly on ASCII
- deadalnix (11/25) Jun 02 2016 The good thing when you define works by whatever it does right
- Andrei Alexandrescu (2/3) Jun 02 2016 No, it works as it was designed. -- Andrei
- deadalnix (3/7) Jun 02 2016 Nobody says it doesn't. Everybody says the design is crap.
- Andrei Alexandrescu (2/8) Jun 02 2016 I think I like it more after this thread. -- Andrei
- deadalnix (4/15) Jun 02 2016 You start reminding me of the joke with that guy complaining that
- Andrei Alexandrescu (2/15) Jun 02 2016 Touché. (Get it?) -- Andrei
- Andrei Alexandrescu (4/13) Jun 02 2016 Meh, thinking of it again: I don't like it more, I'd still do it
- Nick Sabalausky (2/11) Jun 03 2016 Well there's a fantastic argument.
- Timon Gehr (4/6) Jun 02 2016 It also has false positives (you can combine 'ö' with some combining
- Walter Bright (5/15) Jun 02 2016 There are 3 levels of Unicode support. What Andrei is talking about is L...
- Andrei Alexandrescu (2/20) Jun 02 2016 Apparently I'm not the only idiot. -- Andrei
- deadalnix (3/26) Jun 02 2016 To be able to convert back and forth from/to unicode in a
- Walter Bright (2/6) Jun 02 2016 Sorry, that makes no sense, as it is saying "they're the same, only diff...
- John Colvin (6/29) Jun 02 2016 There are languages that make heavy use of diacritics, often
- Jonathan M Davis via Digitalmars-d (7/16) Jun 02 2016 Yeah. I'm inclined to think that the fact that there are multiple
- Walter Bright (4/9) Jun 02 2016 I didn't say ordering, I said there should be no such thing as "normaliz...
- H. S. Teoh via Digitalmars-d (51/63) Jun 03 2016 I think it was a combination of historical baggage and trying to
- Walter Bright (7/11) Jun 03 2016 It is not inevitable. Simply disallow the 2 codepoint sequences - the si...
- Vladimir Panteleev (5/19) Jun 03 2016 I don't think it would work (or at least, the analogy doesn't
- Jonathan M Davis via Digitalmars-d (16/38) Jun 03 2016 I would have argued that no composited characters should have ever exist...
- Chris (2/46) Jun 03 2016 I do exactly this. Validate and normalize.
- deadalnix (3/4) Jun 05 2016 And once you've done this, auto decoding is useless because the
- Walter Bright (3/6) Jun 03 2016 So don't add new precomposited characters when a recognized existing seq...
- Walter Bright (6/16) Jun 03 2016 I don't see that this is tricky at all. Adding additional semantic meani...
- Vladimir Panteleev (7/31) Jun 03 2016 That's not right either. Cyrillic letters can look slightly
- H. S. Teoh via Digitalmars-d (17/42) Jun 03 2016 Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π i...
- Walter Bright (2/6) Jun 03 2016 It's almost as if printed documents and books have never existed!
- H. S. Teoh via Digitalmars-d (30/37) Jun 03 2016 But if we were to encode appearance instead of logical meaning, that
- Walter Bright (11/36) Jun 03 2016 No.
- H. S. Teoh via Digitalmars-d (48/69) Jun 03 2016 [...]
- Walter Bright (13/44) Jun 03 2016 It works for books. Unicode invented a problem, and came up with a thoro...
- H. S. Teoh via Digitalmars-d (37/92) Jun 03 2016 This madness already exists *without* Unicode. If you have a page with a
- Walter Bright (14/30) Jun 04 2016 It's not a problem that Unicode can solve. As you said, the meaning is i...
- docandrew (17/58) Jun 05 2016 Even if a character in different languages share a glyph or look
- deadalnix (5/7) Jun 05 2016 Interestingly enough, I've mentioned earlier here that only
- Walter Bright (8/15) Jun 05 2016 You'd be in error. I've been casually working on my grandfather's thesis...
- Patrick Schluter (30/48) Jun 04 2016 In Unicode there are 2 different codepoints for lower case sigma
- Patrick Schluter (25/25) Jun 04 2016 One has also to take into consideration that Unicode is the way
- ketmar (6/8) Jun 03 2016 some old xUSSR books which has some English text sometimes used
- Walter Bright (2/3) Jun 03 2016 Nobody here suggested using the wrong font, it's completely irrelevant.
- ketmar (4/8) Jun 03 2016 you suggested that unicode designers should make similar-looking
- deadalnix (2/12) Jun 05 2016 TIL: books are read by computers.
- Walter Bright (2/3) Jun 05 2016 I should introduce you to a fabulous technology called OCR. :-)
- Walter Bright (2/7) Jun 03 2016 How did people ever get by with printed books and documents?
- Timon Gehr (2/12) Jun 03 2016 They can disambiguate the letters based on context well enough.
- Walter Bright (4/7) Jun 03 2016 Characters do not have semantic meaning. Their meaning is always inferre...
- Adam D. Ruppe (4/5) Jun 03 2016 Printed books pick one font and one layout, then is read by
- Jonathan M Davis via Digitalmars-d (33/50) Jun 03 2016 Actually, I would argue that the moment that Unicode is concerned with w...
- Walter Bright (4/9) Jun 03 2016 What I meant was pretty clear. Font is an artistic style that does not c...
- Adam D. Ruppe (5/6) Jun 03 2016 Nah, then it is an Awesome Font that is totally Web Scale!
- Jonathan M Davis via Digitalmars-d (35/45) Jun 05 2016 Well, maybe I misunderstood what was being argued, but it seemed like yo...
- Dmitry Olshansky (6/26) Jun 03 2016 Yeah, Unicode was not meant to be easy it seems. Or this is whatever
- Alix Pexton (38/44) Jun 04 2016 Typing as someone who as spent some time creating typefaces, having two
- Timon Gehr (34/118) Jun 02 2016 Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable...
- jmh530 (25/27) Jun 02 2016 In Andrei's original post, he says that s is a string variable.
- Andrei Alexandrescu (3/4) Jun 02 2016 That would be another language design option, which we don't have the
- Andrei Alexandrescu (3/4) Jun 02 2016 As expected. Different code units for different folks. That's a
- Andrei Alexandrescu (2/4) Jun 02 2016 The goal is to operate on code units. -- Andrei
- Andrei Alexandrescu (2/6) Jun 02 2016 s/units/points/
- ag0aep6g (5/6) Jun 02 2016 You sure you got the right word there? The code unit is the smallest
- Andrei Alexandrescu (3/4) Jun 02 2016 By whom? The "support level 1" folks yonder at the Unicode standard? :o)...
- tsbockman (6/24) Jun 02 2016 From the standard:
- Andrei Alexandrescu (4/7) Jun 02 2016 Code point/Level 1 support sounds like a sweet spot between
- ag0aep6g (3/5) Jun 02 2016 Do they say that level 1 should be the default, and do they give a
- Andrei Alexandrescu (6/11) Jun 02 2016 No, but that sounds agreeable to me, especially since it breaks no code
- default0 (36/42) Jun 02 2016 The level 2 support description noted that it should be opt-in
- Walter Bright (5/7) Jun 02 2016 The o is inferred as a wchar. The lamda then is inferred to return a wch...
- Timon Gehr (4/14) Jun 02 2016 It still would not be the right thing. The lambda shouldn't compile. It
- Andrei Alexandrescu (2/3) Jun 02 2016 But it is meaningful to compare Unicode code points. -- Andrei
- Timon Gehr (3/6) Jun 02 2016 It is also meaningful to compare two utf-8 code units or two utf-16 code...
- Andrei Alexandrescu (2/10) Jun 02 2016 By decoding them of course. -- Andrei
- Timon Gehr (6/17) Jun 02 2016 That makes no sense, I cannot decode single code units.
- Andrei Alexandrescu (2/21) Jun 02 2016 Then you lost me. (I'm sure you're making a good point.) -- Andrei
- Timon Gehr (4/28) Jun 02 2016 Basically:
- Walter Bright (5/21) Jun 02 2016 Yes, you have a good point. But we do allow things like:
- Timon Gehr (15/22) Jun 02 2016 Well, this is a somewhat different case, because 10000 is just not
- Walter Bright (4/7) Jun 02 2016 Not exactly. (c == 'ö') is always false for the same reason that (b == ...
- Timon Gehr (7/17) Jun 02 2016 Yes. And _additionally_, some other concerns apply that are not there
- Vladimir Panteleev (10/13) Jun 02 2016 Why allowing char/wchar/dchar comparisons is wrong:
- Andrei Alexandrescu (2/8) Jun 02 2016 The lambda returns bool. -- Andrei
- Walter Bright (3/6) Jun 02 2016 Can be made to work without autodecoding.
- Andrei Alexandrescu (6/13) Jun 02 2016 By special casing? Perhaps. I seem to recall though that one major issue...
- Timon Gehr (3/16) Jun 02 2016 The major issue is that it special cases when there's different, more
- Walter Bright (6/17) Jun 02 2016 The argument to canFind() can be detected as not being a char, then deco...
- Jonathan M Davis via Digitalmars-d (7/19) Jun 02 2016 How do you suggest that we handle the normalization issue? Should we jus...
- Walter Bright (2/3) Jun 02 2016 Started a new thread for that one.
- Jonathan M Davis via Digitalmars-d (63/78) Jun 02 2016 Yeah, I believe that you do have to do some special casing, though it wo...
- Marco Leise (24/30) Jun 02 2016 Andrei, your ignorance is really starting to grind on
- Walter Bright (7/8) Jun 02 2016 That's my fault.
- Andrei Alexandrescu (40/70) Jun 02 2016 Indeed there seem to be serious questions about my competence, basic
- Marco Leise (62/143) Jun 03 2016 That's not my general impression, but something is different
- Jonathan M Davis via Digitalmars-d (40/44) Jun 03 2016 It comes down to the question of whether it's better to fail quickly whe...
- jmh530 (43/52) Jun 02 2016 I've been lurking on this thread for a while and was convinced by
- Andrei Alexandrescu (26/38) Jun 02 2016 Yah, this is a bummer and one of the larger issues of our community:
- Adam D. Ruppe (9/12) Jun 02 2016 We wrote a PR to implement the first step in the autodecode
- deadalnix (2/14) Jun 02 2016 https://www.youtube.com/watch?v=MJiBjfvltQw
- Kagamin (3/5) Jun 02 2016 It outright deprecated popFront - that's not the first step in
- Adam D. Ruppe (4/6) Jun 02 2016 Which gave us the list of places inside Phobos to fix, only about
- Kagamin (4/7) Jun 02 2016 Yes, it was a research PR that was never meant to be an
- Andrei Alexandrescu (3/10) Jun 02 2016 I closed it because it wasn't an actual implementation, in full
- Walter Bright (3/6) Jun 02 2016 Nothing prevents anyone from doing that on their own (it's trivial) in o...
- Walter Bright (8/9) Jun 02 2016 That's right. It's going about things backwards.
- Adam D. Ruppe (4/6) Jun 02 2016 The compiler can help you with that. That's the point of the do
- Walter Bright (2/4) Jun 02 2016 What is supposed to be done with "do not merge" PRs other than close the...
- Jack Stouffer (3/5) Jun 02 2016 Experimentally iterate until something workable comes about. This
- tsbockman (9/11) Jun 02 2016 Occasionally people need to try something on the auto tester (not
- Andrei Alexandrescu (2/10) Jun 02 2016 Feel free to reopen if it helps, it wasn't closed in anger. -- Andrei
- Walter Bright (5/15) Jun 02 2016 That doesn't seem to apply here, either.
- tsbockman (5/18) Jun 02 2016 I was just responding to the general question you posed about "do
- Andrei Alexandrescu (13/48) Jun 02 2016 You mean https://github.com/dlang/phobos/pull/4384, the one with "[do
- Adam D. Ruppe (18/20) Jun 02 2016 Not at this time, no, but I also wouldn't advise you to close it
- Andrei Alexandrescu (3/4) Jun 02 2016 I don't think the plan is realistic. How can I tell you this without you...
- Adam D. Ruppe (7/9) Jun 02 2016 You get out of the way and let the community get to work.
- Andrei Alexandrescu (7/14) Jun 02 2016 This applies to high-risk work that is also of commensurately
- Kagamin (7/11) Jun 02 2016 Autodecode doesn't need to be removed from phobos completely, it
- Andrei Alexandrescu (2/11) Jun 02 2016 Yah, and then such code will work with RCStr. -- Andrei
- Kagamin (5/13) Jun 02 2016 Yes, do consider Walter's proposal, it will be an enabling
- Andrei Alexandrescu (3/13) Jun 02 2016 Walter and I have a unified view on this. Although I'd need to raise the...
- ZombineDev (4/25) Jun 02 2016 The primitive is byUTF!dchar:
- H. S. Teoh via Digitalmars-d (60/92) Jun 02 2016 Appeal to authority.
- Andrei Alexandrescu (37/126) Jun 02 2016 There is no denying. If I did things all over again, autodecoding would
- Jonathan M Davis via Digitalmars-d (19/23) Jun 02 2016 Are folks going to not start using D because of auto-decoding? No, becau...
- Andrei Alexandrescu (2/11) Jun 02 2016 Actually ranges are a major reason for which people look into D. -- Andr...
- Steven Schveighoffer (8/13) Jun 02 2016 If this doesn't happen, then all this push to change anything in Phobos
- Andrei Alexandrescu (6/20) Jun 02 2016 Yeah, it's a miracle the language stays glued eh.
- Steven Schveighoffer (13/35) Jun 02 2016 The push to make Phobos only use byDchar (or any other band-aid fixes
- Andrei Alexandrescu (3/5) Jun 02 2016 A good idea for all of us. Could you also please look on my post on our
- Timon Gehr (4/13) Jun 02 2016 He is just saying that the fundamental reason why autodecoding is bad is...
- deadalnix (10/28) Jun 02 2016 This, deep down, point at the fact that conversion from/to char
- Timon Gehr (11/20) Jun 02 2016 The current situation is bad:
- Jonathan M Davis via Digitalmars-d (28/52) May 31 2016 walkLength treats a code point like it's a character. My point is that
- Timon Gehr (9/31) May 31 2016 What's "correct"? Maybe the user intended to count the number of code
- Wyatt (6/10) May 31 2016 That's a property of your font and font rendering engine, not
- Timon Gehr (8/20) May 31 2016 Sure. Hence "context". If you are e.g. trying to manually underline some...
- Jonathan M Davis via Digitalmars-d (7/28) May 31 2016 It can't, which is precisely why having it select for you was a bad desi...
- H. S. Teoh via Digitalmars-d (19/31) May 31 2016 [...]
- Jonathan M Davis via Digitalmars-d (22/60) May 31 2016 In the vast majority of cases what folks care about is full characters,
- Andrei Alexandrescu (2/3) May 31 2016 How are you so sure? -- Andrei
- Marco Leise (9/13) May 31 2016 Because a full character is the typical unit of a written
- Jonathan M Davis via Digitalmars-d (46/57) May 31 2016 Exactly. How many folks here have written code where the correct thing t...
- Jack Stouffer (11/12) May 31 2016 This thread is going in circles; the against crowd has stated
- Marc =?UTF-8?B?U2Now7x0eg==?= (5/10) Jun 01 2016 He doesn't need to be sure. You are the one advocating for code
- Andrei Alexandrescu (2/3) May 31 2016 No, it treats a code point like it's a code point. -- Andrei
- Jonathan M Davis via Digitalmars-d (41/44) May 31 2016 Wasn't the whole point of operating at the code point level by default t...
- Andrei Alexandrescu (9/13) May 31 2016 The point is to operate on representation-independent entities (Unicode
- Max Samukha (11/14) May 31 2016 Unicode FAQ disagrees (http://unicode.org/faq/utf_bom.html):
- H. S. Teoh via Digitalmars-d (14/30) May 31 2016 This is basically saying that we operate on dchar[] by default, except
- Marc =?UTF-8?B?U2Now7x0eg==?= (3/15) Jun 01 2016 _Both_ are low-level representation-specific artifacts.
- Andrei Alexandrescu (4/17) Jun 01 2016 Maybe this is a misunderstanding. Representation = how things are laid
- Nick Sabalausky (6/17) Jun 01 2016 As has been explained countless times already, code points are a non-1:1...
- Andrei Alexandrescu (7/12) Jun 01 2016 The relevance is meandering across the discussion, and it's good to have...
- Marc =?UTF-8?B?U2Now7x0eg==?= (14/26) Jun 02 2016 Ok, if you define it that way, sure. I was thinking in terms of
- H. S. Teoh via Digitalmars-d (28/29) May 31 2016 Let's put the question this way. Given the following string, what do
- Steven Schveighoffer (3/8) May 31 2016 Compiler error.
- Timon Gehr (2/12) May 31 2016 What about e.g. joiner?
- H. S. Teoh via Digitalmars-d (17/32) May 31 2016 joiner is one of those algorithms that can work perfectly fine *without*
- Steven Schveighoffer (3/15) May 31 2016 Compiler error. Better than what it does now.
- Marc =?UTF-8?B?U2Now7x0eg==?= (6/9) Jun 01 2016 I believe everything that does only concatenation will work
- Steven Schveighoffer (8/17) Jun 02 2016 This means that a string is a range. What is it a range of? If you want
- Marc =?UTF-8?B?U2Now7x0eg==?= (11/28) Jun 02 2016 No, I don't want to make string a range of anything, I want to
- Timon Gehr (2/6) Jun 02 2016 If strings are not ranges, returning a range of chars is inconsistent.
- Kagamin (7/10) Jun 02 2016 After the first migration step joiner will return a decoded dchar
- Andrei Alexandrescu (3/6) May 31 2016 The number of code units in the string. That's the contract promised and...
- Andrei Alexandrescu (2/9) May 31 2016 Code points I mean. -- Andrei
- Nick Sabalausky (7/17) May 31 2016 Yes, we know it's the contract. ***That's the problem.*** As everybody
- Jonathan M Davis via Digitalmars-d (16/34) May 31 2016 Exactly. Operating at the code point level rarely makes sense. What sort...
- ag0aep6g (6/9) May 31 2016 You got the terms mixed up. Code unit is lower level. Code point is
- Andrei Alexandrescu (2/8) May 31 2016 Apologies and thank you. -- Andrei
- Andrei Alexandrescu (3/7) May 31 2016 The way I see it is it's specialization to speed things up without
- Nick Sabalausky (6/14) May 31 2016 Problem is, that "higher"[1] level abstraction you don't want to give up...
- Walter Bright (60/133) May 27 2016 It's a consequence of autodecoding, not arrays.
- Andrei Alexandrescu (2/3) May 27 2016 Always valid or potentially invalid as well? -- Andrei
- Walter Bright (8/11) May 27 2016 Some years ago I would have said always valid. Experience, however, says...
- Andrei Alexandrescu (3/5) May 27 2016 Violent agreement is occurring here. We have plenty of those and need
- Martin Nowak (4/19) May 29 2016 There are more than 2 choices here, see the related discussion on
- Marco Leise (5/5) May 30 2016 A relevant thread in the Rust bug tracker I remember from
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:I am as unclear about the problems of autodecoding as I am about the necessity to remove curl. Whenever I ask I hear some arguments that work well emotionally but are scant on reason and engineering. Maybe it's time to rehash them? I just did so about curl, no solid argument seemed to come together. I'd be curious of a crisp list of grievances about autodecoding. -- AndreiHere are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.
May 12 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:12. The result of autodecoding, a range of Unicode code points, is rarely actually useful, and code that relies on autodecoding is rarely actually, universally correct. Graphemes are occasionally useful for a subset of scripts, and a subset of that subset has all graphemes mapped to single code points, but this only applies to some scripts/languages. In the majority of cases, autodecoding provides only the illusion of correctness.I am as unclear about the problems of autodecoding as I amabout the necessityto remove curl. Whenever I ask I hear some arguments thatwork well emotionallybut are scant on reason and engineering. Maybe it's time torehash them? I justdid so about curl, no solid argument seemed to come together.I'd be curious ofa crisp list of grievances about autodecoding. -- AndreiHere are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.
May 12 2016
On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote:On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:[...]Example of string special-casing leading to bugs: https://issues.dlang.org/show_bug.cgi?id=15972 This particular issue highlight the problem quite well: one would hardly how could a single char need to be "auto-decoded" to a dchar? Unfortunately, due to Phobos algorithms assuming autodecoding, the resulting range of char is not recognized as "string-like" data by .joiner, thus causing a compile error. The workaround (as described in the bug comments) also illustrates the inconsistency in handling ranges of char vs. ranges of dchar: writing .joiner("\n".byCodeUnit) will actually fix the problem, basically by explicitly disabling autodecoding. We can, of course, fix .joiner to recognize this case and handle it correctly, but the fact the using .byCodeUnit works perfectly proves that autodecoding is not necessary here. Which begs the question, why have autodecoding at all, and then require .byCodeUnit to work around issues it causes? T -- It is widely believed that reinventing the wheel is a waste of time; but I disagree: without wheel reinventers, we would be still be stuck with wooden horse-cart wheels.1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.
May 12 2016
On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote: [...]12. The result of autodecoding, a range of Unicode code points, is rarely actually useful, and code that relies on autodecoding is rarely actually, universally correct. Graphemes are occasionally useful for a subset of scripts, and a subset of that subset has all graphemes mapped to single code points, but this only applies to some scripts/languages. In the majority of cases, autodecoding provides only the illusion of correctness.A range of Unicode code points is not the same as a range of graphemes (a grapheme is what a layperson would consider to be a "character"). Autodecoding returns dchar, a code point, rather than a grapheme. Therefore, autodecoding actually only produces intuitively correct results when your string has a 1-to-1 correspondence between grapheme and code point. In general, this is only true for a small subset of languages, mainly a few common European languages and a handful of others. It doesn't work for Korean, and doesn't work for any language that uses combining diacritics or other modifiers. You need byGrapheme to have the correct results. So basically autodecoding, as currently implemented, fails to meet its goal of segmenting a string by "character" (i.e., grapheme), and yet imposes a performance penalty that is difficult to "turn off" (you have to sprinkle your code with byCodeUnit everywhere, and many Phobos algorithms just return a range of dchar anyway). Not to mention that a good number of string algorithms don't actually *need* autodecoding at all. (One could make a case for auto-segmenting by grapheme, but that's even worse in terms of performance (it requires a non-trivial Unicode algorithm involving lookup tables, and may need memory allocation). At the end of the day, we're back to square one: iterate by code unit, and explicitly ask for byGrapheme where necessary.) T -- "I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly
May 12 2016
On Thursday, 12 May 2016 at 23:16:23 UTC, H. S. Teoh wrote:Therefore, autodecoding actually only produces intuitively correct results when your string has a 1-to-1 correspondence between grapheme and code point. In general, this is only true for a small subset of languages, mainly a few common European languages and a handful of others. It doesn't work for Korean, and doesn't work for any language that uses combining diacritics or other modifiers. You need byGrapheme to have the correct results.In fact, even most European languages are affected if NFD normalization is used, which is the default on MacOS X. And this is actually the main problem with it: It was introduced to make unicode string handling correct. Well, it doesn't, therefore it has no justification.
May 13 2016
Am Fri, 13 May 2016 10:49:24 +0000 schrieb Marc Sch=C3=BCtz <schuetzm gmx.net>:In fact, even most European languages are affected if NFD=20 normalization is used, which is the default on MacOS X. =20 And this is actually the main problem with it: It was introduced=20 to make unicode string handling correct. Well, it doesn't,=20 therefore it has no justification.+1 for leaning back and contemplate exactly what auto-decode was aiming for and how it missed that goal. You'll see that an o=CC=88 may still be cut between the o and the =C2=A8. Hangul symbols are composed of pieces that go in different corners. Those would also be split up by auto-decode. Can we handle real world text AT ALL? Are graphemes good enough to find the column in a fixed width display of some string (e.g. line+column or an error)? No, there my still be full-width characters in there that take 2 columns. :p --=20 Marco
May 13 2016
On Fri, May 13, 2016 at 09:26:40PM +0200, Marco Leise via Digitalmars-d wrote:Am Fri, 13 May 2016 10:49:24 +0000 schrieb Marc Schütz <schuetzm gmx.net>:[...] A simple lookup table ought to fix this. Preferably in std.uni so that it doesn't get reinvented by every other project. T -- Don't modify spaghetti code unless you can eat the consequences.In fact, even most European languages are affected if NFD normalization is used, which is the default on MacOS X. And this is actually the main problem with it: It was introduced to make unicode string handling correct. Well, it doesn't, therefore it has no justification.+1 for leaning back and contemplate exactly what auto-decode was aiming for and how it missed that goal. You'll see that an ö may still be cut between the o and the ¨. Hangul symbols are composed of pieces that go in different corners. Those would also be split up by auto-decode. Can we handle real world text AT ALL? Are graphemes good enough to find the column in a fixed width display of some string (e.g. line+column or an error)? No, there my still be full-width characters in there that take 2 columns. :p
May 13 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:For me it is not about autodecoding. I would like to have something like String type which do that. But what I am really piss of is that current string type is alias to immutable(char)[] (so it is not usable at all). This is really problem for me. Because this make working on array of chars almost impossible. Even char[] is unusable. So I am force to used ubyte[], but this is really not an array of chars. ATM D does not support even full Unicode strings and even basic array of chars :(. I hope this will be fixed one day. So I could start to expand D in Czech, until than I am unable to do that.I am as unclear about the problems of autodecoding as I amabout the necessityto remove curl. Whenever I ask I hear some arguments thatwork well emotionallybut are scant on reason and engineering. Maybe it's time torehash them? I justdid so about curl, no solid argument seemed to come together.I'd be curious ofa crisp list of grievances about autodecoding. -- AndreiHere are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.
May 12 2016
On 5/12/2016 4:23 PM, Daniel Kozak wrote:But what I am really piss of is that current string type is alias to immutable(char)[] (so it is not usable at all). This is really problem for me. Because this make working on array of chars almost impossible. Even char[] is unusable. So I am force to used ubyte[], but this is really not an array of chars. ATM D does not support even full Unicode strings and even basic array of chars :(. I hope this will be fixed one day. So I could start to expand D in Czech, until than I am unable to do that.I can't find any actionable information in this.
May 12 2016
Am Thu, 12 May 2016 13:15:45 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames.More precisely they are byte strings with '/' reserved to separate path elements. While on an out-of-the-box Linux nowadays everything is typically presented as UTF-8, there are still die-hards that use code pages, corrupted file systems or incorrectly bound network shares displaying with the wrong charset. It is safer to work with them as a ubyte[] and that also bypasses auto-decoding. I'd like 'string' to mean valid UTF-8 in D as far as the encoding goes. A filename should not be a 'string'. -- Marco
May 12 2016
On 5/12/2016 4:52 PM, Marco Leise wrote:I'd like 'string' to mean valid UTF-8 in D as far as the encoding goes. A filename should not be a 'string'.I would have agreed with you in the past, but more and more it just doesn't seem practical. UTF-8 is dirty in the real world, and D code will have to deal with it. By dealing with it I mean not crash, throw exceptions, or other tantrums when encountering it. Unless it matters, it should pass the invalid encodings along unmolested and without comment. For example, if you're searching for 'a' in a UTF-8 string, what does it matter if there are invalid encodings in that string? For filenames/paths in particular, having redone the file/path code in Phobos, I realized that invalid encodings are completely immaterial.
May 12 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:Here are some that are not matters of opinion.If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button. I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse. D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.
May 12 2016
On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse.To hammer this home a little more, Python 3 had a really useful library in order to abstract most of the differences automatically. But despite that, here is a list of the top 200 Python packages in 2011, three years after the fork, and if they supported Python 3 or not: https://web.archive.org/web/20110215214547/http://python3wos.appspot.com/ This is _three years_ later, and only 18 out of the top 200 supported Python 3. And here it is now, eight years later, at 174 out of 200 https://python3wos.appspot.com/
May 12 2016
On 5/12/2016 5:47 PM, Jack Stouffer wrote:D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.I agree, if it is possible at all.
May 12 2016
On Friday, 13 May 2016 at 01:00:54 UTC, Walter Bright wrote:On 5/12/2016 5:47 PM, Jack Stouffer wrote:I don't know to which extent my problems with string handling are related to autodecode. However, I had to write some utility functions to get around issues with code points, graphemes and the like. While it is not a huge issue in terms of programming time, it does slow down my program, because even simple operations may be referred to a utility function to make sure the result is correct (.length for example). But that might be an issue related to Unicode in general (or D's handling of it). If autodecode is killed, could we have a test version asap? I'd be willing to test my programs with autodecode turned off and see what happens. Others should do likewise and we could come up with a transition strategy based on what happened.D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.I agree, if it is possible at all.
May 13 2016
On 5/13/2016 2:12 AM, Chris wrote:If autodecode is killed, could we have a test version asap? I'd be willing to test my programs with autodecode turned off and see what happens. Others should do likewise and we could come up with a transition strategy based on what happened.You can avoid autodecode by using .byChar
May 13 2016
On Friday, 13 May 2016 at 13:17:44 UTC, Walter Bright wrote:On 5/13/2016 2:12 AM, Chris wrote:Hm. It would be difficult to make sure that my whole code base doesn't do something, somewhere that doesn't trigger auto decode. PS Why does do I get a "StopForumSpam error" every time I post today? Has anyone else experienced the same problem: "StopForumSpam error: Socket error: Lookup error: getaddrinfo error: Name or service not known. Please solve a CAPTCHA to continue."If autodecode is killed, could we have a test version asap? I'd be willing to test my programs with autodecode turned off and see what happens. Others should do likewise and we could come up with a transition strategy based on what happened.You can avoid autodecode by using .byChar
May 13 2016
On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:PS Why does do I get a "StopForumSpam error" every time I post today? Has anyone else experienced the same problem: "StopForumSpam error: Socket error: Lookup error: getaddrinfo error: Name or service not known. Please solve a CAPTCHA to continue."https://twitter.com/StopForumSpam
May 13 2016
On Friday, 13 May 2016 at 14:06:28 UTC, Vladimir Panteleev wrote:On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:I don't understand. Does that mean we have to solve CAPTCHAs every time we post? Annoying CAPTCHAs at that.PS Why does do I get a "StopForumSpam error" every time I post today? Has anyone else experienced the same problem: "StopForumSpam error: Socket error: Lookup error: getaddrinfo error: Name or service not known. Please solve a CAPTCHA to continue."https://twitter.com/StopForumSpam
May 13 2016
On Friday, 13 May 2016 at 01:00:54 UTC, Walter Bright wrote:On 5/12/2016 5:47 PM, Jack Stouffer wrote:A plan: 1. Mark as deprecated places where auto-decoding used. I think it's all "range" functions for string(front, popFront, back, ...). Force using byChar & co. 2. Introduce new String type in Phobos. 3. After ages make immutable(char) ordinal array. Is it OK? Profit?D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.I agree, if it is possible at all.
May 13 2016
On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.Python 2 is/was deployed at a much larger scale and with far more library dependencies, so I don't think it is comparable. It is easier for D to get away with breaking changes. I am still using Python 2.7 exclusively, but now I use: from __future__ import division, absolute_import, with_statement, unicode_literals D can do something similar. C++ is using a comparable solution. Use switches to turn on different compatibility levels.
May 13 2016
On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button.char[] is always going to be unsafe for UTF-8. I don't think we can remove it or auto-decoding, only discourage use of it. We need a String struct IMO, without length or indexing. Its front can do autodecoding, and it has a ubyte[] raw() property too. (Possibly the byte length of front can be cached for use in popFront, assuming it was faster). This would be a gradual transition.
May 13 2016
On Fri, May 13, 2016 at 12:16:30PM +0000, Nick Treleaven via Digitalmars-d wrote:On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:alias String = typeof(std.uni.byGrapheme(immutable(char)[].init)); :-) Well, OK, perhaps you could wrap this in a struct that allows extraction of .raw, etc.. But basically this isn't hard to implement today. We already have all of the tools necessary. T -- Dogs have owners ... cats have staff. -- Krista CasadaIf you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button.char[] is always going to be unsafe for UTF-8. I don't think we can remove it or auto-decoding, only discourage use of it. We need a String struct IMO, without length or indexing. Its front can do autodecoding, and it has a ubyte[] raw() property too. (Possibly the byte length of front can be cached for use in popFront, assuming it was faster). This would be a gradual transition.
May 13 2016
On 05/12/2016 08:47 PM, Jack Stouffer wrote:If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button. I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse. D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.As much as I agree on the importance of a good smooth migration path, I don't think the "Python 2 vs 3" situation is really all that comparable here. Unlike Python, we wouldn't be maintaining a "with auto-decoding" fork for years and years and years, ensuring nobody ever had a pressing reason to bother migrating. And on top of that, we don't have a culture and design philosophy that promotes "do the lazy thing first and the robust thing never". D users are more likely than dynamic language users to be willing to make a few changes for the sake of improvement. Heck, we weather breaking fixes enough anyway. There was even one point within the last couple years where something (forget offhand what it was) was removed from std.datetime and its replacement was added *in the very same compiler release*. No transition period. It was an annoying pain (at least to me), but I got through it fine and never even entertained the thought of just sticking with the old compiler. Not sure most people even noticed it. Point is, in D, even when something does need to change, life goes on fine. As long as we don't maintain a long-term fork ;) Naturally, minimizing breakage is important here, but I really don't think Python's UTF migration situation is all that comparable.
May 29 2016
On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:Unlike Python, we wouldn't be maintaining a "with auto-decoding" fork for years and years and years, ensuring nobody ever had a pressing reason to bother migrating.If it happens, they better. The D1 fork was maintained for almost three years for a good reason.Heck, we weather breaking fixes enough anyway.Not nearly on a scale similar to changing how strings are iterated; not since the D1/D2 split.It was an annoying pain (at least to me), but I got through it fine and never even entertained the thought of just sticking with the old compiler. Not sure most people even noticed it. Point is, in D, even when something does need to change, life goes on fine. As long as we don't maintain a long-term fork ;)The problem is not active users. The problem is companies who have > 10K LOC and libraries that are no longer maintained. E.g. It took Sociomantic eight years after D2's release to switch only a few parts of their projects to D2. With the loss of old libraries/old code (even old answers on SO), all of a sudden you lose a lot of the network effect that makes programming languages much more useful.
May 29 2016
On 05/29/2016 09:58 PM, Jack Stouffer wrote:The problem is not active users. The problem is companies who have > 10K LOC and libraries that are no longer maintained. E.g. It took Sociomantic eight years after D2's release to switch only a few parts of their projects to D2. With the loss of old libraries/old code (even old answers on SO), all of a sudden you lose a lot of the network effect that makes programming languages much more useful.D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.
May 30 2016
On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 30 2016
On 05/30/2016 12:34 PM, Jack Stouffer wrote:On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:That kind of makes this thread less productive than "How to improve autodecoding?" -- AndreiD1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 30 2016
On 30-May-2016 21:24, Andrei Alexandrescu wrote:On 05/30/2016 12:34 PM, Jack Stouffer wrote:1. Generalize to all ranges of code units i.e. ranges of char/wchar. 2. Operating on codeunits explicitly would then always involve a step through ubyte/byte. -- Dmitry OlshanskyOn Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:That kind of makes this thread less productive than "How to improve autodecoding?" -- AndreiD1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 30 2016
On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:That kind of makes this thread less productive than "How to improve autodecoding?" -- AndreiPlease don't misunderstand, I'm for fixing string behavior. But, let's not pretend that this wouldn't be one of the (if not the) largest breaking change since D2. As I said, straight up removing auto-decoding would break all string handling code.
May 30 2016
On 05/30/2016 03:00 PM, Jack Stouffer wrote:On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:Surely the misunderstanding is not on this side of the table :o). By "that" I meant your assertion at face value (i.e. assuming it's a fact) "All string handling code would become broken, even if it appears to work at first". -- AndreiThat kind of makes this thread less productive than "How to improve autodecoding?" -- AndreiPlease don't misunderstand, I'm for fixing string behavior.
May 30 2016
On Monday, May 30, 2016 14:24:23 Andrei Alexandrescu via Digitalmars-d wrote:On 05/30/2016 12:34 PM, Jack Stouffer wrote:I think that the first step is getting Phobos to work with all ranges of character types - be they char, wchar, dchar, or graphemes. Then the algorithms themselves will work whether we have auto-decoding or not. With that done, we can at minimum tell folks to use byCodeUnit, byChar!T, byGrapheme, etc. to get the correct, efficient behavior. Right now, if you try to use ranges like byCodeUnit, they work with some of Phobos but not enough to really work as a viable replacement to auto-decoding strings. With all that done, at least it should be reasonably easy for folks to sanely get around auto-decoding, though the question still remains at that point how possible it will be to remove auto-decoding and treat ranges of char the same way that byCodeUnit would. But at bare minimum, it's what we need to do to make it possible and reasonable to work around auto-decoding when you need to while specifying the level of Unicode that you actually want to operate at. - Jonathan M DavisOn Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:That kind of makes this thread less productive than "How to improve autodecoding?" -- AndreiD1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 31 2016
On 5/31/16 2:21 PM, Jonathan M Davis via Digitalmars-d wrote:I think that the first step is getting Phobos to work with all ranges of character types - be they char, wchar, dchar, or graphemes. Then the algorithms themselves will work whether we have auto-decoding or not. With that done, we can at minimum tell folks to use byCodeUnit, byChar!T, byGrapheme, etc. to get the correct, efficient behavior. Right now, if you try to use ranges like byCodeUnit, they work with some of Phobos but not enough to really work as a viable replacement to auto-decoding strings.Great. Could you put together a sample PR so we understand the implications better? Thanks! -- Andrei
May 31 2016
On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:Assuming silent breakage is on the table, what would be broken, really? Code that must intentionally count or otherwise operate code points, sure. But how much of all string handling code is like that? Perhaps it would be worth trying to silently remove autodecoding and seeing how much of Phobos breaks, as an experiment. Has this been tried before? (Not saying this is a route we should take, but it doesn't seem to me that it will break "all string handling code" either.)D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 30 2016
On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:132 lines in Phobos use auto-decoding - that should be fixable ;-) See them: http://sprunge.us/hUCL More details: https://github.com/dlang/phobos/pull/4384On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:Assuming silent breakage is on the table, what would be broken, really? Code that must intentionally count or otherwise operate code points, sure. But how much of all string handling code is like that? Perhaps it would be worth trying to silently remove autodecoding and seeing how much of Phobos breaks, as an experiment. Has this been tried before? (Not saying this is a route we should take, but it doesn't seem to me that it will break "all string handling code" either.)D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 30 2016
On 5/30/16 7:52 PM, Seb wrote:On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:Thanks for this investigation! Results are about as I'd have speculated. -- AndreiOn Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:132 lines in Phobos use auto-decoding - that should be fixable ;-) See them: http://sprunge.us/hUCL More details: https://github.com/dlang/phobos/pull/4384On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:Assuming silent breakage is on the table, what would be broken, really? Code that must intentionally count or otherwise operate code points, sure. But how much of all string handling code is like that? Perhaps it would be worth trying to silently remove autodecoding and seeing how much of Phobos breaks, as an experiment. Has this been tried before? (Not saying this is a route we should take, but it doesn't seem to me that it will break "all string handling code" either.)D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 30 2016
On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:Perhaps it would be worth trying to silently remove autodecoding and seeing how much of Phobos breaks, as an experiment. Has this been tried before?Did it, the results are a large number of phobos modules fail to compile because of template constraints that test for is(Unqual!(ElementType!S2) == dchar). As a result, anything that imports std.format or std.uni fails to compile. Also, I see some errors caused by the fact that is(string.front == immutable) now. Is hard to find specifics because D halts execution after one test failure.
May 30 2016
On 05/30/2016 12:25 PM, Nick Sabalausky wrote:On 05/29/2016 09:58 PM, Jack Stouffer wrote:It was also made at a time when the community was smaller by a couple orders of magnitude. -- AndreiThe problem is not active users. The problem is companies who have > 10K LOC and libraries that are no longer maintained. E.g. It took Sociomantic eight years after D2's release to switch only a few parts of their projects to D2. With the loss of old libraries/old code (even old answers on SO), all of a sudden you lose a lot of the network effect that makes programming languages much more useful.D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.
May 30 2016
On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:On 05/12/2016 08:47 PM, Jack Stouffer wrote: As much as I agree on the importance of a good smooth migration path, I don't think the "Python 2 vs 3" situation is really all that comparable here. Unlike Python, we wouldn't be maintaining a "with auto-decoding" fork for years and years and years, ensuring nobody ever had a pressing reason to bother migrating. And on top of that, we don't have a culture and design philosophy that promotes "do the lazy thing first and the robust thing never". D users are more likely than dynamic language users to be willing to make a few changes for the sake of improvement. Heck, we weather breaking fixes enough anyway. There was even one point within the last couple years where something (forget offhand what it was) was removed from std.datetime and its replacement was added *in the very same compiler release*. No transition period. It was an annoying pain (at least to me), but I got through it fine and never even entertained the thought of just sticking with the old compiler. Not sure most people even noticed it. Point is, in D, even when something does need to change, life goes on fine. As long as we don't maintain a long-term fork ;) Naturally, minimizing breakage is important here, but I really don't think Python's UTF migration situation is all that comparable.I suggest providing an automatic tool (either within the compiler or as a separate program like dfix) to help with the transition. Ideally the tool would advise the user where potential problems are and how to fix them. If it's true that auto decode is unnecessary in many cases, then it shouldn't affect the whole code base. But I might be mistaken here. Maybe we should make a list of the functions where auto decode does make a difference, see how common they are, and work out a strategy from there. Destroy.
May 30 2016
Am Mon, 30 May 2016 09:26:09 +0000 schrieb Chris <wendlec tcd.ie>:If it's true that auto decode is unnecessary in many cases, then=20 it shouldn't affect the whole code base. But I might be mistaken=20 here. Maybe we should make a list of the functions where auto=20 decode does make a difference, see how common they are, and work=20 out a strategy from there. Destroy.It makes a difference for every function. But it still isn't necessary in many cases. It's fairly simple: code unit =3D=3D bytes/chars code point =3D=3D auto-decode grapheme* =3D=3D .byGrapheme So if for now you used auto-decode you iterated code-points, which works correctly for most scripts in NFC**. And here lies the rub and why people say auto-decoding is unnecessary most of the time: If you are working with XML, CSV or JSON or another structured text format, these all use ASCII characters for their syntax elements. Code unit, code point and graphemes become all the same and auto-decoding just slows you down. When on the other hand you work with real world international text, you'll want to work with graphemes. One example is putting an ellipsis in long text: "Alle Segelt=C3=B6rns im =C3=9Cberblick" (in NFD, e.g. OS X file name) may display as this with auto-decode: "Alle Segelto=E2=80=A6=C2=A8berblick" and this with byGrapheme: "Alle Segelt=C3=B6=E2=80=A6=C3=9Cberblick" But at that point you are likely also in need of localized sorting of strings, a set of algorithms that may change with the rise and fall of nations or reformations. So you'll use the platform's go-to Unicode library instead of what Phobos offers. For Java and Linux that would be ICU***. That last point makes me think we should not bother much with decoding in Phobos at all. Odds are we miss other capabilities to make good use of it. Users of auto-decode should review their code to see if code-points is really what they want and potentially switch to no-decoding or .byGrapheme. * What we typically perceive as one unit in written text. ** A normalization form where e.g. '=C3=B6' is a single code-point, as opposed to NFD, where '=C3=B6' would be assembled from the two 'o' and '=C2=A8' code-points as in OS X file names. *** http://site.icu-project.org/home#TOC-What-is-ICU- --=20 Marco
May 30 2016
On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:*** http://site.icu-project.org/home#TOC-What-is-ICU-I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.
May 30 2016
Am Mon, 30 May 2016 17:35:36 +0000 schrieb Chris <wendlec tcd.ie>:I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.You have to compare to the situation before, when every operating system with every localization had its own encoding. Have some text file with ASCII art in a DOS code page? Doesn't render on Windows with the same locale. Open Cyrillic text on a Latin system? Indigestible. Someone wrote a website on Windows and incorrectly tagged it with an ISO charset? The browser has to fix it up for them. One objection I remember was the Han Unification: https://en.wikipedia.org/wiki/Han_unification Not everyone liked how Chinese, Japanese, Korean were represented with a common set of ideograms. At the time Unicode was still 16-bit and the unified symbols would already make up 32% of all code points. In my eyes many of the perceived problems of Unicode are stemming from the fact that raises awareness to different writing systems all over the globe in a way that we didn't have to, when software was developed locally instead of globally on GitHub, when the target was Windows instead of cross-platform and mobile, when we were lucky if we localized for a couple of latin languages, but Asia was a real barrier. I don't know what you and your colleague discussed about ICU, but likely if you should add another dependency and what alternatives there are. In Linux user space, almost everything is an outside project, an extra library, most of them with alternatives. My own research lead me to the point where I came to think that there was one set of libraries without real alternatives: ICU -> HarfBuff -> Pango That's the go-to chain for Unicode text. From text processing over rendering to layouting. Moreover many successful open-source projects make use of it: LibreOffice, sqlite, Qt, libxml2, WebKit to name a few. Unicode is here to stay, no matter what could have been done better in the past, and I think it is perfectly safe to bet on ICU on Linux for what e.g. Windows has built-in. Otherwise just do as Adam Ruppe said:Don't mess with strings. Get them from the user, store them without modification, spit them back out again.:p -- Marco
May 30 2016
On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:Part of it is the complexity of written language, part of it is bad technical decisions. Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity. I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain. UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched. D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_ The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness. Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language. UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile. It is not.*** http://site.icu-project.org/home#TOC-What-is-ICU-I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.
May 31 2016
On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote:UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere but vast sea of code that is C or C++ generally uses UTF-8 as do plenty of other programming languages. And even aside from English, most European languages are going to be more efficient with UTF-8, because they're still primarily ASCII even if they contain characters that are not. Stuff like Chinese is definitely worse in UTF-8 than it would be in UTF-16, but there are a lot of languages other than English which are going to encode better with UTF-8 than UTF-16 - let alone UTF-32. Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it for it to be going anywhere, and most folks have no problem with that. Any attempt to get rid of it would be a huge, uphill battle. But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving the standard library - so anyone who wants to avoid UTF-8 is free to do so. - Jonathan M Davis
May 31 2016
On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote:I agree that both UTF encodings are somewhat popular now.UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere anytime soon - if ever. The Win32 API is C or C++ generally uses UTF-8 as do plenty of other programming languages.And even aside from English, most European languages are going to be more efficient with UTF-8, because they're still primarily ASCII even if they contain characters that are not. Stuff like Chinese is definitely worse in UTF-8 than it would be in UTF-16, but there are a lot of languages other than English which are going to encode better with UTF-8 than UTF-16 - let alone UTF-32.And there are a lot more languages that will be twice as long than English, ie ASCII.Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it for it to be going anywhere, and most folks have no problem with that. Any attempt to get rid of it would be a huge, uphill battle.I disagree, it is inevitable. Any tech so complex and inefficient cannot last long.But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving the standard library - so anyone who wants to avoid UTF-8 is free to do so.Yes, but not by using UTF-16/32, which use too much memory. I've suggested a single-byte encoding for most languages instead, both in my last post and the earlier thread. D could use this new encoding internally, while keeping its current UTF-8/16 strings around for any outside UTF-8/16 data passed in. Any of that data run through algorithms that don't require decoding could be kept in UTF-8, but the moment any decoding is required, D would translate UTF-8 to the new encoding, which would be much easier for programmers to understand and manipulate. If UTF-8 output is needed, you'd have to encode back again. Yes, this translation layer would be a bit of a pain, but the new encoding would be so much more efficient and understandable that it would be worth it, and you're already decoding and encoding back to UTF-8 for those algorithms now. All that's changing is that you're using a new and different encoding than dchar as the default. If it succeeds for D, it could then be sold more widely as a replacement for UTF-8/16. I think this would be the right path forward, not navigating this UTF-8/16 mess further.
May 31 2016
Am Tue, 31 May 2016 16:29:33 +0000 schrieb Joakim <dlang joakim.fea.st>:Part of it is the complexity of written language, part of it is=20 bad technical decisions. Building the default string type in D=20 around the horrible UTF-8 encoding was a fundamental mistake,=20 both in terms of efficiency and complexity. I noted this in one=20 of my first threads in this forum, and as Andrei said at the=20 time, nobody agreed with me, with a lot of hand-waving about how=20 efficiency wasn't an issue or that UTF-8 arrays were fine. =20 Fast-forward years later and exactly the issues I raised are now=20 causing pain.Maybe you can dig up your old post and we can look at each of your complaints in detail.UTF-8 is an antiquated hack that needs to be eradicated. It=20 forces all other languages than English to be twice as long, for=20 no good reason, have fun with that when you're downloading text=20 on a 2G connection in the developing world. It is unnecessarily=20 inefficient, which is precisely why auto-decoding is a problem. =20 It is only a matter of time till UTF-8 is ditched.You don't download twice the data. First of all, some languages had two-byte encodings before UTF-8, and second web content is full of HTML syntax and gzip compressed afterwards. Take this Thai Wikipedia entry for example: https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97= %E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2 The download of the gzipped html is 11% larger in UTF-8 than in Thai TIS-620 single-byte encoding. And that is dwarfed by the size of JS + images. (I don't have the numbers, but I expect the effective overhead to be ~2%). Ironically a lot of symbols we take for granted would then have to be implemented as HTML entities using their Unicode code points(sic!). Amongst them basic stuff like dashes, degree (=C2=B0) and minute (=E2=80=B2), accents in names, non-breaking space or footnotes (=E2=86=91).D devs should lead the way in getting rid of the UTF-8 encoding,=20 not bickering about how to make it more palatable. I suggested a=20 single-byte encoding for most languages, with double-byte for the=20 ones which wouldn't fit in a byte. Use some kind of header or=20 other metadata to combine strings of different languages, _rather=20 than encoding the language into every character!_That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like =C2=B0 or =CE=B1,=CE=B2,=CE=B3.The common string-handling use case, by far, is strings with only=20 one language, with a distant second some substrings in a second=20 language, yet here we are putting the overhead into every=20 character to allow inserting characters from an arbitrary=20 language! This is madness.No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either.Yes, the complexity of diacritics and combining characters will=20 remain, but that is complexity that is inherent to the variety of=20 written language. UTF-8 is not: it is just a bad technical=20 decision, likely chosen for ASCII compatibility and some=20 misguided notion that being able to combine arbitrary language=20 strings with no other metadata was worthwhile. It is not.The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols. --=20 Marco
May 31 2016
On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote:Am Tue, 31 May 2016 16:29:33 +0000 schrieb Joakim <dlang joakim.fea.st>:Not interested. I believe you were part of that thread then. Google it if you want to read it again.Part of it is the complexity of written language, part of it is bad technical decisions. Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity. I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain.Maybe you can dig up your old post and we can look at each of your complaints in detail.The vast majority can be encoded in a single byte, and are unnecessarily forced to two bytes by the inefficient UTF-8/16 encodings. HTML syntax is a non sequitur; compression helps but isn't as efficient as a proper encoding.UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.You don't download twice the data. First of all, some languages had two-byte encodings before UTF-8, and second web content is full of HTML syntax and gzip compressed afterwards.Take this Thai Wikipedia entry for example: https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2 The download of the gzipped html is 11% larger in UTF-8 than in Thai TIS-620 single-byte encoding. And that is dwarfed by the size of JS + images. (I don't have the numbers, but I expect the effective overhead to be ~2%).Nobody on a 2G connection is waiting minutes to download such massive web pages. They are mostly sending text to each other on their favorite chat app, and waiting longer and using up more of their mobile data quota if they're forced to use bad encodings.Ironically a lot of symbols we take for granted would then have to be implemented as HTML entities using their Unicode code points(sic!). Amongst them basic stuff like dashes, degree (°) and minute (′), accents in names, non-breaking space or footnotes (↑).No, they just don't use HTML, opting for much superior mobile apps instead. :)Let's see: a constant-time addition to a header or constantly decoding every character every time I want to manipulate the string... I wonder which is a better choice?! You would not "intersperse" any other encodings, unless you kept track of those substrings in the header. My whole point is that such mixing of languages or "extra symbols" is an extreme minority use case: the vast majority of strings are a single language.D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like ° or α,β,γ.Unicode _is_ a retro codepage system, they merely standardized a bunch of the most popular codepages. So that's not going away no matter what system you use. :)The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness.No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either.Those are some of the least-trafficked parts of the web, which itself is dying off as the developing world comes online through mobile apps, not the bloated web stack. Anyway, I'm not interested in rehashing this dumb argument again. The UTF-8/16 encodings are a horrible mess, and D made a big mistake by baking them in.Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language. UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile. It is not.The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols.
May 31 2016
On 31.05.2016 22:20, Marco Leise wrote:Am Tue, 31 May 2016 16:29:33 +0000 schrieb Joakim<dlang joakim.fea.st>:It is probably this one. Not sure what "exactly the issues" are though. http://forum.dlang.org/thread/bwbuowkblpdxcpysejpb forum.dlang.orgMaybe you can dig up your old post and we can look at each of your complaints in detail.Part of it is the complexity of written language, part of it is bad technical decisions. Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity. I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain.
May 31 2016
On 5/31/2016 1:20 PM, Marco Leise wrote:[...]I agree. I dealt the madness of code pages, Shift-JIS, EBCDIC, locales, etc., in the pre-Unicode days. Despite its problems, Unicode (and UTF-8) is a major improvement, and I mean major. 16 years ago, I bet that Unicode was the future, and events have shown that to be correct. But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).
May 31 2016
On 06/01/2016 12:47 AM, Walter Bright wrote:But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I suppose you mean UTF-32/UCS-4. [1] https://en.wikipedia.org/wiki/UTF-16
May 31 2016
On 5/31/2016 4:00 PM, ag0aep6g wrote:Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I suppose you mean UTF-32/UCS-4. [1] https://en.wikipedia.org/wiki/UTF-16Thanks for the correction.
May 31 2016
Am Tue, 31 May 2016 15:47:02 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs.I think so too, although more APIs than just Windows use UTF-16. Think of Java or ICU. Aside from their Java heritage they found that it is the fastest encoding for transcoding from and to Unicode as UTF-16 codepoints cover most 8-bit codepages. Also Qt defined a char as UTF-16 code point, but they probably regret it as the 'charmap' program KCharSelect is now unable to show Unicode characters >= 0x10000. -- Marco
May 31 2016
On 05/31/2016 06:29 PM, Joakim wrote:D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic.
May 31 2016
On Tuesday, 31 May 2016 at 20:28:32 UTC, ag0aep6g wrote:On 05/31/2016 06:29 PM, Joakim wrote:No, this is the root of the problem, but I'm not interested in debating it, so you can go back to discussing how to avoid the elephant in the room.D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic.
May 31 2016
On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.I assume you're talking about the web here. In this case, plain text makes up only a minor part of the entire traffic, the majority of which is images (binary data), javascript and stylesheets (almost pure ASCII), and HTML markup (ditto). It's like not significant even without taking compression into account, which is ubiquitous.It is unnecessarily inefficient, which is precisely why auto-decoding is a problem.No, inefficiency is the least of the problems with auto-decoding.It is only a matter of time till UTF-8 is ditched.This is ridiculous, even if your other claims were true.D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_I think I remember that post, and - sorry to be so blunt - it was one of the worst things I've ever seen proposed regarding text encoding.The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness.No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.
Jun 01 2016
On Wednesday, 1 June 2016 at 10:04:42 UTC, Marc Schütz wrote:On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:No, I explicitly said not the web in a subsequent post. The ignorance here of what 2G speeds are like is mind-boggling.UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.I assume you're talking about the web here. In this case, plain text makes up only a minor part of the entire traffic, the majority of which is images (binary data), javascript and stylesheets (almost pure ASCII), and HTML markup (ditto). It's like not significant even without taking compression into account, which is ubiquitous.Right... that's why this 200-post thread was spawned with that as the main reason.It is unnecessarily inefficient, which is precisely why auto-decoding is a problem.No, inefficiency is the least of the problems with auto-decoding.The UTF-8 encoding is what's ridiculous.It is only a matter of time till UTF-8 is ditched.This is ridiculous, even if your other claims were true.Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_I think I remember that post, and - sorry to be so blunt - it was one of the worst things I've ever seen proposed regarding text encoding.Lol, this may be the dumbest argument put forth yet. I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness.No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.
Jun 01 2016
Am Wed, 01 Jun 2016 13:57:27 +0000 schrieb Joakim <dlang joakim.fea.st>:No, I explicitly said not the web in a subsequent post. The ignorance here of what 2G speeds are like is mind-boggling.I've used 56k and had a phone conversation with my sister while she was downloading a 800 MiB file over 2G. You just learn to be patient (or you already are when the next major city is hundreds of kilometers away) and load only what you need. Your point about the costs convinced me more. Here is one article spiced up with numbers and figures: http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third But even if you could prove with a study that UTF-8 caused a notable bandwith cost in real life, it would - I think - be a matter of regional ISPs to provide special servers and apps that reduce data volume. There is also the overhead of key exchange when establishing a secure connection: http://stackoverflow.com/a/20306907/4038614 Something every app should do, but will increase bandwidth use. Then there is the overhead of using XML in applications like WhatsApp, which I presume is quite popular around the world. I'm just trying to broaden the view a bit here. This note from the XMPP that WhatsApp and Jabber use will make you cringe: https://tools.ietf.org/html/rfc6120#section-11.6 -- Marco
Jun 01 2016
On Wednesday, 1 June 2016 at 14:58:47 UTC, Marco Leise wrote:Am Wed, 01 Jun 2016 13:57:27 +0000 schrieb Joakim <dlang joakim.fea.st>:I see that max 2G speeds are 100-200 kbits/s. At that rate, it would have taken her more than 10 hours to download such a large file, that's nuts. The worst part is when the download gets interrupted and you have to start over again because most download managers don't know how to resume, including the stock one on Android. Also, people in these countries buy packs of around 100-200 MB for 30-60 US cents, so they would never download such a large file. They use messaging apps like Whatsapp or WeChat, which nobody in the US uses, to avoid onerous SMS charges.No, I explicitly said not the web in a subsequent post. The ignorance here of what 2G speeds are like is mind-boggling.I've used 56k and had a phone conversation with my sister while she was downloading a 800 MiB file over 2G. You just learn to be patient (or you already are when the next major city is hundreds of kilometers away) and load only what you need. Your point about the costs convinced me more.Here is one article spiced up with numbers and figures: http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-thirdYes, only the middle class, which are at most 10-30% of the population in these developing countries, can even afford 2G. The way to get costs down even further is to make the tech as efficient as possible. Of course, much of the rest of the population are illiterate, so there are bigger problems there.But even if you could prove with a study that UTF-8 caused a notable bandwith cost in real life, it would - I think - be a matter of regional ISPs to provide special servers and apps that reduce data volume.Yes, by ditching UTF-8.There is also the overhead of key exchange when establishing a secure connection: http://stackoverflow.com/a/20306907/4038614 Something every app should do, but will increase bandwidth use.That's not going to happen, even HTTP/2 ditched that requirement. Also, many of those countries' govts will not allow it: google how Blackberry had to give up their keys for "secure" BBM in many countries. It's not just Canada and the US spying on their citizens.Then there is the overhead of using XML in applications like WhatsApp, which I presume is quite popular around the world. I'm just trying to broaden the view a bit here.I didn't know they used XML. Googling it now, I see mention that they switched to an "internally developed protocol" at some point, so I doubt they're using XML now.This note from the XMPP that WhatsApp and Jabber use will make you cringe: https://tools.ietf.org/html/rfc6120#section-11.6Haha, no wonder Jabber is dead. :) I jumped on Jabber for my own messages a decade ago, as it seemed like an open way out of that proprietary messaging mess, then I read that they're using XML and gave up on it. On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:Well, then apparently you're unaware of how bloated web pages are nowadays. It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller.No, I explicitly said not the web in a subsequent post. The ignorance here of what 2G speeds are like is mind-boggling.It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge.Codepages and incompatible encodings were terrible then, too. Never again.This only shows you probably don't know the difference between an encoding and a code page, which are orthogonal concepts in Unicode. It's not surprising, as Walter and many others responding show the same ignorance. I explained this repeatedly in the previous thread, but it depends on understanding the tech, and I can't spoon-feed that to everyone.I think we can do a lot better.Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.Are you trolling? Because I was just calling it like it is. The vast majority of software is written for _one_ language, the local one. You may think otherwise because the software that sells the most and makes the most money is internationalized software like Windows or iOS, because it can be resold into many markets. But as a percentage of lines of code written, such international code is almost nothing.This just makes it feel like you're trolling. You're not just trolling, right?No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.Lol, this may be the dumbest argument put forth yet.No, I have never once suggested "turning back." I have suggested a new scheme that retains one technical aspect of the prior schemes, ie constant-width encoding for each language, with a single byte sufficing for most. _You and several others_, including Walter, see that and automatically translate that to, "He wants EBCDIC to come back!," as though that were the only possible single-byte encoding and largely ignoring the possibilities of the header scheme I suggested. I could call that "trolling" by all of you, :) but I'll instead call it what it likely is, reactionary thinking, and move on.I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past. This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed.If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer.I don't think you understand: _you_ are the special case. The 5 billion people outside the US and EU are _not the special case_. Yes, they have not mattered so far, because they were too poor to buy computers. But the "computers" with the most sales these days are smartphones, and Motorola just launched their new Moto G4 in India and Samsung their new C5 and C7 in China. They didn't bother announcing release dates for these mid-range phones- well, they're high-end in those countries- in the US. That's because "computer" sales in all these non-ASCII countries now greatly outweighs the US. Now, a large majority of people in those countries don't have smartphones or text each other, so a significant chunk of the minority who do buy mostly ~$100 smartphones over there can likely afford a fatter text encoding and I don't know what encodings these developing markets are commonly using now. The problem is all the rest, and those just below who cannot afford it at all, in part because the tech is not as efficient as it could be yet. Ditching UTF-8 will be one way to make it more efficient. On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:Indeed, Joakim's proposal is so insane it beggars belief (why not go back to baudot encoding, it's only 5 bit, hurray, it's so much faster when used with flag semaphores).I suspect you don't understand my proposal.As a programmer in the European Commission translation unit, working on the probably biggest translation memory in the world for 14 years, I can attest that Unicode is a blessing. When I remember the shit we had in our documents because of the code pages before most programs could handle utf-8 or utf-16 (and before 2004 we only had 2 alphabets to take care of, Western and Greek). What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual.Oh, I'm well aware of this. I just think a variable-length encoding like UTF-8 or UTF-16 is a bad design. And what you have to realize is that most strings in most software will only have one language. Anyway, the scheme I sketched out handles multiple languages: it just doesn't optimize for completely random jumbles of characters from every possible language, which is what UTF-8 is optimized for and is a ridiculous decision.Translators of course handle nearly exclusively with at least bi-lingual documents. Any document encountered by a translator must at least be able to present the source and the target language. But even outside of that specific population, multilingual documents are very, very common.You are likely biased by the fact that all your documents are bilingual: they're _not_ common for the vast majority of users. Even if they were, UTF-8 is as suboptimal, compared to the constant-width encoding scheme I've sketched, for bilingual or even trilingual documents as it is for a single language, so even if I were wrong about their frequency, it wouldn't matter.
Jun 01 2016
On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:It's telling that you think the encoding of the text is anything but the tiniest fraction of the problem. You should look at where the actual weight of a "modern" web page comes from.It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge.Well, then apparently you're unaware of how bloated web pages are nowadays. It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller."I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_" Yeah, that? That's codepages. And your exact proposal to put encodings in the header was ALSO tried around the time that Unicode was getting hashed out. It sucked. A lot. (Not as bad as storing it in the directory metadata, though.)Codepages and incompatible encodings were terrible then, too. Never again.This only shows you probably don't know the difference between an encoding and a code page,Maybe. But no one's done it yet.I think we can do a lot better.Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.The vast majority of software is written for _one_ language, the local one. You may think otherwise because the software that sells the most and makes the most money is internationalized software like Windows or iOS, because it can be resold into many markets. But as a percentage of lines of code written, such international code is almost nothing.I'm surprised you think this even matters after talking about web pages. The browser is your most common string processing situation. Nothing else even comes close.largely ignoring the possibilities of the header scheme I suggested."Possibilities" that were considered and discarded decades ago by people with way better credentials. The era of single-byte encodings is gone, it won't come back, and good riddance to bad rubbish.I could call that "trolling" by all of you, :) but I'll instead call it what it likely is, reactionary thinking, and move on.It's not trolling to call you out for clearly not doing your homework.I don't think you understand: _you_ are the special case.Oh, I understand perfectly. _We_ (whoever "we" are) can handle any sequence of glyphs and combining characters (correctly-formed or not) in any language at any time, so we're the special case...? Yeah, it sounds funny to me, too.The 5 billion people outside the US and EU are _not the special case_.Fortunately, it works for them to.The problem is all the rest, and those just below who cannot afford it at all, in part because the tech is not as efficient as it could be yet. Ditching UTF-8 will be one way to make it more efficient.All right, now you've found the special case; the case where the generic, unambiguous encoding may need to be lowered to something else: people for whom that encoding is suboptimal because of _current_ network constraints. I fully acknowledge it's a couple billion people and that's nothing to sneeze at, but I also see that it's a situation that will become less relevant over time. -Wyatt
Jun 01 2016
On Wednesday, 1 June 2016 at 18:30:25 UTC, Wyatt wrote:On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:I'm well aware that text is a small part of it. My point is that they're not downloading those web pages, they're using mobile instead, as I explicitly said in a prior post. My only point in mentioning the web bloat to you is that _your perception_ is off because you seem to think they're downloading _current_ web pages over 2G connections, and comparing it to your downloads of _past_ web pages with modems. Not only did it take minutes for us back then, it takes _even longer_ now. I know the text encoding won't help much with that. Where it will help is the mobile apps they're actually using, not the bloated websites they don't use.On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:It's telling that you think the encoding of the text is anything but the tiniest fraction of the problem. You should look at where the actual weight of a "modern" web page comes from.It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge.Well, then apparently you're unaware of how bloated web pages are nowadays. It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller.You know what's also codepages? Unicode. The UCS is a standardized set of code pages for each language, often merely picking the most popular code page at that time. I don't doubt that nothing I'm saying hasn't been tried in some form before. The question is whether that alternate form would be better if designed and implemented properly, not if a botched design/implementation has ever been attempted."I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_" Yeah, that? That's codepages. And your exact proposal to put encodings in the header was ALSO tried around the time that Unicode was getting hashed out. It sucked. A lot. (Not as bad as storing it in the directory metadata, though.)Codepages and incompatible encodings were terrible then, too. Never again.This only shows you probably don't know the difference between an encoding and a code page,That's what people said about mobile devices for a long time, until about a decade ago. It's time we got this right.Maybe. But no one's done it yet.I think we can do a lot better.Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.No, it's certainly popular software, but at the scale we're talking about, ie all string processing in all software, it's fairly small. And the vast majority of webapps that handle strings passed from a browser are written to only handle one language, the local one.The vast majority of software is written for _one_ language, the local one. You may think otherwise because the software that sells the most and makes the most money is internationalized software like Windows or iOS, because it can be resold into many markets. But as a percentage of lines of code written, such international code is almost nothing.I'm surprised you think this even matters after talking about web pages. The browser is your most common string processing situation. Nothing else even comes close.Lol, credentials. :D If you think that matters at all in the face of the blatant stupidity embodied by UTF-8, I don't know what to tell you.largely ignoring the possibilities of the header scheme I suggested."Possibilities" that were considered and discarded decades ago by people with way better credentials. The era of single-byte encodings is gone, it won't come back, and good riddance to bad rubbish.That's funny, because it's precisely you and others who haven't done your homework. So are you all trolling me? By your definition of trolling, which btw is not the standard one, _you_ are the one doing it.I could call that "trolling" by all of you, :) but I'll instead call it what it likely is, reactionary thinking, and move on.It's not trolling to call you out for clearly not doing your homework.And you're doing so by mostly using a single-byte encoding for _your own_ Euro-centric languages, ie ASCII, while imposing unnecessary double-byte and triple-byte encodings on everyone else, despite their outnumbering you 10 to 1. That is the very definition of a special case.I don't think you understand: _you_ are the special case.Oh, I understand perfectly. _We_ (whoever "we" are) can handle any sequence of glyphs and combining characters (correctly-formed or not) in any language at any time, so we're the special case...?Yeah, it sounds funny to me, too.I'm happy to hear you find your privilege "funny," but I'm sorry to tell you, it won't last.At a higher and unneccessary cost, which is why it won't last.The 5 billion people outside the US and EU are _not the special case_.Fortunately, it works for them to.I continue to marvel at your calling a couple billion people "the special case," presumably thinking ~700 million people in the US and EU primarily using the single-byte encoding of ASCII are the general case. As for the continued relevance of such constrained use, I suggest you read the link Marco provided above. The vast majority of the worlwide literate population doesn't have a smartphone or use a cellular data plan, whereas the opposite is true if you include featurephones, largely because they can by used only for voice. As that article notes, costs for smartphones and 2G data plans will have to come down for them to go wider. That will take decades to roll out, though the basic tech design will mostly be done now. The costs will go down by making the tech more efficient, and ditching UTF-8 will be one of the ways the tech will be made more efficient.The problem is all the rest, and those just below who cannot afford it at all, in part because the tech is not as efficient as it could be yet. Ditching UTF-8 will be one way to make it more efficient.All right, now you've found the special case; the case where the generic, unambiguous encoding may need to be lowered to something else: people for whom that encoding is suboptimal because of _current_ network constraints. I fully acknowledge it's a couple billion people and that's nothing to sneeze at, but I also see that it's a situation that will become less relevant over time.
Jun 01 2016
On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:No, I explicitly said not the web in a subsequent post. The ignorance here of what 2G speeds are like is mind-boggling.It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge. Codepages and incompatible encodings were terrible then, too. Never again.Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.This just makes it feel like you're trolling. You're not just trolling, right?No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.Lol, this may be the dumbest argument put forth yet.I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past. This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed. If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer. -Wyatt
Jun 01 2016
On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:Indeed, Joakim's proposal is so insane it beggars belief (why not go back to baudot encoding, it's only 5 bit, hurray, it's so much faster when used with flag semaphores). As a programmer in the European Commission translation unit, working on the probably biggest translation memory in the world for 14 years, I can attest that Unicode is a blessing. When I remember the shit we had in our documents because of the code pages before most programs could handle utf-8 or utf-16 (and before 2004 we only had 2 alphabets to take care of, Western and Greek). What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual. Translators of course handle nearly exclusively with at least bi-lingual documents. Any document encountered by a translator must at least be able to present the source and the target language. But even outside of that specific population, multilingual documents are very, very common.No, I explicitly said not the web in a subsequent post. The ignorance here of what 2G speeds are like is mind-boggling.It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge. Codepages and incompatible encodings were terrible then, too. Never again.Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.This just makes it feel like you're trolling. You're not just trolling, right?No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.Lol, this may be the dumbest argument put forth yet.I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past. This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed.If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer.
Jun 01 2016
On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual.That should be obvious to anyone living outside the USA.
Jun 01 2016
On 06/01/2016 12:26 PM, deadalnix wrote:On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:Or anyone in the USA who's ever touched a product that includes a manual or a safety warning, or gone to high school (a foreign language class is pretty much universally mandatory, even in the US).What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual.That should be obvious to anyone living outside the USA.
Jun 01 2016
On Wednesday, 1 June 2016 at 16:26:36 UTC, deadalnix wrote:On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:https://msdn.microsoft.com/th-th inside too :)What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual.That should be obvious to anyone living outside the USA.
Jun 01 2016
On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer.UTF-8 encoded SMS work fine for me in GSM network, didn't notice any problem.
Jun 01 2016
On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:When on the other hand you work with real world international text, you'll want to work with graphemes.Actually, my main rule of thumb is: don't mess with strings. Get them from the user, store them without modification, spit them back out again. Wherever possible, don't do anything more. But if you do have to implement the rest, eh, it depends on what you're doing still. If I want an ellipsis, for example, I like to take font size into account too - basically, I do a dry-run of the whole font render to get the length in pixels, then slice off the partial grapheme... So yeah that's kinda complicated...
May 30 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.This is a great example of special casing in Phobos that someone showed me: https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L1714
May 12 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:Here are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.Wow, that's eleven things wrong with just one tiny element of D, with the potential to cause problems, whether fixed or not. And I get called a troll and other names when I list half a dozen things wrong with D, my posts get removed/censored, etc, all because I try to inform people not to waste time with D because it's a broken and failed language. *sigh* Phobos, a piece of useless rock orbiting a dead planet ... the irony.
May 12 2016
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:*rant*Actually, chap, it's the attitude that's the turn-off in your post there. Listing problems in order to improve them, and listing problems to convince people something is a waste of time are incompatible mindsets around here.
May 13 2016
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:You get banned because there is a difference between torpedoing a project and having constructive criticism. Also, you are missing the point by claiming that a technical problem is sure to kill D. Note that very successful languages like C++, python and so on also have undergone heated discussions about various features, and often live design mistakes for many years. The real reason why languages are successful is what they enable, not how many quirks they have. Quirks are why they get replaced by others 20 years later. :)(...)Wow, that's eleven things wrong with just one tiny element of D, with the potential to cause problems, whether fixed or not. And I get called a troll and other names when I list half a dozen things wrong with D, my posts get removed/censored, etc, all because I try to inform people not to waste time with D because it's a broken and failed language. *sigh* Phobos, a piece of useless rock orbiting a dead planet ... the irony.
May 13 2016
On Sunday, 15 May 2016 at 01:45:25 UTC, Bill Hicks wrote:From a technical point, D is not successful, for the most part. C/C++ at least can use the excuse that they were created during a time when we didn't have the experience and the knowledge that we do now.Not really. The dominating precursor to C, BCPL was a bootstrapping language for CPL. C was a quick hack to implement Unix. C++ has always been viewed as a hack and was heavily criticised since its inception as a ugly bastardized language that got many things wrong. Reality is, current main stream programming languages draw on theory that has been well understood for 40+ years. There is virtually no innovation, but a lot of repeated mistakes. Some esoteric languages draw on more modern concepts and innovate, but I can't think of a single mainstream language that does that.If by successful you mean the size of the user base, then D doesn't have that either. The number of D users is most definitely less than 10k. The number of people who have tried D is no doubt greater than that, but that's the thing with D, it has a low retention rate, for obvious reasons.Yes, but D can make breaking changes, something C++ cannot do. Unfortunately there is no real willingness to clean up the language, so D is moving way too slow to become competitive. But that is more of a cultural issue than a language issue. I am personally increasingly involved with C++, but unfortunately, there is no single C++ language. The C/C++ committees have unfortunately tried to make the C-languages more high performant and high level at the cost of correctness. So, now you either have to do heavy code reviews or carefully select compiler options to get a sane C++ environment. Like, in modern C/C++ the compiler assumes that there is no aliasing between pointers to different types. So if I cast a scalar float pointer to a simd pointer I either have to: 1. make sure that I turn off that assumption by using the compiler switch "-fno-strict-aliasing" and add "__restrict__" where I know there is no aliasing, or 2. Put __may_alias__ on my simd pointers. 3. Carefully place memory barriers between pointer type casts. 4. Dig into the compiler internals to figure out what it does. C++ is trying way too hard to become a high level language, without the foundation to support it. This is an area where D could do well, but it isn't doing enough to get there, neither on the theoretical level or the implementation level. Rust seems to try, but I don't think they will make it as they don't seem to have a broad view of programming. Maybe someone will build a new language over the Rust mid-level IR (MIR) that will be successful. I'm hopeful, but hey, it won't happen in less than 5 years. Until then there is only three options for C++ish progamming: C++, D and Loci. Currently C++ is the path of least resistance (but with very high initial investment, 1+ year for an experienced educated programmer). So clearly a language comparable to D _could_ make headway, but not without a philosophical change that makes it a significant improvement over C++ and systematically adresses the C++ short-comings one by one (while retaining the application area and basic programming model).
May 15 2016
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:Wow, that's eleven things wrong with just one tiny element of D, with the potential to cause problems, whether fixed or not. And I get called a troll and other names when I list half a dozen things wrong with D, my posts get removed/censored, etc, all because I try to inform people not to waste time with D because it's a broken and failed language. *sigh* Phobos, a piece of useless rock orbiting a dead planet ... the irony.Is there any PL that doesn't have multiple issues? Look at Swift. They keep changing it, although it started out as _the_ big the chronically ill C++. There is no such thing as the perfect PL, and as hardware is changing, PLs are outdated anyway and have to catch up. The question is not whether a language sucks or not, the question is which language sucks the least for the task at hand. PS I wonder does Bill Hicks know you're using his name? But I guess he's lost interest in this planet and happily lives on Mars now.
May 13 2016
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:not to waste time with D because it's a broken and failed language.D is a better broken thing among all the broken things in this broken world, so it's to be expected to be preferred to spend time on.
May 13 2016
On 5/12/2016 11:50 PM, Bill Hicks wrote:And I get called a troll and other names when I list half a dozen things wrong with D, my posts get removed/censored, etc, all because I try to inform people not to waste time with D because it's a broken and failed language.Posts that engage in personal attacks and bring up personal issues about other forum members get removed. You're welcome to post here in a reasonably professional manner.
May 13 2016
On Thursday, May 12, 2016 13:15:45 Walter Bright via Digitalmars-d wrote:On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote: > I am as unclear about the problems of autodecoding as I am about the > necessity to remove curl. Whenever I ask I hear some arguments that work > well emotionally but are scant on reason and engineering. Maybe it's > time to rehash them? I just did so about curl, no solid argument seemed > to come together. I'd be curious of a crisp list of grievances about > autodecoding. -- Andrei Here are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.It also results in constantly special-casing algorithms for narrow strings in order to avoid auto-decoding. Phobos does this all over the place. We have a ridiculous amount of code in Phobos just to avoid auto-decoding, and anyone who wants high performance will have to do the same. And it's not like auto-decoding is even correct. It would be one thing if auto-decoding were fully correct but slow, but to be fully correct, it would need to operate at the grapheme level, not the code point level. So, by default, we get slower code without actually getting fully correct code. So, we're neither fast nor correct. We _are_ correct in more cases than we'd be if we simply acted like ASCII was all there was, but what we end up with is the illusion that we're correct when we're not. IIRC, Andrei talked in TDPL about how Java's choice to go with UTF-16 was worse than the choice to go with UTF-8, because it was correct in many more cases to operate on the code unit level as if a code unit were a character, and it was therefore harder to realize that what you were doing was wrong, whereas with UTF-8, it's obvious very quickly. We currently have that same problem with auto-decoding except that it's treating UTF-32 code units as if they were full characters rather than treating UTF-16 code units as if they were full characters. Ideally, algorithms would be Unicode aware as appropriate, but the default would be to operate on code units with wrappers to handle decoding by code point or grapheme. Then it's easy to write fast code while still allowing for full correctness. Granted, it's not necessarily easy to get correct code that way, but anyone who wants fully correctness without caring about efficiency can just use ranges of graphemes. Ranges of code points are rare regardless. Based on what I've seen in previous conversations on auto-decoding over the past few years (be it in the newsgroup, on github, or at dconf), most of the core devs think that auto-decoding was a major blunder that we continue to pay for. But unfortunately, even if we all agree that it was a huge mistake and want to fix it, the question remains of how to do that without breaking tons of code - though since AFAIK, Andrei is still in favor of auto-decoding, we'd have a hard time going forward with plans to get rid of it even if we had come up with a good way of doing so. But I would love it if we could get rid of auto-decoding and clean up string handling in D. - Jonathan M Davis
May 13 2016
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:Based on what I've seen in previous conversations on auto-decoding over the past few years (be it in the newsgroup, on github, or at dconf), most of the core devs think that auto-decoding was a major blunder that we continue to pay for. But unfortunately, even if we all agree that it was a huge mistake and want to fix it, the question remains of how to do that without breaking tons of code - though since AFAIK, Andrei is still in favor of auto-decoding, we'd have a hard time going forward with plans to get rid of it even if we had come up with a good way of doing so. But I would love it if we could get rid of auto-decoding and clean up string handling in D. - Jonathan M DavisWhy not just try it in a separate test release? Only then can we know to what extent it actually breaks code, and what remedies we could come up with.
May 13 2016
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:Ideally, algorithms would be Unicode aware as appropriate, but the default would be to operate on code units with wrappers to handle decoding by code point or grapheme. Then it's easy to write fast code while still allowing for full correctness. Granted, it's not necessarily easy to get correct code that way, but anyone who wants fully correctness without caring about efficiency can just use ranges of graphemes. Ranges of code points are rare regardless.char[], wchar[] etc. can simply be made non-ranges, so that the user has to choose between .byCodePoint, .byCodeUnit (or .representation as it already exists), .byGrapheme, or even higher-level units like .byLine or .byWord. Ranges of char, wchar however stay as they are today. That way it's harder to accidentally get it wrong.Based on what I've seen in previous conversations on auto-decoding over the past few years (be it in the newsgroup, on github, or at dconf), most of the core devs think that auto-decoding was a major blunder that we continue to pay for. But unfortunately, even if we all agree that it was a huge mistake and want to fix it, the question remains of how to do that without breaking tons of code - though since AFAIK, Andrei is still in favor of auto-decoding, we'd have a hard time going forward with plans to get rid of it even if we had come up with a good way of doing so. But I would love it if we could get rid of auto-decoding and clean up string handling in D.There is a simple deprecation path that's already been suggested. `isInputRange` and friends can output a helpful deprecation warning when they're called with a range that currently triggers auto-decoding.
May 13 2016
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:IIRC, Andrei talked in TDPL about how Java's choice to go with UTF-16 was worse than the choice to go with UTF-8, because it was correct in many more casesUTF-16 was a migration from UCS-2, and UCS-2 was superior at the time.May 13 2016On Friday, May 13, 2016 12:52:13 Kagamin via Digitalmars-d wrote:On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:The history of why UTF-16 was chosen isn't really relevant to my point (Win32 has the same problem as Java and for similar reasons). My point was that if you use UTF-8, then it's obvious _really_ fast when you screwed up Unicode-handling by treating a code unit as a character, because anything beyond ASCII is going to fall flat on its face. But with UTF-16, a _lot_ more code units are representable as a single code point - as well as a single grapheme - so it's far easier to write code that treats a code unit as if it were a full character without realizing that you're screwing it up. UTF-8 is fail-fast in this regard, whereas UTF-16 is not. UTF-32 takes that problem to a new level, because now you'll only notice problems when you're dealing with a grapheme constructed of multiple code points. So, odds are that even if you test with Unicode strings, you won't catch the bugs. It'll work 99% of the time, and you'll get subtle bugs the rest of the time. There are reasons to operate at the code point level, but in general, you either want to be operating at the code unit level or the grapheme level, not the code point level, and if you don't know what you're doing, then anything other than the grapheme level is likely going to be wrong if you're manipulating individual characters. Fortunately, a lot of string processing doesn't need to operate on individual characters and as long as the standard library functions get it right, you'll tend to be okay, but still, operating at the code point level is almost always wrong, and it's even harder to catch when it's wrong than when treating UTF-16 code units as characters. - Jonathan M DavisIIRC, Andrei talked in TDPL about how Java's choice to go with UTF-16 was worse than the choice to go with UTF-8, because it was correct in many more casesUTF-16 was a migration from UCS-2, and UCS-2 was superior at the time.May 13 2016On Friday, 13 May 2016 at 21:46:28 UTC, Jonathan M Davis wrote:The history of why UTF-16 was chosen isn't really relevant to my point (Win32 has the same problem as Java and for similar reasons). My point was that if you use UTF-8, then it's obvious _really_ fast when you screwed up Unicode-handling by treating a code unit as a character, because anything beyond ASCII is going to fall flat on its face.On the other hand if you deal with UTF-16 text, you can't interpret it in a way other than UTF-16, people either get it correct or give up, even for ASCII, even with casts, it's that resilient. With UTF-8 problems happened on a massive scale in LAMP setups: mysql used latin1 as a default encoding and almost everything worked fine.May 17 2016On Tuesday, 17 May 2016 at 09:53:17 UTC, Kagamin wrote:With UTF-8 problems happened on a massive scale in LAMP setups: mysql used latin1 as a default encoding and almost everything worked fine.^ latin-1 with Swedish collation rules. And even if you set the encoding to "utf8", almost everything works fine until you discover that you need to set the encoding to "utf8mb4" to get real utf8. Also, MySQL has per-connection character encoding settings, so even if your application is properly set up to use utf8, you can break things by accidentally connecting with a client using the default pretty-much-latin1 encoding. With MySQL's "silently ram the square peg into the round hole" design philosophy, this can cause data corruption. But, of course, almost everything works fine. Just some examples of why broken utf8 exists (and some venting of MySQL trauma).May 17 2016On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.This just means that filenames mustn't be represented as strings; it's unrelated to auto decoding.May 13 2016On 5/13/2016 3:43 AM, Marc Schütz wrote:On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:It means much more than that, filenames are just an example. I recently fixed MicroEmacs (my text editor) to assume the source is UTF-8, and display Unicode characters. But it still needs to work with dirty UTF-8 without throwing exceptions, modifying the text in-place, or other tantrums.7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.This just means that filenames mustn't be represented as strings; it's unrelated to auto decoding.May 13 2016On 5/12/16 4:15 PM, Walter Bright wrote:10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.I'll repeat what I said in the other thread. The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays. If you think this code makes sense, then my definition of sane varies slightly from yours: static assert(!hasLength!R && is(typeof(R.init.length))); static assert(!is(ElementType!R == R.init[0])); static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $]))); I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy. If I ran D, that's what I would do. -SteveMay 13 2016On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer wrote:On 5/12/16 4:15 PM, Walter Bright wrote:Well, the "auto" part of autodecoding means "automatically doing it for plain strings", right? If you explicitly do decoding, I think it would just be "decoding"; there's no "auto" part. I doubt anyone is going to complain if you add in a struct wrapper around a string that iterates over code units or graphemes. The issue most people have, as you say, is the fact that the default for strings is to decode.10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.I'll repeat what I said in the other thread. The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays. If you think this code makes sense, then my definition of sane varies slightly from yours: static assert(!hasLength!R && is(typeof(R.init.length))); static assert(!is(ElementType!R == R.init[0])); static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $]))); I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy. If I ran D, that's what I would do. -SteveMay 13 2016On 5/13/16 5:25 PM, Alex Parrill wrote:On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer wrote:No, the problem isn't the auto-decoding. The problem is having *arrays* do that. Sometimes. I would be perfectly fine with a custom string type that all string literals were typed as, as long as I can get a sanely behaving array out of it.On 5/12/16 4:15 PM, Walter Bright wrote:Well, the "auto" part of autodecoding means "automatically doing it for plain strings", right? If you explicitly do decoding, I think it would just be "decoding"; there's no "auto" part.10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.I'll repeat what I said in the other thread. The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays. If you think this code makes sense, then my definition of sane varies slightly from yours: static assert(!hasLength!R && is(typeof(R.init.length))); static assert(!is(ElementType!R == R.init[0])); static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $]))); I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy. If I ran D, that's what I would do.I doubt anyone is going to complain if you add in a struct wrapper around a string that iterates over code units or graphemes. The issue most people have, as you say, is the fact that the default for strings is to decode.I want to clarify that I don't really care if strings by default auto-decode. I think that's fine. What I dislike is that immutable(char)[] auto-decodes. -SteveMay 13 2016On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:Given the importance of performance in the auto-decoding topic, it seems reasonable to quantify it. I took a stab at this. It would of course be prudent to have others conduct similar analysis rather than rely on my numbers alone. Measurements were done using an artificial scenario, counting lower-case ascii letters. This had the effect of calling front/popFront many times on a long block of text. Runs were done both treating the text as char[] and ubyte[] and comparing the run times. (char[] performs auto-decoding, ubyte[] does not.) Timings were done with DMD and LDC, and on two different data sets. One data set was a mix of latin languages (e.g. German, English, Finnish, etc.), the other non-Latin languages (e.g. Japanese, Chinese, Greek, etc.). The goal being to distinguish between scenarios with high and low Ascii character content. The result: For DMD, auto-decoding showed a 1.6x to 2.6x cost. For LDC, a 12.2x to 12.9x cost. Details: - Test program: https://dpaste.dzfl.pl/67c7be11301f - DMD 2.071.0. Options: -release -O -boundscheck=off -inline - LDC 1.0.0-beta1 (based on DMD v2.070.2). Options: -release -O -boundscheck=off - Machine: Macbook Pro (2.8 GHz Intel I7, 16GB ram) Runs for each combination were done five times and the median times used. The median times and the char[] to ubyte[] ratio are below: | | | char[] | ubyte[] | | Compiler | Text type | time (ms) | time (ms) | ratio | |----------+-----------+-----------+-----------+-------| | DMD | Latin | 7261 | 4513 | 1.6 | | DMD | Non-latin | 10240 | 3928 | 2.6 | | LDC | Latin | 11773 | 913 | 12.9 | | LDC | Non-latin | 10756 | 883 | 12.2 | Note: The numbers above don't provide enough info to derive a front/popFront rate. The program artificially makes multiple loops to increase the run-times. (For these runs, the program's repeat-count was set to 20). Characteristics of the two data sets: | | | | | Bytes per | | Text type | Bytes | DChars | Ascii Chars | DChar | Pct Ascii | |-----------+---------+---------+-------------+-----------+-----------| | Latin | 4156697 | 4059016 | 3965585 | 1.024 | 97.7% | | Non-latin | 4061554 | 1949290 | 348164 | 2.084 | 17.9% | Run-to-run variability - The run times recorded were quite stable. The largest delta between minimum and median time for any group was 17 milliseconds.I am as unclear about the problems of autodecoding as I amabout the necessityto remove curl. Whenever I ask I hear some arguments thatwork well emotionallybut are scant on reason and engineering. Maybe it's time torehash them? I justdid so about curl, no solid argument seemed to come together.I'd be curious ofa crisp list of grievances about autodecoding. -- AndreiMay 15 2016On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:Given the importance of performance in the auto-decoding topic, it seems reasonable to quantify it. I took a stab at this. It would of course be prudent to have others conduct similar analysis rather than rely on my numbers alone.Here is another benchmark (see the above comment for the code to apply the patch to) that measures the iteration time difference: http://forum.dlang.org/post/ndj6dm$a6c$1 digitalmars.com The result is a 756% slow downMay 15 2016On Mon, May 16, 2016 at 12:31:04AM +0000, Jack Stouffer via Digitalmars-d wrote:On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:I decide to do my own benchmarking too. Here's the code: /** * Simple-minded benchmark for measuring performance degradation caused by * autodecoding. */ import std.typecons : Flag, Yes, No; size_t countNewlines(Flag!"autodecode" autodecode)(const(char)[] input) { size_t count = 0; static if (autodecode) { import std.array; foreach (dchar ch; input) { if (ch == '\n') count++; } } else // !autodecode { import std.utf : byCodeUnit; foreach (char ch; input.byCodeUnit) { if (ch == '\n') count++; } } return count; } void main(string[] args) { import std.datetime : benchmark; import std.file : read; import std.stdio : writeln, writefln; string input = (args.length >= 2) ? args[1] : "/usr/src/d/phobos/std/datetime.d"; uint n = 50; auto data = cast(char[]) read(input); writefln("Input: %s (%d bytes)", input, data.length); size_t count; writeln("With autodecoding:"); auto result = benchmark!({ count = countNewlines!(Yes.autodecode)(data); })(n); writefln("Newlines: %d Time: %s msecs", count, result[0].msecs); writeln("Without autodecoding:"); result = benchmark!({ count = countNewlines!(No.autodecode)(data); })(n); writefln("Newlines: %d Time: %s msecs", count, result[0].msecs); } // vim:set sw=4 ts=4 et: Just for fun, I decided to use std/datetime.d, one of the largest modules in Phobos, as a test case. For comparison, I compiled with dmd (latest git head) and gdc 5.3.1. The compile commands were: dmd -O -inline bench.d -ofbench.dmd gdc -O3 bench.d -o bench.gdc Here are the results from bench.dmd: Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes) With autodecoding: Newlines: 35398 Time: 331 msecs Without autodecoding: Newlines: 35398 Time: 254 msecs And the results from bench.gdc: Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes) With autodecoding: Newlines: 35398 Time: 253 msecs Without autodecoding: Newlines: 35398 Time: 25 msecs These results are pretty typical across multiple runs. There is a variance of about 20 msecs or so between bench.dmd runs, but the bench.gdc runs vary only by about 1-2 msecs. So for bench.dmd, autodecoding adds about a 30% overhead to running time, whereas for bench.gdc, autodecoding costs an order of magnitude increase in running time. As an interesting aside, compiling with dmd without -O -inline causes the non-autodecoding case to be actually consistently *slower* than the autodecoding case. Apparently in this case the performance is dominated by the cost of calling non-inlined range primitives on byCodeUnit, whereas a manual for-loop over the array of chars produces similar results to the -O -inline case. I find this interesting, because it shows that the cost of autodecoding is relatively small compared to the cost of unoptimized range primitives. Nevertheless, it does make a big difference when range primitives are properly optimized. It is especially poignant in the case of gdc that, given a superior optimizer, the non-autodecoding case can be made an order of magnitude faster, whereas the autodecoding case is presumably complex enough to defeat the optimizer. T -- Democracy: The triumph of popularity over principle. -- C.BondGiven the importance of performance in the auto-decoding topic, it seems reasonable to quantify it. I took a stab at this. It would of course be prudent to have others conduct similar analysis rather than rely on my numbers alone.Here is another benchmark (see the above comment for the code to apply the patch to) that measures the iteration time difference: http://forum.dlang.org/post/ndj6dm$a6c$1 digitalmars.com The result is a 756% slow downMay 15 2016On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:Runs for each combination were done five times and the median times used. The median times and the char[] to ubyte[] ratio are below: | | | char[] | ubyte[] | | Compiler | Text type | time (ms) | time (ms) | ratio | |----------+-----------+-----------+-----------+-------| | DMD | Latin | 7261 | 4513 | 1.6 | | DMD | Non-latin | 10240 | 3928 | 2.6 | | LDC | Latin | 11773 | 913 | 12.9 | | LDC | Non-latin | 10756 | 883 | 12.2 |Interesting that LDC is slower than DMD for char[].May 16 2016This might be a good time to discuss this a tad further. I'd appreciate if the debate stayed on point going forward. Thanks! My thesis: the D1 design decision to represent strings as char[] was disastrous and probably one of the largest weaknesses of D1. The decision in D2 to use immutable(char)[] for strings is a vast improvement but still has a number of issues. The approach to autodecoding in Phobos is an improvement on that decision. The insistent shunning of a user-defined type to represent strings is not good and we need to rid ourselves of it. On 05/12/2016 04:15 PM, Walter Bright wrote:On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote: > I am as unclear about the problems of autodecoding as I am about the necessity > to remove curl. Whenever I ask I hear some arguments that work well emotionally > but are scant on reason and engineering. Maybe it's time to rehash them? I just > did so about curl, no solid argument seemed to come together. I'd be curious of > a crisp list of grievances about autodecoding. -- Andrei Here are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency.Agreed. At the point of that decision, the party line was "arrays of characters are strings, nothing else is or should be". Now it is apparent that shouldn't have been the case.2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.This is a consequence of 1. It is at least partially fixable.3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case.This is also a consequence of 1.4. Autodecoding is slow and has no place in high speed string processing.I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing. Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.5. Very few algorithms require decoding.The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.Objection. Vague.9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width. Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte? This is a fundamental question for which we need a rigorous answer. What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)? If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.11. Indexing an array produces different results than autodecoding, another glaring special case.This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding. Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean. AndreiMay 26 2016On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.For an example where the std.algorithm/range functions don't cut it, my random format date string parser first breaks up the given character range into tokens. Once it has the tokens, it checks several known formats. One piece of that is checking if some of the tokens are in AAs of month and day names for fast tests of presence. Because the AAs are int[string], and it's unknowable the encoding of string (it's complicated), during tokenization, the character range must be forced to UTF-8 with byChar with all isSomeString!R == true inputs to avoid the auto-decoding and subsequent AA key mismatch.Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.See the discussion here: https://issues.dlang.org/show_bug.cgi?id=14519 I think some of the proposals there are interesting.Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.If you agree that iterating over code units and code points isn't what people want/need most of the time, then I will quote something from my article on the subject: "I really don't see the benefit of the automatic behavior fulfilling this one specific corner case when you're going to make everyone else call a range generating function when they want to iterate over code units or graphemes. Just make everyone call a range generating function to specify the type of iteration and save a lot of people the trouble!" I think the only clear way forward is to not make strings ranges and force people to make a decision when passing them to range functions. The HUGE problem is the code this will break, which is just about all of it.May 26 2016On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]On 05/12/2016 04:15 PM, Walter Bright wrote:[...]General Unicode strings have a lot of non-ASCII characters. Why are we only optimizing for the ASCII case?4. Autodecoding is slow and has no place in high speed string processing.I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing. Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.Question: what should count return, given a string containing (1) combining diacritics, or (2) Korean text? Or (3) zero-width spaces?5. Very few algorithms require decoding.The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control charactersCurrently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.The problem is that often such decisions can only be made by the user, because it depends on what the user wants to accomplish. What should count return, given some Unicode string? If the user wants to determine the size of a buffer (e.g., to store a string minus some characters to be stripped), then count should return the byte count. If the user wants to count the number of matching visual characters, then count should return the number of graphemes. If the user wants to determine the visual width of the (filtered) string, then count should not be used at all, but instead a font metric algorithm. (I can't think of a practical use case where you'd actually need to count code points(!).) Having the library arbitrarily choose one use case over the others (especially one that seems the least applicable to practical situations) just doesn't seem right to me at all. Rather, the user ought to specify what exactly is to be counted, i.e., s.byCodeUnit.count(), s.byCodePoint.count(), or s.byGrapheme.count(). [...]Therefore, instead of: myString.splitter!"abc".joiner!"def".count; we have to write: myString.representation .splitter!("abc".representation) .joiner!("def".representation) .count; Great. [...]9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[].That is a strawman. We are not arguing for eliminating the distinction between char[] and ubyte[]. Rather, the complaint is that autodecoding represents a constant overhead in string processing that's often *unnecessary*. Many string operations don't *need* to autodecode, and even those that may seem like they do, are often better implemented differently. For example, filtering a string by a non-ASCII character can actually be done via substring search -- expand the non-ASCII character into 1 to 6 code units, and then do the equivalent of C's strstr(). This will not have false positives thanks to the way UTF-8 is designed. It eliminates the overhead of decoding every single character -- in implementational terms, it could, for example, first scan for the first 1st byte by linear scan through the string without decoding, which is a lot faster than decoding every single character and then comparing with the target. Only when the first byte matches does it need to do the slightly more expensive operation of substring comparison. Similarly, splitter does not need to operate on code points at all. It's unnecessarily slow that way. Most use cases of splitter has lots of data in between delimiters, which means most of the work done by autodecoding is wasted. Instead, splitter should just scan for the substring to split on -- again the design of UTF-8 guarantees there will be no false positives -- and only put in the effort where it's actually needed: at the delimiters, not the data in between. The same could be said of joiner, and many other common string algorithms. There aren't many algorithms that actually need to decode; decoding should be restricted to them, rather than an overhead applied across the board. [...]Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.[...] We already have a clear definition: char, wchar, and dchar are Unicode code units, and the latter is also Unicode code points. That's all there is to it. If we want Phobos to truly be able to take advantage of the fact that char[], wchar[], dchar[] contain Unicode strings, we need to stop the navel gazing at what byte representations and bits mean, and look at the bigger picture. Consider char[] as a unit in itself, a complete Unicode string -- the actual code units don't really matter, as they are just an implementation detail. What you want to be able to do is for a Phobos algorithm to decide, OK, in order to produce output X, it's faster to do substring scanning, and in order to produce output Y, it's better to decode first. In other words, decoding or not decoding ought to be a decision made at the algorithm level (or higher), depending on the need at hand. It should not be hard-boiled into the lower-level internals of how strings are handled, such that higher-level algorithms are straitjacketed and forced to work with the decoded stream, even when they actually don't *need* decoding to do what they want. In the cases where Phobos is unable to make a decision (e.g., what should count return -- which depends on what the user is trying to accomplish), it should be left to the user. The user shouldn't have to work against a default setting that only works for a subset of use cases. T -- Without geometry, life would be pointless. -- VSMay 26 2016On 05/26/2016 07:23 PM, H. S. Teoh via Digitalmars-d wrote:Therefore, instead of: myString.splitter!"abc".joiner!"def".count; we have to write: myString.representation .splitter!("abc".representation) .joiner!("def".representation) .count;No, that's not necessary (or correct). -- AndreiMay 26 2016Am Thu, 26 May 2016 16:23:16 -0700 schrieb "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com>:On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]Hey, I was about to answer exactly the same. It reminds me that a few years ago I proposed making string iteration explicit by code-unit, code-point and grapheme in "Rust" and there was virtually no debate about doing it in the sense that to enable people to write correct code they'd need to understand a bit of Unicode and pick the right primitive. If you don't know what to pick you look it up. -- Marcos.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control charactersQuestion: what should count return, given a string containing (1) combining diacritics, or (2) Korean text? Or (3) zero-width spaces?Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.The problem is that often such decisions can only be made by the user, because it depends on what the user wants to accomplish. What should count return, given some Unicode string? If the user wants to determine the size of a buffer (e.g., to store a string minus some characters to be stripped), then count should return the byte count. If the user wants to count the number of matching visual characters, then count should return the number of graphemes. If the user wants to determine the visual width of the (filtered) string, then count should not be used at all, but instead a font metric algorithm. (I can't think of a practical use case where you'd actually need to count code points(!).)May 30 2016I like "make string iteration explicit" but I wonder about other constructs. E.g. What about "sort an array of strings"? How would you tell a generic sort function whether you want it to interpret strings by code unit vs code point vs grapheme?May 30 2016On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote:I like "make string iteration explicit" but I wonder about other constructs. E.g. What about "sort an array of strings"? How would you tell a generic sort function whether you want it to interpret strings by code unit vs code point vs grapheme?The comparison predicate does that... sort!( (string a, string b) { /* you interpret a and b here and return the comparison */ })(["hi", "there"]);May 30 2016On Monday, 30 May 2016 at 18:26:32 UTC, Adam D. Ruppe wrote:On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote:Thanks! You left out some details but I think I see - an example predicate might be "cmp(a.byGrapheme, b.byGrapheme)" and by the looks of it, that code works in D today. (However, "cmp(a, b)" would default to code points today, which is surprising to almost everyone, and that's more what this thread is about).I like "make string iteration explicit" but I wonder about other constructs. E.g. What about "sort an array of strings"? How would you tell a generic sort function whether you want it to interpret strings by code unit vs code point vs grapheme?The comparison predicate does that... sort!( (string a, string b) { /* you interpret a and b here and return the comparison */ })(["hi", "there"]);May 30 2016Am Mon, 30 May 2016 17:14:47 +0000 schrieb Andrew Godfrey <X y.com>:I like "make string iteration explicit" but I wonder about other=20 constructs. E.g. What about "sort an array of strings"? How would=20 you tell a generic sort function whether you want it to interpret=20 strings by code unit vs code point vs grapheme?You are just scratching the surface! Unicode strings are sorted following the Unicode Collation Algorithm which is described in the 86 pages document here: (http://www.unicode.org/reports/tr10/) which is implemented in the ICU library mentioned before. Some obvious considerations from the description of the algorithm: In Sweden z comes before =C3=B6, while in Germany its the reverse. In Germany, words in a dictionary are sorted differently from lists of names in a phone book. dictionary: of < =C3=B6f, phone book: =C3=B6f < of Spanish sorts 'll' as one character right after 'l'. The default collation is selected in Windows through the control panel's localization app and on Linux (Posix) using the LC_COLLATE environment variable. The actual string sorting in the user's locale can then be performed with the C library using http://www.cplusplus.com/reference/cstring/strcoll/ or OS specific functions like CompareStringEx on Windows https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=3Dvs.= 85%29.aspx TL;DR neither code-points nor grapheme clusters are adequate for string sorting. Also two strings may compare unequal byte for byte, while they are actually the same text in different normalization forms. (E.g. Umlauts on OS X (NFD) vs. rest of the world (NFC)). Admittedly I find myself using str1 =3D=3D str2 without first normalizing both, because it is frigging convenient and fast. --=20 MarcoMay 30 2016On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:It is completely wasted mental effort.4. Autodecoding is slow and has no place in high speed string processing.I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D.As far as I can see, the language currently does not provide the facilities to implement the above without autodecoding.5. Very few algorithms require decoding.The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.count!(c => "!()-;:,.?".canFind(c)) // punctuationHowever the following do require autodecoding: s.walkLengthUsage of the result of this expression will be incorrect in many foreseeable cases.s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuationDitto.s.count!(c => c >= 32) // non-control charactersDitto, with a big red flag. If you are dealing with control characters, the code is likely low-level enough that you need to be explicit in what you are counting. It is likely not what actually needs to be counted. Such confusion can lead to security risks.Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.It should be explicit.This is not practical. Do you really see changing std.file and std.path to accept ubyte[] for all path arguments?7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.I can confirm this vague subjective observation. For example, DustMite reimplements some std.string functions in order to be able to handle D files with invalid UTF-8 characters.8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.Objection. Vague.This is neither easy nor practical. It makes writing reliable string handling code a chore in D. Because it is difficult to find all places where this must be done, it is not possible to do on a program-wide scale, thus bugs can only be discovered when this or that component fails because it was not tested with Unicode strings.9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)Why?10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width. Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte? This is a fundamental question for which we need a rigorous answer.What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)? If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.I don't follow this line of reasoning at all.There is no convincing argument why indexing and slicing should not simply operate on code units.11. Indexing an array produces different results than autodecoding, another glaring special case.This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.I don't follow. Though, making char implicitly convertible to wchar and dchar has clearly been a mistake.May 26 2016On Friday, May 27, 2016 04:31:49 Vladimir Panteleev via Digitalmars-d wrote:On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei AlexandrescuIn addition, as soon as you have ubyte[], none of the string-related functions work. That's fixable, but as it stands, operating on ubyte[] instead of char[] is a royal pain. - Jonathan M DavisThis is neither easy nor practical. It makes writing reliable string handling code a chore in D. Because it is difficult to find all places where this must be done, it is not possible to do on a program-wide scale, thus bugs can only be discovered when this or that component fails because it was not tested with Unicode strings.9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)May 31 2016On 05/31/2016 02:57 PM, Jonathan M Davis via Digitalmars-d wrote:In addition, as soon as you have ubyte[], none of the string-related functions work. That's fixable, but as it stands, operating on ubyte[] instead of char[] is a royal pain.That'd be nice to fix indeed. Please break the ground? -- AndreiMay 31 2016On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:Sounds like you want to say that string should be smarter than an array of code units in dealing with unicode. As I understand, design rationale behind strings being plain arrays of code units is that it's impractical for the string to smarter than array of code units - it just won't cut it, while plain array provides simple and easy to understand implementation of string.11. Indexing an array produces different results than autodecoding, another glaring special case.This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.May 27 2016On 5/27/16 6:26 AM, Kagamin wrote:As I understand, design rationale behind strings being plain arrays of code units is that it's impractical for the string to smarter than array of code units - it just won't cut it, while plain array provides simple and easy to understand implementation of string.That's my understanding too. And I think the design rationale is wrong. -- AndreiMay 27 2016On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:This might be a good time to discuss this a tad further. I'd appreciate if the debate stayed on point going forward. Thanks! My thesis: the D1 design decision to represent strings as char[] was disastrous and probably one of the largest weaknesses of D1. The decision in D2 to use immutable(char)[] for strings is a vast improvement but still has a number of issues. The approach to autodecoding in Phobos is an improvement on that decision.It is not, which has been shown by various posts in this thread. Iterating by code points is at least as wrong as iterating by code units; it can be argued it is worse because it sometimes makes the fact that it's wrong harder to detect.The insistent shunning of a user-defined type to represent strings is not good and we need to rid ourselves of it.While this may be true, it has nothing to do with auto decoding. I assume you would want such a user-define string type to auto-decode as well, right?On 05/12/2016 04:15 PM, Walter Bright wrote:Yes.5. Very few algorithms require decoding.The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a')s.count!(c => "!()-;:,.?".canFind(c)) // punctuationIdeally yes, but this is a special case that cannot be detected by `count`.However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control charactersNo, they do not need _auto_ decoding, they need a decision _by the user_ what they should be decoded to. Code units? Code points? Graphemes? Words? Lines?Currently the standard library operates at code point levelBecause it auto decodes.even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.No one wants to take that second part away. For example, the `find` can provide an overload that accepts `const(char)[]` directly, while `walkLength` doesn't, requiring a decision by the caller.I believe a library type would be more appropriate than bare `ubyte[]`. It should provide conversion between the OS encoding (which can be detected automatically) and UTF strings, for example. And it should be used for any "strings" that comes from outside the program, like main's arguments, env variables...7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.This would no longer work if char[] and char ranges were to be treated identically.9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)Agreed.10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width. Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte? This is a fundamental question for which we need a rigorous answer. What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)?If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.Distinguishing them is the right thing to do, but auto decoding is not the way to achieve that, see above.May 27 2016On 5/27/16 6:56 AM, Marc Schütz wrote:It is not, which has been shown by various posts in this thread.Couldn't quite find strong arguments. Could you please be more explicit on which you found most convincing? -- AndreiMay 27 2016On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:On 5/27/16 6:56 AM, Marc Schütz wrote:There are several possibilities of what iteration over a char range can mean. (For the sake of simplicity, let's ignore special cases like `find` and `split`; instead, let's look at `walkLength`, `retro` and similar.) BEFORE the introduction of auto decoding, it used to iterate over UTF8 code _units_, which is wrong for any non-ASCII data (except for the unlikely case where you really want code units). AFTER the introduction of auto decoding, it iterates over UTF8 code _points_, which is wrong for combined characters, e.g. äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for the even more unlikely case where you really want code points). That is, both the BEFORE and AFTER behaviour are wrong, both break for various kinds of input in different ways. So, is AFTER an improvement over BEFORE? The set of inputs where auto decoding produces wrong output is likely smaller, making it slightly less likely to encounter problems in practice; on the other hand, it's still wrong, and it's harder to find these problems during testing. That's like "improving" a bicycle so that it only breaks down after riding it for 30 minutes instead of just after 10 minutes, so you won't notice it during a test ride. But there are even more possibilities. It could iterate over graphemes, which is expensive, but more likely to produce the results that the user wants. Or it could iterate by lines, or words (and there are different ways to define what a word is), and so on. The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do. So, what was the original goal when introducing auto decoding? To improve correctness, right? I would argue that this goal has not been achieved. Have a look at the article [1], which IMO gives good criteria for how a _correct_ string type should behave. Both BEFORE and AFTER fail most of them. [1] https://mortoray.com/2013/11/27/the-string-type-is-broken/It is not, which has been shown by various posts in this thread.Couldn't quite find strong arguments. Could you please be more explicit on which you found most convincing? -- AndreiMay 28 2016On 5/28/16 6:59 AM, Marc Schütz wrote:The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do.OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.) So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives. AndreiMay 28 2016On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required. A string class does not do that (from the article: "I admit the correct answer is not always clear").May 28 2016On Saturday, 28 May 2016 at 19:04:14 UTC, Walter Bright wrote:On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:You're right. An "array of code units" is a very useful low-level primitive. I've dealt with a lot of code that uses these (more or less correctly) in various languages. But when providing such a thing, I think it's very important to make it *look* like a low-level primitive, and use the type system to distinguish it from higher-level ones. E.g. A string literal should not implicitly convert into an array of code units. What should it implicitly convert to? I'm not sure. Something close to how it looks in the source code, probably. A sequential range of graphemes? From all the detail in this thread, I wonder now if "a grapheme" is even an unambiguous concept across different environments. But one thing I'm sure of (and this is from other languages/API's, not from D specifically): A function which converts from one representation to another, but doesn't keep track of the change (e.g. Different compile-time type; e.g. State in a "string" class about whether it is in normalized form), is a "bug farm".So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required. A string class does not do that (from the article: "I admit the correct answer is not always clear").May 28 2016On Saturday, 28 May 2016 at 22:29:12 UTC, Andrew Godfrey wrote: [snip]From all the detail in this thread, I wonder now if "a grapheme" is even an unambiguous concept across different environments.Unicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages? To avoid confusion and misunderstandings we should agree on the terminology first.May 29 2016On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:Unicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages? To avoid confusion and misunderstandings we should agree on the terminology first.No, this is well established terminology, you are confusing several things here: - A grapheme is a "character" as written on the page - A phoneme is a spoken "character" - A codepoint is the fundamental "unit" of unicode Graphemes are built from one or more codepoints. Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.May 29 2016On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:I am pretty sure that a single grapheme in unicode does not correspond to your notion of "character". I am pretty sure that what you think of as a "character" is officially called "Grapheme Cluster" not "Grapheme". See here: http://www.unicode.org/glossary/#grapheme_clusterUnicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages? To avoid confusion and misunderstandings we should agree on the terminology first.No, this is well established terminology, you are confusing several things here: - A grapheme is a "character" as written on the page - A phoneme is a spoken "character" - A codepoint is the fundamental "unit" of unicode Graphemes are built from one or more codepoints. Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.May 29 2016On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:I am pretty sure that a single grapheme in unicode does not correspond to your notion of "character". I am pretty sure that what you think of as a "character" is officially called "Grapheme Cluster" not "Grapheme".Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a cluster of codepoints representing a grapheme. It's called "cluster" in the unicode spec, because there there is no dedicated grapheme unit. I put "character" into quotes, because the term is not really well defined. I just used it for a short and pregnant answer. I'm sure there's a better/more correct definition of graphem/phoneme, but it's probably also much longer and complicated.May 29 2016On Sunday, 29 May 2016 at 13:04:18 UTC, Tobias M wrote:On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:I am pretty sure that a single grapheme in unicode does not correspond to your notion of "character". I am pretty sure that what you think of as a "character" is officially called "Grapheme Cluster" not "Grapheme".Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a cluster of codepoints representing a grapheme. It's called "cluster" in the unicode spec, because there there is no dedicated grapheme unit.I put "character" into quotes, because the term is not really well defined. I just used it for a short and pregnant answer. I'm sure there's a better/more correct definition of graphem/phoneme, but it's probably also much longer and complicated.Which is why we need to agree on a terminology, i.e. be clear when we use linguistic terms and when we use Unicode specific terminology.May 29 2016On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:Ok, you have a point there, to be precise <sh> is a multigraph (a digraph)(cf. [1]). In French you can have multigraphs consisting of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. However, a phoneme is not necessarily a spoken "character" as <sh> represents one phoneme but consists of two "characters" or graphemes. <th> can represent two different phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`). My point was that we have to be _very_ careful not to mix our cultural experience with written text with machine representations. There's bound to be confusion. That's why we should always make clear what we refer to when we use the words grapheme, character, code point etc. [1] https://en.wikipedia.org/wiki/GraphemeUnicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages? To avoid confusion and misunderstandings we should agree on the terminology first.No, this is well established terminology, you are confusing several things here: - A grapheme is a "character" as written on the page - A phoneme is a spoken "character" - A codepoint is the fundamental "unit" of unicode Graphemes are built from one or more codepoints. Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.May 29 2016On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:Ok, you have a point there, to be precise <sh> is a multigraph (a digraph)(cf. [1]). In French you can have multigraphs consisting of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. However, a phoneme is not necessarily a spoken "character" as <sh> represents one phoneme but consists of two "characters" or graphemes. <th> can represent two different phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`).What I meant was, a phoneme is the "character" (smallest unit) in a spoken language, not that it corresponds to a character (whatever that means).My point was that we have to be _very_ careful not to mix our cultural experience with written text with machine representations. There's bound to be confusion. That's why we should always make clear what we refer to when we use the words grapheme, character, code point etc.I used 'character' in quotes, because it's not a well defined therm. Code point, grapheme and phoneme are well defined.May 29 2016On Sun, May 29, 2016 at 01:13:36PM +0000, Tobias M via Digitalmars-d wrote:On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:[...] Calling a phoneme a "character" is misleading. A phoneme is a logical sound unit in a spoken language, whereas a "character" is a unit of written language. The two do not necessarily have a direct correspondence (or even any correspondence whatsoever). In a language like English, whose writing system was codified many hundreds of years ago, the spoken language has sufficiently diverged from the written language (specifically, in the way words are spelt) that the correspondence between the two is complex at best, downright arbitrary at worst. For example, the 'o' in "women" and the 'i' in "fish" map to the same phoneme, the short /i/, in (common dialects of) spoken English, in spite of being two completely different characters. Therefore conflating "character" and "phoneme" is misleading and is only confusing the issue. As far as Unicode is concerned, it is a standard for representing *written* text, not spoken language, so concepts like phonemes aren't even relevant in the first place. Let's not get derailed from the present discussion by confusing the two. T -- What are you when you run out of Monet? Baroque.Ok, you have a point there, to be precise <sh> is a multigraph (a digraph)(cf. [1]). In French you can have multigraphs consisting of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. However, a phoneme is not necessarily a spoken "character" as <sh> represents one phoneme but consists of two "characters" or graphemes. <th> can represent two different phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`).What I meant was, a phoneme is the "character" (smallest unit) in a spoken language, not that it corresponds to a character (whatever that means).May 29 2016On 5/29/2016 5:56 PM, H. S. Teoh via Digitalmars-d wrote:As far as Unicode is concerned, it is a standard for representing *written* text, not spoken language, so concepts like phonemes aren't even relevant in the first place. Let's not get derailed from the present discussion by confusing the two.As far as D is concerned, we are not going to invent our own concepts around text that is different from Unicode or redefine Unicode terms. Unicode is what it is, and D is going to work with it.May 29 2016On 5/29/2016 4:47 AM, Tobias Müller wrote:No, this is well established terminology, you are confusing several things here:For D, we should stick with the terminology as defined by Unicode.May 29 2016On 05/28/2016 03:04 PM, Walter Bright wrote:On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:Nope. Not buying it.So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.A string class does not do thatBuying it. -- AndreiMay 30 2016On 30.05.2016 18:01, Andrei Alexandrescu wrote:On 05/28/2016 03:04 PM, Walter Bright wrote:I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:Nope. Not buying it.So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.May 30 2016On 05/30/2016 03:04 PM, Timon Gehr wrote:On 30.05.2016 18:01, Andrei Alexandrescu wrote:Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters? Wouldn't ranges - the most important artifact of D's stdlib - default for strings on the least meaningful approach to strings (dumb code units)? Would a smattering of Unicode primitives in std.utf and friends entitle us to claim D had dyed Unicode in its wool? (All are not rhetorical.) I.e. wouldn't be in a worse place than now? (This is rhetorical.) The best argument for autodecoding is to contemplate where we'd be without it: the ghetto of Unicode string handling. I'm not going to debate this further (though I'll look for meaningful answers to the questions above). But this thread has been informative in that it did little to change my conviction that autodecoding is a good thing for D, all things considered (i.e. the wrong decision to not encapsulate string as a separate type distinct from bare array of code units). I'd lie if I said it did nothing. It did, but only a little. Funny thing is that's not even what's important. What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D. So the focus should be making autodecoding the best it could ever be. AndreiOn 05/28/2016 03:04 PM, Walter Bright wrote:I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:Nope. Not buying it.So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.May 30 2016On Mon, May 30, 2016 at 03:28:38PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 05/30/2016 03:04 PM, Timon Gehr wrote:They already randomly work or not work on ranges of dchar. I hope we don't have to rehash all the examples of why things that seem to work, like count, filter, map, etc., actually *don't* work outside of a very narrow set of languages. The best of all this is that they *both* don't work properly *and* make your program pay for the performance overhead, even when you're not even using them -- thanks to ubiquitous autodecoding.On 30.05.2016 18:01, Andrei Alexandrescu wrote:Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters?On 05/28/2016 03:04 PM, Walter Bright wrote:I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:Nope. Not buying it.So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.Wouldn't ranges - the most important artifact of D's stdlib - default for strings on the least meaningful approach to strings (dumb code units)?No, ideally there should *not* be a default range type -- the user needs to specify what he wants to iterate by, whether code unit, code point, or grapheme, etc..Would a smattering of Unicode primitives in std.utf and friends entitle us to claim D had dyed Unicode in its wool? (All are not rhetorical.)I have no idea what this means.I.e. wouldn't be in a worse place than now? (This is rhetorical.) The best argument for autodecoding is to contemplate where we'd be without it: the ghetto of Unicode string handling.I've no idea what you're talking about. Without autodecoding we'd actually have faster string handling, and forcing the user to specify the unit of iteration would actually bring more Unicode-awareness which would improve the quality of string handling code, instead of proliferating today's wrong code that just happens to work in some languages but make a hash of things everywhere else.I'm not going to debate this further (though I'll look for meaningful answers to the questions above). But this thread has been informative in that it did little to change my conviction that autodecoding is a good thing for D, all things considered (i.e. the wrong decision to not encapsulate string as a separate type distinct from bare array of code units). I'd lie if I said it did nothing. It did, but only a little. Funny thing is that's not even what's important. What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D. So the focus should be making autodecoding the best it could ever be.[...] If I ever had to write string-heavy code, I'd probably fork Phobos just so I can get decent performance. Just sayin'. T -- People walk. Computers run.May 30 2016On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:If I ever had to write string-heavy code, I'd probably fork Phobos just so I can get decent performance. Just sayin'.When I wrote Warp, the only point of which was speed, I couldn't use phobos because of autodecoding. I have since recoded a number of phobos functions so they didn't autodecode, so the situation is better.May 30 2016On Monday, 30 May 2016 at 21:39:00 UTC, Walter Bright wrote:On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:Two questions: 1. Given you experience with Warp, how hard would it be to clean Phobos up? 2. After recoding a number of Phobos functions, how much code did actually break (yours or someone else's)?.If I ever had to write string-heavy code, I'd probably fork Phobos just so I can get decent performance. Just sayin'.When I wrote Warp, the only point of which was speed, I couldn't use phobos because of autodecoding. I have since recoded a number of phobos functions so they didn't autodecode, so the situation is better.May 31 2016On 5/31/2016 1:57 AM, Chris wrote:1. Given you experience with Warp, how hard would it be to clean Phobos up?It's not hard, it's just a bit tedious.2. After recoding a number of Phobos functions, how much code did actually break (yours or someone else's)?.It's been a while so I don't remember exactly, but as I recall if the API had to change, I created a new overload or a new name, and left the old one as it is. For the std.path functions, I just changed them. While that technically changed the API, I'm not aware of any actual problems it caused. (Decoding file strings is a latent bug anyway, as pointed out elsewhere in this thread. It's a change that had to be made sooner or later.)May 31 2016On 30.05.2016 21:28, Andrei Alexandrescu wrote:On 05/30/2016 03:04 PM, Timon Gehr wrote:In D, enum does not mean enumeration, const does not mean constant, pure is not pure, lazy is not lazy, and char does not mean character.On 30.05.2016 18:01, Andrei Alexandrescu wrote:Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters?On 05/28/2016 03:04 PM, Walter Bright wrote:I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:Nope. Not buying it.So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.Wouldn't ranges - the most important artifact of D's stdlib - default for strings on the least meaningful approach to strings (dumb code units)?I don't see how that's the least meaningful approach. It's the data that you actually have sitting in memory. It's the data that you can slice and index and get a length for in constant time.Would a smattering of Unicode primitives in std.utf and friends entitle us to claim D had dyed Unicode in its wool? (All are not rhetorical.) ...We should support Unicode by having all the required functionality and properly documenting the data formats used. What is the goal here? I.e. what does a language that has "Unicode dyed in its wool" have that other languages do not? Why isn't it enough to provide data types for UTF8/16/32 and Unicode algorithms operating on them?I.e. wouldn't be in a worse place than now? (This is rhetorical.) The best argument for autodecoding is to contemplate where we'd be without it: the ghetto of Unicode string handling. ...Those questions seem to be mostly marketing concerns. I'm more concerned with whether I find it convenient to use. Autodecoding does not improve Unicode support.I'm not going to debate this further (though I'll look for meaningful answers to the questions above). But this thread has been informative in that it did little to change my conviction that autodecoding is a good thing for D, all things considered (i.e. the wrong decision to not encapsulate string as a separate type distinct from bare array of code units). I'd lie if I said it did nothing. It did, but only a little. Funny thing is that's not even what's important. What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D. So the focus should be making autodecoding the best it could ever be. AndreiSure, I didn't mean to engage in a debate (it seems there is no decision to be made here that might affect me in the future).May 30 2016On 05/30/2016 04:30 PM, Timon Gehr wrote:In D, enum does not mean enumeration, const does not mean constant, pure is not pure, lazy is not lazy, and char does not mean character.My new favorite quote :)May 30 2016On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a stringYes!So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.If you're proposing a library type, a la RCStr, as an alternative then yeah.May 28 2016On 05/28/2016 03:04 PM, Andrei Alexandrescu wrote:On 5/28/16 6:59 AM, Marc Schütz wrote:Ideally there should not be a way to iterate a (unicode) string at all without explictily stating mode of operations, i.e. struct String { private void[] data; CodeUnitRange byCodeUnit ( ); CodePointRange byCodePoint ( ); GraphemeRange byGrapheme ( ); bool normalize ( ); } (byGrapheme and normalize have rather expensive dependencies so probably better to provide those via UFCS on demand)The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do.OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string.May 29 2016On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:On 5/28/16 6:59 AM, Marc Schütz wrote:I think this is going too far. It's sufficient if they (= char slices, not ranges) can't be iterated over directly, i.e. aren't input ranges (and maybe don't work with foreach). That would force the user to append .byCodeUnit etc. as needed. This provides a very nice deprecation path, by the way, it's just not clear whether it can be implemented with the way `deprecated` currently works. I.e. deprecate/warn every time auto decoding kicks in, print a nice message to the user, and later remove auto decoding and make isInputRange!string return false.The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do.OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.) So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.May 30 2016On 05/30/2016 07:58 AM, Marc Schütz wrote:On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:That's... what I said. -- AndreiOn 5/28/16 6:59 AM, Marc Schütz wrote:I think this is going too far. It's sufficient if they (= char slices, not ranges) can't be iterated over directly, i.e. aren't input ranges (and maybe don't work with foreach).The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do.OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.) So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.May 30 2016On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu wrote:That's... what I said. -- AndreiYou said "not arrays", he said "not ranges". So that just means making the std.range.primitives.popFront and front add a constraint if(!isSomeString()). Language built-ins still work, but the library rejects them. Indeed, we could add a deprecated overload then that points people to the other range getter methods (byCodeUnit, byCodePoint, byGrapheme, etc.)... this might be our migration path.May 30 2016On Monday, 30 May 2016 at 12:59:08 UTC, Adam D. Ruppe wrote:On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu wrote:That's a great idea - the compiler should also issue deprecation warnings when I try to do things like: string a = "你好"; a[1]; // deprecation: direct access to a Unicode string is highly error-prone. Please specify the type of access. More details (shortlink) a[1] = "b"; // deprecation: direct index assignment to a Unicode string is ... a.length; // deprecation: a Unicode string has multiple definitions of length. Please specify your iteration (...). More details (shortlink) ... Btw should a[] be an alias for `byCodeUnit` or also trigger a warning?That's... what I said. -- AndreiYou said "not arrays", he said "not ranges". So that just means making the std.range.primitives.popFront and front add a constraint if(!isSomeString()). Language built-ins still work, but the library rejects them. Indeed, we could add a deprecated overload then that points people to the other range getter methods (byCodeUnit, byCodePoint, byGrapheme, etc.)... this might be our migration path.May 30 2016On 05/30/2016 04:35 PM, Seb wrote:That's a great idea - the compiler should also issue deprecation warnings when I try to do things like: string a = "你好"; a[1]; // deprecation: direct access to a Unicode string is highly error-prone. Please specify the type of access. More details (shortlink) a[1] = "b"; // deprecation: direct index assignment to a Unicode string is ... a.length; // deprecation: a Unicode string has multiple definitions of length. Please specify your iteration (...). More details (shortlink) ... Btw should a[] be an alias for `byCodeUnit` or also trigger a warning?All this is only sensible when we move to a dedicated string type that's not just an alias of `immutable(char)[]`. `immutable(char)[]` explicitly is an array of code units. It would not be acceptable, in my opinion, if the normal array syntax got broken for it.May 30 2016On Monday, 30 May 2016 at 14:56:36 UTC, ag0aep6g wrote:All this is only sensible when we move to a dedicated string type that's not just an alias of `immutable(char)[]`. `immutable(char)[]` explicitly is an array of code units. It would not be acceptable, in my opinion, if the normal array syntax got broken for it.I agree; most of the troubles have been with auto-decoding. In an ideal world, we'd also want to change the way `length` and `opIndex` work, but if we only fix the range primitives, we've achieved almost as much with fewer compatibility problems.May 30 2016On 5/30/2016 8:34 AM, Marc Schütz wrote:In an ideal world, we'd also want to change the way `length` and `opIndex` work,Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.May 30 2016On 5/30/16 5:51 PM, Walter Bright wrote:On 5/30/2016 8:34 AM, Marc Schütz wrote:That's not an argument. Objects are arrays of bytes, or tuples of their fields, etc. The whole point of encapsulation is superimposing a more structured view on top of the representation. Operating on open-heart representation is risky, and strings are no exception. -- AndreiIn an ideal world, we'd also want to change the way `length` and `opIndex` work,Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.May 30 2016On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:On 5/30/16 5:51 PM, Walter Bright wrote:Consistency is a factual argument, and autodecode is not consistent.On 5/30/2016 8:34 AM, Marc Schütz wrote:That's not an argument.In an ideal world, we'd also want to change the way `length` and `opIndex` work,Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.Objects are arrays of bytes, or tuples of their fields, etc. The whole point of encapsulation is superimposing a more structured view on top of the representation. Operating on open-heart representation is risky, and strings are no exception.If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it. Autodecoding is not it.May 31 2016On Tuesday, 31 May 2016 at 07:56:54 UTC, Walter Bright wrote:On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:+1On 5/30/16 5:51 PM, Walter Bright wrote:Consistency is a factual argument, and autodecode is not consistent.On 5/30/2016 8:34 AM, Marc Schütz wrote:That's not an argument.In an ideal world, we'd also want to change the way `length` and `opIndex` work,Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.Thing is, more info is needed to support unicode properly. Collation for instance.Objects are arrays of bytes, or tuples of their fields, etc. The whole point of encapsulation is superimposing a more structured view on top of the representation. Operating on open-heart representation is risky, and strings are no exception.If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it. Autodecoding is not it.May 31 2016On 5/31/16 3:56 AM, Walter Bright wrote:On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:Consistency with what? Consistent with what?On 5/30/16 5:51 PM, Walter Bright wrote:Consistency is a factual argument, and autodecode is not consistent.On 5/30/2016 8:34 AM, Marc Schütz wrote:That's not an argument.In an ideal world, we'd also want to change the way `length` and `opIndex` work,Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges. -- AndreiObjects are arrays of bytes, or tuples of their fields, etc. The whole point of encapsulation is superimposing a more structured view on top of the representation. Operating on open-heart representation is risky, and strings are no exception.If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it.May 31 2016On Tuesday, 31 May 2016 at 15:07:09 UTC, Andrei Alexandrescu wrote:Consistency with what? Consistent with what?It is a slice type. It should work as a slice type. Every other design stink.May 31 2016On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:On 5/31/16 3:56 AM, Walter Bright wrote:Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF. I have to agree with Walter in that there really isn't a way to automatically handle Unicode correctly and efficiently while hiding the fact that it's doing all of the stuff that has to be done for UTF. That being said, while an array of code units is really what a string should be underneath the hood, having a string type that provides byCodeUnit, byCodePoint, and byGrapheme is an improvement over treating immutable(char)[] as string, even if byCodeUnit returns immutable(char)[], because it forces the programmer to decide what they want to do rather than blindingly operate on immutable(char)[] as if a char were a full character. And as long as it provides access to each level of Unicode, then it's possible for programmers who know what they're doing to efficiently operate on Unicode while simultaneously making it much more obvious to those who don't know what they're doing that they don't know they're doing rather than having them blindly act like char is a full character. There's really no reason why we couldn't define a string type that operated that way while continuing to treat arrays of char the way that we do now in the language, though transitioning to such a scheme is not at all straightforward in terms of avoiding code breakage. Defining a String type would be simple enough, and any function in Phobos which accepted a string could be changed to accept a String, but we'd have problems with many functions which currently returned string, since changing what they returned would break code. But even if Phobos were somehow completly changed over to use a new String type, and even if the string alias were deprecated/removed, we'd still have to deal with arrays of char, wchar, and dchar and run the risk of someone using those and having problems, because they didn't treat them as arrays of code units. We can't really prevent that, just make it so that string/String is something else that makes the Unicode issue obvious so that folks are less likely to blindly treat chars as full characters. But even then, it's not like it would be hard for folks to just use the wrong Unicode level. All we'd really be doing is shoving the issue in their face so that they'd have to acknowledge it on some level and maybe then actually learn enough to operate on Unicode strings correctly. But then again, since all you're really doing at that point is shoving the Unicode issues in folks' faces by not treating strings as ranges or indexable and forcing them to call byCodeUnit, byCodePoint, byGrapheme, etc., I don't know that it actually solves much over treating immutable(char)[] as string. Programmers still have to learn Unicode enough to handle it correctly, just like they do now (whether we have autodecoding or not). And such a string type really doesn't make the Unicode handling any easier. It just make it harder to ignore the Unicode issues. The Unicode problem is a lot like the floating point problems that have been discussed recently. Programmers want it to "just work" without them having to worry about the details, but that really doesn't work, and while the average programmer may not understand either floating point operations or Unicode properly, the average programmer does actually have to work with both on a regular basis. I'm not at all convinced that having string be an alias of immutable(char)[] was a mistake, but having a struct that's not a range may very well be an improvement. It _would_ at least make some of the Unicode issues more obvious, but it doesn't really solve much from what I can see. - Jonathan M DavisIf there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it.It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.May 31 2016On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:How is that different from what I said? -- AndreiOn 5/31/16 3:56 AM, Walter Bright wrote:Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF.If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it.It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.May 31 2016On Tuesday, May 31, 2016 13:01:11 Andrei Alexandrescu via Digitalmars-d wrote:On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:wrote:On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-dMy point was that Walter was stating that you can't have a type that hides the fact that it's dealing with Unicode while still being efficient, whereas you mentioned a proposal for a type that does not hide the fact that it's dealing with Unicode. So, you weren't really responding with a type that rebutted Walter's statement. Rather, you responded with a type that attempts to make its Unicode nature more explicit than immutable(char)[]. - Jonathan M DavisHow is that different from what I said? -- AndreiOn 5/31/16 3:56 AM, Walter Bright wrote:Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF.If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it.It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.May 31 2016On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:On 5/30/2016 8:34 AM, Marc Schütz wrote:So, strings are _implemented_ as arrays of code units. But indiscriminately treating them as such in all situations leads to wrong results (just like arrays of code points would). In an ideal world, the programs someone intuitively writes will do the right thing, and if they can't, they at least refuse to compile. If we agree that it's up to the user whether to iterate over a string by code unit or code points or graphemes, and that we shouldn't arbitrarily choose one of those (except when we know that it's what the user wants), then the same applies to indexing, slicing and counting. On the other hand, changing such low-level things will likely be impractical, that's why I said "In an ideal world".In an ideal world, we'd also want to change the way `length` and `opIndex` work,Why? strings are arrays of code units.All the trouble comes from erratically pretending otherwise.For me, the trouble comes from pretending otherwise _without being told to_. To make sure there are no misunderstandings, here is what is suggested as an alternative to the current situation: * `char[]`, `wchar[]` (and `dchar[]`?) no longer pass `isInputRange`. * Ranges with element type `char`, `wchar`, and `dchar` do pass `isInputRange`. * A bunch of rangeifying helpers are added to `std.string` (I believe they are already there): `byCodePoint`, `byCodeUnit`, `byChar`, `byWchar`, `byDchar`, ... * Algorithms like `find`, `join(er)` get overloads that accept char slices directly. * Built-in operators and `length` of char slices are unchanged. Advantages: * Algorithms that can work _correctly_ without any kind of decoding will do so. * Algorithms that would yield incorrect results won't compile, requiring the user to make a decision regarding the desired element type. * No auto-decoding. => Best performance depending on the actual requirements. => No results that look correct when tested with only precomposed characters but are wrong in the general case. * Behaviour of [] and .length is no worse than today.May 31 2016On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:If we follow Adam's proposal to deprecate front, back, popFront and popBack, we don't even need to touch the compiler and it's trivial to do so. The proof of concept change needs eight lines. https://github.com/dlang/phobos/pull/4384 Explicitly stating the type of iteration in the 132 places with auto-decoding in Phobos doesn't sound that terrible.[...]So, strings are _implemented_ as arrays of code units. But indiscriminately treating them as such in all situations leads to wrong results (just like arrays of code points would). [...]May 31 2016On 05/31/2016 04:33 PM, Seb wrote:https://github.com/dlang/phobos/pull/4384 Explicitly stating the type of iteration in the 132 places with auto-decoding in Phobos doesn't sound that terrible.After checking some of those 132 places, they are in generic functions that take ranges. std.algorithm.equal, std.range.take - stuff like that. That's expected, of course, as the range primitives are used there. But those places are not the ones we'd have to fix. We'd have to fix the code that uses those generic functions on strings.May 31 2016On 5/31/16 10:33 AM, Seb wrote:Explicitly stating the type of iteration in the 132 places with auto-decoding in Phobos doesn't sound that terrible.It is terrible, no two ways about it. We've been very very careful with changes that caused a handful or breakages in Phobos. It really means every D project on the planet will be broken. We can't contemplate that, it's suicide. -- AndreiMay 31 2016On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:In an ideal world, the programs someone intuitively writes will do the right thing, and if they can't, they at least refuse to compile. If we agree that it's up to the user whether to iterate over a string by code unit or code points or graphemes, and that we shouldn't arbitrarily choose one of those (except when we know that it's what the user wants), then the same applies to indexing, slicing and counting.If the user doesn't know how he wants to iterate and you leave the decision to the user... erm... it's not going to give correct result :)May 31 2016On Monday, 30 May 2016 at 14:35:03 UTC, Seb wrote:That's a great idea - the compiler should also issue deprecation warnings when I try to do things like:I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units). Besides, it'd be a much bigger change than the library transition.May 30 2016On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).Yup. It isn't hard at all to use arrays of codeunits correctly.May 30 2016On 5/30/16 6:00 PM, Walter Bright wrote:On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- AndreiI don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).Yup. It isn't hard at all to use arrays of codeunits correctly.May 30 2016On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 5/30/16 6:00 PM, Walter Bright wrote:Neither does autodecoding make code anymore correct. It just better hides the fact that the code is wrong. T -- I've been around long enough to have seen an endless parade of magic new techniques du jour, most of which purport to remove the necessity of thought about your programming problem. In the end they wind up contributing one or two pieces to the collective wisdom, and fade away in the rearview mirror. -- Walter BrightOn 5/30/2016 11:25 AM, Adam D. Ruppe wrote:Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- AndreiI don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).Yup. It isn't hard at all to use arrays of codeunits correctly.May 30 2016On Tuesday, 31 May 2016 at 06:45:56 UTC, H. S. Teoh wrote:On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu via Digitalmars-d wrote:Thinking about this a bit more - what algorithms are actually correct when implemented on the level of code units? Off the top of my head I can only really think of copying and hashing, since you want to do that on the byte level anyways. I would also think that if you know your strings are normalized in the same normalization form (for example because they come from the same normalized source), you can check two strings for equality on the code unit level, but my understanding of unicode is still quite lacking, so I'm not sure on that.On 5/30/16 6:00 PM, Walter Bright wrote:Neither does autodecoding make code anymore correct. It just better hides the fact that the code is wrong. TOn 5/30/2016 11:25 AM, Adam D. Ruppe wrote:Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- AndreiI don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).Yup. It isn't hard at all to use arrays of codeunits correctly.May 31 2016Am Tue, 31 May 2016 07:17:03 +0000 schrieb default0 <Kevin.Labschek gmx.de>:Thinking about this a bit more - what algorithms are actually correct when implemented on the level of code units?Calculating the buffer size of a string, validation and fast versions of general algorithms that can be defined in terms of ASCII, like skipAsciiWhitespace(), splitByComma(), splitByLineAscii().I would also think that if you know your strings are normalized in the same normalization form (for example because they come from the same normalized source), you can check two strings for equality on the code unit level, but my understanding of unicode is still quite lacking, so I'm not sure on that.That's correct. -- MarcoMay 31 2016On Tuesday, May 31, 2016 07:17:03 default0 via Digitalmars-d wrote:Thinking about this a bit more - what algorithms are actually correct when implemented on the level of code units? Off the top of my head I can only really think of copying and hashing, since you want to do that on the byte level anyways. I would also think that if you know your strings are normalized in the same normalization form (for example because they come from the same normalized source), you can check two strings for equality on the code unit level, but my understanding of unicode is still quite lacking, so I'm not sure on that.Equality does not require decoding. Similarly, functions like find don't either. Something like filter generally would, but it's also not particularly normal to filter a string on a by-character basis. You'd probably want to get to at least the word level in that case. To make matters worse, functions like find or splitter are frequently used to look for ASCII delimiters, even when the strings themselves contain Unicode characters. So, even if decoding were necessary when looking for a Unicode character, it's utterly wasteful when the character you're looking for is ASCII. But searching generally does not require decoding so long as the same character is always encoded the same way. So, Unicode normalization _can_ be a problem, but that's a problem with code points as well as code units (since the normalization has to do with the order of code points when multiple code points make up a single grapheme). You'd have to go to the grapheme level to avoid that problem. And that's why at least some of the time, string-processing code is going to need to normalize its strings before doing searches. But the searches themselves can then operate at the code unit level. - Jonathan M DavisMay 31 2016On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:Equality does not require decoding. Similarly, functions like find don't either. Something like filter generally would, but it's also not particularly normal to filter a string on a by-character basis. You'd probably want to get to at least the word level in that case.It's nice that the stdlib takes care of that.To make matters worse, functions like find or splitter are frequently used to look for ASCII delimiters, even when the strings themselves contain Unicode characters. So, even if decoding were necessary when looking for a Unicode character, it's utterly wasteful when the character you're looking for is ASCII.Good idea. We could overload functions such as find on char, wchar, and dchar. Jonathan, could you look into a PR to do that?But searching generally does not require decoding so long as the same character is always encoded the same way.Yah, a good rule of thumb is to get the same (consistent, heh) results for a given string (including a given normalization) regardless of the encoding used. So e.g. it's nice that walkLength the same number for the string whether it's UTF8/16/32. AndreiMay 31 2016Am Tue, 31 May 2016 13:06:16 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:Both "equality" and "find" require byGrapheme. =E2=87=B0 The equivalence algorithm first brings both strings to a common normalization form (NFD or NFC), which works on one grapheme cluster at a time and afterwards does the binary comparison. http://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence =E2=87=B0 Find would yield false positives for the start of grapheme clust= ers. I.e. will match 'o' in an NFD "=C3=B6" (simplified example). http://www.unicode.org/reports/tr10/#Searching --=20 MarcoEquality does not require decoding. Similarly, functions like find don't either. Something like filter generally would, but it's also not particularly normal to filter a string on a by-character basis. You'd probably want to get to at least the word level in that case. =20=20 It's nice that the stdlib takes care of that.May 31 2016On 31-May-2016 01:00, Walter Bright wrote:On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly. -- Dmitry OlshanskyI don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).Yup. It isn't hard at all to use arrays of codeunits correctly.May 31 2016On Tuesday, May 31, 2016 22:47:56 Dmitry Olshansky via Digitalmars-d wrote:On 31-May-2016 01:00, Walter Bright wrote:Yeah, but Phobos provides the tools to do that reasonably easily even when autodecoding isn't involved. Sure, it's slightly more tedious to call std.utf.decode or std.utf.encode yourself rather than letting autodecoding take care of it, but it's easy enough to do and allows you to control when it's done. And we have stuff like byChar!dchar or byGrapheme for the cases where you don't want to actually operate on arrays of code units. - Jonathan M DavisOn 5/30/2016 11:25 AM, Adam D. Ruppe wrote:Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly.I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).Yup. It isn't hard at all to use arrays of codeunits correctly.May 31 2016On Tue, May 31, 2016 at 10:47:56PM +0300, Dmitry Olshansky via Digitalmars-d wrote:On 31-May-2016 01:00, Walter Bright wrote:[...] Working on individual characters needs byGrapheme, unless you know beforehand that the character(s) you're working with are ASCII, or fits in a single code unit. About "clever tricks", it's not really that hard. I was thinking that things like s.canFind('Ш') should translate the 'Ш' into a UTF-8 byte sequence, and then do a substring search directly on the encoded string. This way, a large number of single-character algorithms don't even need to decode. The way UTF-8 is designed guarantees that there will not be any false positives. This will eliminate a lot of the current overhead of autodecoding. T -- Klein bottle for rent ... inquire within. -- Stephen MulraneyOn 5/30/2016 11:25 AM, Adam D. Ruppe wrote:Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly.I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).Yup. It isn't hard at all to use arrays of codeunits correctly.May 31 2016On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote: [snip]I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.I disagree. "if used naively" shouldn't be the default. A user (naively) expects string algorithms to work as efficiently as possible without overheads. To tell the user later that s/he shouldn't _naively_ have used a certain algorithm provided by the library is a bit cynical. Having to redesign a code base because of hidden behavior is a big turn off, having to go through Phobos to determine where the hidden pitfalls are is not the user's job.Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.And what if you deal with non-ASCII heavy text? Does the user have to guess an micro-optimize for simple use cases?But how is the user supposed to know without being a core contributor to Phobos? If using a library method that works well in one case can slow down your code in a slightly different case, something is wrong with the language/library design. For simple cases the burden shouldn't be on the user, or, if it is, s/he should be informed about it in order to be able to make well-informed decisions. Personally I wouldn't mind having to decide in each case what I want (provided I have a best practices cheat sheet :)), so I can get the best out of it. But to keep guessing, testing and benchmarking each string handling library function is not good at all. [snip]5. Very few algorithms require decoding.The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.May 27 2016On 5/27/16 7:19 AM, Chris wrote:On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote: [snip]Misunderstanding.I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.I disagree."if used naively" shouldn't be the default. A user (naively) expects string algorithms to work as efficiently as possible without overheads.That's what happens with autodecoding.Misunderstanding.Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.And what if you deal with non-ASCII heavy text? Does the user have to guess an micro-optimize for simple use cases?Misunderstanding. All examples work properly today because of autodecoding. -- AndreiBut how is the user supposed to know without being a core contributor to Phobos?5. Very few algorithms require decoding.The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.May 27 2016On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2Misunderstanding. All examples work properly today because of autodecoding. -- AndreiHowever the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.But how is the user supposed to know without being a core contributor to Phobos?May 27 2016On Friday, 27 May 2016 at 13:47:32 UTC, ag0aep6g wrote:I agree. It has happened to me that characters like "é" return length == 2, which has been the cause of some bugs in my code. I'm wiser now, of course, but you wouldn't expect this, if you write if (input.length == 1) speakCharacter(input); // e.g. when spelling a word else processInput(input); The worst thing is that you never know, what's going on under the hood and where autodecode slows you down, unbeknownst to yourself.Misunderstanding. All examples work properly today because of autodecoding. -- AndreiThey only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2May 27 2016On 5/27/16 10:15 AM, Chris wrote:It has happened to me that characters like "é" return length == 2Would normalization make length 1? -- AndreiMay 27 2016On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote:Would normalization make length 1? -- AndreiIn some, but not all cases.May 27 2016On 27-May-2016 21:11, Andrei Alexandrescu wrote:On 5/27/16 10:15 AM, Chris wrote:No, this is not the point of normalization. -- Dmitry OlshanskyIt has happened to me that characters like "é" return length == 2Would normalization make length 1? -- AndreiMay 27 2016On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:On 27-May-2016 21:11, Andrei Alexandrescu wrote:What is? -- AndreiOn 5/27/16 10:15 AM, Chris wrote:No, this is not the point of normalization.It has happened to me that characters like "é" return length == 2Would normalization make length 1? -- AndreiMay 27 2016On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:This video will be helpfull :) https://www.youtube.com/watch?v=n0GK-9f4dl8 It talks about Unicode in C++, but also explains how unicode works.On 27-May-2016 21:11, Andrei Alexandrescu wrote:What is? -- AndreiOn 5/27/16 10:15 AM, Chris wrote:No, this is not the point of normalization.It has happened to me that characters like "é" return length == 2Would normalization make length 1? -- AndreiMay 27 2016On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:1) A grapheme may include several combining characters (such as diacritics) whose order is not supposed to be semantically significant. Normalization sorts them in a standardized way so that string comparisons return the expected result for graphemes which differ only by the internal order of their constituent combining code points. 2) Some graphemes (like accented latin letters) can be represented by a single code point OR a letter followed by a combining diacritic. Normalization either splits them all apart (NFD), or combines them whenever possible (NFC). Again, this is primarily intended to make things like string comparisons work as expected, and perhaps to simplify low-level tasks like graphical rendering of text. (Disclaimer: This is an oversimplification, because nothing about Unicode is ever simple.)No, this is not the point of normalization.What is? -- AndreiMay 27 2016On 28-May-2016 01:04, tsbockman wrote:On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:Quite accurate statement of the goals. Normalization is all about having canonical order of combining code points.On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:1) A grapheme may include several combining characters (such as diacritics) whose order is not supposed to be semantically significant. Normalization sorts them in a standardized way so that string comparisons return the expected result for graphemes which differ only by the internal order of their constituent combining code points. 2) Some graphemes (like accented latin letters) can be represented by a single code point OR a letter followed by a combining diacritic. Normalization either splits them all apart (NFD), or combines them whenever possible (NFC). Again, this is primarily intended to make things like string comparisons work as expected, and perhaps to simplify low-level tasks like graphical rendering of text.No, this is not the point of normalization.What is? -- Andrei(Disclaimer: This is an oversimplification, because nothing about Unicode is ever simple.)-- Dmitry OlshanskyMay 28 2016On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:Here is an example about normalization. In Unicode, the grapheme Ä is composed of two code points: A (the ascii A) and the ¨ character. However, one of the goals of unicode was to be backwards to compatible with earlier encodings that extended ASCII (codepages). In some codepages, Ä was an actual codepoint. So in some cases you would have the unicode one which is two codepoints and the one from some codepages which would be one. Those should be the same though, i.e compare the same. In order to do that, there is normalization. What is does is to _expand_ the single codepoint Ä into A + ¨On 27-May-2016 21:11, Andrei Alexandrescu wrote:What is? -- AndreiOn 5/27/16 10:15 AM, Chris wrote:No, this is not the point of normalization.It has happened to me that characters like "é" return length == 2Would normalization make length 1? -- AndreiMay 27 2016On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote:Those should be the same though, i.e compare the same. In order to do that, there is normalization. What is does is to _expand_ the single codepoint Ä into A + ¨Unless I'm mistaken, this depends on the form used. For example, in NFKC you'd get the single codepoint Ä. — DavidMay 27 2016On Friday, May 27, 2016 23:16:58 David Nadlinger via Digitalmars-d wrote:On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote:Yeah. For better or worse, there are different normalization schemes for Unicode. A normalization scheme makes the encodings consisent, but that doesn't mean that each of the different normalization schemes does the same thing, just that if you apply the same normalization scheme to two strings, then all graphemes within those strings will be encoded identically. - Jonathan M DavisThose should be the same though, i.e compare the same. In order to do that, there is normalization. What is does is to _expand_ the single codepoint into A +Unless I'm mistaken, this depends on the form used. For example, in NFKC you'd get the single codepoint .May 31 2016On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote:On 5/27/16 10:15 AM, Chris wrote:No, I've tried it. I think dchar[] returns one or you check by grapheme.It has happened to me that characters like "é" return length == 2Would normalization make length 1? -- AndreiMay 28 2016On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it. String handling, especially in the standard library, ought to be (1) efficient where possible, and (2) be as correct as possible (meaning, most corresponding to user expectations -- principle of least surprise). If we can't have both, we should at least have one, right? However, the way autodecoding is currently implemented, we have neither. Firstly, it is beyond clear that autodecoding adds a significant amount of overhead, and because it's automatic, it applies to ALL string processing in D. The only way around it is to fight against the standard library and use workarounds to bypass all that meticulously-crafted autodecoding code, begging the question of why we're even spending the effort on said code in the first place. Secondly, it violates the principle of least surprise when the user, given a string of, say, Korean text, discovers that s.count() *doesn't* return the correct answer. Oh, it's "correct", all right, if your definition of correct is "number of Unicode code points", but to a Korean user, such an answer is completely meaningless because it has little correspondence with what he would perceive as the number of "characters" in the string. It might as well be a random number and it would be just as meaningful. It is just as wrong as s.count() returning the number of code units, except that in the current Euro-centric D community the wrong instances are less often encountered and so are often overlooked. But that doesn't change the fact that code that assumes s.count() returns anything remotely meaningful to the user is buggy. Autodecoding into code points only serves to hide the bugs. As has been said before already countless times, autodecoding, as currently implemented, is neither "correct" nor efficient. Iterating by code point is much faster, but more prone to user mistakes; whereas iterating by grapheme more often corresponds with user expectations but performs quite poorly. The current implementation of autodecoding represents the worst of both worlds: it is both inefficient *and* prone to user mistakes, and worse yet, it serves to conceal such user mistakes by giving the false sense of security that because we're iterating by code points we're somehow magically "correct" by definition. The fact of the matter is that if you're going to write Unicode string processing code, you're gonna hafta to know the dirty nitty gritty of Unicode strings, including the fine distinctions between code units, code points, grapheme clusters, etc.. Since this is required knowledge anyway, why not just let the user worry about how to iterate over the string? Let the user choose what best suits his application, whether it's working directly with code units for speed, or iterating over grapheme clusters for correctness (in terms of visual "characters"), instead of choosing the pessimal middle ground that's neither efficient nor correct? T -- Do not reason with the unreasonable; you lose by definition.They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2Misunderstanding. All examples work properly today because of autodecoding. -- AndreiHowever the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.But how is the user supposed to know without being a core contributor to Phobos?May 27 2016On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it.Which languages are covered by code points, and which languages require graphemes consisting of multiple code points? How does normalization play into this? -- AndreiMay 27 2016On 05/27/2016 08:42 PM, Andrei Alexandrescu wrote:Which languages are covered by code points, and which languages require graphemes consisting of multiple code points? How does normalization play into this? -- AndreiI don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that. I think there are scripts that use combining characters extensively, but Unicode also has stuff like combining arrows. Those can make sense in an otherwise plain English text. For example: 'a' + U+20D7 = a⃗. There is no combined character for that, so normalization can't do anything here.May 27 2016On 5/27/16 3:10 PM, ag0aep6g wrote:I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiMay 27 2016On 05/27/2016 09:30 PM, Andrei Alexandrescu wrote:It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiI think so, yeah. Due to combining characters, code points are similar to code units: a Unicode thing that you need to know about of when working below the human-perceived character (grapheme) level.May 27 2016On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 5/27/16 3:10 PM, ag0aep6g wrote:That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character". T -- English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry WallI don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiMay 27 2016On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:That's what we've been trying to say all along!If that's the case things are pretty dire, autodecoding or not. -- AndreiMay 27 2016On Fri, May 27, 2016 at 04:41:09PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:Like it or not, Unicode ain't merely some glorified form of C's ASCII char arrays. It's about time we faced the reality and dealt with it accordingly. Trying to sweep the complexities of Unicode under the rug is not doing us any good. T -- The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike EllisThat's what we've been trying to say all along!If that's the case things are pretty dire, autodecoding or not. -- AndreiMay 27 2016On Friday, May 27, 2016 16:41:09 Andrei Alexandrescu via Digitalmars-d wrote:On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:True enough. Correctly handling Unicode in the general case is ridiculously hard - especially if you want to be efficient. We could do everything at the grapheme level to get the correctness, but we'd be so slow that it would be ridiculous. Fortunately, many string algorithms really don't need to care much about Unicode so long as the strings involved are normalized. For instance, a function like find can usually compare code units without decoding anything (though even then, depending on the normalization, you run the risk of finding a part of a character if it involves combining code points - e.g. searching for e could give you the first part of if its encoded with the e followed by the accent). But ultimately, fully correct string handling requires having a far better understanding of Unicode than most programmers have. Even the percentage of programmers here that have that level of understanding isn't all that great - though the fact that D supports UTF-8, UTF-16, and UTF-32 the way that it does has led a number of us to dig further into Unicode and learn it better in ways that we probably wouldn't have if all it had was char. It highlights that there is something that needs to be learned to get this right in a way that most languages don't. - Jonathan M DavisThat's what we've been trying to say all along!If that's the case things are pretty dire, autodecoding or not. -- AndreiMay 31 2016On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points. Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points. Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".On 5/27/16 3:10 PM, ag0aep6g wrote:That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiMay 29 2016On 05/29/2016 09:42 AM, Tobias M wrote:On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:So now code points are good? -- AndreiOn Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points. Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points. Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".On 5/27/16 3:10 PM, ag0aep6g wrote:That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".I don't think there is value in distinguishing by language. > Thepoint of Unicode is that you shouldn't need to do that. It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiMay 29 2016On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 05/29/2016 09:42 AM, Tobias M wrote:It depends on what you're trying to accomplish. That's the point we're trying to get at. For some operations, working with code points makes the most sense. But for other operations, it does not. There is no one representation that is best for all situations; it needs to be decided on a case-by-case basis. Which is why forcing everything to decode to code points eventually leads to problems. T -- Customer support: the art of getting your clients to pay for your own incompetence.On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:So now code points are good? -- AndreiOn Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points. Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points. Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".On 5/27/16 3:10 PM, ag0aep6g wrote:That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiMay 29 2016On 05/29/2016 04:47 PM, H. S. Teoh via Digitalmars-d wrote:It depends on what you're trying to accomplish. That's the point we're trying to get at. For some operations, working with code points makes the most sense. But for other operations, it does not. There is no one representation that is best for all situations; it needs to be decided on a case-by-case basis. Which is why forcing everything to decode to code points eventually leads to problems.I see. Again this all to me sounds like "naked arrays of characters are the wrong choice and should have been encapsulated in a dedicated string type". -- AndreiMay 30 2016On Sunday, May 29, 2016 13:47:32 H. S. Teoh via Digitalmars-d wrote:On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu viaDigitalmars-d wrote:Exactly. And even a given function can't necessarily always be defined to use a specific level of Unicode, because whether that's correct or not depends on what the programmer is actually trying to do with the function. And then there are cases where the programmer knows enough about the data that they're dealing with that they're able to operate at a different level of Unicode than would normally be correct. The most obvious example of that is when you know that your strings are pure ASCII, but it's not the only case. We should strive to make Phobos operate correctly on strings by default where we can, but there are cases where the programmer needs to know enough to specify the behavior that they want, and deciding for them is just going to lead to behavior that happens to be right some of the time while making it hard for code using Phobos to have the correct behavior the rest of the time. And the default behavior that we currently have is inefficient to boot. - Jonathan M DavisSo now code points are good? -- AndreiIt depends on what you're trying to accomplish. That's the point we're trying to get at. For some operations, working with code points makes the most sense. But for other operations, it does not. There is no one representation that is best for all situations; it needs to be decided on a case-by-case basis. Which is why forcing everything to decode to code points eventually leads to problems.May 31 2016On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote:It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiIt might help to think of code points as being a kind of byte code for a text-representing VM. It's not meaningless, but it also isn't trivial and relevant metrics can only be seen in application. BTW you don't even have to get into unicode to hit complications. Tab, backspace, carriage return, these are part of ASCII but already complicate questions. http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior came up on a quick search. Does the backspace character reduce the length of a string? In some contexts, maybe.May 27 2016On Fri, May 27, 2016 at 07:53:30PM +0000, Adam D. Ruppe via Digitalmars-d wrote:On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote:Fun fact: on some old Unix boxen, Backspace + underscore was interpreted to mean "underline the previous character". Probably inherited from the old typewriter days. Scarily enough, some Posix terminals may still interpret this sequence this way! An early precursor of Unicode combining diacritics, perhaps? :-D T -- Everybody talks about it, but nobody does anything about it! -- Mark TwainIt seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiIt might help to think of code points as being a kind of byte code for a text-representing VM. It's not meaningless, but it also isn't trivial and relevant metrics can only be seen in application. BTW you don't even have to get into unicode to hit complications. Tab, backspace, carriage return, these are part of ASCII but already complicate questions. http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior came up on a quick search. Does the backspace character reduce the length of a string? In some contexts, maybe.May 27 2016On 5/27/16 3:30 PM, Andrei Alexandrescu wrote:On 5/27/16 3:10 PM, ag0aep6g wrote:The only unmistakably correct use I can think of is transcoding from one UTF representation to another. That is, in order to transcode from UTF8 to UTF16, I don't need to know anything about character composition. -SteveI don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- AndreiMay 27 2016On Fri, May 27, 2016 at 02:42:27PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:This is a complicated issue; for a full explanation you'll probably want to peruse the Unicode codices. For example: http://www.unicode.org/faq/char_combmark.html But in brief, it's mostly a number of common European languages have 1-to-1 code point to character mapping, as well as Chinese writing. Outside of this narrow set, you're on shaky ground. Examples (that I can think of, there are many others): - Almost all Korean characters are composed of multiple code points. - The Indic languages (which cover quite a good number of Unicode code pages) have ligatures that require multiple code points. - The Thai block contains a series of combining diacritics for vowels and tones. - Hebrew vowel points require multiple code points; - A good number of native American scripts require combining marks, e.g., Navajo. - International Phonetic Alphabet (primarily only for linguistic uses, but could be widespread because it's relevant everywhere language is spoken). - Classical Greek accents (though this is less common, mostly being used only in academic circles). Even within the realm of European languages and languages that use some version of the Latin script, there is an entire block of code points in Unicode (the U+0300 block) dedicated to combining diacritics. A good number of combinations do not have precomposed characters. Now as far as normalization is concerned, it only helps if a particular combination of diacritics on a base glyph have a precomposed form. A large number of the above languages do not have precomposed characters simply because of the sheer number of combinations. The only reason the CJK block actually includes a huge number of precomposed characters was because the rules for combining the base forms are too complex to encode compositionally. Otherwise, most languages with combining diacritics would not have precomposed characters assigned to their respective blocks. In fact, a good number (all?) of precomposed Latin characters were included in Unicode only because they existed in pre-Unicode days and some form of compatibility was desired back when Unicode was still not yet widely adopted. So basically, besides a small number of languages, the idea of 1 code point == 1 character is pretty unworkable. Especially in this day and age of worldwide connectivity. T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it.Which languages are covered by code points, and which languages require graphemes consisting of multiple code points? How does normalization play into this? -- AndreiMay 27 2016Am Fri, 27 May 2016 15:47:32 +0200 schrieb ag0aep6g <anonymous example.com>:On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:1: Auto-decoding shall ALWAYS do the proper thing 2: Therefor humans shall read text in units of code points 3: OS X is an anomaly and must be purged from this planet 4: Indonesians shall be converted to a sane alphabet 5: He who useth combining diacritics shall burn in hell 6: We shall live in peace and harmony forevermore Let's give this a rest. -- MarcoThey only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2Misunderstanding. All examples work properly today because of autodecoding. -- AndreiHowever the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.But how is the user supposed to know without being a core contributor to Phobos?May 30 20164: Indonesians* shall be converted to a sane alphabet*Correction: Koreans (2-4 Hangul syllables (code points) form each letter) -- MarcoMay 30 2016On Friday, May 27, 2016 09:40:21 H. S. Teoh via Digitalmars-d wrote:On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:Exactly. Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct. More, full characters fit in a single code unit, but they still don't all fit. You have to go to the grapheme level for that. IIRC, Andrei talked in TDPL about how UTF-8 was better than UTF-16, because you figured out when you screwed up Unicode handling more quickly, because very few Unicode characters fit in single UTF-8 code unit, whereas many more fit in a single UTF-16 code unit, making it harder to catch errors with UTF-16. Well, we're making the same mistake but with UTF-32 instead of UTF-16. The code is still wrong, but it's that much harder to catch that it's wrong.On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it. String handling, especially in the standard library, ought to be (1) efficient where possible, and (2) be as correct as possible (meaning, most corresponding to user expectations -- principle of least surprise). If we can't have both, we should at least have one, right? However, the way autodecoding is currently implemented, we have neither.They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2Misunderstanding. All examples work properly today because of autodecoding. -- AndreiHowever the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.But how is the user supposed to know without being a core contributor to Phobos?Firstly, it is beyond clear that autodecoding adds a significant amount of overhead, and because it's automatic, it applies to ALL string processing in D. The only way around it is to fight against the standard library and use workarounds to bypass all that meticulously-crafted autodecoding code, begging the question of why we're even spending the effort on said code in the first place.The standard library has to fight against itself because of autodecoding! The vast majority of the algorithms in Phobos are special-cased on strings in an attempt to get around autodecoding. That alone should highlight the fact that autodecoding is problematic.The fact of the matter is that if you're going to write Unicode string processing code, you're gonna hafta to know the dirty nitty gritty of Unicode strings, including the fine distinctions between code units, code points, grapheme clusters, etc.. Since this is required knowledge anyway, why not just let the user worry about how to iterate over the string? Let the user choose what best suits his application, whether it's working directly with code units for speed, or iterating over grapheme clusters for correctness (in terms of visual "characters"), instead of choosing the pessimal middle ground that's neither efficient nor correct?There is no solution here that's going to be both correct and efficient. Ideally, we either need to provide a fully correct solution that's dog slow, or we need to provide a solution that's efficient but requires that the programmer understand Unicode to write correct code. Right now, we have a slow solution that's incorrect. - Jonathan M DavisMay 31 2016On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiMay 31 2016On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:Okay. If you have the letter A, it will fit in one UTF-8 code unit, one UTF-16 code unit, and one UTF-32 code unit (so, one code point). assert("A"c.length == 1); assert("A"w.length == 1); assert("A"d.length == 1); If you have 月, then you get assert("月"c.length == 3); assert("月"w.length == 1); assert("月"d.length == 1); whereas if you have 𐀆, then you get assert("𐀆"c.length == 4); assert("𐀆"w.length == 2); assert("𐀆"d.length == 1); So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for holding an entire character, but it still looks like UTF-32 does. However, what about characters like é or שׂ? Notice that שׂ takes up more than one code point. assert("שׂ"c.length == 4); assert("שׂ"w.length == 2); assert("שׂ"d.length == 2); It's ש with some sort of dot marker on it that they have in Hebrew, but it's a single character in spite of the fact that it's multiple code points. é is in a similar, though more complicated boat. With D, you'll get assert("é"c.length == 2); assert("é"w.length == 1); assert("é"d.length == 1); because the compiler decides to use the version of é that's a single code point. However, Unicode is set up so that that accent can be its own code point and be applied to any other code point - be it an e, an a, or even something like the number 0. If we normalize é, we can see other versions of it that take up more than one code point. e.g. assert("é"d.normalize!NFC.length == 1); assert("é"d.normalize!NFD.length == 2); assert("é"d.normalize!NFKC.length == 1); assert("é"d.normalize!NFKD.length == 2); And you can even put that accent on 0 by doing something like assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d); One or more code units combine to make a single code point, but one or more code points also combine to make a grapheme. So, while there is a definite layer of separation between code units and code points, it's still the case that a single code point is not guaranteed to be a single character. You do indeed have encodings with code units and not code points (though those still have different normalizations, which is kind of like having different encodings), but in terms of correctness, you have the same problem with treating code points as characters that you have as treating code units as characters. You're still not guaranteed that you're operating on full characters and risk chopping them up. It's just that at the code point level, you're generally chopping something up that is visually separable (like an accent from a letter or a superscript on a symbol), whereas with code units, you end up with utter garbage when you chop them incorrectly. By operating at the code point level, we're correct for _way_ more characters than we would be than if we treated char like a full character, but we're still not fully correct, and it's a lot harder to notice when you screw it up, because the number of characters which are handled incorrectly is far smaller. - Jonathan M DavisSaying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiMay 31 2016On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:Does walkLength yield the same number for all representations?On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:Okay. If you have the letter A, it will fit in one UTF-8 code unit, one UTF-16 code unit, and one UTF-32 code unit (so, one code point). assert("A"c.length == 1); assert("A"w.length == 1); assert("A"d.length == 1); If you have 月, then you get assert("月"c.length == 3); assert("月"w.length == 1); assert("月"d.length == 1); whereas if you have 𐀆, then you get assert("𐀆"c.length == 4); assert("𐀆"w.length == 2); assert("𐀆"d.length == 1); So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for holding an entire character, but it still looks like UTF-32 does.Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiHowever, what about characters like é or שׂ? Notice that שׂ takes up more than one code point. assert("שׂ"c.length == 4); assert("שׂ"w.length == 2); assert("שׂ"d.length == 2); It's ש with some sort of dot marker on it that they have in Hebrew, but it's a single character in spite of the fact that it's multiple code points. é is in a similar, though more complicated boat. With D, you'll get assert("é"c.length == 2); assert("é"w.length == 1); assert("é"d.length == 1); because the compiler decides to use the version of é that's a single code point.Does walkLength yield the same number for all representations?However, Unicode is set up so that that accent can be its own code point and be applied to any other code point - be it an e, an a, or even something like the number 0. If we normalize é, we can see other versions of it that take up more than one code point. e.g. assert("é"d.normalize!NFC.length == 1); assert("é"d.normalize!NFD.length == 2); assert("é"d.normalize!NFKC.length == 1); assert("é"d.normalize!NFKD.length == 2);Does walkLength yield the same number for all representations?And you can even put that accent on 0 by doing something like assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d); One or more code units combine to make a single code point, but one or more code points also combine to make a grapheme.That's right. D's handling of UTF is at the code unit level (like all of Unicode is portably defined). If you want graphemes use byGrapheme. It seems you destroyed your own argument, which was:Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.You can't claim code units are just a special case of code points. AndreiMay 31 2016On 31.05.2016 20:30, Andrei Alexandrescu wrote:D'sPhobos'handling of UTF is at the code unitcode pointlevel (like all of Unicode is portably defined).May 31 2016On 05/31/2016 02:46 PM, Timon Gehr wrote:On 31.05.2016 20:30, Andrei Alexandrescu wrote:foreach, too. -- AndreiD'sPhobos'May 31 2016On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote:On 05/31/2016 02:46 PM, Timon Gehr wrote:Incorrect. https://dpaste.dzfl.pl/ba7a65d59534On 31.05.2016 20:30, Andrei Alexandrescu wrote:foreach, too. -- AndreiD'sPhobos'Jun 01 2016On 06/01/2016 01:35 PM, ZombineDev wrote:On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote:Try typing the iteration variable with "dchar". -- AndreiOn 05/31/2016 02:46 PM, Timon Gehr wrote:Incorrect. https://dpaste.dzfl.pl/ba7a65d59534On 31.05.2016 20:30, Andrei Alexandrescu wrote:foreach, too. -- AndreiD'sPhobos'Jun 01 2016On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu wrote:Try typing the iteration variable with "dchar". -- AndreiOr you can type it as wchar... But important to note: that's opt in, not automatic.Jun 01 2016On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu wrote:On 06/01/2016 01:35 PM, ZombineDev wrote:I think you are not getting my point. This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach. Typing char, wchar or dchar is the same using byChar, byWchar or byDchar - it is opt-in. The only problems are the front, empty and popFront overloads for narrow strings.On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote:Try typing the iteration variable with "dchar". -- AndreiOn 05/31/2016 02:46 PM, Timon Gehr wrote:Incorrect. https://dpaste.dzfl.pl/ba7a65d59534On 31.05.2016 20:30, Andrei Alexandrescu wrote:foreach, too. -- AndreiD'sPhobos'Jun 01 2016On Wednesday, 1 June 2016 at 19:07:26 UTC, ZombineDev wrote:On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu wrote:in std.range.primitives.On 06/01/2016 01:35 PM, ZombineDev wrote:I think you are not getting my point. This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach. Typing char, wchar or dchar is the same using byChar, byWchar or byDchar - it is opt-in. The only problems are the front, empty and popFront overloads for narrow strings...On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote:Try typing the iteration variable with "dchar". -- AndreiOn 05/31/2016 02:46 PM, Timon Gehr wrote:Incorrect. https://dpaste.dzfl.pl/ba7a65d59534On 31.05.2016 20:30, Andrei Alexandrescu wrote:foreach, too. -- AndreiD'sPhobos'Jun 01 2016On 06/01/2016 03:07 PM, ZombineDev wrote:This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach.I understand where you're coming from, but it actually is autodecoding. Consider: byte[] a; foreach (byte x; a) {} foreach (short x; a) {} foreach (int x; a) {} That works by means of a conversion short->int. However: char[] a; foreach (char x; a) {} foreach (wchar x; a) {} foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language. AndreiJun 01 2016On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language.This seems to be a miscommunication with semantics. This is not auto-decoding at all; you're decoding, but there is nothing "auto" about it. This code is an explicit choice by the programmer to do something. On the other hand, using std.range.primitives.front for narrow strings is auto-decoding because the programmer has not made a choice, the choice is made for the programmer.Jun 01 2016On 06/01/2016 05:30 PM, Jack Stouffer wrote:On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:No, this is autodecoding pure and simple. We can't move the goals whenever we don't like where the ball gets. The usual language rules are not applied for strings - they are autodecoded (i.e. there's code generated that magically decodes UTF surprisingly for beginners, in apparent violation of the language rules, and without any user-visible request) by the foreach statement. -- Andreiforeach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language.This seems to be a miscommunication with semantics. This is not auto-decoding at all; you're decoding, but there is nothing "auto" about it. This code is an explicit choice by the programmer to do something.Jun 01 2016On 01.06.2016 23:48, Andrei Alexandrescu wrote:On 06/01/2016 05:30 PM, Jack Stouffer wrote:It does not share most of the characteristics that make Phobos' autodecoding painful in practice.On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:No, this is autodecoding pure and simple. We can't move the goals whenever we don't like where the ball gets.foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language.This seems to be a miscommunication with semantics. This is not auto-decoding at all; you're decoding, but there is nothing "auto" about it. This code is an explicit choice by the programmer to do something.The usual language rules are not applied for strings - they are autodecoded (i.e. there's code generated that magically decodes UTF surprisingly for beginners, in apparent violation of the language rules, and without any user-visible request) by the foreach statement. -- AndreiAgreed. (But implicit conversion from char to dchar is a bad language rule.)Jun 02 2016On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:On 06/01/2016 03:07 PM, ZombineDev wrote:Regardless of how different people may call it, it's not what this thread is about. Deprecating front, popFront and empty for narrow strings is what we are talking about here. This has little to do with explicit string transcoding in foreach. I don't think anyone has a problem with it, because it is **opt-in** and easy to change to get the desired behavior. On the other hand, trying to prevent Phobos from autodecoding without typesystem defeating hacks like .representation is an uphill battle right now. Removing range autodecoding will also be beneficial for library writers. For example, instead of writing find specializations for char, wchar and dchar needles, it would be much more productive to focus on optimising searching for T in T[] and specializing on element size and other type properties that generic code should care about. Having to specialize for all the char and string types instead of just any types of that size that can be compared bitwise is like programming in a language with no support for generic programing. And like many others have pointed out, it also about correctness. Only the users can decide if searching at code unit, code point or grapheme level (or something else) is right for their needs. A library that pretends that a single interpretation (i.e. code point) is right for every case is a false friend.This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach.I understand where you're coming from, but it actually is autodecoding. Consider: byte[] a; foreach (byte x; a) {} foreach (short x; a) {} foreach (int x; a) {} That works by means of a conversion short->int. However: char[] a; foreach (char x; a) {} foreach (wchar x; a) {} foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language. AndreiJun 01 2016On 06/01/2016 06:09 PM, ZombineDev wrote:Regardless of how different people may call it, it's not what this thread is about.Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".Deprecating front, popFront and empty for narrow strings is what we are talking about here.That will not happen. Walter and I consider the cost excessive and the benefit too small.This has little to do with explicit string transcoding in foreach.It is implicit, not explicit.I don't think anyone has a problem with it, because it is **opt-in** and easy to change to get the desired behavior.It's not opt-in. There is no way to tell foreach "iterate this array by converting char to dchar by the usual language rules, no autodecoding". You can if you e.g. use uint for the iteration variable. Same deal as with .representation.On the other hand, trying to prevent Phobos from autodecoding without typesystem defeating hacks like .representation is an uphill battle right now.Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing? AndreiJun 01 2016On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:Do you mean you agree that range primitives for strings can be changed to stay (auto)decoding to dchar, but require some form of explicit opt-in?Deprecating front, popFront and empty for narrow strings is what we are talking about here.That will not happen. Walter and I consider the cost excessive and the benefit too small.This has little to do with explicit string transcoding in foreach.It is implicit, not explicit.Jun 02 2016On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:On 06/01/2016 06:09 PM, ZombineDev wrote:My claim was not invalidated. I just didn't want to waste time arguing about it, because it is off topic. My point was that foreach is a purely language construct that doesn't know about the std.range.primitives module, therefore doesn't use it and therefore foreach doesn't perform **auto**decoding. It does perform explicit decoding because you need to specify a different type of iteration variable to trigger the behavior. If the variable type is not specified, you won't get any decoding (it will instead iterate over the code units).Regardless of how different people may call it, it's not what this thread is about.Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".On the other hand many people think that the cost of using a language (like C++) that has accumulated excessive number of bad design decisions and pitfalls is too high. Keeping bad design decisions alienates existing users and repulses new ones. I know you are in a difficult decision making position, but imagine telling people ten years from now: A) For the last ten years we worked on fixing every bad design and improving all the good ones. That's why we managed to expand our market share/mind share 10x-100x to what we had before. B) This strange feature you need to know about is here because we chose comparability with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You should use this feature and here's a long list of things you need to consider when avoiding it. The majority of D users ten years from now are not yet D users. That's the target group you need to consider. And given the overwhelming support for fixing this problem by the existing users, you need to reevaluate your cost vs benefit metrics. This theme (breaking code) has come up many times before and I think that instead of complaining about the cost, we should focus on lower it with tooling. The problem I currently see is that there is not enough support for building and improving tools like dfix and leveraging them for language/std lib design process.Deprecating front, popFront and empty for narrow strings is what we are talking about here.That will not happen. Walter and I consider the cost excessive and the benefit too small.You need to opt-in by specifying a the type of the iteration variable and that type needs to be different than the typeof(array[0]). That's opt-in in my book.This has little to do with explicit string transcoding in foreach.It is implicit, not explicit.I don't think anyone has a problem with it, because it is **opt-in** and easy to change to get the desired behavior.It's not opt-in.There is no way to tell foreach "iterate this array by converting char to dchar by the usual language rules, no autodecoding". You can if you e.g. use uint for the iteration variable. Same deal as with .representation.Again, off topic. No sane person wants automatic conversion (bitcast) from char to dchar, because dchar gives the impression of a fully decoded code point, which the result of such cast would certainly not provide.Memory safety is not the only benefit of a type system. This goal is only a small subset of the larger goal of preventing logical errors and allowing greater expressiveness. You may as well invent a memory safe subset of D that works only ubyte, ushort, uint, ulong and arrays of those types, but I don't think anyone would want to use such language. Using .representation in parts of your code, makes those parts like the aforementioned language that no one wants to use.On the other hand, trying to prevent Phobos from autodecoding without typesystem defeating hacks like .representation is an uphill battle right now.Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing?Jun 02 2016... B) This strange feature you need to know about is here because we chose comparability with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You should use this feature and here's a long list of things you need to consider when avoiding it.B) This strange feature is here because we chose compatibility with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You shouldn't use this feature because of this and that potential pitfalls and here's a long list of things you need to consider when avoiding it....Jun 02 2016On 06/02/2016 06:42 AM, ZombineDev wrote:On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:Your claim was obliterated, and now you continue arguing it by adjusting term definitions on the fly, while at the same time awesomely claiming to choose the high road by not wasting time to argue it. I should remember the trick :o). Stand with the points that stand, own those that don't.On 06/01/2016 06:09 PM, ZombineDev wrote:My claim was not invalidated. I just didn't want to waste time arguing about it, because it is off topic. My point was that foreach is a purely language construct that doesn't know about the std.range.primitives module, therefore doesn't use it and therefore foreach doesn't perform **auto**decoding. It does perform explicit decoding because you need to specify a different type of iteration variable to trigger the behavior. If the variable type is not specified, you won't get any decoding (it will instead iterate over the code units).Regardless of how different people may call it, it's not what this thread is about.Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".Definitely. It's a fine line to walk; this particular decision is not that much on the edge at all. We must stay with autodecoding.On the other hand many people think that the cost of using a language (like C++) that has accumulated excessive number of bad design decisions and pitfalls is too high. Keeping bad design decisions alienates existing users and repulses new ones.Deprecating front, popFront and empty for narrow strings is what we are talking about here.That will not happen. Walter and I consider the cost excessive and the benefit too small.I know you are in a difficult decision making position, but imagine telling people ten years from now: A) For the last ten years we worked on fixing every bad design and improving all the good ones. That's why we managed to expand our market share/mind share 10x-100x to what we had before.I think we have underperformed and we need to do radically better. I'm on lookout for radical new approaches to things all the time. This is for another discussion though.B) This strange feature you need to know about is here because we chose comparability with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You should use this feature and here's a long list of things you need to consider when avoiding it.There are many components to the decision, not only compatibility with old code.The majority of D users ten years from now are not yet D users. That's the target group you need to consider. And given the overwhelming support for fixing this problem by the existing users, you need to reevaluate your cost vs benefit metrics.It's funny that evidence for the "overwhelming" support is the vote of 35 voters, which was cast in terms of percentages. Math is great. ZombineDev, I've been at the top level in the C++ community for many many years, even after I wanted to exit :o). I'm familiar with how the committee that steers C++ works, perspective that is unique in our community - even Walter lacks it. I see trends and patterns. It is interesting how easily a small but very influential priesthood can alienate itself from the needs of the larger community and get into a frenzy over matters that are simply missing the point. This is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode. The very definition of a useless debate, the kind he and I had agreed to not initiate anymore. It was a mistake. I'm still metaphorically angry at him for it. I admit I started it by asking the question, but Walter shouldn't have answered. Following that, there was blood in the water; any of us loves to improve something by 2% by completely rewiring the thing. A proneness to doing that is why we self-select to be in this community and forum. Meanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons: * The garbage collector eliminates probably 60% of potential users right off. * Tooling is immature and of poorer quality compared to the competition. * Safety has holes and bugs. * Hiring people who know D is a problem. * Documentation and tutorials are weak. * There's no web services framework (by this time many folks know of D, but of those a shockingly small fraction has even heard of vibe.d). I have strongly argued with Sönke to bundle vibe.d with dmd over one year ago, and also in this forum. There wasn't enough interest. * (On Windows) if it doesn't have a compelling Visual Studio plugin, it doesn't exist. * Let's wait for the "herd effect" (corporate support) to start. * Not enough advantages over the competition to make up for the weaknesses above. There is a second echelon of arguments related to language proper issues, but those collectively count as much less than the above. And "inefficient/poor/error-prone string handling" has NEVER come up. Literally NEVER, even among people who had some familiarity with D and would otherwise make very informed comments about it. Look at reddit and hackernews, too - admittedly other self-selected communities. Language debates often spring about. How often is the point being made that D is wanting because of its string support? Nada.This theme (breaking code) has come up many times before and I think that instead of complaining about the cost, we should focus on lower it with tooling. The problem I currently see is that there is not enough support for building and improving tools like dfix and leveraging them for language/std lib design process.Currently dfix is weak because it doesn't do lookup. So we need to make the front end into a library. Daniel said he wants to be on it, but he has two jobs to worry about so he's short on time. There's only so many hours in the day, and I think the right focus is on attacking the matters above.Taking exception to language rules for iteration with dchar is not opt-in.You need to opt-in by specifying a the type of the iteration variable and that type needs to be different than the typeof(array[0]). That's opt-in in my book.This has little to do with explicit string transcoding in foreach.It is implicit, not explicit.I don't think anyone has a problem with it, because it is **opt-in** and easy to change to get the desired behavior.It's not opt-in.It's very on-topic. It's surprising semantics compared to the rest of the language, for which the user needs to be informed.There is no way to tell foreach "iterate this array by converting char to dchar by the usual language rules, no autodecoding". You can if you e.g. use uint for the iteration variable. Same deal as with .representation.Again, off topic.No sane person wants automatic conversion (bitcast) from char to dchar, because dchar gives the impression of a fully decoded code point, which the result of such cast would certainly not provide.void fun(char c) { if (c < 0x80) { // Look ma I'm not a sane person dchar d = c; // conversion is implicit, too ... } }This sounds like "no comeback here so let's insert a filler". Care to substantiate?Memory safety is not the only benefit of a type system. This goal is only a small subset of the larger goal of preventing logical errors and allowing greater expressiveness.On the other hand, trying to prevent Phobos from autodecoding without typesystem defeating hacks like .representation is an uphill battle right now.Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing?You may as well invent a memory safe subset of D that works only ubyte, ushort, uint, ulong and arrays of those types, but I don't think anyone would want to use such language. Using .representation in parts of your code, makes those parts like the aforementioned language that no one wants to use.I disagree. AndreiJun 02 2016On 02.06.2016 15:06, Andrei Alexandrescu wrote:On 06/02/2016 06:42 AM, ZombineDev wrote:It's not "on the fly". You two were presumably using different definitions of terms all along.On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:Your claim was obliterated, and now you continue arguing it by adjusting term definitions on the fly, while at the same time awesomely claiming to choose the high road by not wasting time to argue it. I should remember the trick :o). Stand with the points that stand, own those that don't.On 06/01/2016 06:09 PM, ZombineDev wrote:My claim was not invalidated. I just didn't want to waste time arguing about it, because it is off topic. My point was that foreach is a purely language construct that doesn't know about the std.range.primitives module, therefore doesn't use it and therefore foreach doesn't perform **auto**decoding. It does perform explicit decoding because you need to specify a different type of iteration variable to trigger the behavior. If the variable type is not specified, you won't get any decoding (it will instead iterate over the code units).Regardless of how different people may call it, it's not what this thread is about.Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".Jun 02 2016On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:Your claim was obliterated, and now you continue arguing it by adjusting term definitions on the fly, while at the same time awesomely claiming to choose the high road by not wasting time to argue it. I should remember the trick :o). Stand with the points that stand, own those that don't.Definitely. It's a fine line to walk; this particular decision is not that much on the edge at all. We must stay with autodecoding.If you are to stay with autodecoding (and I hope you won't) then please, *please*, at least make it decode to graphemes so that it decodes to something that actually have some kind of meaning of its own.I think we have underperformed and we need to do radically better. I'm on lookout for radical new approaches to things all the time. This is for another discussion though. There are many components to the decision, not only compatibility with old code. It's funny that evidence for the "overwhelming" support is the vote of 35 voters, which was cast in terms of percentages. Math is great. ZombineDev, I've been at the top level in the C++ community for many many years, even after I wanted to exit :o). I'm familiar with how the committee that steers C++ works, perspective that is unique in our community - even Walter lacks it. I see trends and patterns. It is interesting how easily a small but very influential priesthood can alienate itself from the needs of the larger community and get into a frenzy over matters that are simply missing the point. This is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode. The very definition of a useless debate, the kind he and I had agreed to not initiate anymore. It was a mistake. I'm still metaphorically angry at him for it. I admit I started it by asking the question, but Walter shouldn't have answered. Following that, there was blood in the water; any of us loves to improve something by 2% by completely rewiring the thing. A proneness to doing that is why we self-select to be in this community and forum. Meanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons: * The garbage collector eliminates probably 60% of potential users right off. * Tooling is immature and of poorer quality compared to the competition. * Safety has holes and bugs. * Hiring people who know D is a problem. * Documentation and tutorials are weak. * There's no web services framework (by this time many folks know of D, but of those a shockingly small fraction has even heard of vibe.d). I have strongly argued with Sönke to bundle vibe.d with dmd over one year ago, and also in this forum. There wasn't enough interest. * (On Windows) if it doesn't have a compelling Visual Studio plugin, it doesn't exist. * Let's wait for the "herd effect" (corporate support) to start. * Not enough advantages over the competition to make up for the weaknesses above. There is a second echelon of arguments related to language proper issues, but those collectively count as much less than the above. And "inefficient/poor/error-prone string handling" has NEVER come up. Literally NEVER, even among people who had some familiarity with D and would otherwise make very informed comments about it. Look at reddit and hackernews, too - admittedly other self-selected communities. Language debates often spring about. How often is the point being made that D is wanting because of its string support? Nada.I think the real reason about why this isn't mentioned in the critics you mention is that people don't know about it. Most people don't even imagine it can be as broken as it is. Heck, it even took Walter by surprise after years! This thread is the first real discussion we've had about it with proper deconstruction and very reasonnable arguments against it. The only unreasonnable thing here has been your own arguments. I'd like not to point a finger at you but the fact is that you are the only single one defending autodecoding and not with good arguments. Currently autodecoding relies on chance only. (Yes, I call “hoping the text we're manipulating can be represented by dchars” chance.) This cannot be anymore.Currently dfix is weak because it doesn't do lookup. So we need to make the front end into a library. Daniel said he wants to be on it, but he has two jobs to worry about so he's short on time. There's only so many hours in the day, and I think the right focus is on attacking the matters above....AndreiJun 02 2016On Thursday, 2 June 2016 at 13:55:28 UTC, cym13 wrote:If you are to stay with autodecoding (and I hope you won't) then please, *please*, at least make it decode to graphemes so that it decodes to something that actually have some kind of meaning of its own.That would cause just as much - if not more - code breakage as ditching auto-decoding entirely. It would also be considerably slower and more memory-hungry.Jun 02 2016On 06/02/2016 09:55 AM, cym13 wrote:On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:That's not going to work. A false impression created in this thread has been that code points are useless and graphemes are da bomb. That's not the case even if we ignore the overwhelming issue of changing semantics of existing code.Your claim was obliterated, and now you continue arguing it by adjusting term definitions on the fly, while at the same time awesomely claiming to choose the high road by not wasting time to argue it. I should remember the trick :o). Stand with the points that stand, own those that don't.Definitely. It's a fine line to walk; this particular decision is not that much on the edge at all. We must stay with autodecoding.If you are to stay with autodecoding (and I hope you won't) then please, *please*, at least make it decode to graphemes so that it decodes to something that actually have some kind of meaning of its own.I think the real reason about why this isn't mentioned in the critics you mention is that people don't know about it. Most people don't even imagine it can be as broken as it is.This should be taken at face value - rampant speculation. From my experience that's not how these things work.Heck, it even took Walter by surprise after years! This thread is the first real discussion we've had about it with proper deconstruction and very reasonnable arguments against it. The only unreasonnable thing here has been your own arguments. I'd like not to point a finger at you but the fact is that you are the only single one defending autodecoding and not with good arguments.Fair enough. I accept continuous scrutiny of my competency - it comes with the territory.Currently autodecoding relies on chance only. (Yes, I call “hoping the text we're manipulating can be represented by dchars” chance.) This cannot be anymore.The real ticket out of this is RCStr. It solves a major problem in the language (compulsive GC) and also a minor occasional annoyance (autodecoding). AndreiJun 02 2016On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:That's not going to work. A false impression created in this thread has been that code points are uselessThey _are_ useless for almost anything you can do with strings. The only places where they should be used are std.uni and std.regex. Again: What is the justification for using code points, in your opinion? Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?Jun 02 2016On 06/02/2016 01:54 PM, Marc Schütz wrote:On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. * s.any!(c => c == 'ö') works only with autodecoding. It returns always false without. * s.balancedParens('〈', '〉') works only with autodecoding. * s.canFind('ö') works only with autodecoding. It returns always false without. * s.commonPrefix(s1) works only if they both use the same encoding; otherwise it still compiles but silently produces an incorrect result. * s.count('ö') works only with autodecoding. It returns always zero without. * s.countUntil(s1) is really odd - without autodecoding, whether it works at all, and the result it returns, depends on both encodings. With autodecoding it always works and returns a number independent of the encodings. * s.endsWith('ö') works only with autodecoding. It returns always false without. * s.endsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.find('ö') works only with autodecoding. It never finds it without. * s.findAdjacent is a very interesting one. It works with autodecoding, but without it it just does odd things. * s.findAmong(s1) is also interesting. It works only with autodecoding. * s.findSkip(s1) works only if s and s1 have the same encoding. Otherwise it compiles and runs but produces incorrect results. * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only if s and s1 have the same encoding. Otherwise they compile and run but produce incorrect results. * s.minCount, s.maxCount are unlikely to be terribly useful but with autodecoding it consistently returns the extremum numeric code unit regardless of representation. Without, they just return encoding-dependent and meaningless numbers. * s.minPos, s.maxPos follow a similar semantics. * s.skipOver(s1) only works with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.startsWith('ö') works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.startsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it will span the entire range. === The intent of autodecoding was to make std.algorithm work meaningfully with strings. As it's easy to see I just went through std.algorithm.searching alphabetically and found issues literally with every primitive in there. It's an easy exercise to go forth with the others. AndreiThat's not going to work. A false impression created in this thread has been that code points are uselessThey _are_ useless for almost anything you can do with strings. The only places where they should be used are std.uni and std.regex. Again: What is the justification for using code points, in your opinion? Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?Jun 02 2016On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s. Would actually work with UTF-16 and only combined 'ö's in s, because the combined character fits in a single UTF-16 code unit.Jun 02 2016ag0aep6g <anonymous example.com> wrote:On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:Works if s is normalized appropriately. No?Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.Jun 02 2016On 02.06.2016 21:26, Andrei Alexandrescu wrote:ag0aep6g <anonymous example.com> wrote:No. assert(!"ö̶".normalize!NFC.any!(c => c== 'ö'));On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:Works if s is normalized appropriately. No?Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.Jun 02 2016On 06/02/2016 09:26 PM, Andrei Alexandrescu wrote:ag0aep6g <anonymous example.com> wrote:Works when normalized to precomposed characters, yes. That's not a given, of course. When the user is aware enough to normalize their strings that way, then they should be able to call byDchar explicitly. And of course you can't do s.all!(c => c == 'a⃗'), despite a⃗ looking like one character. Need byGrapheme for that.On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:Works if s is normalized appropriately. No?Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.Jun 02 2016On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). ...Your 'ö' examples will NOT work reliably with auto-decoded code points, and for nearly the same reason that they won't work with code units; you would have to use byGrapheme. The fact that you still don't get that, even after a dozen plus attempts by the community to explain the difference, makes you unfit to direct Phobos' Unicode support. Please, either go study Unicode until you really understand it, or delegate this issue to someone else.Jun 02 2016On 06/02/2016 03:34 PM, tsbockman wrote:On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). ...Your 'ö' examples will NOT work reliably with auto-decoded code points, and for nearly the same reason that they won't work with code units; you would have to use byGrapheme.The fact that you still don't get that, even after a dozen plus attempts by the community to explain the difference, makes you unfit to direct Phobos' Unicode support.Well there's gotta be a reason why my basic comprehension is under constant scrutiny whereas yours is safe.Please, either go study Unicode until you really understand it, or delegate this issue to someone else.Would be happy to. To whom would I delegate? AndreiJun 02 2016On Thursday, 2 June 2016 at 20:13:14 UTC, Andrei Alexandrescu wrote:On 06/02/2016 03:34 PM, tsbockman wrote:If there were to be a unicode lieutenant, Dmitry seems to be the obvious choice (if he's interested).[...]They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.[...]Well there's gotta be a reason why my basic comprehension is under constant scrutiny whereas yours is safe.[...]Would be happy to. To whom would I delegate? AndreiJun 02 2016On 06/02/2016 10:13 PM, Andrei Alexandrescu wrote:They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.The "spec" here is how the range primitives for narrow strings are defined, right? I.e., the spec says auto-decode code units to code points. The discussion is about whether the spec is good or bad. No one is arguing that there are bugs in the decoding to code points. People are arguing that auto-decoding to code points is not useful.Jun 02 2016On 06/02/2016 04:23 PM, ag0aep6g wrote:People are arguing that auto-decoding to code points is not useful.And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu wrote:On 06/02/2016 04:23 PM, ag0aep6g wrote:Just make RCStr the most amazing string type of any standard library ever and everyone will be happy :o)People are arguing that auto-decoding to code points is not useful.And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- AndreiJun 02 2016On 06/02/2016 04:37 PM, default0 wrote:On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu wrote:Soon as this thread ends. -- AndreiOn 06/02/2016 04:23 PM, ag0aep6g wrote:Just make RCStr the most amazing string type of any standard library ever and everyone will be happy :o)People are arguing that auto-decoding to code points is not useful.And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- AndreiJun 02 2016On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote:And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- AndreiI think you'd have to substantiate how that would be worse than auto-decoding. Your examples only show that treating code points as characters falls apart at a higher level than treating code units as characters. But it still falls apart. Failing early is a quality.Jun 02 2016On 06/02/2016 04:47 PM, ag0aep6g wrote:On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote:I gave a long list of std.algorithm uses that perform virtually randomly on char[].And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- AndreiI think you'd have to substantiate how that would be worse than auto-decoding.Your examples only show that treating code points as characters falls apart at a higher level than treating code units as characters. But it still falls apart. Failing early is a quality.It does not fall apart for code points. AndreiJun 02 2016On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:It does not fall apart for code points.Yes it does. You've been given plenty examples where it falls apart. Your answer to that was that it operates on code points, not graphemes. Well, duh. Comparing UTF-8 code units against each other works, too. That's not an argument for doing that by default.Jun 02 2016On 6/2/16 5:01 PM, ag0aep6g wrote:On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:There weren't any.It does not fall apart for code points.Yes it does. You've been given plenty examples where it falls apart.Your answer to that was that it operates on code points, not graphemes.That is correct.Well, duh. Comparing UTF-8 code units against each other works, too. That's not an argument for doing that by default.Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level. AndreiJun 02 2016On 02.06.2016 23:06, Andrei Alexandrescu wrote:As the examples show, the examples would be entirely meaningless at code unit level.So far, I needed to count the number of characters 'ö' inside some string exactly zero times, but I wanted to chain or join strings relatively often.Jun 02 2016On 02.06.2016 23:16, Timon Gehr wrote:On 02.06.2016 23:06, Andrei Alexandrescu wrote:(Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.)As the examples show, the examples would be entirely meaningless at code unit level.So far, I needed to count the number of characters 'ö' inside some string exactly zero times,but I wanted to chain or join strings relatively often.Jun 02 2016On 6/2/16 5:19 PM, Timon Gehr wrote:On 02.06.2016 23:16, Timon Gehr wrote:You may look for a specific dchar, and it'll work. How about findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? -- AndreiOn 02.06.2016 23:06, Andrei Alexandrescu wrote:(Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.)As the examples show, the examples would be entirely meaningless at code unit level.So far, I needed to count the number of characters 'ö' inside some string exactly zero times,Jun 02 2016On 02.06.2016 23:23, Andrei Alexandrescu wrote:On 6/2/16 5:19 PM, Timon Gehr wrote:.̂ ̪.̂ (Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.) The point is that if I do: ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")]) no match is returned. If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect: writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂" (Also, do you have an use case for this?)On 02.06.2016 23:16, Timon Gehr wrote:You may look for a specific dchar, and it'll work. How about findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? -- AndreiOn 02.06.2016 23:06, Andrei Alexandrescu wrote:(Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.)As the examples show, the examples would be entirely meaningless at code unit level.So far, I needed to count the number of characters 'ö' inside some string exactly zero times,Jun 02 2016On 6/2/16 5:43 PM, Timon Gehr wrote:.̂ ̪.̂ (Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.) The point is that if I do: ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")]) no match is returned. If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect: writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"Nice example.(Also, do you have an use case for this?)Count delimited words. Did you also look at balancedParens? AndreiJun 02 2016On 02.06.2016 23:46, Andrei Alexandrescu wrote:On 6/2/16 5:43 PM, Timon Gehr wrote:Thanks! :o).̂ ̪.̂ (Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.) The point is that if I do: ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")]) no match is returned. If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect: writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"Nice example. ...On 02.06.2016 22:01, Timon Gehr wrote:(Also, do you have an use case for this?)Count delimited words. Did you also look at balancedParens? Andreiassert("⟨⃖".normalize!NFC.byGrapheme.balancedParens(Grapheme("⟨"),Grapheme("⟩"))); writeln("⟨⃖".balancedParens('⟨','⟩')); // false* s.balancedParens('〈', '〉') works only with autodecoding. ...Doesn't work, e.g. s="⟨⃖". Shouldn't compile.Jun 02 2016On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level.They're simply not possible. Won't compile. There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units. Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. You can still search for 'a', and 'o', and the rest of ASCII in a range of code units.Jun 02 2016On 06/02/2016 11:24 PM, ag0aep6g wrote:They're simply not possible. Won't compile. There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units. Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. You can still search for 'a', and 'o', and the rest of ASCII in a range of code units.I'm ignoring combining characters there. You can search for 'a' in code units in the same way that you can search for 'ä' in code points. I.e., more or less, depending on how serious you are about combining characters.Jun 02 2016On 6/2/16 5:24 PM, ag0aep6g wrote:On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:They do compile.Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level.They're simply not possible. Won't compile.There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units.Of course you can. Can you search for an int in a short[]? Oh yes you can. Can you search for a dchar in a char[]? Of course you can. Autodecoding also gives it meaning.Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points.Of course you can.You can still search for 'a', and 'o', and the rest of ASCII in a range of code units.You can search for a dchar in a char[] because you can compare an individual dchar with either another dchar (correct, autodecoding) or with a char (incorrect, no autodecoding). As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o). AndreiJun 02 2016On 6/2/16 5:27 PM, Andrei Alexandrescu wrote:On 6/2/16 5:24 PM, ag0aep6g wrote:Correx, indeed you can't. -- AndreiJust like there is no single code point for 'a⃗' so you can't search for it in a range of code points.Of course you can.Jun 02 2016On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:On 6/2/16 5:24 PM, ag0aep6g wrote:Yes, you're right, of course they do. char implicitly converts to dchar. I didn't think of that anti-feature.On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:They do compile.Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level.They're simply not possible. Won't compile.As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o).It's more of an argument against char : dchar, I'd say.Jun 02 2016On 6/2/16 5:35 PM, ag0aep6g wrote:On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:I do think that's an interesting option in PL design space, but that would be super disruptive. -- AndreiOn 6/2/16 5:24 PM, ag0aep6g wrote:Yes, you're right, of course they do. char implicitly converts to dchar. I didn't think of that anti-feature.On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:They do compile.Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level.They're simply not possible. Won't compile.As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o).It's more of an argument against char : dchar, I'd say.Jun 02 2016On Thursday, 2 June 2016 at 20:13:14 UTC, Andrei Alexandrescu wrote:On 06/02/2016 03:34 PM, tsbockman wrote:Your examples will pass or fail depending on how (and whether) the 'ö' grapheme is normalized. They only ever succeeds because 'ö' happens to be one of the privileged graphemes that *can* be (but often isn't!) represented as a single code point. Many other graphemes have no such representation. Working directly with code points is sometimes useful anyway - but then, working with code units can be, also. Neither will lead to inherently "correct" Unicode processing, and in the absence of a compelling context, your examples fall completely flat as an argument for the inherent superiority of processing at the code unit level.Your 'ö' examples will NOT work reliably with auto-decoded code points, and for nearly the same reason that they won't work with code units; you would have to use byGrapheme.They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.Who said mine is safe? I *know* that I'm not qualified to be in charge of this. Your comprehension is under greater scrutiny because you are proposing to overrule nearly all other active contributors combined.The fact that you still don't get that, even after a dozen plus attempts by the community to explain the difference, makes you unfit to direct Phobos' Unicode support.Well there's gotta be a reason why my basic comprehension is under constant scrutiny whereas yours is safe.If you're serious, I would suggest Dmitry Olshansky. He seems to be our top Unicode expert, based on his contributions to `std.uni` and `std.regex`. But, if he is unwilling/unsuitable for some reason there are other candidates participating in this thread (not me).Please, either go study Unicode until you really understand it, or delegate this issue to someone else.Would be happy to. To whom would I delegate?Jun 02 2016On 06/02/2016 04:36 PM, tsbockman wrote:Your examples will pass or fail depending on how (and whether) the 'ö' grapheme is normalized.And that's fine. Want graphemes, .byGrapheme wags its tail in that corner. Otherwise, you work on code points which is a completely meaningful way to go about things. What's not meaningful is the random results you get from operating on code units.They only ever succeeds because 'ö' happens to be one of the privileged graphemes that *can* be (but often isn't!) represented as a single code point. Many other graphemes have no such representation.Then there's no dchar for them so no problem to start with. s.find(c) ----> "Find code unit c in string s" AndreiJun 02 2016On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 06/02/2016 04:36 PM, tsbockman wrote:[...] This is a ridiculous argument. We might as well say, "there's no single byte UTF-8 that can represent Ш, so that's no problem to start with" -- since we can just define it away by saying s.find(c) == "find byte c in string s", and thereby justify using ASCII as our standard string representation. The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in the general case. It is adequate for a subset of characters -- just like ASCII is also adequate for a subset of characters. If you only need to work with ASCII, it suffices to work with ubyte[]. Similarly, if your work is restricted to only languages without combining diacritics, then a range of dchar suffices. But a range of dchar is NOT good enough in the general case, and arguing that it does only makes you look like a fool. Appealing to normalization doesn't change anything either, since only a subset of base character + diacritic combinations will normalize to a single code point. If the string has a base character + diacritic combination doesn't have a precomposed code point, it will NOT fit in a dchar. (And keep in mind that the notion of diacritic is still very Euro-centric. In Korean, for example, a single character is composed of multiple parts, each of which occupies 1 code point. While some precomposed combinations do exist, they don't cover all of the possibilities, so normalization won't help you there.) T -- Frank disagreement binds closer than feigned agreement.Your examples will pass or fail depending on how (and whether) the 'ö' grapheme is normalized.And that's fine. Want graphemes, .byGrapheme wags its tail in that corner. Otherwise, you work on code points which is a completely meaningful way to go about things. What's not meaningful is the random results you get from operating on code units.They only ever succeeds because 'ö' happens to be one of the privileged graphemes that *can* be (but often isn't!) represented as a single code point. Many other graphemes have no such representation.Then there's no dchar for them so no problem to start with. s.find(c) ----> "Find code unit c in string s"Jun 02 2016On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.* s.any!(c => c == 'ö') works only with autodecoding. It returns always false without.False. (while this is pretty much the same as 1, one can come up with with as many example as wished by tweaking the same one to produce endless variations).* s.balancedParens('〈', '〉') works only with autodecoding.Not sure, so I'll say OK.* s.canFind('ö') works only with autodecoding. It returns always false without.False.* s.commonPrefix(s1) works only if they both use the same encoding; otherwise it still compiles but silently produces an incorrect result.False.* s.count('ö') works only with autodecoding. It returns always zero without.False.* s.countUntil(s1) is really odd - without autodecoding, whether it works at all, and the result it returns, depends on both encodings. With autodecoding it always works and returns a number independent of the encodings.False.* s.endsWith('ö') works only with autodecoding. It returns always false without.False.* s.endsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.False.* s.find('ö') works only with autodecoding. It never finds it without.False.* s.findAdjacent is a very interesting one. It works with autodecoding, but without it it just does odd things.Not sure so I'll say OK, while I strongly suspect that, like for other, this will only work if string are normalized.* s.findAmong(s1) is also interesting. It works only with autodecoding.False.* s.findSkip(s1) works only if s and s1 have the same encoding. Otherwise it compiles and runs but produces incorrect results.False.* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only if s and s1 have the same encoding. Otherwise they compile and run but produce incorrect results.False.* s.minCount, s.maxCount are unlikely to be terribly useful but with autodecoding it consistently returns the extremum numeric code unit regardless of representation. Without, they just return encoding-dependent and meaningless numbers.Note sure, so I'll say ok.* s.minPos, s.maxPos follow a similar semantics.Note sure, so I'll say ok.* s.skipOver(s1) only works with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.False.* s.startsWith('ö') works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.False.* s.startsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.False.* s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it will span the entire range.False.=== The intent of autodecoding was to make std.algorithm work meaningfully with strings. As it's easy to see I just went through std.algorithm.searching alphabetically and found issues literally with every primitive in there. It's an easy exercise to go forth with the others. AndreiI mean what a trainwreck. Your examples are saying it all doesn't it ? Almost none of them would work without normalizing the string first. And that is the point you've been refusing to hear so far. autodecoding doesn't pay for itself as it is unable to do what it is supposed to do in the general case. Really, there is not much you can do with anything unicode related without first going through normalization. If you want anything more than searching substring or alike, you'll also need a collation, that is locale dependent (for sorting for instance). Supporting unicode, IMO, would be to provide facilities to normalize (preferably lazilly as a range), to manage collations, and so on. Decoding to codepoints just don't cut it. As a result, any algorithm that need to support string need to either fight against the language because it doesn't need decoding, use decoding and assume to be incorrect without normalization or do the correct thing by itself (which is also going to require to work against the language).Jun 02 2016On 06/02/2016 03:34 PM, deadalnix wrote:On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:True. "Are all code points equal to this one?" -- AndreiPretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False.Jun 02 2016On 02.06.2016 22:13, Andrei Alexandrescu wrote:On 06/02/2016 03:34 PM, deadalnix wrote:I.e. you are saying that 'works' means 'operates on code points'.On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:True. "Are all code points equal to this one?" -- AndreiPretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False.Jun 02 2016On 06/02/2016 04:17 PM, Timon Gehr wrote:I.e. you are saying that 'works' means 'operates on code points'.Affirmative. -- AndreiJun 02 2016On Thu, Jun 02, 2016 at 04:28:45PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 06/02/2016 04:17 PM, Timon Gehr wrote:Again, a ridiculous position. I can use exactly the same line of argument for why we should just standardize on ASCII. All I have to do is to define "work" to mean "operates on an ASCII character", and then every ASCII algorithm "works" by definition, so nobody can argue with me. Unfortunately, everybody else's definition of "work" is different from mine, so the argument doesn't hold water. Similarly, you are the only one whose definition of "work" means "operates on code points". Basically nobody else here uses that definition, so while you may be right according to your own made-up tautological arguments, none of your conclusions actually have any bearing in the real world of Unicode handling. Give it up. It is beyond reasonable doubt that autodecoding is a liability. D should be moving away from autodecoding instead of clinging to historical mistakes in the face of overwhelming evidence. (And note, I said *auto*-decoding; decoding by itself obviously is very relevant. But it needs to be opt-in because of its performance and correctness implications. The user needs to be able to choose whether to decode, and how to decode.) T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.I.e. you are saying that 'works' means 'operates on code points'.Affirmative. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu wrote:On 06/02/2016 03:34 PM, deadalnix wrote:A:“We should decode to code points” B:“No, decoding to code points is a stupid idea.” A:“No it's not!” B:“Can you show a concrete example where it does something useful?” A:“Sure, look at that!” B:“This isn't working at all, look at all those counter-examples!” A:“It may not work for your examples but look how easy it is to find code points!” *Sigh*On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:True. "Are all code points equal to this one?" -- AndreiPretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False.Jun 02 2016On 06/02/2016 04:22 PM, cym13 wrote:A:“We should decode to code points” B:“No, decoding to code points is a stupid idea.” A:“No it's not!” B:“Can you show a concrete example where it does something useful?” A:“Sure, look at that!” B:“This isn't working at all, look at all those counter-examples!” A:“It may not work for your examples but look how easy it is to find code points!”With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- AndreiJun 02 2016On 02.06.2016 22:29, Andrei Alexandrescu wrote:On 06/02/2016 04:22 PM, cym13 wrote:No, without it, it operates correctly on code units.A:“We should decode to code points” B:“No, decoding to code points is a stupid idea.” A:“No it's not!” B:“Can you show a concrete example where it does something useful?” A:“Sure, look at that!” B:“This isn't working at all, look at all those counter-examples!” A:“It may not work for your examples but look how easy it is to find code points!”With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 20:29:48 UTC, Andrei Alexandrescu wrote:On 06/02/2016 04:22 PM, cym13 wrote:Allow me to try another angle: - There are different levels of unicode support and you don't want to support them all transparently. That's understandable. - The level you choose to support is the code point level. There are many good arguments about why this isn't a good default but you won't change your mind. I don't like that at all and I'm not alone but let's forget the entirety of the vocal D community for a moment. - A huge part of unicode chars can be normalized to fit your definition. That way not everything work (far from it) but a sufficiently big subset works. - On the other hand without normalization it just doesn't make any sense from a user perspective.The ö example has clearly shown that much, you even admitted it yourself by stating that many counter arguments would have worked had the string been normalized). - The most proeminent problem is with graphems that can have different representations as those that can't be normalized can't be searched as dchars as well. - If autodecoding to code points is to stay and in an effort to find a compromise then normalizing should be done by default. Sure it would take some more time but it wouldn't break any code (I think) and would actually make things more correct. They still wouldn't be correct but I feel that something as crazy as unicode cannot be tackled generically anyway.A:“We should decode to code points” B:“No, decoding to code points is a stupid idea.” A:“No it's not!” B:“Can you show a concrete example where it does something useful?” A:“Sure, look at that!” B:“This isn't working at all, look at all those counter-examples!” A:“It may not work for your examples but look how easy it is to find code points!”With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- AndreiJun 02 2016On 6/2/16 5:38 PM, cym13 wrote:Allow me to try another angle: - There are different levels of unicode support and you don't want to support them all transparently. That's understandable.Cool.- The level you choose to support is the code point level. There are many good arguments about why this isn't a good default but you won't change your mind. I don't like that at all and I'm not alone but let's forget the entirety of the vocal D community for a moment.You mean all 35 of them? It's not about changing my mind! A massive thing that the code point level handling is the incumbent, and that changing it would need to mark an absolutely Earth-shattering improvement to be worth it!- A huge part of unicode chars can be normalized to fit your definition. That way not everything work (far from it) but a sufficiently big subset works.Cool.- On the other hand without normalization it just doesn't make any sense from a user perspective.The ö example has clearly shown that much, you even admitted it yourself by stating that many counter arguments would have worked had the string been normalized).Yah, operating at code point level does not come free of caveats. It is vastly superior to operating on code units, and did I mention it's the incumbent.- The most proeminent problem is with graphems that can have different representations as those that can't be normalized can't be searched as dchars as well.Yah, I'd say if the program needs graphemes the option is there. Phobos by default deals with code points which are not perfect but are independent of representation, produce meaningful and consistent results with std.algorithm etc.- If autodecoding to code points is to stay and in an effort to find a compromise then normalizing should be done by default. Sure it would take some more time but it wouldn't break any code (I think) and would actually make things more correct. They still wouldn't be correct but I feel that something as crazy as unicode cannot be tackled generically anyway.Some more work on normalization at strategic points in Phobos would be interesting! AndreiJun 02 2016On Thu, Jun 02, 2016 at 04:29:48PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 06/02/2016 04:22 PM, cym13 wrote:With ASCII strings, all of std.algorithm operates correctly on ASCII bytes. So let's standardize on ASCII strings. What a vacuous argument! Basically you're saying "I define code points to be correct. Therefore, I conclude that decoding to code points is correct." Well, duh. Unfortunately such vacuous conclusions have no bearing in the real world of Unicode handling. T -- I am Ohm of Borg. Resistance is voltage over current.A:“We should decode to code points” B:“No, decoding to code points is a stupid idea.” A:“No it's not!” B:“Can you show a concrete example where it does something useful?” A:“Sure, look at that!” B:“This isn't working at all, look at all those counter-examples!” A:“It may not work for your examples but look how easy it is to find code points!”With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu wrote:On 06/02/2016 03:34 PM, deadalnix wrote:The good thing when you define works by whatever it does right now, it is that everything always works and there are literally never any bug. The bad thing is that this is a completely useless definition of work. The sample code won't count the instance of the grapheme 'ö' as some of its encoding won't be counted, which definitively count as doesn't work. When your point need to redefine words in ways that nobody agree with, it is time to admit the point is bogus.On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:True. "Are all code points equal to this one?" -- AndreiPretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False.Jun 02 2016On 6/2/16 5:20 PM, deadalnix wrote:The good thing when you define works by whatever it does right nowNo, it works as it was designed. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:On 6/2/16 5:20 PM, deadalnix wrote:Nobody says it doesn't. Everybody says the design is crap.The good thing when you define works by whatever it does right nowNo, it works as it was designed. -- AndreiJun 02 2016On 6/2/16 5:35 PM, deadalnix wrote:On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:I think I like it more after this thread. -- AndreiOn 6/2/16 5:20 PM, deadalnix wrote:Nobody says it doesn't. Everybody says the design is crap.The good thing when you define works by whatever it does right nowNo, it works as it was designed. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote:On 6/2/16 5:35 PM, deadalnix wrote:You start reminding me of the joke with that guy complaining that everybody is going backward on the highway.On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:I think I like it more after this thread. -- AndreiOn 6/2/16 5:20 PM, deadalnix wrote:Nobody says it doesn't. Everybody says the design is crap.The good thing when you define works by whatever it does right nowNo, it works as it was designed. -- AndreiJun 02 2016On 6/2/16 5:38 PM, deadalnix wrote:On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote:Touché. (Get it?) -- AndreiOn 6/2/16 5:35 PM, deadalnix wrote:You start reminding me of the joke with that guy complaining that everybody is going backward on the highway.On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:I think I like it more after this thread. -- AndreiOn 6/2/16 5:20 PM, deadalnix wrote:Nobody says it doesn't. Everybody says the design is crap.The good thing when you define works by whatever it does right nowNo, it works as it was designed. -- AndreiJun 02 2016On 6/2/16 5:37 PM, Andrei Alexandrescu wrote:On 6/2/16 5:35 PM, deadalnix wrote:Meh, thinking of it again: I don't like it more, I'd still do it differently given a clean slate (viz. RCStr). But let's say I didn't get many compelling reasons to remove autodecoding from this thread. -- AndreiOn Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:I think I like it more after this thread. -- AndreiOn 6/2/16 5:20 PM, deadalnix wrote:Nobody says it doesn't. Everybody says the design is crap.The good thing when you define works by whatever it does right nowNo, it works as it was designed. -- AndreiJun 02 2016On 06/02/2016 05:37 PM, Andrei Alexandrescu wrote:On 6/2/16 5:35 PM, deadalnix wrote:Well there's a fantastic argument.On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:I think I like it more after this thread. -- AndreiOn 6/2/16 5:20 PM, deadalnix wrote:Nobody says it doesn't. Everybody says the design is crap.The good thing when you define works by whatever it does right nowNo, it works as it was designed. -- AndreiJun 03 2016On 02.06.2016 23:20, deadalnix wrote:The sample code won't count the instance of the grapheme 'ö' as some of its encoding won't be counted, which definitively count as doesn't work.It also has false positives (you can combine 'ö' with some combining character in order to get some strange character that is not an 'ö', and not even NFC helps with that).Jun 02 2016On 6/2/2016 12:34 PM, deadalnix wrote:On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.Jun 02 2016On 06/02/2016 04:27 PM, Walter Bright wrote:On 6/2/2016 12:34 PM, deadalnix wrote:Apparently I'm not the only idiot. -- AndreiOn Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.htmlPretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.Jun 02 2016On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:On 6/2/2016 12:34 PM, deadalnix wrote:To be able to convert back and forth from/to unicode in a lossless manner.On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.Jun 02 2016On 6/2/2016 2:25 PM, deadalnix wrote:On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:Sorry, that makes no sense, as it is saying "they're the same, only different."I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.To be able to convert back and forth from/to unicode in a lossless manner.Jun 02 2016On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:On 6/2/2016 12:34 PM, deadalnix wrote:There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character? It's an interesting idea, but it's not how things are.On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.Jun 02 2016On Thursday, June 02, 2016 22:27:16 John Colvin via Digitalmars-d wrote:On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:Yeah. I'm inclined to think that the fact that there are multiple normalizations was a huge mistake on the part of the Unicode folks, but we're stuck dealing with it. And as horrible as it is for most cases, maybe it _does_ ultimately make sense because of certain use cases; I don't know. But bad idea or not, we're stuck. :( - Jonathan M DavisI wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character? It's an interesting idea, but it's not how things are.Jun 02 2016On 6/2/2016 3:27 PM, John Colvin wrote:I didn't say ordering, I said there should be no such thing as "normalization" in Unicode, where two codepoints are considered to be identical to some other codepoint.I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character?Jun 02 2016On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote:On 6/2/2016 3:27 PM, John Colvin wrote:I think it was a combination of historical baggage and trying to accomodate unusual but still valid use cases. The historical baggage was that Unicode was trying to unify all of the various already-existing codepages out there, and many of those codepages already come with various precomposed characters. To maximize compatibility with existing codepages, Unicode tried to preserve as much of the original mappings as possible within each 256-point block, so these precomposed characters became part of the standard. However, there weren't enough of them -- some people demanded less common character + diacritic combinations, and some languages had writing so complex their characters had to be composed from more basic parts. The original Unicode range was 16-bit, so there wasn't enough room to fit all of the precomposed characters people demanded, plus there were other things people wanted, like multiple diacritics (e.g., in IPA). So the concept of combining diacritics was invented, in part to prevent combinatorial explosion from soaking up the available code point space, in part to allow for novel combinations of diacritics that somebody out there somewhere might want to represent. However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence. (Normalization, of course, also subsumes a few other things, such as collation, but this is one of the factors behind it.) (This is a greatly over-simplified description, of course. At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see). Then you have the wonderful Indic and Arabic cursive writings, where letterforms mutate depending on the surrounding context, which, if you were to include all variants as distinct code points, would occupy many more pages than they currently do. And also sticky issues like the oft-mentioned Turkish i, which is encoded as a Latin i but behaves differently w.r.t. upper/lowercasing when in Turkish locale -- some cases of this, IIRC, are unfixable bugs in Phobos because we currently do not handle locales. So you see, imagining that code points == the solution to Unicode string handling is a joke. Writing correct Unicode handling is *hard*.) As with all sufficiently complex software projects, Unicode represents a compromise between many contradictory factors -- writing systems in the world being the complex, not-very-consistent beasts they are -- so such "dirty" details are somewhat inevitable. T -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. KernighanI didn't say ordering, I said there should be no such thing as "normalization" in Unicode, where two codepoints are considered to be identical to some other codepoint.I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character?Jun 03 2016On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence.It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.Jun 03 2016On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence.It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.Jun 03 2016On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do. As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want. - Jonathan M DavisOn 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence.It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.Jun 03 2016On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote:On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:I do exactly this. Validate and normalize.On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do. As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want. - Jonathan M DavisOn 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence.It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.Jun 03 2016On Friday, 3 June 2016 at 12:04:39 UTC, Chris wrote:I do exactly this. Validate and normalize.And once you've done this, auto decoding is useless because the same character has the same representation anyway.Jun 05 2016On 6/3/2016 3:10 AM, Vladimir Panteleev wrote:I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.So don't add new precomposited characters when a recognized existing sequence exists.Jun 03 2016On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see).I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs. They should have put me in charge of Unicode. I'd have put a stop to much of the madness :-)Jun 03 2016On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see).I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.Jun 03 2016On Fri, Jun 03, 2016 at 10:14:15AM +0000, Vladimir Panteleev via Digitalmars-d wrote:On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π in some fonts, but in cursive form it looks more like Latin lowercase n. It wouldn't make sense to encode Cyrillic п the same as Greek π or Latin lowercase n just by appearance, since logically it stands as its own character despite its various appearances. But it wouldn't make sense to encode it differently just because you're using a different font! Similarly, lowercase Cyrillic т in some cursive fonts looks like lowercase Latin m. I don't think it would make sense to encode lowercase Т as Latin m just because of that. Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANGOn 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see).I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.Jun 03 2016On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently.It's almost as if printed documents and books have never existed!Jun 03 2016On Fri, Jun 03, 2016 at 11:43:07AM -0700, Walter Bright via Digitalmars-d wrote:On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:But if we were to encode appearance instead of logical meaning, that would mean the *same* lowercase Cyrillic ь would have multiple, different encodings depending on which font was in use. That doesn't seem like the right solution either. Do we really want Unicode strings to encode font information too?? 'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters. And what of the Arabic and Indic scripts? They would need to encode the same letter multiple times, each being a variation of the physical form that changes depending on the surrounding context. Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint. Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today. T -- Let's eat some disquits while we format the biskettes.Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently.It's almost as if printed documents and books have never existed!Jun 03 2016On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:But if we were to encode appearance instead of logical meaning, that would mean the *same* lowercase Cyrillic ь would have multiple, different encodings depending on which font was in use.I don't see that consequence at all.That doesn't seem like the right solution either. Do we really want Unicode strings to encode font information too??No.'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters.If they are different letters, then they should have a different code point. I don't see why this is such a hard concept.And what of the Arabic and Indic scripts? They would need to encode the same letter multiple times, each being a variation of the physical form that changes depending on the surrounding context. Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one?Two. Again, why is this hard to grasp? If there is meaning in having two different visual representations, then they are two codepoints. If the visual representation is the same, then it is one codepoint. If the difference is only due to font selection, that it is the same codepoint.Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today.The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.Jun 03 2016On Fri, Jun 03, 2016 at 03:35:18PM -0700, Walter Bright via Digitalmars-d wrote:On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:[...][...] It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again: - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in cursive form. In some font renderings the two are IDENTICAL glyphs, in spite of being completely different, unrelated letters. However, in non-cursive form, Cyrillic lowercase т is visually distinct. - Similarly, lowercase Cyrillic П in cursive font looks like lowercase Latin n, and in some fonts they are identical glyphs. Again, completely unrelated letters, yet they have the SAME VISUAL REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is п, which is visually distinct from Latin n. - These aren't the only ones, either. Other Cyrillic false friends include cursive Д, which in some fonts looks like lowercase Latin g. But in non-cursive font, it's д. Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters. By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding. Similarly, since lowercase Cyrillic П is n (in cursive font), we should encode it the same way as Latin lowercase n. But again, the letterform changes based on font. Your criteria of "same visual representation" does not work outside of English. What you imagine to be a simple, straightforward concept is far from being simple once you're dealing with the diverse languages and writing systems of the world. Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two? Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø? Obviously not, but according to your "visual representation" idea, the answer should be yes. The bottomline is that uppercase O and the digit 0 represent different LOGICAL entities, in spite of their sharing the same visual representation. Eventually you have to resort to representing *logical* entities ("characters") rather than visual appearance, which is a property of the font, and has no place in a digital text encoding.'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters.If they are different letters, then they should have a different code point. I don't see why this is such a hard concept.But what should "i".toUpper return? Or are you saying the standard library should not include such a basic function as a case-changing function? T -- Customer support: the art of getting your clients to pay for your own incompetence.Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today.The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.Jun 03 2016On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again: - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in cursive form. In some font renderings the two are IDENTICAL glyphs, in spite of being completely different, unrelated letters. However, in non-cursive form, Cyrillic lowercase т is visually distinct. - Similarly, lowercase Cyrillic П in cursive font looks like lowercase Latin n, and in some fonts they are identical glyphs. Again, completely unrelated letters, yet they have the SAME VISUAL REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is п, which is visually distinct from Latin n. - These aren't the only ones, either. Other Cyrillic false friends include cursive Д, which in some fonts looks like lowercase Latin g. But in non-cursive font, it's д. Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters.It works for books. Unicode invented a problem, and came up with a thoroughly wretched "solution" that we'll be stuck with for generations. One of those bad solutions is have the reader not know what a glyph actually is without pulling back the cover to read the codepoint. It's madness.By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding.Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two? Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø? Obviously not, but according to your "visual representation" idea, the answer should be yes.Don't confuse fonts with code points. It'd be adequate if Unicode defined a canonical glyph for each code point, and let the font makers do what they wish.Not relevant to my point that Unicode shouldn't decide what "upper case" for all languages means, any more than Unicode should specify a font. Now when you argue that Unicode should make such decisions, note what a spectacularly hopeless job of it they've done.The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.But what should "i".toUpper return?Jun 03 2016On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:Because books don't allow their readers to change the font.It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again: - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in cursive form. In some font renderings the two are IDENTICAL glyphs, in spite of being completely different, unrelated letters. However, in non-cursive form, Cyrillic lowercase т is visually distinct. - Similarly, lowercase Cyrillic П in cursive font looks like lowercase Latin n, and in some fonts they are identical glyphs. Again, completely unrelated letters, yet they have the SAME VISUAL REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is п, which is visually distinct from Latin n. - These aren't the only ones, either. Other Cyrillic false friends include cursive Д, which in some fonts looks like lowercase Latin g. But in non-cursive font, it's д. Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters.It works for books.Unicode invented a problem, and came up with a thoroughly wretched "solution" that we'll be stuck with for generations. One of those bad solutions is have the reader not know what a glyph actually is without pulling back the cover to read the codepoint. It's madness.This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т. So which letter is it, M or Т? The fundamental problem is that writing systems for different languages interpret the same letter forms differently. In English, lowercase g has at least two different forms that we recognize as the same letter. However, to a Cyrillic reader the two forms are distinct, because one of them looks like a Cyrillic letter but the other one looks foreign. So should g be encoded as a single point or two different points? In a similar vein, to a Cyrillic reader the glyphs т and m represent the same letter, but to an English letter they are clearly two different things. If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually.It's not a bad font. It's standard practice to print Cyrillic cursive letters with different glyphs. Russian readers can read both without any problem. The same letter is represented by different glyphs, and therefore the abstract letter is a more fundamental unit of meaning than the glyph itself.By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding.Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.So should O and 0 share the same glyph or not? They're visually the same thing, even though some fonts render them differently. What should be the canonical shape of O vs. 0? If they are the same shape, then by your argument they must be the same code point, regardless of what font makers do to disambiguate them. Good luck writing a parser that can't tell between an identifier that begins with O vs. a number literal that begins with 0. The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate.Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two? Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø? Obviously not, but according to your "visual representation" idea, the answer should be yes.Don't confuse fonts with code points. It'd be adequate if Unicode defined a canonical glyph for each code point, and let the font makers do what they wish.In other words toUpper and toLower does not belong in the standard library. Great. T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.Not relevant to my point that Unicode shouldn't decide what "upper case" for all languages means, any more than Unicode should specify a font. Now when you argue that Unicode should make such decisions, note what a spectacularly hopeless job of it they've done.The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.But what should "i".toUpper return?Jun 03 2016On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:Unicode is not the font.It works for books.Because books don't allow their readers to change the font.This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т. So which letter is it, M or Т?It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot. ('m' doesn't always mean m in english, either. It depends on the context.) Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-)If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually.Books do visually just fine!So should O and 0 share the same glyph or not? They're visually the same thing,No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0.The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate.Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I.In other words toUpper and toLower does not belong in the standard library. Great.Unicode and the standard library are two different things.Jun 04 2016On Saturday, 4 June 2016 at 08:12:47 UTC, Walter Bright wrote:On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:Even if a character in different languages share a glyph or look identical though, it makes sense to duplicate them with different code points/units/whatever. Simple functions like isCyrillicLetter() can then do a simple less-than / greater-than comparison instead of having a lookup table to check different numeric representations scattered throughout the Unicode table. Functions like toUpper and toLower become easier to write as well (for SOME languages anyhow), it's simply myletter +/- numlettersinalphabet. Redundancy here is very helpful. Maybe instead of Unicode they should have called it Babel... :) "The Lord said, “If as one people speaking the same language they have begun to do this, then nothing they plan to do will be impossible for them. Come, let us go down and confuse their language so they will not understand each other.”" -JonOn Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:Unicode is not the font.It works for books.Because books don't allow their readers to change the font.This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т. So which letter is it, M or Т?It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot. ('m' doesn't always mean m in english, either. It depends on the context.) Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-)If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually.Books do visually just fine!So should O and 0 share the same glyph or not? They're visually the same thing,No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0.The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate.Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I.In other words toUpper and toLower does not belong in the standard library. Great.Unicode and the standard library are two different things.Jun 05 2016On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.Interestingly enough, I've mentioned earlier here that only people from the US would believe that documents with mixed languages aren't commonplace. I wasn't expecting to be proven right that fast.Jun 05 2016On 6/5/2016 1:07 AM, deadalnix wrote:On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:You'd be in error. I've been casually working on my grandfather's thesis trying to make a web version of it, and it is mixed German, French, and English. I've also made a digital version of an old history book that is mixed English, old English, German, French, Greek, old Greek, and Egyptian hieroglyphs (available on Amazons in your neighborhood!). I've also lived in Germany for 3 years, though that was before computers took over the world.Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.Interestingly enough, I've mentioned earlier here that only people from the US would believe that documents with mixed languages aren't commonplace. I wasn't expecting to be proven right that fast.Jun 05 2016On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:In Unicode there are 2 different codepoints for lower case sigma ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma. Codepoint U+3A2 is undefined. So your objection is not hypothetic, it is actually an issue for uppercase() and lowercase() functions. Another difficulty besides dotless and dotted i of Turkic, the double letters used in latin transcription of cyrillic text in east and south europe dž, lj, nj and dz, which have an uppercase forme (DŽ, LJ, NJ, DZ) and a titlecase form (Dž, Lj, Nj, Dz).Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint.Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today.As an anecdote I can tell the story of the accession to the European Union of Romania and Bulgaria in 2007. The issue was that 3 letters used by Romanian and Bulgarian had been forgotten by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B and 2 Cyrillic letters that I do not remember). The Romanian used as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and U+162), which look a little bit alike. When the Commission finally managed to force Mirosoft to correct the fonts to include them, we could start to correct the data. The transition was finished in 2012 and was only possible because no other language we deal with uses the "wrong" codepoints (Turkish but fortunately we only have a handful of them in our db's). So 5 years of ad hoc processing for the substicion of 4 codepoints. BTW: using combining diacritics was out of the question at that time simply because Microsoft Word didn't support it at that time and many documents we encountered still only used codepages (one has also to remember that in big institution like the EC, the IT is always several years behind the open market, which means that when product is in release X, the Institution still might use release X-5 years).Jun 04 2016One has also to take into consideration that Unicode is the way it is because it was not invented in an empty space. It had to take consideration of the existing and find compromisses allowing its adoption. Even if they had invented the perfect encoding, NO ONE WOULD HAVE USED IT, as it would have fubar the existing. As it was invented it allowed a (relatively smooth) transition. Here some points that made it even possible that Unicode could be adopted at all: - 16 bits: while that choice was a bit shortsighted, 16 bits is a good compromice between compactness and richness (BMP suffice to express nearly all living languages). - Using more or less the same arrangement of codepoints as in the different codepages. This allowed to transform legacy documents with simple scripts (matter of fact I wrote a script to repair misencoded Greek documents, it consisted mainly of unich = ch>0x80 ? ch+0x2D0 : ch; - Utf-8: this was the genious stroke encoding that allowed to mix it all without requiring awful acrobatics (Joakim is completely out to lunch on that one, shifting encoding without self-synchronisation are hellish, that's why Chinese and Japanese adopted Unicode without hesitation, they had enough experience with their legacy encodings. - Letting time for the transition. So all the points that people here criticize, were in fact the reason why Unicode could even be become the standard it is now.Jun 04 2016On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:It's almost as if printed documents and books have never existed!some old xUSSR books which has some English text sometimes used Cyrillic font to represent English. it was awful, and barely readable. this was done to ease the work of compositors, and the result was unacceptable. do you feel a recognizable pattern here? ;-)Jun 03 2016On 6/3/2016 5:42 PM, ketmar wrote:sometimes used Cyrillic font to represent English.Nobody here suggested using the wrong font, it's completely irrelevant.Jun 03 2016On Saturday, 4 June 2016 at 02:46:31 UTC, Walter Bright wrote:On 6/3/2016 5:42 PM, ketmar wrote:you suggested that unicode designers should make similar-looking glyphs share the same code, and it reminds me this little story. maybe i misunderstood you, though.sometimes used Cyrillic font to represent English.Nobody here suggested using the wrong font, it's completely irrelevant.Jun 03 2016On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:TIL: books are read by computers.Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently.It's almost as if printed documents and books have never existed!Jun 05 2016On 6/5/2016 1:05 AM, deadalnix wrote:TIL: books are read by computers.I should introduce you to a fabulous technology called OCR. :-)Jun 05 2016On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.How did people ever get by with printed books and documents?Jun 03 2016On 03.06.2016 20:41, Walter Bright wrote:On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:They can disambiguate the letters based on context well enough.That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.How did people ever get by with printed books and documents?Jun 03 2016On 6/3/2016 11:54 AM, Timon Gehr wrote:On 03.06.2016 20:41, Walter Bright wrote:Characters do not have semantic meaning. Their meaning is always inferred from the context. Unicode's troubles started the moment they stepped beyond their charter.How did people ever get by with printed books and documents?They can disambiguate the letters based on context well enough.Jun 03 2016On Friday, 3 June 2016 at 18:41:36 UTC, Walter Bright wrote:How did people ever get by with printed books and documents?Printed books pick one font and one layout, then is read by people. It doesn't have to be represented in some format where end users can change the font and size etc.Jun 03 2016On Friday, June 03, 2016 03:08:43 Walter Bright via Digitalmars-d wrote:On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:Actually, I would argue that the moment that Unicode is concerned with what the character actually looks like rather than what character it logically is that it's gone outside of its charter. The way that characters actually look is far too dependent on fonts, and aside from display code, code does not care one whit what the character looks like. For instance, take the capital letter I, the lowercase letter l, and the number one. In some fonts that are feeling cruel towards folks who actually want to read them, two of those characters - or even all three of them - look identical. But I think that you'll agree that those characters should be represented as distinct characters in Unicode regardless of what they happen to look like in a particular font. Now, take a cyrllic letter that looks similar to a latin letter. If they're logically equivalent such that no code would ever want to distinguish between the two and such that no font would ever even consider representing them differently, then they're truly the same letter, and they should only have one Unicode representation. But if anyone would ever consider them to be logically distinct, then it makes no sense for them to be considered to be the same character by Unicode, because they don't have the same identity. And that distinction is quite clear if any font would ever consider representing the two characters differently, no matter how slight that difference might be. Really, what a character looks like has nothing to do with Unicode. The exact same Unicode is used regardless of how the text is displayed. Rather, what Unicode is doing is providing logical identifiers for characters so that code can operate on them, and display code can then do whatever it does to display those characters, whether they happen to look similar or not. I would think that the fact that non-display code does not care one whit about what a character looks like and that display code can have drastically different visual representations for the same character would make it clear that Unicode is concerned with having identifiers for logical characters and that that is distinct from any visual representation. - Jonathan M DavisAt the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see).I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs. They should have put me in charge of Unicode. I'd have put a stop to much of the madness :-)Jun 03 2016On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:Actually, I would argue that the moment that Unicode is concerned with what the character actually looks like rather than what character it logically is that it's gone outside of its charter. The way that characters actually look is far too dependent on fonts, and aside from display code, code does not care one whit what the character looks like.What I meant was pretty clear. Font is an artistic style that does not change context nor semantic meaning. If a font choice changes the meaning then it is not a font.Jun 03 2016On Friday, 3 June 2016 at 22:38:38 UTC, Walter Bright wrote:If a font choice changes the meaning then it is not a font.Nah, then it is an Awesome Font that is totally Web Scale! i wish i was making that up http://fontawesome.io/ i hate that thing But, it is kinda legal: gotta love the Unicode private use area!Jun 03 2016On Friday, June 03, 2016 15:38:38 Walter Bright via Digitalmars-d wrote:On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:Well, maybe I misunderstood what was being argued, but it seemed like you've been arguing that two characters should be considered the same just because they look similar, whereas H. S. Teoh is arguing that two characters can be logically distinct while still looking similar and that they should be treated as distinct in Unicode because they're logically distinct. And if that's what's being argued, then I agree with H. S. Teoh. I expect - at least ideally - for Unicode to contain identifiers for characters that are distinct from whatever their visual representation might be. Stuff like fonts then worries about how to display them, and hopefully don't do stupid stuff like make a capital I look like a lowercase l (though they often do, unfortunately). But if two characters in different scripts - be they latin and cyrillic or whatever - happen to often look the same but would be considered two different characters by humans, then I would expect Unicode to consider them to be different, whereas if no one would reasonably consider them to be anything but exactly the same character, then there should only be one character in Unicode. However, if we really have crazy stuff where subtly different visual representations of the letter g are considered to be one character in English and two in Russian, then maybe those should be three different characters in Unicode so that the English text can clearly be operating on g, whereas the Russian text is doing whatever it does with its two characters that happen to look like g. I don't know. That sort of thing just gets ugly. But I definitely think that Unicode characters should be made up of what the logical characters are and leave the visual representation up to the fonts and the like. Now, how to deal with uppercase vs lowercase and all of that sort of stuff is a completely separate issue IMHO, and that comes down to how the characters are somehow logically associated with one another, and it's going to be very locale-specific such that it's not really part of the core of Unicode's charter IMHO (though I'm not sure that it's bad if there's a set of locale rules that go along with Unicode for those looking to correctly apply such rules - they just have nothing to do with code points and graphemes and how they're represented in code). - Jonathan M DavisActually, I would argue that the moment that Unicode is concerned with what the character actually looks like rather than what character it logically is that it's gone outside of its charter. The way that characters actually look is far too dependent on fonts, and aside from display code, code does not care one whit what the character looks like.What I meant was pretty clear. Font is an artistic style that does not change context nor semantic meaning. If a font choice changes the meaning then it is not a font.Jun 05 2016On 02-Jun-2016 23:27, Walter Bright wrote:On 6/2/2016 12:34 PM, deadalnix wrote:Yeah, Unicode was not meant to be easy it seems. Or this is whatever happens with evolutionary design that started with "everything is a 16-bit character". -- Dmitry OlshanskyOn Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.Jun 03 2016On 03/06/2016 20:12, Dmitry Olshansky wrote:On 02-Jun-2016 23:27, Walter Bright wrote:Typing as someone who as spent some time creating typefaces, having two representations makes sense, and it didn't start with Unicode, it started with movable type. It is much easier for a font designer to create the two codepoint versions of characters for most instances, i.e. make the base letters and the diacritics once. Then what I often do is make single codepoint versions of the ones I'm likely to use, but only if they need more tweaking than the kerning options of the font format allow. I'll omit the history lesson on how this was similar in the case of movable type. Keyboards for different languages mean that a character that is a single keystroke in one case is two together or in sequence in another. This means that Unicode not only represents completed strings, but also those that are mid composition. The ordering that it uses to ensure that graphemes have a single canonical representation is based on the order that those multi-key characters are entered. I wouldn't call it elegant, but its not inelegant either. Trying to represent all sufficiently similar glyphs with the same codepoint would lead to a layout problem. How would you order them so that strings of any language can be sorted by their local sorting rules, without having to special case algorithms? Also consider ligatures, such as those for "ff", "fi", "ffi", "fl", "ffl" and many, many more. Typographers create these glyphs whenever available kerning tools do a poor job of combining them from the individual glyphs. From the point of view of meaning they should still be represented as individual codepoints, but for display (electronic or print) that sequence needs to be replaced with the single codepoint for the ligature. I think that in order to understand the decisions of the Unicode committee, one has to consider that they are trying to unify the concerns of representing written information from two sides. One side prioritises storage and manipulation, while the other considers aesthetics and design workflow more important. My experience of using Unicode from both sides gives me a different appreciation for the difficulties of reconciling the two. A... P.S. Then they started adding emojis, and I lost all faith in humanity ;)I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.Yeah, Unicode was not meant to be easy it seems. Or this is whatever happens with evolutionary design that started with "everything is a 16-bit character".Jun 04 2016On 02.06.2016 21:05, Andrei Alexandrescu wrote:On 06/02/2016 01:54 PM, Marc Schütz wrote:Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.) assert("ö".all!(c => c == 'ö')); // failsOn Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. ...That's not going to work. A false impression created in this thread has been that code points are uselessThey _are_ useless for almost anything you can do with strings. The only places where they should be used are std.uni and std.regex. Again: What is the justification for using code points, in your opinion? Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?* s.any!(c => c == 'ö') works only with autodecoding. It returns always false without. ...Doesn't work. Shouldn't compile. assert("ö".any!(c => c == 'ö")); // fails assert(!"̃ö⃖".any!(c => c== 'ö')); // fails* s.balancedParens('〈', '〉') works only with autodecoding. ...Doesn't work, e.g. s="⟨⃖". Shouldn't compile.* s.canFind('ö') works only with autodecoding. It returns always false without. ...Doesn't work. Shouldn't compile. assert("ö".canFind!(c => c == 'ö")); // fails* s.commonPrefix(s1) works only if they both use the same encoding; otherwise it still compiles but silently produces an incorrect result. ...Doesn't work. Shouldn't compile.* s.count('ö') works only with autodecoding. It returns always zero without. ....Doesn't work. Shouldn't compile.* s.countUntil(s1) is really odd - without autodecoding, whether it works at all, and the result it returns, depends on both encodings. With autodecoding it always works and returns a number independent of the encodings. ...Doesn't work. Shouldn't compile.* s.endsWith('ö') works only with autodecoding. It returns always false without. ...Doesn't work. Shouldn't compile.* s.endsWith(s1) works only with autodecoding.Doesn't work.Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. ...Shouldn't compile.* s.find('ö') works only with autodecoding. It never finds it without. ...Doesn't work. Shouldn't compile.* s.findAdjacent is a very interesting one. It works with autodecoding, but without it it just does odd things. ....Doesn't work. Shouldn't compile.* s.findAmong(s1) is also interesting. It works only with autodecoding. ...Doesn't work. Shouldn't compile.* s.findSkip(s1) works only if s and s1 have the same encoding. Otherwise it compiles and runs but produces incorrect results. ...Doesn't work. Shouldn't compile.* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only if s and s1 have the same encoding.Doesn't work.Otherwise they compile and run but produce incorrect results. ...Shouldn't compile.* s.minCount, s.maxCount are unlikely to be terribly useful but with autodecoding it consistently returns the extremum numeric code unit regardless of representation. Without, they just return encoding-dependent and meaningless numbers. * s.minPos, s.maxPos follow a similar semantics. ...Hardly a point in favour of autodecoding.* s.skipOver(s1) only works with autodecoding.Doesn't work. Shouldn't compile.Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. ...Shouldn't compile.* s.startsWith('ö') works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. ...Doesn't work. Shouldn't compile.* s.startsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. ...Doesn't work. Shouldn't compile.* s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it will span the entire range. ...Doesn't work. Shouldn't compile.=== The intent of autodecoding was to make std.algorithm work meaningfully with strings. As it's easy to see I just went through std.algorithm.searching alphabetically and found issues literally with every primitive in there. It's an easy exercise to go forth with the others. ...Basically all of those still don't work with UTF-32 (assuming your goal is to operate on characters). You need to normalize and possibly iterate on graphemes. Also, many of those functions actually have valid uses intentionally operating on code units. The "shouldn't compile" remarks ideally would be handled at the language level: char/wchar/dchar should be incompatible types and char[], wchar[] and dchar[] should be handled like all arrays.Jun 02 2016On Thursday, 2 June 2016 at 20:01:54 UTC, Timon Gehr wrote:Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.)In Andrei's original post, he says that s is a string variable. He doesn't say it's a char. I find the weirder thing to be that t below is false, per deadalnix's point. import std.algorithm : all; import std.stdio : writeln; void main() { string s = "ö"; auto t = s.all!(c => c == 'ö'); writeln(t); //prints false } I could imagine getting frustrated that something like the code below throws errors. import std.algorithm : all; import std.stdio : writeln; void main() { import std.uni : byGrapheme; string s = "ö"; auto s2 = s.byGrapheme; auto t2 = s2.all!(c => c == 'ö'); writeln(t2); }Jun 02 2016On 06/02/2016 04:01 PM, Timon Gehr wrote:Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.)That would be another language design option, which we don't have the luxury to explore. -- AndreiJun 02 2016On 06/02/2016 04:01 PM, Timon Gehr wrote:assert("ö".all!(c => c == 'ö')); // failsAs expected. Different code units for different folks. That's a different matter than walking blindly through code units. -- AndreiJun 02 2016On 06/02/2016 04:01 PM, Timon Gehr wrote:Basically all of those still don't work with UTF-32 (assuming your goal is to operate on characters).The goal is to operate on code units. -- AndreiJun 02 2016On 06/02/2016 04:26 PM, Andrei Alexandrescu wrote:On 06/02/2016 04:01 PM, Timon Gehr wrote:s/units/points/Basically all of those still don't work with UTF-32 (assuming your goal is to operate on characters).The goal is to operate on code units. -- AndreiJun 02 2016On 06/02/2016 10:26 PM, Andrei Alexandrescu wrote:The goal is to operate on code units. -- AndreiYou sure you got the right word there? The code unit is the smallest building block. A code point is encoded with one or more code units. Also, if you mean code points, that's where people disagree. Operating on code points by default is seen as not particularly useful.Jun 02 2016On 06/02/2016 04:33 PM, ag0aep6g wrote:Operating on code points by default is seen as not particularly useful.By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- AndreiJun 02 2016On Thursday, 2 June 2016 at 20:36:12 UTC, Andrei Alexandrescu wrote:On 06/02/2016 04:33 PM, ag0aep6g wrote:From the standard:Operating on code points by default is seen as not particularly useful.By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- AndreiLevel 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are surrogates, canonical equivalence, word boundaries, grapheme boundaries, and loose matches. (For more information about boundary conditions, see The Unicode Standard, Section 5-15.) Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still locale independent and easily implementable. However, the implementation may be slower when supporting Level 2, and some expressions may require Level 1 matches. Thus it is usually required to have some sort of syntax that will turn Level 2 support on and off.That doesn't sound like much of an endorsement for defaulting to only level 1 support to me - "it does not handle more complex languages or extensions to the Unicode Standard very well".Jun 02 2016On 06/02/2016 04:47 PM, tsbockman wrote:That doesn't sound like much of an endorsement for defaulting to only level 1 support to me - "it does not handle more complex languages or extensions to the Unicode Standard very well".Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu wrote:On 06/02/2016 04:47 PM, tsbockman wrote:Actually, according to the document Walter Bright linked level 1 does NOT operate at the code point level:That doesn't sound like much of an endorsement for defaulting to only level 1 support to me - "it does not handle more complex languages or extensions to the Unicode Standard very well".Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- AndreiLevel 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic 16-bit logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or UTF-32.) ... Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are **surrogates** ...So, level 1 appears to be UTF-16 code units, not code points. To do code points it would have to recognize surrogates, which are specifically mentioned as not supported. Level 2 skips straight to graphemes, and there is no code point level. However, this document is very old - from Unicode 3.0 and the year 2000:While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them...Perhaps level 1 has since been redefined?Jun 02 2016On Thursday, 2 June 2016 at 21:00:17 UTC, tsbockman wrote:However, this document is very old - from Unicode 3.0 and the year 2000:I found the latest (unofficial) draft version: http://www.unicode.org/reports/tr18/tr18-18.html Relevant changes: * Level 1 is to be redefined as working on code points, not code units:While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them...Perhaps level 1 has since been redefined?A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units.* Level 2 (graphemes) is explicitly described as a "default level":This is still a default level—independent of country or language—but provides much better support for end-user expectations than the raw level 1...* All mention of level 2 being slow has been removed. The only reason given for making it toggle-able is for compatibility with level 1 algorithms:Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still locale-independent and easily implementable. However, for compatibility with Level 1, it is useful to have some sort of syntax that will turn Level 2 support on and off.Jun 02 2016On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- AndreiDo they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?Jun 02 2016On 06/02/2016 04:52 PM, ag0aep6g wrote:On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:No, but that sounds agreeable to me, especially since it breaks no code of ours. We really should document this better. Kudos to Walter for finding all that Level 1 support. AndreiBy whom? The "support level 1" folks yonder at the Unicode standard? :o) -- AndreiDo they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?Jun 02 2016On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:The level 2 support description noted that it should be opt-in because its slow. Arguably it should be easier to operate on code units if you know its safe to do so, but either always working on code units or always working on graphemes as the default seems to be either too broken too often or too slow too often. Now one can argue either consistency for code units (because then we can treat char[] and friends as a slice) or correctness for graphemes but really the more I think about it the more I think there is no good default and you need to learn unicode anyways. The only sad parts here are that 1) we hijacked an array type for strings, which sucks and 2) that we dont have an api that is actually good at teaching the user what it does and doesnt do. The consequence of 1 is that generic code that also wants to deal with strings will want to special-case to get rid of auto-decoding, the consequence of 2 is that we will have tons of not-actually-correct string handling. I would assume that almost all string handling code that is out in the wild is broken anyways (in code I have encountered I have never seen attempts to normalize or do other things before or after comparisons, searching, etc), unless of course, YOU or one of your colleagues wrote it (consider that checking the length of characters is often done and wrong, because .Length is the number of UTF-16 code units in those languages) :o) So really as bad and alarming as "incorrect string handling" by default seems, it in practice of other languages that get used way more than D has not prevented people from writing working (internationalized!) applications in those languages. One could say we should do it better than them, but I would be inclined to believe that RCStr provides our opportunity to do so. Having char[] be what it is is an annoying wart, and maybe at some point we can deprecate/remove that behaviour, but for now Id rather see if RCStr is viable than attempt to change semantics of all string handling code in D.By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- AndreiDo they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?Jun 02 2016On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:The level 2 support description noted that it should be opt-in because its slow.1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.Jun 02 2016On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:1) Right because a special toggleable syntax is definitely not "opt-in". 2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points. 3) Not an argument - doing more work makes code slower. The only thing that changes is what specific operations have what cost (for instance, memory access has a much higher cost now than it had then). Considering the way the process works and judging from what others in this thread have said about it, I will stick with "always decoding to graphemes for all operations is very slow" and indulge in being too lazy to write benchmarks for it to show just how bad it is.The level 2 support description noted that it should be opt-in because its slow.1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.Jun 02 2016On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1.1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.1) Right because a special toggleable syntax is definitely not "opt-in".2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points.And working on code points is way slower than working on code units (the actual level 1).3) Not an argument - doing more work makes code slower.What do you think I'm arguing for? It's not graphemes-by-default. What I actually want to see: permanently deprecate the auto-decoding range primitives. Force the user to explicitly specify whichever of `by!dchar`, `byCodePoint`, or `byGrapheme` their specific algorithm actually needs. Removing the implicit conversions between `char`, `wchar`, and `dchar` would also be nice, but isn't really necessary I think. That would be a standards-compliant solution (one of several possible). What we have now is non-standard, at least going by the old version Walter linked.Jun 02 2016On Thursday, 2 June 2016 at 21:51:51 UTC, tsbockman wrote:On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:*sigh* reading comprehension. Needing to write .byGrapheme or similar to enable the behaviour qualifies for what that description was arguing for. I hope you understand that now that I am repeating this for you.On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1.1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.1) Right because a special toggleable syntax is definitely not "opt-in".Never claimed the opposite. Do note however that its specifically talking about UTF-16 code units.2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points.And working on code points is way slower than working on code units (the actual level 1).Unrelated. I was refuting the point you made about the relevance of the performance claims of the unicode level 2 support description, not evaluating your hypothetical design. Please do not take what I say out of context, thank you.3) Not an argument - doing more work makes code slower.What do you think I'm arguing for? It's not graphemes-by-default.Jun 02 2016On Thursday, 2 June 2016 at 22:03:01 UTC, default0 wrote:*sigh* reading comprehension. ... Please do not take what I say out of context, thank you.Earlier you said:The level 2 support description noted that it should be opt-in because its slow.My main point is simply that you mischaracterized what the standard says. Making level 1 opt-in, rather than level 2, would be just as compliant as the reverse. The standard makes no suggestion as to which should be default.Jun 02 2016On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:* s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.The o is inferred as a wchar. The lamda then is inferred to return a wchar. The algorithm can check that the input is char[], and is being tested against a wchar. Therefore, the algorithm can specialize to do the decoding itself. No autodecoding necessary, and it does the right thing.Jun 02 2016On 02.06.2016 22:07, Walter Bright wrote:On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:No, the lambda returns a bool.* s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.The o is inferred as a wchar. The lamda then is inferred to return a wchar.The algorithm can check that the input is char[], and is being tested against a wchar. Therefore, the algorithm can specialize to do the decoding itself. No autodecoding necessary, and it does the right thing.It still would not be the right thing. The lambda shouldn't compile. It is not meaningful to compare utf-8 and utf-16 code units directly.Jun 02 2016On 06/02/2016 04:12 PM, Timon Gehr wrote:It is not meaningful to compare utf-8 and utf-16 code units directly.But it is meaningful to compare Unicode code points. -- AndreiJun 02 2016On 02.06.2016 22:28, Andrei Alexandrescu wrote:On 06/02/2016 04:12 PM, Timon Gehr wrote:It is also meaningful to compare two utf-8 code units or two utf-16 code units.It is not meaningful to compare utf-8 and utf-16 code units directly.But it is meaningful to compare Unicode code points. -- AndreiJun 02 2016On 06/02/2016 04:50 PM, Timon Gehr wrote:On 02.06.2016 22:28, Andrei Alexandrescu wrote:By decoding them of course. -- AndreiOn 06/02/2016 04:12 PM, Timon Gehr wrote:It is also meaningful to compare two utf-8 code units or two utf-16 code units.It is not meaningful to compare utf-8 and utf-16 code units directly.But it is meaningful to compare Unicode code points. -- AndreiJun 02 2016On 02.06.2016 22:51, Andrei Alexandrescu wrote:On 06/02/2016 04:50 PM, Timon Gehr wrote:That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.On 02.06.2016 22:28, Andrei Alexandrescu wrote:By decoding them of course. -- AndreiOn 06/02/2016 04:12 PM, Timon Gehr wrote:It is also meaningful to compare two utf-8 code units or two utf-16 code units.It is not meaningful to compare utf-8 and utf-16 code units directly.But it is meaningful to compare Unicode code points. -- AndreiJun 02 2016On 6/2/16 5:23 PM, Timon Gehr wrote:On 02.06.2016 22:51, Andrei Alexandrescu wrote:Then you lost me. (I'm sure you're making a good point.) -- AndreiOn 06/02/2016 04:50 PM, Timon Gehr wrote:That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.On 02.06.2016 22:28, Andrei Alexandrescu wrote:By decoding them of course. -- AndreiOn 06/02/2016 04:12 PM, Timon Gehr wrote:It is also meaningful to compare two utf-8 code units or two utf-16 code units.It is not meaningful to compare utf-8 and utf-16 code units directly.But it is meaningful to compare Unicode code points. -- AndreiJun 02 2016On 02.06.2016 23:29, Andrei Alexandrescu wrote:On 6/2/16 5:23 PM, Timon Gehr wrote:Basically: bool bad(char c,dchar d){ return c==d; } // ideally shouldn't compile bool good(char c,char d){ return c==d; } // should compileOn 02.06.2016 22:51, Andrei Alexandrescu wrote:Then you lost me. (I'm sure you're making a good point.) -- AndreiOn 06/02/2016 04:50 PM, Timon Gehr wrote:That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.On 02.06.2016 22:28, Andrei Alexandrescu wrote:By decoding them of course. -- AndreiOn 06/02/2016 04:12 PM, Timon Gehr wrote:It is also meaningful to compare two utf-8 code units or two utf-16 code units.It is not meaningful to compare utf-8 and utf-16 code units directly.But it is meaningful to compare Unicode code points. -- AndreiJun 02 2016On 6/2/2016 1:12 PM, Timon Gehr wrote:On 02.06.2016 22:07, Walter Bright wrote:Thanks for the correction.On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:No, the lambda returns a bool.* s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.The o is inferred as a wchar. The lamda then is inferred to return a wchar.Yes, you have a good point. But we do allow things like: byte b; if (b == 10000) ...The algorithm can check that the input is char[], and is being tested against a wchar. Therefore, the algorithm can specialize to do the decoding itself. No autodecoding necessary, and it does the right thing.It still would not be the right thing. The lambda shouldn't compile. It is not meaningful to compare utf-8 and utf-16 code units directly.Jun 02 2016On 02.06.2016 23:56, Walter Bright wrote:On 6/2/2016 1:12 PM, Timon Gehr wrote:Well, this is a somewhat different case, because 10000 is just not representable as a byte. Every value that fits in a byte fits in an int though. It's different for code units. They are incompatible both ways. E.g. dchar obviously does not fit in a char, and while the lower half of char is compatible with dchar, the upper half is specific to the encoding. dchar cannot represent upper half char code units. You get the code points with the corresponding values instead. E.g.: void main(){ import std.stdio,std.utf; foreach(dchar d;"ö".byCodeUnit) writeln(d); // "Ã", "¶" }... It is not meaningful to compare utf-8 and utf-16 code units directly.Yes, you have a good point. But we do allow things like: byte b; if (b == 10000) ...Jun 02 2016On 6/2/2016 3:11 PM, Timon Gehr wrote:Well, this is a somewhat different case, because 10000 is just not representable as a byte. Every value that fits in a byte fits in an int though. It's different for code units. They are incompatible both ways.Not exactly. (c == 'ö') is always false for the same reason that (b == 1000) is always false. I'm not sure what the right answer is here.Jun 02 2016On 03.06.2016 00:26, Walter Bright wrote:On 6/2/2016 3:11 PM, Timon Gehr wrote:Yes. And _additionally_, some other concerns apply that are not there for byte vs. int. I.e. if b == 10000 is disallowed, then c == d should be disallowed too, but b == 10000 can be allowed even if c == d is disallowed.Well, this is a somewhat different case, because 10000 is just not representable as a byte. Every value that fits in a byte fits in an int though. It's different for code units. They are incompatible both ways.Not exactly. (c == 'ö') is always false for the same reason that (b == 1000) is always false. ...I'm not sure what the right answer is here.char to dchar is a lossy conversion, so it shouldn't happen. byte to int is a lossless conversion, so there is no problem a priori.Jun 02 2016On Thursday, 2 June 2016 at 21:56:10 UTC, Walter Bright wrote:Yes, you have a good point. But we do allow things like: byte b; if (b == 10000) ...Why allowing char/wchar/dchar comparisons is wrong: void main() { string s = "Привет"; foreach (c; s) assert(c != 'Ñ'); } From my post from 2014: http://forum.dlang.org/post/knrwiqxhlvqwxqshyqpy forum.dlang.orgJun 02 2016On 06/02/2016 04:07 PM, Walter Bright wrote:On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:The lambda returns bool. -- Andrei* s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.The o is inferred as a wchar. The lamda then is inferred to return a wchar.Jun 02 2016On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:The lambda returns bool. -- AndreiYes, I was wrong about that. But the point still stands with:* s.balancedParens('〈', '〉') works only with autodecoding. * s.canFind('ö') works only with autodecoding. It returns always false without.Can be made to work without autodecoding.Jun 02 2016On 06/02/2016 05:58 PM, Walter Bright wrote:On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. So you'd need to go through all of std.algorithm and make sure you can special-case your way out of situations that work today. AndreiThe lambda returns bool. -- AndreiYes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.Jun 02 2016On 03.06.2016 00:23, Andrei Alexandrescu wrote:On 06/02/2016 05:58 PM, Walter Bright wrote:The major issue is that it special cases when there's different, more natural semantics available.On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms.The lambda returns bool. -- AndreiYes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.Jun 02 2016On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:On 06/02/2016 05:58 PM, Walter Bright wrote:The argument to canFind() can be detected as not being a char, then decoded into a sequence of char's, then forwarded to a substring search.> * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.By special casing? Perhaps.I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. So you'd need to go through all of std.algorithm and make sure you can special-case your way out of situations that work today.That's right. A side effect of that is that the algorithms will go even faster! So it's good. (A substring of codeunits is faster to search than decoding the input stream.)Jun 02 2016On Thursday, June 02, 2016 15:48:03 Walter Bright via Digitalmars-d wrote:On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:How do you suggest that we handle the normalization issue? Should we just assume NFC like std.uni.normalize does and provide an optional template argument to indicate a different normalization (like normalize does)? Since without providing a way to deal with the normalization, we're not actually making the code fully correct, just faster. - Jonathan M DavisOn 06/02/2016 05:58 PM, Walter Bright wrote:The argument to canFind() can be detected as not being a char, then decoded into a sequence of char's, then forwarded to a substring search.> * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.By special casing? Perhaps.Jun 02 2016On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:How do you suggest that we handle the normalization issue?Started a new thread for that one.Jun 02 2016On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d wrote:On 06/02/2016 05:58 PM, Walter Bright wrote:Yeah, I believe that you do have to do some special casing, though it would be special casing on ranges of code units in general and not strings specifically, and a lot of those functions are already special cased on string in an attempt be efficient. In particular, with a function like find or canFind, you'd take the needle and encode it to match the haystack it was passed so that you can do the comparisons via code units. So, you incur the encoding cost once when encoding the needle rather than incurring the decoding cost of each code point or grapheme as you iterate over the haystack. So, you end up with something that's correct and efficient. It's also much friendlier to code that only operates on ASCII. The one issue that I'm not quite sure how we'd handle in that case is normalization (which auto-decoding doesn't handle either), since you'd need to normalize the needle to match the haystack (which also assumes that the haystack was already normalized). Certainly, it's the sort of thing that makes it so that you kind of wish you were dealing with a string type that had the normalization built into it rather than either an array of code units or an arbitrary range of code units. But maybe we could assume the NFC normalization like std.uni.normalize does and provide an optional template argument for the normalization scheme. In any case, while it's not entirely straightforward, it is quite possible to write some algorithms in a way which works on arbitrary ranges of code units and deals with Unicode correctly without auto-decoding or requiring that the user convert it to a range of code points or graphemes in order to properly handle the full range of Unicode. And even if we keep auto-decoding, we pretty much need to fix it so that std.algorithm and friends are Unicode-aware in this manner so that ranges of code units work in general without requiring that you use byGrapheme. So, this sort of thing could have a large impact on RCStr, even if we keep auto-decoding for narrow strings. Other algorithms, however, can't be made to work automatically with Unicode - at least not with the current range paradigm. filter, for instance, really needs to operate on graphemes to filter on characters, but with a range of code units, that would mean operating on groups of code units as a single element, which you can't do with something like a range of char, since that essentially becomes a range of ranges. It has to be wrapped in a range that's going to provide graphemes - and of course, if you know that you're operating only on ASCII, then you wouldn't want to deal with graphemes anyway, so automatically converting to graphemes would be undesirable. So, for a function like filter, it really does have to be up to the programmer to indicate what level of Unicode they want to operate at. But if we don't make functions Unicode-aware where possible, then we're going to take a performance hit by essentially forcing everyone to use explicit ranges of code points or graphemes even when they should be unnecessary. So, I think that we're stuck with some level of special casing, but it would then be for ranges of code units and code points and not strings. So, it would work efficiently for stuff like RCStr, which the current scheme does not. I think that the reality of the matter is that regardless of whether we keep auto-decoding for narrow strings in place, we need to make Phobos operate on arbitrary ranges of code units and code points, since even stuff like RCStr won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable in as many cases otherwise, because if a generic function isn't Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the question of auto-decoding matters much for what we need to do at this point. If we do what we need to do, then Phobos will work whether we have auto-decoding or not (working in a Unicode-aware manner where possible and forcing the user to decide the correct level of Unicode to work at where not), and then it just becomes a question of whether we can or should deprecate auto-decoding once all that's done. - Jonathan M DavisOn 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. So you'd need to go through all of std.algorithm and make sure you can special-case your way out of situations that work today.The lambda returns bool. -- AndreiYes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.Jun 02 2016Am Thu, 2 Jun 2016 15:05:44 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:On 06/02/2016 01:54 PM, Marc Sch=C3=BCtz wrote:Andrei, your ignorance is really starting to grind on everyones nerves. If after 350 posts you still don't see why this is incorrect: s.any!(c =3D> c =3D=3D 'o'), you must be actively skipping the informational content of this thread. You are in error, no one agrees with you, and you refuse to see it and in the end we have to assume you will make a decisive vote against any PR with the intent to remove auto-decoding from Phobos. Your so called vocal minority is actually D's panel of Unicode experts who understand that auto-decoding is a false ally and should be on the deprecation track. Remember final-by-default? You promised, that your objection about breaking code means that D2 will only continue to be fixed in a backwards compatible way, be it the implementation of shared or whatever else. Yet months later you opened a thread with the title "inout must go". So that must have been an appeasement back then. People don't forget these things easily and RCStr seems to be a similar distraction, considering we haven't looked into borrowing/scoped enough and you promise wonders from it. --=20 MarcoWhich practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units? =20=20 Pretty much everything. s.all!(c =3D> c =3D=3D '=C3=B6')Jun 02 2016On 6/2/2016 3:10 PM, Marco Leise wrote:we haven't looked into borrowing/scoped enoughThat's my fault. As for scoped, the idea is to make scope work analogously to DIP25's 'return ref'. I don't believe we need borrowing, we've worked out another solution that will work for ref counting. Please do not reply to this in this thread - start a new one if you wish to continue with this topic.Jun 02 2016On 06/02/2016 06:10 PM, Marco Leise wrote:Am Thu, 2 Jun 2016 15:05:44 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:Indeed there seem to be serious questions about my competence, basic comprehension, and now knowledge. I understand it is tempting to assume that a disagreement is caused by the other simply not understanding the matter. Even if that were true it's not worth sacrificing civility over it.On 06/02/2016 01:54 PM, Marc Schütz wrote:Andrei, your ignorance is really starting to grind on everyones nerves.Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?Pretty much everything. s.all!(c => c == 'ö')If after 350 posts you still don't see why this is incorrect: s.any!(c => c == 'o'), you must be actively skipping the informational content of this thread.Is it 'o' with an umlaut or without? At any rate, consider s of type string and x of type dchar. The dchar type is defined as "a Unicode code point", or at least my understanding that has been a reasonable definition to operate with in the D language ever since its first release. Also in the D language, the various string types char[], wchar[] etc. with their respective qualified versions are meant to hold Unicode strings with one of the UTF8, UTF16, and UTF32 encodings. Following these definitions, it stands to reason to infer that the call s.find(c => c == x) means "find the code point x in string s and return the balance of s positioned there". It's prima facie application of the definitions of the entities involved. Is this the only possible or recommended meaning? Most likely not, viz. the subtle cases in which a given grapheme is represented via either one or multiple code points by means of combining characters. Is it the best possible meaning? It's even difficult to define what "best" means (fastest, covering most languages, etc). I'm not claiming that meaning is the only possible, the only recommended, or the best possible. All I'm arguing is that it's not retarded, and within a certain universe confined to operating at code point level (which is reasonable per the definitions of the types involved) it can be considered correct. If at any point in the reasoning above some rampant ignorance comes about, please point it out.You are in error, no one agrees with you, and you refuse to see it and in the end we have to assume you will make a decisive vote against any PR with the intent to remove auto-decoding from Phobos.This seems to assume I have some vesting in the position that makes it independent of facts. That is not the case. I do what I think is right to do, and you do what you think is right to do.Your so called vocal minority is actually D's panel of Unicode experts who understand that auto-decoding is a false ally and should be on the deprecation track.They have failed to convince me. But I am more convinced than before that RCStr should not offer a default mode of iteration. I think its impact is lost in this discussion, because once it's understood RCStr will become D's recommended string type, the entire matter becomes moot.Remember final-by-default? You promised, that your objection about breaking code means that D2 will only continue to be fixed in a backwards compatible way, be it the implementation of shared or whatever else. Yet months later you opened a thread with the title "inout must go". So that must have been an appeasement back then. People don't forget these things easily and RCStr seems to be a similar distraction, considering we haven't looked into borrowing/scoped enough and you promise wonders from it.What the hell is this, digging dirt on me? Paying back debts? Please stop that crap. AndreiJun 02 2016Am Thu, 2 Jun 2016 18:54:21 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:On 06/02/2016 06:10 PM, Marco Leise wrote:That's not my general impression, but something is different with this thread.Am Thu, 2 Jun 2016 15:05:44 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>: =20=20 Indeed there seem to be serious questions about my competence, basic=20 comprehension, and now knowledge.On 06/02/2016 01:54 PM, Marc Sch=C3=BCtz wrote: =20Andrei, your ignorance is really starting to grind on everyones nerves. =20Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units? =20Pretty much everything. s.all!(c =3D> c =3D=3D '=C3=B6') =20I understand it is tempting to assume that a disagreement is caused by=20 the other simply not understanding the matter. Even if that were true=20 it's not worth sacrificing civility over it.Civility has had us caught in an 36 pages long, tiresome debate with us mostly talking past each other. I was being impolite and can't say I regret it, because I prefer this answer over the rest of the thread. It's more informed, elaborate and conclusive.turn=20If after 350 posts you still don't see why this is incorrect: s.any!(c =3D> c =3D=3D 'o'), you must be actively skipping the informational content of this thread. =20=20 Is it 'o' with an umlaut or without? At any rate, consider s of type string and x of type dchar. The dchar type is defined as "a Unicode code point", or at least my understanding that has been a reasonable definition to operate with in the D language ever since its first release. Also in the D language, the various string types char[], wchar[] etc. with their respective qualified versions are meant to hold Unicode strings with one of the UTF8, UTF16, and UTF32 encodings. Following these definitions, it stands to reason to infer that the call=20 s.find(c =3D> c =3D=3D x) means "find the code point x in string s and re=the balance of s positioned there". It's prima facie application of the=20 definitions of the entities involved. =20 Is this the only possible or recommended meaning? Most likely not, viz.=20 the subtle cases in which a given grapheme is represented via either one==20or multiple code points by means of combining characters. Is it the best==20possible meaning? It's even difficult to define what "best" means=20 (fastest, covering most languages, etc). =20 I'm not claiming that meaning is the only possible, the only=20 recommended, or the best possible. All I'm arguing is that it's not=20 retarded, and within a certain universe confined to operating at code=20 point level (which is reasonable per the definitions of the types=20 involved) it can be considered correct. =20 If at any point in the reasoning above some rampant ignorance comes=20 about, please point it out.No, it's pretty close now. We can all agree that there is no "best" way, only different use cases. Just defining Phobos to work on code points gives the false illusion that it does the correct thing in all use cases - after all D claims to support Unicode. But in case you wanted to iterate on visual letters it is incorrect and otherwise slow when you work on ASCII structured formats (JSON, XML, paths, Warp, ...). Then there is explaining the different default iteration schemes when using foreach vs. range API (no big deal, just not easily justified) and the cost of implementation when dealing with char[]/wchar[]. =46rom this observation we concluded that decoding should be opt-in and that when we need it, it should be a conscious decision. Unicode is quite complex and learning about the difference between code points and grapheme clusters when segmenting strings will benefit code quality. As for the question, do multi-code-point graphemes ever appear in the wild ? OS X is known to use NFD on its native file system and there is a hint on Wikipedia that some symbols from Thai or Hindi's Devanagari need them: https://en.wikipedia.org/wiki/UTF-8#Disadvantages Some form of Lithuanian seems to have a use for them, too: http://www.unicode.org/L2/L2012/12026r-n4191-lithuanian.pdf Aside from those there is nothing generally wrong about decomposed letters appearing in strings, even though the use of NFC is encouraged.Your vote outweighs that of many others for better or worse. When a decision needs to be made and the community is divided, we need you or Walter or anyone who is invested in the matter to cast a ruling vote. However when several dozen people support an idea after discussion, hearing everyones arguments with practically no objections and you overrule everyone tensions build up. I welcome the idea to delegate some of the tasks to smaller groups. No single person is knowledgeable in every area of CS and both a bus factor of 1 and too big a group can hinder decision making. It would help to know for the future, if you understand your role as one with veto powers or if you could arrange with giving up responsibilities to decisions within the community and if so under what conditions.[=E2=80=A6harsh tone removed=E2=80=A6] in the end we have to assume you will make a decisive vote against any PR with the intent to remove auto-decoding from Phobos. =20=20 This seems to assume I have some vesting in the position that makes it independent of facts. That is not the case. I do what I think is right to do, and you do what you think is right to do.No, that was my actual impression. I must apologize for generalizing it to other people though. I welcome that RCStr project and hope it will be good. At this time though it is not yet fleshed out and we can't tell how fast its adoption will be. Remember that DIPs on scope and RC have had the past tendency to go into long debates with unclear outcome. Unlike this thread, which may be the first in D's forum history with such a high agreement across the board.Your so called vocal minority is actually D's panel of Unicode experts who understand that auto-decoding is a false ally and should be on the deprecation track. =20=20 They have failed to convince me. But I am more convinced than before=20 that RCStr should not offer a default mode of iteration. I think its=20 impact is lost in this discussion, because once it's understood RCStr=20 will become D's recommended string type, the entire matter becomes moot.Remember final-by-default? You promised, that your objection about breaking code means that D2 will only continue to be fixed in a backwards compatible way, be it the implementation of shared or whatever else. Yet months later you opened a thread with the title "inout must go". So that must have been an appeasement back then. People don't forget these things easily and RCStr seems to be a similar distraction, considering we haven't looked into borrowing/scoped enough and you promise wonders from it. =20=20 What the hell is this, digging dirt on me? Paying back debts? Please=20 stop that crap.Andrei--=20 MarcoJun 03 2016On Thursday, June 02, 2016 15:05:44 Andrei Alexandrescu via Digitalmars-d wrote:The intent of autodecoding was to make std.algorithm work meaningfully with strings. As it's easy to see I just went through std.algorithm.searching alphabetically and found issues literally with every primitive in there. It's an easy exercise to go forth with the others.It comes down to the question of whether it's better to fail quickly when Unicode is handled incorrectly so that it's obvious that you're doing it wrong, or whether it's better for it to work in a large number of cases so that for a lot of code it "just works" but is still wrong in the general case, and it's a lot less obvious that that's the case, so many folks won't realize that they need to do more in order to have their string handling be Unicode-correct. With code units - especially UTF-8 - it becomes obvious very quickly that treating each element of the string/range as a character is wrong. With code points, you have to work far harder to find examples that are incorrect. So, it's not at all obvious (especially to the lay programmer) that the Unicode handling is incorrect and that their code is wrong - but their code will end up working a large percentage of the time in spite of it being wrong in the general case. So, yes, it's trivial to show how operating on ranges of code units as if they were characters gives incorrect results far more easily than operating on ranges of code points does. But operating on code points as if they were characters is still going to give incorrect results in the general case. Regardless of auto-decoding, the anwser is that the programmer needs to understand the Unicode issues and use ranges of code units or code points where appropriate and use ranges of graphemes where appropriate. It's just that if we default to handling code points, then a lot of code will be written which treats those as characters, and it will provide the correct result more often than it would if it treated code units as characters. In any case, I've probably posted too much in this thread already. It's clear that the first step to solving this problem is to improve Phobos so that it handles ranges of code units, code points, and graphemes correctly whether auto-decoding is involved or not, and only then can we consider the possibility of removing auto-decoding (and even then, the answer may still be that we're stuck, because we consider the resulting code breakage to be too great). But whether Phobos retains auto-decoding or not, the Unicode handling stuff in general is the same, and what we need to do to improve the siutation is the same. So, clearly, I need to do a much better job of finding time to work on D so that I can create some PRs to help the situation. Unfortunately, it's far easier to find a few minutes here and there while waiting on other stuff to shoot off a post or two in the newsgroup than it is to find time to substantively work on code. :| - Jonathan M DavisJun 03 2016On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:Look at reddit and hackernews, too - admittedly other self-selected communities. Language debates often spring about. How often is the point being made that D is wanting because of its string support? Nada.I've been lurking on this thread for a while and was convinced by the arguments that autodecoding should go. Nevertheless, I think this is really the strongest argument you've made against using the community's resources to fix it now. If your position from the beginning were this clear, then I think the thread might not have gone on so long. As someone trained in economics, I get convinced by arguments about scarce resources. It makes more sense to focus on higher value issues. However, the case against autodecoding is clearly popular. At a minimum, it has resulted in a significant amount of time dedicated to forum discussion and has made you metaphorically angry at Walter. Resources spent grumbling about it could be better spent elsewhere. One way to deal with the problem of scarce resources is by reducing the cost of whatever action you want to take. For instance, Adam Ruppe just put up a good post in the Dealing with Autodecode thread https://forum.dlang.org/post/ksasfwpuvpwxjfniupiv forum.dlang.org noting that a compiler switch could easily be added to phobos. Combined with a long deprecation timeline, the cost that it would impose on D users who are not active forum members and might want to complain about the issue would be relatively small. Another problem related to scarce resources is that there is a division of labor in the community. People like yourself and Walter have fewer substitutes for your labor. It makes sense that the top contributors should be focusing on higher value issues where fewer people have the ability to contribute. I don't dispute that. However, there seem to be a number of people who can contribute on this issue and want to contribute. Scarcity of resources seems to be less of an issue here. Finally, when you discussed things people complain about D, you mentioned tooling. In the time I've been following this forum, I haven't seen a single thread focusing on this issue. I don't mean a few comments like "oh D should improve its tooling." I mean a thread dedicated to D's tooling strengths and weaknesses with a goal of creating a plan on what to do to improve things.Currently dfix is weak because it doesn't do lookup. So we need to make the front end into a library. Daniel said he wants to be on it, but he has two jobs to worry about so he's short on time. There's only so many hours in the day, and I think the right focus is on attacking the matters above.On a somewhat tangential basis, I was reading about Microsoft's Roslyn a week or so ago. They do something similar where they have a compiler API. I don't have a very good sense of how it works from their overview, but it seems to be an interesting approach.Jun 02 2016On 06/02/2016 10:14 AM, jmh530 wrote:However, the case against autodecoding is clearly popular. At a minimum, it has resulted in a significant amount of time dedicated to forum discussion and has made you metaphorically angry at Walter. Resources spent grumbling about it could be better spent elsewhere.Yah, this is a bummer and one of the larger issues of our community: there's too much talking about doing things and too little doing things. On one hand I want to empower people (as I said at DConf: please get me fired!), and on the other I need to prevent silly things from happening. The quality of some of the code that gets into Phobos when I look the other way is sadly sub-par. Cumulatively that has reduced its quality over time. That (improving the time * talent torque) is the real solution to Phobos' technical debt, of which autodecoding is negligible.One way to deal with the problem of scarce resources is by reducing the cost of whatever action you want to take. For instance, Adam Ruppe just put up a good post in the Dealing with Autodecode thread https://forum.dlang.org/post/ksasfwpuvpwxjfniupiv forum.dlang.org noting that a compiler switch could easily be added to phobos. Combined with a long deprecation timeline, the cost that it would impose on D users who are not active forum members and might want to complain about the issue would be relatively small.This is a very costly solution to a very small problem. I'm here to prevent silly things like this from happening and from bringing back perspective. We've had huge issues with language changes that were much more important and brought much less breakage. The fact that people talk about 132 breakages in Phobos with a straight face is a good sign that the heat of the debate has taken perspective away. I'm sure it will come back in a few weeks. Just need to keep the dam until then. The real ticket out of this is RCStr. It solves a major problem in the language (compulsive GC) and also a minor occasional annoyance (autodecoding). This is what I need to work on, instead of writing long messages to put back sense into people. Many don't realize that the only reason current strings ever work in safe code is because of the GC. char[] is too little encapsulation, so it needs GC as a crutch to be safe. That's the problem with D's strings, not autodecoding. That's why we need to change things. That's what keeps me awake at night. AndreiJun 02 2016On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu wrote:Yah, this is a bummer and one of the larger issues of our community: there's too much talking about doing things and too little doing things.We wrote a PR to implement the first step in the autodecode deprecation cycle. Granted, it wasn't ready to merge, but you just closed it with a flippant "not gonna happen" despite the *unanimous* agreement that the status quo sucks, and now complain that there's too much talking and too little doing! When we do something, you just shut it down then blame us. What's even the point of trying anymore?Jun 02 2016On Thursday, 2 June 2016 at 15:38:46 UTC, Adam D. Ruppe wrote:On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu wrote:https://www.youtube.com/watch?v=MJiBjfvltQwYah, this is a bummer and one of the larger issues of our community: there's too much talking about doing things and too little doing things.We wrote a PR to implement the first step in the autodecode deprecation cycle. Granted, it wasn't ready to merge, but you just closed it with a flippant "not gonna happen" despite the *unanimous* agreement that the status quo sucks, and now complain that there's too much talking and too little doing! When we do something, you just shut it down then blame us. What's even the point of trying anymore?Jun 02 2016On Thursday, 2 June 2016 at 15:38:46 UTC, Adam D. Ruppe wrote:We wrote a PR to implement the first step in the autodecode deprecation cycle.It outright deprecated popFront - that's not the first step in the migration.Jun 02 2016On Thursday, 2 June 2016 at 15:50:54 UTC, Kagamin wrote:It outright deprecated popFront - that's not the first step in the migration.Which gave us the list of places inside Phobos to fix, only about two hours of work, and proved that the version() method was viable (and REALLY easy to implement).Jun 02 2016On Thursday, 2 June 2016 at 16:02:18 UTC, Adam D. Ruppe wrote:Which gave us the list of places inside Phobos to fix, only about two hours of work, and proved that the version() method was viable (and REALLY easy to implement).Yes, it was a research PR that was never meant to be an implementation of the first step. You used wrong wording that just unnecessarily freaked Andrei out.Jun 02 2016On 06/02/2016 12:45 PM, Kagamin wrote:On Thursday, 2 June 2016 at 16:02:18 UTC, Adam D. Ruppe wrote:I closed it because it wasn't an actual implementation, in full understanding that the discussion in it could continue. -- AndreiWhich gave us the list of places inside Phobos to fix, only about two hours of work, and proved that the version() method was viable (and REALLY easy to implement).Yes, it was a research PR that was never meant to be an implementation of the first step. You used wrong wording that just unnecessarily freaked Andrei out.Jun 02 2016On 6/2/2016 9:02 AM, Adam D. Ruppe wrote:Which gave us the list of places inside Phobos to fix, only about two hours of work, and proved that the version() method was viable (and REALLY easy to implement).Nothing prevents anyone from doing that on their own (it's trivial) in order to find Phobos problems, and pick one or three to fix.Jun 02 2016On 6/2/2016 8:50 AM, Kagamin wrote:It outright deprecated popFront - that's not the first step in the migration.That's right. It's going about things backwards. The first step is to adjust Phobos implementations and documentation so they do not rely on autodecoding. This will take some time and care, particularly with algorithms that support mixed codeunit argument types. (Or perhaps mixed codeunit argument types can be deprecated.) This is not so simple, as they have to be dealt with one by one.Jun 02 2016On Thursday, 2 June 2016 at 20:32:39 UTC, Walter Bright wrote:The first step is to adjust Phobos implementations and documentation so they do not rely on autodecoding.The compiler can help you with that. That's the point of the do not merge PR: it got an actionable list out of the compiler and proved the way forward was viable.Jun 02 2016On 6/2/2016 1:46 PM, Adam D. Ruppe wrote:The compiler can help you with that. That's the point of the do not merge PR: it got an actionable list out of the compiler and proved the way forward was viable.What is supposed to be done with "do not merge" PRs other than close them?Jun 02 2016On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:What is supposed to be done with "do not merge" PRs other than close them?Experimentally iterate until something workable comes about. This way it's done publicly and people can collaborate.Jun 02 2016On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:What is supposed to be done with "do not merge" PRs other than close them?Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though). Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label. Either way, they shouldn't be closed just because they say "do not merge" (unless they're abandoned or something, obviously).Jun 02 2016On 6/2/16 5:05 PM, tsbockman wrote:On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:Feel free to reopen if it helps, it wasn't closed in anger. -- AndreiWhat is supposed to be done with "do not merge" PRs other than close them?Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though). Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label.Jun 02 2016On 6/2/2016 2:05 PM, tsbockman wrote:On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:I've done that, but that doesn't apply here.What is supposed to be done with "do not merge" PRs other than close them?Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though).Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label.That doesn't seem to apply here, either.Either way, they shouldn't be closed just because they say "do not merge" (unless they're abandoned or something, obviously).Something like that could not be merged until 132 other PRs are done to fix Phobos. It doesn't belong as a PR.Jun 02 2016On Thursday, 2 June 2016 at 22:20:49 UTC, Walter Bright wrote:On 6/2/2016 2:05 PM, tsbockman wrote:I was just responding to the general question you posed about "do not merge" PRs, not really arguing for that one, in particular, to be re-opened. I'm sure wilzbach is willing to explain if anyone cares to ask him why he did it as a PR, though.Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label.That doesn't seem to apply here, either.Either way, they shouldn't be closed just because they say "do not merge" (unless they're abandoned or something, obviously).Something like that could not be merged until 132 other PRs are done to fix Phobos. It doesn't belong as a PR.Jun 02 2016On 06/02/2016 11:38 AM, Adam D. Ruppe wrote:On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu wrote:You mean https://github.com/dlang/phobos/pull/4384, the one with "[do not merge]" in the title? Would you realistically have advised me to merge it? I spent time writing what I thought was a reasonable and reasonably long answer. Allow me to quote it below:Yah, this is a bummer and one of the larger issues of our community: there's too much talking about doing things and too little doing things.We wrote a PR to implement the first step in the autodecode deprecation cycle. Granted, it wasn't ready to merge, but you just closed it with a flippant "not gonna happen" despite the *unanimous* agreement that the status quo sucks, and now complain that there's too much talking and too little doing!wilzbach thanks for running this experiment.Could you please point me at the parts you found flippant in it, or merely unreasonable?Andrei is wrong.Definitely wouldn't be the first time and not the last.We can all see it, and maybe if we demonstrate that a migration path is possible, even actually pretty easy following a simple deprecation path, maybe he can see it too.I'm not sure who "all" is but that's beside the point. Taking a step back, we'd take in a change that breaks Phobos in 132 places only if it was a major language overhaul bringing dramatic improvements to the quality of life for D programmers. An artifact as earth shattering as ranges, or an ownership system that was massively simple and beneficial. For comparison, the recent changes in name lookup broke Phobos in fewer places (I don't have an exact number, but I think they were at most a couple dozen.) Those changes were closing an enormous hole in the language and mark a huge step forward. I'd be really hard pressed to characterize the elimination of autodecoding as enough of an improvement to warrant this kind of breakage. (I do realize there's a difference between breakage and deprecation, but for many folks the distinction is academic.) The better end game here is to improve efficiency of code that uses autodecoding (e.g. per the recent `find()` work), and to make sure `RCStr` is the right design. A string that manages its own memory _and_ does the right things with regard to Unicode is the ticket. Let's focus future efforts on that.When we do something, you just shut it down then blame us. What's even the point of trying anymore?At some point I need to stick with what I think is the better course for D, even if that means disagreeing with you. But I hope you understand this is not "flippant" or teasing people then shutting down their good work. AndreiJun 02 2016On Thursday, 2 June 2016 at 16:12:01 UTC, Andrei Alexandrescu wrote:Would you realistically have advised me to merge it?Not at this time, no, but I also wouldn't advise you to close it and tell us to stop trying if you were actually open to a chance. You closed that and posted this at about the same time: http://forum.dlang.org/post/nii497$2p79$1 digitalmars.com "I'm not going to debate this further" "What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D." So, what do you seriously expect us to think? We had a migration plan and enough excitement to start working on the code, then within about 15 minutes of each other, you close the study PR and post that the discussion is over and your mistake is here to stay.I'm not sure who "all" is but that's beside the point.This sentence makes me pretty mad too. This topic has come up many times and nobody, NOBODY, with the exception of yourself agrees with the current behavior anymore. It is a very frequently asked question among new users, and we have no real justification because there is no technical merit to it.Jun 02 2016On 06/02/2016 02:36 PM, Adam D. Ruppe wrote:We had a migration plan and enough excitement to start working on the codeI don't think the plan is realistic. How can I tell you this without you getting mad at me? Apparently the only way to go is do as you say. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu wrote:I don't think the plan is realistic. How can I tell you this without you getting mad at me?You get out of the way and let the community get to work. Actually delegate, let people take ownership of problems, success and failure alike. If we fail then, at least it will be from our own experience instead of from executive meddling.Jun 02 2016On 06/02/2016 03:13 PM, Adam D. Ruppe wrote:On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu wrote:That's a good point. We plan to do more of that in the future.I don't think the plan is realistic. How can I tell you this without you getting mad at me?You get out of the way and let the community get to work. Actually delegate, let people take ownership of problems, success and failure alike.If we fail then, at least it will be from our own experience instead of from executive meddling.This applies to high-risk work that is also of commensurately extraordinary value. My assessment is this is not it. If you were in my position you'd also do what you think is the best thing to do, and nobody should feel offended by that. AndreiJun 02 2016On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:This is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode.Autodecode doesn't need to be removed from phobos completely, it only needs to be more bearable, like it is in the foreach statement. E.g. byDchar will stay, initial idea is to actually put it to more intensive usage in phobos and user code, no need to remove it.Jun 02 2016On 06/02/2016 10:53 AM, Kagamin wrote:On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:Yah, and then such code will work with RCStr. -- AndreiThis is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode.Autodecode doesn't need to be removed from phobos completely, it only needs to be more bearable, like it is in the foreach statement. E.g. byDchar will stay, initial idea is to actually put it to more intensive usage in phobos and user code, no need to remove it.Jun 02 2016On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu wrote:Yes, do consider Walter's proposal, it will be an enabling technology for RCStr too: the more phobos works with string-like ranges the more it is usable for RCStr.Autodecode doesn't need to be removed from phobos completely, it only needs to be more bearable, like it is in the foreach statement. E.g. byDchar will stay, initial idea is to actually put it to more intensive usage in phobos and user code, no need to remove it.Yah, and then such code will work with RCStr. -- AndreiJun 02 2016On 06/02/2016 12:14 PM, Kagamin wrote:On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu wrote:Walter and I have a unified view on this. Although I'd need to raise the issue that the primitive should be by!dchar, not byDchar. -- AndreiYes, do consider Walter's proposal, it will be an enabling technology for RCStr too: the more phobos works with string-like ranges the more it is usable for RCStr.Autodecode doesn't need to be removed from phobos completely, it only needs to be more bearable, like it is in the foreach statement. E.g. byDchar will stay, initial idea is to actually put it to more intensive usage in phobos and user code, no need to remove it.Yah, and then such code will work with RCStr. -- AndreiJun 02 2016On Thursday, 2 June 2016 at 16:21:33 UTC, Andrei Alexandrescu wrote:On 06/02/2016 12:14 PM, Kagamin wrote:The primitive is byUTF!dchar:On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu wrote:Walter and I have a unified view on this. Although I'd need to raise the issue that the primitive should be by!dchar, not byDchar. -- AndreiYes, do consider Walter's proposal, it will be an enabling technology for RCStr too: the more phobos works with string-like ranges the more it is usable for RCStr.Autodecode doesn't need to be removed from phobos completely, it only needs to be more bearable, like it is in the foreach statement. E.g. byDchar will stay, initial idea is to actually put it to more intensive usage in phobos and user code, no need to remove it.Yah, and then such code will work with RCStr. -- AndreiJun 02 2016On Thu, Jun 02, 2016 at 09:06:44AM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]ZombineDev, I've been at the top level in the C++ community for many many years, even after I wanted to exit :o). I'm familiar with how the committee that steers C++ works, perspective that is unique in our community - even Walter lacks it. I see trends and patterns. It is interesting how easily a small but very influential priesthood can alienate itself from the needs of the larger community and get into a frenzy over matters that are simply missing the point.Appeal to authority.This is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode.I think that's a misrepresentation of the situation. I was getting increasingly unhappy with autodecoding myself, completely independently of Walter, and in fact have filed bugs and posted complaints about it long before Walter started his thread. I used to be a supporter of autodecoding, but over time it has become increasingly clear to me that it was a mistake. The fact that you continue to deny this and write it off in the face of similar complaints raised by many active D users is very off-putting, to say the least, and does not inspire confidence. Not to mention the fact that you started this thread yourself with a question about what it is we dislike about autodecoding, yet after having received a multitude of complaints, corrobated by many forum members, you simply write off the whole thing like it was nothing. If you want D to succeed, you need to raise the morale of the community, and this is not the way to raise morale.The very definition of a useless debate, the kind he and I had agreed to not initiate anymore. It was a mistake. I'm still metaphorically angry at him for it.On the contrary, I found that Walter's willingness to admit past mistakes very refreshing, even if practically speaking we can't actually get rid of autodecoding today. What he proposed in the other thread is actually a workable step towards reversing the wrong decision behind autodecoding, that doesn't leave existing users out in the cold, and that we might actually be able to pull off if done carefully. I know you probably won't see it the same way, since you still seem convinced that autodecoding was a good idea, but you need to understand that your opinion is not representative in this case. [...]Meanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons: * The garbage collector eliminates probably 60% of potential users right off.At least we have begun to do something about this. That's good news.* Tooling is immature and of poorer quality compared to the competition.And what have we done about it? How long has it been since dfix existed, yet we still haven't really integrated it into the dmd toolchain?* Safety has holes and bugs.And what have we done about it?* Hiring people who know D is a problem.There are many willing candidates right here. :-P* Documentation and tutorials are weak.And what have we done about this?* There's no web services framework (by this time many folks know of D, but of those a shockingly small fraction has even heard of vibe.d). I have strongly argued with Snke to bundle vibe.d with dmd over one year ago, and also in this forum. There wasn't enough interest.What about linking to it in a prominent place on dlang.org? This isn't a big problem, AFAICT. I don't think it takes months and years to put up a big prominent banner promoting vibe.d on, say, the download page of dlang.org.* (On Windows) if it doesn't have a compelling Visual Studio plugin, it doesn't exist.And what have we done about this? One of the things that I have found a little disappointing with D is that while it has many very promising features, it lacks polish in many small details. Such as the way features interact with each other in corner cases. E.g., the whole can't-use-gc from dtor debacle, the semantics of closures over aggregate members, holes in safe, holes in const/immutable in unions, the whole import mess that took oh-how-many-years to clean up that thankfully was finally improved recently, can't use nogc with Phobos, can't use const/pure/etc. in Object.toString, Object.opEqual, et al (which we've been trying to get of since how many years ago now?), and a whole long list of small irritations that in themselves are nothing, but together add up like a dustball to an overall perception of lack of polish. I'm more sympathetic to Walter's stance of improving the language for *current* users, instead of bending over backwards to please would-be adopters who may never actually adopt the language -- they'd just come back with new excuses of why they can't adopt D yet. If you make existing users happier, they will do all the work of evangelism for you, instead of you having to fight the uphill battle by yourself while bleeding away current users due to poor morale. T -- Why ask rhetorical questions? -- JCJun 02 2016On 06/02/2016 10:48 AM, H. S. Teoh via Digitalmars-d wrote:On Thu, Jun 02, 2016 at 09:06:44AM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]You cut the context, which was rampant speculation.ZombineDev, I've been at the top level in the C++ community for many many years, even after I wanted to exit :o). I'm familiar with how the committee that steers C++ works, perspective that is unique in our community - even Walter lacks it. I see trends and patterns. It is interesting how easily a small but very influential priesthood can alienate itself from the needs of the larger community and get into a frenzy over matters that are simply missing the point.Appeal to authority.There is no denying. If I did things all over again, autodecoding would not be in. But also string would not be immutable(char)[] which is the real mistake. Some of the arguments in here have been good, but many (probably the majority) of them were not so much. A good one didn't even come up, Walter told it to me over the phone: the reality of invalid UTF strings forces you to mind the representation more often than you'd want in an ideal world. There is no "writing off". Again, the real solution here is RCStr. We can't continue with immutable(char)[] as our flagship string. Autodecoding is the least of its problems.This is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode.I think that's a misrepresentation of the situation. I was getting increasingly unhappy with autodecoding myself, completely independently of Walter, and in fact have filed bugs and posted complaints about it long before Walter started his thread. I used to be a supporter of autodecoding, but over time it has become increasingly clear to me that it was a mistake. The fact that you continue to deny this and write it off in the face of similar complaints raised by many active D users is very off-putting, to say the least, and does not inspire confidence. Not to mention the fact that you started this thread yourself with a question about what it is we dislike about autodecoding, yet after having received a multitude of complaints, corrobated by many forum members, you simply write off the whole thing like it was nothing. If you want D to succeed, you need to raise the morale of the community, and this is not the way to raise morale.I don't see it the same way. Yes, I agree my opinion is not representative. I'd also say I'm glad I can do something about this.The very definition of a useless debate, the kind he and I had agreed to not initiate anymore. It was a mistake. I'm still metaphorically angry at him for it.On the contrary, I found that Walter's willingness to admit past mistakes very refreshing, even if practically speaking we can't actually get rid of autodecoding today. What he proposed in the other thread is actually a workable step towards reversing the wrong decision behind autodecoding, that doesn't leave existing users out in the cold, and that we might actually be able to pull off if done carefully. I know you probably won't see it the same way, since you still seem convinced that autodecoding was a good idea, but you need to understand that your opinion is not representative in this case.[...]I've been working on RCStr for the past few days. I'd get a lot more work done if I didn't need to talk sense into people in this thread.Meanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons: * The garbage collector eliminates probably 60% of potential users right off.At least we have begun to do something about this. That's good news.I've spoken to Brian about it. Dfix does not do lookup, which makes it sadly not up for meaningful uses.* Tooling is immature and of poorer quality compared to the competition.And what have we done about it? How long has it been since dfix existed, yet we still haven't really integrated it into the dmd toolchain?Walter and I are working on safe RC.* Safety has holes and bugs.And what have we done about it?Nice.* Hiring people who know D is a problem.There are many willing candidates right here. :-Phttp://tour.dlang.org is a good start.* Documentation and tutorials are weak.And what have we done about this?PR please. I can't babysit everything. I'm preparing for a conference where I'll evangelize for D next week (http://ndcoslo.com/speaker/andrei-alexandrescu/). As I mentioned at DConf, for better or worse this is the kind of stuff I cannot delegate. That kind of work is where the community would really make an impact, not a large debate that I need to worry will lead to some silly rash decision.* There's no web services framework (by this time many folks know of D, but of those a shockingly small fraction has even heard of vibe.d). I have strongly argued with Snke to bundle vibe.d with dmd over one year ago, and also in this forum. There wasn't enough interest.What about linking to it in a prominent place on dlang.org? This isn't a big problem, AFAICT. I don't think it takes months and years to put up a big prominent banner promoting vibe.d on, say, the download page of dlang.org.I'm actively looking for a collaboration.* (On Windows) if it doesn't have a compelling Visual Studio plugin, it doesn't exist.And what have we done about this?One of the things that I have found a little disappointing with D is that while it has many very promising features, it lacks polish in many small details. Such as the way features interact with each other in corner cases. E.g., the whole can't-use-gc from dtor debacle, the semantics of closures over aggregate members, holes in safe, holes in const/immutable in unions, the whole import mess that took oh-how-many-years to clean up that thankfully was finally improved recently, can't use nogc with Phobos, can't use const/pure/etc. in Object.toString, Object.opEqual, et al (which we've been trying to get of since how many years ago now?), and a whole long list of small irritations that in themselves are nothing, but together add up like a dustball to an overall perception of lack of polish.It's a fair perspective. Those annoy me as well. I'll also note every language has such matter, including the mainstream ones. At some point we need to acknowledge they're there but they're small enough to live with. (Some of those you enumerated aren't small, e.g. the holes in safe.)I'm more sympathetic to Walter's stance of improving the language for *current* users, instead of bending over backwards to please would-be adopters who may never actually adopt the language -- they'd just come back with new excuses of why they can't adopt D yet. If you make existing users happier, they will do all the work of evangelism for you, instead of you having to fight the uphill battle by yourself while bleeding away current users due to poor morale.We want to improve the language for current AND future users. RCStr is part of that. AndreiJun 02 2016On Thursday, June 02, 2016 09:06:44 Andrei Alexandrescu via Digitalmars-d wrote:Meanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons:Are folks going to not start using D because of auto-decoding? No, because they won't know anything about it. Many of them don't even know anything about ranges. But it _will_ result in a WTF moment for pretty much everyone. It happens all the time and results in plenty of questions on D.Learn and stackoverflow, because no one expects it, and it causes them problems. Can we sanely remove auto-decoding from Phobos? I don't know. It's entrenched enough that doing so without breaking code is going to be very difficult. But at minimum, we need to mitigate it's effects, and I'm sure that we're going to be sorry in the long run if we don't figure out how to actually excise it. It's already a major wart that causes frequent problems, and it's the sort of thing that's going to make a number of folks unhappy with D in the long run, even if you can convince them to switch to it now while auto-decoding is still in place. Will it make them unhappy enough to switch away from D? Probably not. But it is going to be a constant pain point of the sort that folks frequently complain about with C++ - only this is one that we'll have, and C++ won't. - Jonathan M DavisJun 02 2016On 06/02/2016 11:58 AM, Jonathan M Davis via Digitalmars-d wrote:On Thursday, June 02, 2016 09:06:44 Andrei Alexandrescu via Digitalmars-d wrote:Actually ranges are a major reason for which people look into D. -- AndreiMeanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons:Are folks going to not start using D because of auto-decoding? No, because they won't know anything about it. Many of them don't even know anything about ranges.Jun 02 2016On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:On 06/01/2016 06:09 PM, ZombineDev wrote:If this doesn't happen, then all this push to change anything in Phobos is completely wasted effort. As long as arrays aren't treated like arrays, we will have to deal with auto-decoding. You can change string literals to be something other than arrays, and then we have a path forward. But as long as char[] is not an array, we have lost the battle of sanity. -SteveDeprecating front, popFront and empty for narrow strings is what we are talking about here.That will not happen. Walter and I consider the cost excessive and the benefit too small.Jun 02 2016On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:Really? "Anything"?On 06/01/2016 06:09 PM, ZombineDev wrote:If this doesn't happen, then all this push to change anything in Phobos is completely wasted effort.Deprecating front, popFront and empty for narrow strings is what we are talking about here.That will not happen. Walter and I consider the cost excessive and the benefit too small.As long as arrays aren't treated like arrays, we will have to deal with auto-decoding. You can change string literals to be something other than arrays, and then we have a path forward. But as long as char[] is not an array, we have lost the battle of sanity.Yeah, it's a miracle the language stays glued eh. Your post is a prime example that this thread has lost the battle of sanity. I'll destroy you in person tonight. AndreiJun 02 2016On 6/2/16 9:09 AM, Andrei Alexandrescu wrote:On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:The push to make Phobos only use byDchar (or any other band-aid fixes for this issue) is what I meant by anything. not "anything" anything :)On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:Really? "Anything"?On 06/01/2016 06:09 PM, ZombineDev wrote:If this doesn't happen, then all this push to change anything in Phobos is completely wasted effort.Deprecating front, popFront and empty for narrow strings is what we are talking about here.That will not happen. Walter and I consider the cost excessive and the benefit too small.I mean as far as narrow strings are concerned. To have the language tell me, yes, char[] is an array with a .length member, but hasLength is false? What, str[4] works, but isRandomAccessRange is false? Maybe it's more Orwellian than insane: Phobos is saying 2 + 2 = 5 ;)As long as arrays aren't treated like arrays, we will have to deal with auto-decoding. You can change string literals to be something other than arrays, and then we have a path forward. But as long as char[] is not an array, we have lost the battle of sanity.Yeah, it's a miracle the language stays glued eh.Your post is a prime example that this thread has lost the battle of sanity. I'll destroy you in person tonight.It's the cynicism of talking/debating about this for years and years and not seeing any progress. We can discuss of course, and see who gets destroyed :) And yes, I'm about to kill this thread from my newsreader, since it's wasting too much of my time... -SteveJun 02 2016On 06/02/2016 09:25 AM, Steven Schveighoffer wrote:And yes, I'm about to kill this thread from my newsreader, since it's wasting too much of my time...A good idea for all of us. Could you also please look on my post on our meetup page? Thx! -- AndreiJun 02 2016On 02.06.2016 15:09, Andrei Alexandrescu wrote:It's not a language problem. Just avoid Phobos.You can change string literals to be something other than arrays, and then we have a path forward. But as long as char[] is not an array, we have lost the battle of sanity.Yeah, it's a miracle the language stays glued eh. ...Your post is a prime example that this thread has lost the battle of sanity.He is just saying that the fundamental reason why autodecoding is bad is that it denies that T[] is an array for any T.Jun 02 2016On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:On 06/01/2016 03:07 PM, ZombineDev wrote:This, deep down, point at the fact that conversion from/to char types are ill defined. One should be able to convert from char to byte/ubyte but not the other way around. One should be able to convert from byte to short but not from char to wchar. Once you disable the naive conversions, then the autodecoding in foreach isn't inconsistent anymore.This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach.I understand where you're coming from, but it actually is autodecoding. Consider: byte[] a; foreach (byte x; a) {} foreach (short x; a) {} foreach (int x; a) {} That works by means of a conversion short->int. However: char[] a; foreach (char x; a) {} foreach (wchar x; a) {} foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language. AndreiJun 02 2016On 02.06.2016 12:38, deadalnix wrote:The current situation is bad: void main(){ import std.utf,std.stdio; foreach(dchar d;"∑") writeln(d); // "∑" foreach(dchar d;"∑".byCodeUnit) writeln(d); // "â", "\210", "\221" } Implicit conversion should not happen, and I'd prefer both of them to behave the same. (I.e. make both a compile-time error or decode for both).This, deep down, point at the fact that conversion from/to char types are ill defined. One should be able to convert from char to byte/ubyte but not the other way around. One should be able to convert from byte to short but not from char to wchar. Once you disable the naive conversions, then the autodecoding in foreach isn't inconsistent anymore.Jun 02 2016On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:wrote:On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-dwalkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. walkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters.Does walkLength yield the same number for all representations?On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiThe point is that treating a code point like it's a full character is just as wrong as treating a code unit as if it were a full character. It's _not_ guaranteed to be a full character. Treating code points as full characters does give you the correct result in more cases than treating a code unit as a full character gives you the correct result, but it still gives you the wrong result in many cases. If we want to have fully correct behavior without making the programmer deal with all of the Unicode issues themselves, then we need to operate at the grapheme level so that we are operating on full characters (though that obviously comes at a high cost to efficiency). Treating code points as characters like we do right now does not give the correct result in the general case just like treating code units as characters doesn't give the correct result in the general case. Both work some of the time, but neither works all of the time. Autodecoding attempts to hide the fact that it's operating on Unicode but does not actually go far enough to result in correct behavior. So, we pay the cost of decoding without getting the benefit of correctness. - Jonathan M DavisAnd you can even put that accent on 0 by doing something like assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d); One or more code units combine to make a single code point, but one or more code points also combine to make a grapheme.That's right. D's handling of UTF is at the code unit level (like all of Unicode is portably defined). If you want graphemes use byGrapheme. It seems you destroyed your own argument, which was:Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.You can't claim code units are just a special case of code points.May 31 2016On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size. Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units.wrote:On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-dwalkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. ...Does walkLength yield the same number for all representations?On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiSaying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.walkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters.The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456May 31 2016On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456That's a property of your font and font rendering engine, not Unicode. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.) -WyattMay 31 2016On 31.05.2016 21:40, Wyatt wrote:On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:Sure. Hence "context". If you are e.g. trying to manually underline some text in console output, for example in a compiler error message, counting the number of characters will not actually be what you want, even though it works reliably for ASCII text.The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456That's a property of your font and font rendering engine, not Unicode.(Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.) -WyattIt's precisely six columns in my terminal (also in emacs and in gedit). My point was, how can std.algorithm ever guess correctly what you /actually/ intended to do?May 31 2016On Tuesday, May 31, 2016 21:48:36 Timon Gehr via Digitalmars-d wrote:On 31.05.2016 21:40, Wyatt wrote:It can't, which is precisely why having it select for you was a bad design decision. The programmer needs to be making that decision. And the fact that Phobos currently makes that decision for you means that it's often doing the wrong thing, and the fact that it chose to decode code points by default means that it's often eating up unnecessary cycles to boot. - Jonathan M DavisOn Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:Sure. Hence "context". If you are e.g. trying to manually underline some text in console output, for example in a compiler error message, counting the number of characters will not actually be what you want, even though it works reliably for ASCII text.The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456That's a property of your font and font rendering engine, not Unicode.(Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.) -WyattIt's precisely six columns in my terminal (also in emacs and in gedit). My point was, how can std.algorithm ever guess correctly what you /actually/ intended to do?May 31 2016On Tue, May 31, 2016 at 07:40:13PM +0000, Wyatt via Digitalmars-d wrote:On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:[...] I believe he was talking about a console terminal that uses 2 columns to render the so-called "double width" characters. The CJK block does contain "double-width" versions of selected blocks (e.g., the ASCII block), to be used with said characters. Of course, using string length to measure string width is a risky venture fraught with pitfalls, because your terminal may not actually render them the way you think it should. Nevertheless, it does serve to highlight why a construct like s.walkLength is essentially buggy, because there is not enough information to determine which length it should return -- length of the buffer in bytes, or the number of code points, or the number of graphemes, or the width of the string. No matter which choice you make, it only works for a subset of cases and is wrong for the other cases. This is a prime illustration of why forcing autodecoding on every string in D is a wrong design. T -- Не дорог подарок, дорога любовь.The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456That's a property of your font and font rendering engine, not Unicode. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.)May 31 2016On Tuesday, May 31, 2016 21:20:19 Timon Gehr via Digitalmars-d wrote:On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:wrote:On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-dIn the vast majority of cases what folks care about is full characters, which is not what code points are. But the fact that they want different things in different situation just highlights the fact that just converting everything to code points by default is a bad idea. And even worse, code points are usually the worst choice. Many operations don't require decoding and can be done at the code unit level, meaning that operating at the code point level is just plain inefficient. And the vast majority of the operations that can't operate at the code point level, then need to operate on full characters, which means that they need to be operating at the grapheme level. Code points are in this weird middle ground that's useful in some cases but usually isn't what you want or need. We need to be able to operate at the code unit level, the code point level, and the grapheme level. But defaulting to the code point level really makes no sense.What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size. Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units.wrote:On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-dwalkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. ...Does walkLength yield the same number for all representations?On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiSaying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.Well, that's getting into displaying characters which is a whole other can of worms, but it also highlights that assuming that the programmer wants a particular level of unicode is not a particularly good idea and that we should avoid converting for them without being asked, since it risks being inefficient to no benefit. - Jonathan M DaviswalkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters.The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456May 31 2016On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:In the vast majority of cases what folks care about is full characterHow are you so sure? -- AndreiMay 31 2016Am Tue, 31 May 2016 16:56:43 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:Because a full character is the typical unit of a written language. It's what we visualize in our heads when we think about finding a substring or counting characters. A special case of this is the reduction to ASCII where we can use code units in place of grapheme clusters. -- MarcoIn the vast majority of cases what folks care about is full characterHow are you so sure? -- AndreiMay 31 2016On Tuesday, May 31, 2016 23:36:20 Marco Leise via Digitalmars-d wrote:Am Tue, 31 May 2016 16:56:43 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:Exactly. How many folks here have written code where the correct thing to do is to search on code points? Under what circumstances is that even useful? Code points are a mid-level abstraction between UTF-8/16 and graphemes that are not particularly useful on their own. Yes, by using code points, we eliminate the differences between the encodings, but how much code even operates on multiple string types? Having all of your strings have the same encoding fixes the consistency problem just as well as autodecoding to dchar evereywhere does - and without the efficiency hit. Typically, folks operate on string or char[] unless they're talking to the Windows API, in which case, they need wchar[]. Our general recommendation is that D code operate on UTF-8 except when it needs to operate on a different encoding because of other stuff it has to interact with (like the Win32 API), in which case, ideally it converts those strings to UTF-8 once they get into the D code and operates on them as UTF-8, and anything that has to be output in a different encoding is operated on as UTF-8 until it needs to be outputed, in which case, it's converted to UTF-16 or whatever the target encoding is. Not much of anyone is recommending that you use dchar[] everywhere, but that's essentially what the range API is trying to force. I think that it's very safe to say that the vast majority of string processing either is looking to operate on strings as a whole or on individual, full characters within a string. Code points are neither. While code may play tricks with Unicode to be efficient (e.g. operating at the code unit level where it can rather than decoding to either code points or graphemes), or it might make assumptions about its data being ASCII-only, aside from explicit Unicode processing code, I have _never_ seen code that was actually looking to logically operate on only pieces of characters. While it may operate on code units for efficiency, it's always looking to be logically operating on string as a unit or on whole characters. Anyone looking to operate on code points is going to need to take into account the fact that they're not full characters, just like anyone who operates on code units needs to take into account the fact that they're not whole characters. Operating on code points as if they were characters - which is exactly what D currently does with ranges - is just plain wrong. We need to support operating at the code point level for those rare cases where it's actually useful, but autedecoding makes no sense. It incurs a performance penality without actually giving correct results except in those rare cases where you want code points instead of full characters. And only Unicode experts are ever going to want that. The average programmer who is not super Unicode savvy doesn't even know what code points are. They're clearly going to be looking to operate on strings as sequences of characters, not sequences of code points. I don't see how anyone could expect otherwise. Code points are a mid-level, Unicode abstraction that only those who are Unicode savvy are going to know or care about, let alone want to operate on. - Jonathan M DavisOn 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:Because a full character is the typical unit of a written language. It's what we visualize in our heads when we think about finding a substring or counting characters. A special case of this is the reduction to ASCII where we can use code units in place of grapheme clusters.In the vast majority of cases what folks care about is full characterHow are you so sure? -- AndreiMay 31 2016On Wednesday, 1 June 2016 at 02:17:21 UTC, Jonathan M Davis wrote:...This thread is going in circles; the against crowd has stated each of their arguments very clearly at least five times in different ways. The cost/benefit problems with auto decoding are as clear as day. If the evidence already presented in this thread (and in the many others) isn't enough to convince people of that, then I don't think anything else said will have an impact. I don't want to sound like someone telling people not to discuss this anymore, but honestly, what is continuing this thread going to accomplish?May 31 2016On Tuesday, 31 May 2016 at 20:56:43 UTC, Andrei Alexandrescu wrote:On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:He doesn't need to be sure. You are the one advocating for code points, so the burden is on you to present evidence that it's the correct choice.In the vast majority of cases what folks care about is full characterHow are you so sure? -- AndreiJun 01 2016On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:walkLength treats a code point like it's a character.No, it treats a code point like it's a code point. -- AndreiMay 31 2016On Tuesday, May 31, 2016 15:33:38 Andrei Alexandrescu via Digitalmars-d wrote:On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level? Thanks to how Phobos treats strings as ranges of dchar, most D code treats code points as if they were characters. So, whether it's correct or not, a _lot_ of D code is treating walkLength like it returns the number of characters in a string. And if walkLength doesn't provide the number of characters in a string, why would I want to use it under normal circumstances? Why would I want to be operating at the code point level in my code? It's not necessarily a full character, since it's not necessarily a grapheme. So, by using walkLength and front and popFront and whatnot with strings, I'm not getting full characters. I'm still only getting pieces of characters - just like would happen if strings were treated as ranges of code units. I'm just getting bigger pieces of the characters out of the deal. But if they're not full characters, what's the point? I am sure that there is code that is going to want to operate at the code point level, but your average program is either operating on strings as a whole or individual characters. As long as strings are being operated on as a whole, code units are generally plenty, and careful encoding of characters into code units for comparisons means that much of the time that you want to operate on individual characters, you can still operate at the code unit level. But if you can't, then you need the grapheme level, because a code point is not necessarily a full character. So, what is the point of operating on code points in your average D program? walkLength will not always tell me the number of characters in a string. front risks giving me a partial character rather than a whole one. Slicing dchar[] risks chopping up characters just like slicing char[] does. Operating on code points by default does not result in correct string processing. I honestly don't see how autodecoding is defensible. We may not be able to get rid of it due to the breakage that doing that would cause, but I fail to see how it is at all desirable that we have autodecoded strings. I can understand how we got it if it's based on a misunderstanding on your part about how Unicode works. We all make mistakes. But I fail to see how autodecoding wasn't a mistake. It's the worst of both worlds - inefficient while still incorrect. At least operating at the code unit level would be fast while being incorrect, and it would be obviously incorrect once you did anything with non-ASCII values, whereas it's easy to miss that ranges of dchar are doing the wrong thing too - Jonathan M DaviswalkLength treats a code point like it's a character.No, it treats a code point like it's a code point. -- AndreiMay 31 2016On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level?The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units). That's the contract, and it seems meaningful seeing how Unicode is defined in terms of code points as its abstract building block. If user code needs to go lower at the code unit level, they can do so. If user code needs to go upper at the grapheme level, they can do so. If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- AndreiMay 31 2016On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:If user code needs to go upper at the grapheme level, they can If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- AndreiUnicode FAQ disagrees (http://unicode.org/faq/utf_bom.html): "Q: How about using UTF-32 interfaces in my APIs? A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16 APIs the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the required functionality at the high levels."May 31 2016On Tue, May 31, 2016 at 05:01:17PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:This is basically saying that we operate on dchar[] by default, except that we disguise its detrimental memory usage consequences by compressing to UTF-8/UTF-16 and incurring the cost of decompression every time we access its elements. Perhaps you love the idea of running an OS that stores all files in compressed form and always decompresses upon every syscall to read(), but I prefer a higher-performance system.Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level?The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units).That's the contract, and it seems meaningful seeing how Unicode is defined in terms of code points as its abstract building block.Where's this contract stated, and when did we sign up for this?If user code needs to go lower at the code unit level, they can do so. If user code needs to go upper at the grapheme level, they can do so.Only with much pain by using workarounds to bypass meticulously-crafted autodecoding algorithms in Phobos.If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- AndreiNo, autodecoding is a stalemate that's neither fast nor correct. T -- "Real programmers can write assembly code in any language. :-)" -- Larry WallMay 31 2016On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:_Both_ are low-level representation-specific artifacts.Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level?The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units).Jun 01 2016On 06/01/2016 06:25 AM, Marc Schütz wrote:On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? -- AndreiOn 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:_Both_ are low-level representation-specific artifacts.Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level?The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units).Jun 01 2016On 06/01/2016 10:29 AM, Andrei Alexandrescu wrote:On 06/01/2016 06:25 AM, Marc Schütz wrote:As has been explained countless times already, code points are a non-1:1 internal representation of graphemes. Code points don't exist for their own sake, their entire existence is purely as a way to encode graphemes. Whether that technically qualifies as "memory representation" or not is irrelevant: it's still a low-level implementation detail of text.On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? -- AndreiThe point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units)._Both_ are low-level representation-specific artifacts.Jun 01 2016On 06/01/2016 12:41 PM, Nick Sabalausky wrote:As has been explained countless times already, code points are a non-1:1 internal representation of graphemes. Code points don't exist for their own sake, their entire existence is purely as a way to encode graphemes.Of course, thank you.Whether that technically qualifies as "memory representation" or not is irrelevant: it's still a low-level implementation detail of text.The relevance is meandering across the discussion, and it's good to have the same definitions for terms. Unicode code points are abstract notions with meanings attached to them, whereas UTF8/16/32 are concerned with their representation. AndreiJun 01 2016On Wednesday, 1 June 2016 at 14:29:58 UTC, Andrei Alexandrescu wrote:On 06/01/2016 06:25 AM, Marc Schütz wrote:Ok, if you define it that way, sure. I was thinking in terms of the actual text: Unicode is a way to represent that text using a variety of low-level representations: UTF8/NFC, UTF8/NFD, unnormalized UTF8, UTF16 big/little endian x normalization, UTF32 x normalization, some other more obscure ones. From that viewpoint, auto decoded char[] (= UTF8) is equivalent to dchar[] (= UTF32). Neither of them is the actual text. Both writing and the memory representation consist of fundamental units. But there is no 1:1 relationship between the units of char[] (UTF8 code units) or auto decoded strings (Unicode code points) on the one hand, and the units of writing (graphemes) on the other.On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? --The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units)._Both_ are low-level representation-specific artifacts.Jun 02 2016On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]Does walkLength yield the same number for all representations?Let's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠ I think any reasonable person would have to say it should return 5, because there are 5 visual "characters" here. Otherwise, what is even the meaning of walkLength?! For it to return anything other than 5 means that it's a leaky abstraction, because it's leaking low-level "implementation details" of the Unicode representation of this string. However, with the current implementation of autodecoding, walkLength returns 11. Can anyone reasonably argue that it's reasonable for "şŭt̥ḛ́k̠".walkLength to equal 11? What difference does this make if we get rid of autodecoding, and walkLength returns 17 instead? *Both* are wrong. 17 is actually the right answer if you're looking to allocate a buffer large enough to hold this string, because that's the number of bytes it occupies. 5 is the right answer to an end user who knows nothing about Unicode. 11 is an answer that a question that only makes sense to a Unicode specialist, and that no layperson understands. 11 is the answer we currently give. And that, at the cost of across-the-board performance degradation. Yet you're seriously arguing that 11 should be the right answer, by insisting that the current implementation of autodecoding is "correct". It boggles the mind. T -- Today's society is one of specialization: as you grow, you learn more and more about less and less. Eventually, you know everything about nothing.May 31 2016On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]Compiler error. -SteveDoes walkLength yield the same number for all representations?Let's put the question this way. Given the following string, what do *you* think walkLength should return?May 31 2016On 31.05.2016 21:51, Steven Schveighoffer wrote:On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:What about e.g. joiner?On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]Compiler error. -SteveDoes walkLength yield the same number for all representations?Let's put the question this way. Given the following string, what do *you* think walkLength should return?May 31 2016On Tue, May 31, 2016 at 10:38:03PM +0200, Timon Gehr via Digitalmars-d wrote:On 31.05.2016 21:51, Steven Schveighoffer wrote:joiner is one of those algorithms that can work perfectly fine *without* autodecoding anything at all. The only time it'd actually need to decode would be if you're joining a set of UTF-8 strings with a UTF-16 delimiter, or some other such combination, which should be pretty rare. After all, within the same application you'd usually only be dealing with a single encoding rather than mixing UTF-8, UTF-16, and UTF-32 willy-nilly. (Unless the code is specifically written for transcoding, in which case decoding is part of the job description, so it should be expected that the programmer ought to know how to do it properly without needing Phobos to do it for him.) Even in the case of s.joiner('Ш'), joiner could easily convert that dchar into a short UTF-8 string and then operate directly on UTF-8. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:What about e.g. joiner?On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]Compiler error. -SteveDoes walkLength yield the same number for all representations?Let's put the question this way. Given the following string, what do *you* think walkLength should return?May 31 2016On 5/31/16 4:38 PM, Timon Gehr wrote:On 31.05.2016 21:51, Steven Schveighoffer wrote:Compiler error. Better than what it does now. -SteveOn 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:What about e.g. joiner?On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]Compiler error.Does walkLength yield the same number for all representations?Let's put the question this way. Given the following string, what do *you* think walkLength should return?May 31 2016On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer wrote:On 5/31/16 4:38 PM, Timon Gehr wrote:I believe everything that does only concatenation will work correctly. That's why joiner() is one of those algorithms that should accept strings directly without going through any decoding (but it may need to recode the joining element itself, of course).What about e.g. joiner?Compiler error. Better than what it does now.Jun 01 2016On 6/1/16 6:31 AM, Marc Schütz wrote:On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer wrote:This means that a string is a range. What is it a range of? If you want to make it a range of code units, I think you will lose that battle. If you want to special-case joiner for strings, that's always possible. Or string could be changed to be a range of dchar struct explicitly. Then at least joiner makes sense, and I can reasonably explain why it behaves the way it does. -SteveOn 5/31/16 4:38 PM, Timon Gehr wrote:I believe everything that does only concatenation will work correctly. That's why joiner() is one of those algorithms that should accept strings directly without going through any decoding (but it may need to recode the joining element itself, of course).What about e.g. joiner?Compiler error. Better than what it does now.Jun 02 2016On Thursday, 2 June 2016 at 13:11:10 UTC, Steven Schveighoffer wrote:On 6/1/16 6:31 AM, Marc Schütz wrote:No, I don't want to make string a range of anything, I want to provide an additional overload for joiner() that accepts a const(char)[], and returns a range of chars. The remark about the joining element is that ["abc", "xyz"].joiner(","d) should convert ","d to "," first, to match the element type of the elements. But this is purely a convenience; it can also be pushed to the user.I believe everything that does only concatenation will work correctly. That's why joiner() is one of those algorithms that should accept strings directly without going through any decoding (but it may need to recode the joining element itself, of course).This means that a string is a range. What is it a range of? If you want to make it a range of code units, I think you will lose that battle.If you want to special-case joiner for strings, that's always possible.Yes, that's what I want. Sorry if it wasn't clear.Or string could be changed to be a range of dchar struct explicitly. Then at least joiner makes sense, and I can reasonably explain why it behaves the way it does. -SteveJun 02 2016On 02.06.2016 15:48, Marc Schütz wrote:If strings are not ranges, returning a range of chars is inconsistent.No, I don't want to make string a range of anything, I want to provide an additional overload for joiner() that accepts a const(char)[], and returns a range of chars.Jun 02 2016On Thursday, 2 June 2016 at 13:11:10 UTC, Steven Schveighoffer wrote:This means that a string is a range. What is it a range of? If you want to make it a range of code units, I think you will lose that battle.After the first migration step joiner will return a decoded dchar range just like it does now, only code will change internally, there will be no observable semantic difference to the user. Anyway, read Walter's proposal in the thread about dealing with autodecode.Jun 02 2016On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:Let's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠The number of code units in the string. That's the contract promised and honored by Phobos. -- AndreiMay 31 2016On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:Code points I mean. -- AndreiLet's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠The number of code units in the string. That's the contract promised and honored by Phobos. -- AndreiMay 31 2016On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:Yes, we know it's the contract. ***That's the problem.*** As everybody is saying, it *SHOULDN'T* be the contract. Why shouldn't it be the contract? Because it's proven itself, both logically (as presented by pretty much everybody other than you in both this and other threads) and empirically (in phobos, warp, and other user code) to be both the least useful and most PITA option.On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:Code points I mean. -- AndreiLet's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠The number of code units in the string. That's the contract promised and honored by Phobos. -- AndreiMay 31 2016On Tuesday, May 31, 2016 20:38:14 Nick Sabalausky via Digitalmars-d wrote:On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:Exactly. Operating at the code point level rarely makes sense. What sorts of algorithms purposefully do that in a typical program? Unless you're doing very specific Unicode stuff or somehow know that your strings don't contain any graphemes that are made up of multiple code points, operating at the code point level is just bug-prone, and unless you're using dchar[] everywhere, it's slow to boot, because you're strings have to be decoded whether the algorithm needs to or not. I think that it's very safe to say that the vast majority of string algorithms are either able to operate at the code unit level without decoding (though possibly encoding another string to match - e.g. with a string comparison or search), or they have to operate at the grapheme level in order to deal with full characters. A code point is borderline useless on its own. It's just a step above the different UTF encodings without actually getting to proper characters. - Jonathan M DavisOn 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:Yes, we know it's the contract. ***That's the problem.*** As everybody is saying, it *SHOULDN'T* be the contract. Why shouldn't it be the contract? Because it's proven itself, both logically (as presented by pretty much everybody other than you in both this and other threads) and empirically (in phobos, warp, and other user code) to be both the least useful and most PITA option.On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:Code points I mean. -- AndreiLet's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠The number of code units in the string. That's the contract promised and honored by Phobos. -- AndreiMay 31 2016On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiYou got the terms mixed up. Code unit is lower level. Code point is higher level. One code point is encoded with one or more code units. char is a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is both a UTF-32 code unit and a code point, because in UTF-32 it's a 1-to-1 relation.May 31 2016On 05/31/2016 03:34 PM, ag0aep6g wrote:On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:Apologies and thank you. -- AndreiCould you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- AndreiYou got the terms mixed up. Code unit is lower level. Code point is higher level.May 31 2016On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:The standard library has to fight against itself because of autodecoding! The vast majority of the algorithms in Phobos are special-cased on strings in an attempt to get around autodecoding. That alone should highlight the fact that autodecoding is problematic.The way I see it is it's specialization to speed things up without giving up the higher level abstraction. -- AndreiMay 31 2016On 05/31/2016 01:23 PM, Andrei Alexandrescu wrote:On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:Problem is, that "higher"[1] level abstraction you don't want to give up (ie working on code points) is rarely useful, and yet the default is to pay the price for something which is rarely useful. [1] It's really the mid-level abstraction - grapheme is the high-level one (and more likely useful).The standard library has to fight against itself because of autodecoding! The vast majority of the algorithms in Phobos are special-cased on strings in an attempt to get around autodecoding. That alone should highlight the fact that autodecoding is problematic.The way I see it is it's specialization to speed things up without giving up the higher level abstraction. -- AndreiMay 31 2016On 5/26/2016 9:00 AM, Andrei Alexandrescu wrote:My thesis: the D1 design decision to represent strings as char[] was disastrous and probably one of the largest weaknesses of D1. The decision in D2 to use immutable(char)[] for strings is a vast improvement but still has a number of issues.The mutable vs immutable has nothing to do with autodecoding.On 05/12/2016 04:15 PM, Walter Bright wrote:It's a consequence of autodecoding, not arrays.On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote: 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.This is a consequence of 1. It is at least partially fixable.Having written high speed string processing code in D, that also deals with unicode (i.e. Warp), the only knowledge of autodecoding needed was how to have it not happen. Autodecoding made it slower than necessary in every case it was used. I found no place in Warp where autodecoding was desirable.4. Autodecoding is slow and has no place in high speed string processing.I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D.Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc.That doesn't work so well. There always seems to be a need for custom string processing. Worse, when pipelining strings, the autodecoding changes the type to dchar, which then needs to be re-encoded into the result. The std.string algorithms I wrote all work much better (i.e. faster) without autodecoding, while maintaining proper Unicode support. I.e. the autodecoding did not benefit the algorithms at all, and if the user is to use standard algorithms instead of custom ones, then autodecoding is not necessary.When needed, iterating every code unit is trivially done through indexing.This implies replacing pipelining with loops, and also falls apart if indexing is redone to index by code points.Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.I.e. special case the code to avoid autodecoding. The trouble is that the low level code cannot avoid autodecoding, as it happens before the low level code gets it. This is conceptually backwards, and winds up requiring every algorithm to special case strings, even when completely unnecessary. (The 'copy' algorithm is an example of utterly unnecessary decoding.) When teaching people how to write algorithms, having to write every one twice, once for ranges and arrays, and a specialization for strings even when decoding is never necessary (such as for 'copy'), is embarrassing.Running my char[] through a pipeline and having it come out sometimes as char[] and sometimes dchar[] and sometimes ubyte[] is hidden and surprising behavior.5. Very few algorithms require decoding.The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.A third option is to pass the invalid code units through unmolested, which won't work if autodecoding is used.6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.Requiring code units to be all 100% valid is not workable, nor is redoing them to be ubytes. More on that below.7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.Sorry I didn't log the time I spent on it.8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.Objection. Vague..representation changes the type to ubyte[]. All knowledge that this is a Unicode string then gets lost for the rest of the pipeline.9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.Turning off autodecoding is as easy as inserting .representation after any string.(Not to mention using indexing directly.)Doesn't work if you're pipelining.I found .representation to be unworkable because it changed the type.10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width.Even if it is made a special type, the problem of what an index means will remain. Of course, indexing by code point is an O(n) operation, which I submit is surprising and shouldn't be supported as [i] even by a special type (for the same reason that indexing of linked lists is frowned upon). Giving up indexing means giving up efficient slicing, which would be a major downgrade for D.11. Indexing an array produces different results than autodecoding, another glaring special case.This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.They mean code units. This is not ambiguous. How a code unit is different from a ubyte: A. I know you hate bringing up my personal experience, but here goes. I've programmed in C forever. In C, char is used for both small integers and characters. It's always been a source of confusion, and sometimes bugs, to conflate the two: struct S { char field; }; Which is it, a character or a small integer? I have to rely on reading the code. It's a definite improvement in D that they are distinguished, and I feel that improvement every time I have to deal with C/C++ code and see 'char' used as a small integer instead of a character. B. Overloading is different, and that's good. For example, writeln(T[]) produces different results for char[] and ubyte[], and this is unsurprising and expected. It "just works". C. More overloading: writeln('a'); Does anyone want that to print 96? Does anyone really want 'a' to be of type dchar? (The trouble with that is type inference when building up more complex types, as you'll wind up with hidden dchar[] if not careful. My experience with dchar[] is it is almost never desirable, as it is too memory hungry.)May 27 2016On 5/27/16 1:11 PM, Walter Bright wrote:They mean code units.Always valid or potentially invalid as well? -- AndreiMay 27 2016On 5/27/2016 11:27 AM, Andrei Alexandrescu wrote:On 5/27/16 1:11 PM, Walter Bright wrote:Some years ago I would have said always valid. Experience, however, says that Unicode is often dirty and code should be tolerant of that. Consider Unicode in a text editor. You can't have it throwing exceptions, silently changing things to replacement characters, etc., when there's a few invalid sequences in it. You also can't just say "the file isn't Unicode" and refuse to display the Unicode in it. It isn't hard to deal with invalid Unicode in a user friendly manner.They mean code units.Always valid or potentially invalid as well? -- AndreiMay 27 2016On 5/27/16 1:11 PM, Walter Bright wrote:The std.string algorithms I wrote all work much better (i.e. faster) without autodecoding, while maintaining proper Unicode support.Violent agreement is occurring here. We have plenty of those and need more. -- AndreiMay 27 2016On 05/12/2016 10:15 PM, Walter Bright wrote:On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:There are more than 2 choices here, see the related discussion on avoiding redundant unicode validation https://issues.dlang.org/show_bug.cgi?id=14519#c32.I am as unclear about the problems of autodecoding as I am about thenecessityto remove curl. Whenever I ask I hear some arguments that work wellemotionallybut are scant on reason and engineering. Maybe it's time to rehashthem? I justdid so about curl, no solid argument seemed to come together. I'd becurious ofa crisp list of grievances about autodecoding. -- AndreiHere are some that are not matters of opinion. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.May 29 2016A relevant thread in the Rust bug tracker I remember from three years ago: https://github.com/rust-lang/rust/issues/7043 May it be of inspiration. -- MarcoMay 30 2016