www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why not flag away the mistakes of the past?

reply Taylor Hillegeist <taylorh140 gmail.com> writes:
So i've seen on the forum over the years arguments about 
auto-decoding (mostly) and some other things. Things that have 
been considered mistakes, and cannot be corrected because of the 
breaking changes it would create. And I always wonder why not 
make a solution to the tune of a flag that makes things work as 
they used too, and make the new behavior default.

dmd --UseAutoDecoding

That way the breaking change was easily fixable, and the mistakes 
of the past not forever. Is it just the cost of maintenance?
Mar 06 2018
next sibling parent FeepingCreature <feepingcreature gmail.com> writes:
For what it's worth, I like autodecoding.

I worry we could be in a situation where a moderate number of 
people are strong opponents and a lot of people are weak fans, 
none of which individually care enough to post. Hopefully the D 
survey results will shed some light on this, though I don't 
remember if it was written to actually ask people's opinion of 
autodecoding or just list it as a possible issue to raise, which 
would fall into the same trap.
Mar 07 2018
prev sibling next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist 
wrote:
 So i've seen on the forum over the years arguments about 
 auto-decoding (mostly) and some other things. Things that have 
 been considered mistakes, and cannot be corrected because of 
 the breaking changes it would create. And I always wonder why 
 not make a solution to the tune of a flag that makes things 
 work as they used too, and make the new behavior default.

 dmd --UseAutoDecoding

 That way the breaking change was easily fixable, and the 
 mistakes of the past not forever. Is it just the cost of 
 maintenance?
That's the approach used for most things, but there's a lot of things that rely on auto-decoding, so it would be a big effort to actually implement that.
Mar 07 2018
prev sibling next sibling parent reply Guillaume Piolat <notthat email.com> writes:
On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist 
wrote:
 That way the breaking change was easily fixable, and the 
 mistakes of the past not forever. Is it just the cost of 
 maintenance?
auto-decoding problem was mostly that it couldn't be nogc since throwing, but with further releases exception throwing will get nogc. So it's getting fixed.
Mar 07 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, March 07, 2018 12:53:16 Guillaume Piolat via Digitalmars-d 
wrote:
 On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist

 wrote:
 That way the breaking change was easily fixable, and the
 mistakes of the past not forever. Is it just the cost of
 maintenance?
auto-decoding problem was mostly that it couldn't be nogc since throwing, but with further releases exception throwing will get nogc. So it's getting fixed.
I'd actually argue that that's the lesser of the problems with auto-decoding. The big problem is that it's auto-decoding. Code points are almost always the wrong level to be operating at. The programmer needs to be in control of whether the code is operating on code units, code points, or graphemes, and because of auto-decoding, we have to constantly avoid using the range primitives for arrays on strings. Tons of range-based code has to special case for strings in order to work around auto-decoding. We're constantly fighting our own API in order to process strings sanely and efficiently. IMHO, nogc and nothrow don't matter much in comparison. Yes, it would be nice if range-based code operating on strings were nogc and nothrow, but most D code really doesn't care. It uses the GC anyway, and most of the time, no exceptions are thrown, because the strings are valid Unicode. Yes, the fact that the range primitives for strings throw UTFExceptions instead of using the Unicode replacement character is a problem, but that problem is small in comparison to the problems caused by the auto-decoding itself. Even if front and popFront used the variant of decode that used the replacement character, auto-decoding would still be a huge problem. - Jonathan M Davis
Mar 07 2018
next sibling parent reply Nick Treleaven <nick geany.org> writes:
On Wednesday, 7 March 2018 at 13:24:25 UTC, Jonathan M Davis 
wrote:
 I'd actually argue that that's the lesser of the problems with 
 auto-decoding. The big problem is that it's auto-decoding. Code 
 points are almost always the wrong level to be operating at.
For me the fundamental problem is having char[] in the language at all, meaning a Unicode string. Arbitrary slicing and indexing are not Unicode compatible, if we revisit this we need a String type that doesn't support those operations. Plus the issue of string validation - a Unicode string type should be assumed to have valid contents - unsafe data should only be checked at string construction time, so iterating should always be nothrow.
Mar 07 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, March 07, 2018 13:40:20 Nick Treleaven via Digitalmars-d 
wrote:
 On Wednesday, 7 March 2018 at 13:24:25 UTC, Jonathan M Davis

 wrote:
 I'd actually argue that that's the lesser of the problems with
 auto-decoding. The big problem is that it's auto-decoding. Code
 points are almost always the wrong level to be operating at.
For me the fundamental problem is having char[] in the language at all, meaning a Unicode string. Arbitrary slicing and indexing are not Unicode compatible, if we revisit this we need a String type that doesn't support those operations. Plus the issue of string validation - a Unicode string type should be assumed to have valid contents - unsafe data should only be checked at string construction time, so iterating should always be nothrow.
In principle, char is supposed to be a UTF-8 code unit, and strings are supposed to be validated up front rather than constantly validated, but it's never been that way in practice. Regardless, having char[] be sliceable is actually perfectly fine and desirable. That's exactly what you want whenever you operate on code units, and it's frequently the case that you want to be operating at the code unit level. But the programmer needs to be able to reasonably control when code units, code points, or graphemes are used, because each has their time and place. If we had a string type, it would need to provide access to each of those levels and likely would not be directly sliceable at all, because slicing a string is kind of meaningless, because in principle, a string is just on opaque piece of character data. It's when you're dealing at the code unit, code point, or grapheme level that you actually start operating on pieces of a string, and that means that the level that you're operating at needs to be defined. Having char[] be an array of code units works quite well, because then you have efficiency by default. You then need to wrap it in another range type when appropriate to get a range of code points or graphemes, or you need to explicitly decode when appropriate. Whereas right now, what we have is Phobos being "helpful" and constantly decoding for us such that we get needlessy inefficient code, and it's at the code point level, which is usually not the level you want to operate at. So, you don't have efficiency or correctness. Ultimately, it really doesn't work to hide the details of Unicode and not have the programmer worry about code units, code points, and graphemes unless you don't care about efficency. As such, what we really need is to cleanly give the programmer the tools to manage Unicode without the language or library assuming what the programmer wants - especially assuming an inefficient default. The language itself actually does a decent job of that. It's Phobos that dropped the ball on that one, because Andrei didn't know about graphemes and tried to make Phobos Unicode-correct by default. Instead, we get inefficient and incorrect by defaltu. - Jonathan M Davis
Mar 07 2018
prev sibling parent reply Guillaume Piolat <notthat email.com> writes:
On Wednesday, 7 March 2018 at 13:24:25 UTC, Jonathan M Davis 
wrote:
 On Wednesday, March 07, 2018 12:53:16 Guillaume Piolat via 
 Digitalmars-d wrote:
 On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist

 wrote:
 That way the breaking change was easily fixable, and the 
 mistakes of the past not forever. Is it just the cost of 
 maintenance?
auto-decoding problem was mostly that it couldn't be nogc since throwing, but with further releases exception throwing will get nogc. So it's getting fixed.
I'd actually argue that that's the lesser of the problems with auto-decoding. The big problem is that it's auto-decoding. Code points are almost always the wrong level to be operating at. The programmer needs to be in control of whether the code is operating on code units, code points, or graphemes, and because of auto-decoding, we have to constantly avoid using the range primitives for arrays on strings. Tons of range-based code has to special case for strings in order to work around auto-decoding. We're constantly fighting our own API in order to process strings sanely and efficiently.
I'd agree with you, hate the special casing. However it seems to me this has been debated to death already, and that auto-decoding was successfully advocated by Alexandrescu and al; surviving the controversy years ago.
Mar 08 2018
next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Thursday, March 08, 2018 16:34:11 Guillaume Piolat via Digitalmars-d 
wrote:
 On Wednesday, 7 March 2018 at 13:24:25 UTC, Jonathan M Davis

 wrote:
 On Wednesday, March 07, 2018 12:53:16 Guillaume Piolat via

 Digitalmars-d wrote:
 On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist

 wrote:
 That way the breaking change was easily fixable, and the
 mistakes of the past not forever. Is it just the cost of
 maintenance?
auto-decoding problem was mostly that it couldn't be nogc since throwing, but with further releases exception throwing will get nogc. So it's getting fixed.
I'd actually argue that that's the lesser of the problems with auto-decoding. The big problem is that it's auto-decoding. Code points are almost always the wrong level to be operating at. The programmer needs to be in control of whether the code is operating on code units, code points, or graphemes, and because of auto-decoding, we have to constantly avoid using the range primitives for arrays on strings. Tons of range-based code has to special case for strings in order to work around auto-decoding. We're constantly fighting our own API in order to process strings sanely and efficiently.
I'd agree with you, hate the special casing. However it seems to me this has been debated to death already, and that auto-decoding was successfully advocated by Alexandrescu and al; surviving the controversy years ago.
Most everyone who debated in favor of it early on is very much against it now (and I'm one of them). Experience and a better understanding of Unicode has shown it to be a terrible idea. I question that you will find any significant contributor to Phobos who would choose to have it if we were starting from scratch, and most of the folks who post in the newsgroup agree with that. The problem is what to do given that we don't want it and that no one has come up with a way to remove it without breaking tons of code in the process or even providing a clean migration path. So, given how difficult it is to remove at this point, you'll find disagreement about how that should be handled ranging from deciding that we're just stuck with it to wanting to remove it regardless of the cost. But there seems to be almost universal agreement now (certainly among the folks who might make such a decision) that auto-decoding was a mistake. So, there's agreement that it would ideally go, but there isn't agreement on what we should actually do given the situation that we're in. - Jonathan M Davis
Mar 08 2018
parent reply Taylor Hillegeist <taylorh140 gmail.com> writes:
On Thursday, 8 March 2018 at 17:14:16 UTC, Jonathan M Davis wrote:
 On Thursday, March 08, 2018 16:34:11 Guillaume Piolat via 
 Digitalmars-d wrote:
 On Wednesday, 7 March 2018 at 13:24:25 UTC, Jonathan M Davis

 wrote:
 On Wednesday, March 07, 2018 12:53:16 Guillaume Piolat via

 Digitalmars-d wrote:
 On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor 
 Hillegeist

 wrote:
 That way the breaking change was easily fixable, and the 
 mistakes of the past not forever. Is it just the cost of 
 maintenance?
auto-decoding problem was mostly that it couldn't be nogc since throwing, but with further releases exception throwing will get nogc. So it's getting fixed.
I'd actually argue that that's the lesser of the problems with auto-decoding. The big problem is that it's auto-decoding. Code points are almost always the wrong level to be operating at. The programmer needs to be in control of whether the code is operating on code units, code points, or graphemes, and because of auto-decoding, we have to constantly avoid using the range primitives for arrays on strings. Tons of range-based code has to special case for strings in order to work around auto-decoding. We're constantly fighting our own API in order to process strings sanely and efficiently.
I'd agree with you, hate the special casing. However it seems to me this has been debated to death already, and that auto-decoding was successfully advocated by Alexandrescu and al; surviving the controversy years ago.
Most everyone who debated in favor of it early on is very much against it now (and I'm one of them). Experience and a better
I wasn't so much asking about auto-decoding in particular more about the mentality and methods of breaking changes. In a way any change to the compiler is a breaking change when it comes to the configuration. I for one never expect code to compile on the latest compiler, It has to be the same compiler same version for the code base to work as expected. At one point I envisioned every file with a header that states the version of the compiler required for that module. A sophisticated configuration tool could take and compile each module with its respective version and then one could link. (this could very well be the worst idea ever) I'm not saying we should be quick to change... oh noo that would be very bad. But after you set in the filth of your decisions long and hard and are certian that it is indeed bad there should be a plan for action and change. And when it comes to change it should be an evolution not a revolution. It is good avoiding the so easily accepted mentality of legacy... Why do you do it that way? "It's because we've always done it that way." The reason I like D is often that driven by its community it innovates and renovates into a language that is honestly really fun to use. (most of the time)
Mar 08 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Friday, March 09, 2018 03:16:03 Taylor Hillegeist via Digitalmars-d 
wrote:

 I wasn't so much asking about auto-decoding in particular more
 about the mentality and methods of breaking changes.

 In a way any change to the compiler is a breaking change when it
 comes to the configuration.

 I for one never expect code to compile on the latest compiler, It
 has to be the same compiler same version for the code base to
 work as expected.

 At one point I envisioned every file with a header that states
 the version of the compiler required for that module. A
 sophisticated configuration tool could take and compile each
 module with its respective version and then one could link. (this
 could very well be the worst idea ever)

 I'm not saying we should be quick to change... oh noo that would
 be very bad. But after you set in the filth of your decisions
 long and hard and are certian that it is indeed bad there should
 be a plan for action and change. And when it comes to change it
 should be an evolution not a revolution.

 It is good avoiding the so easily accepted mentality of legacy...
 Why do you do it that way? "It's because we've always done it
 that way."

 The reason I like D is often that driven by its community it
 innovates and renovates into a language that is honestly really
 fun to use. (most of the time)
Any and all changes need to be weighed for their pros and cons. No one likes it when their code breaks, and ideally, programs would work pretty much forever without modification, but there are changes that are worth dealing with code breakage. Part of the problem is deciding which changes are worth it, and some of that depends on what the migration path would be. Some stuff can be changed with minimal pain, and other stuff can't really be changed without breaking everything. And the more D code that exists, the higher the cost for any change. The drive to make D perfect and the need to be able to use and rely on D code working in production without having to keep changing it are always in conflict. As Walter likes to say, some folks don't want you to break anything, whereas some folks want breaking changes, and they're frequently the same people. Ideally, any D code that you write would work permanently as-is. Also ideally, any and all problems or pain points with D and its standard library would be fixed. Those two things are in complete contradiction of one another, and it's not always easy to judge how to deal with that. Sometimes, it means that we're stuck with legacy decisions, because fixing them is too costly, whereas other times, it means that we deprecate something, and some of the D code out there has to be updated, or it won't compile anymore in a release somewhere in the future. Either way, outright breaking code immediately, with no migration process is pretty much always unacceptable. We'll make breaking changes if we judge the gain to be worth the pain, but we don't want to be constantly breaking people's code, and some changes are large enough that there's arguably no justification for them, because they would simply be too disruptive. Because of how common string processing is and how integrated auto-decoding is into D's string processing, it is very difficult to come up with a way to change it which isn't simply too disruptive to be justified, even though we want to change it. So, this is a particularly difficult case, and how we're going to end up handling it remains to be seen. Thus far, we've mainly worked on providing better ways to get around it, because we can do that without breaking code, whereas actually removing it is extremely difficult. - Jonathan M Davis
Mar 08 2018
parent Chris <wendlec tcd.ie> writes:
On Friday, 9 March 2018 at 06:14:05 UTC, Jonathan M Davis wrote:

 We'll make breaking changes if we judge the gain to be worth 
 the pain, but we don't want to be constantly breaking people's 
 code, and some changes are large enough that there's arguably 
 no justification for them, because they would simply be too 
 disruptive. Because of how common string processing is and how 
 integrated auto-decoding is into D's string processing, it is 
 very difficult to come up with a way to change it which isn't 
 simply too disruptive to be justified, even though we want to 
 change it. So, this is a particularly difficult case, and how 
 we're going to end up handling it remains to be seen. Thus far, 
 we've mainly worked on providing better ways to get around it, 
 because we can do that without breaking code, whereas actually 
 removing it is extremely difficult.

 - Jonathan M Davis
It's aleady been said (by myself and others) that we should actually try to remove it (with a compiler switch) and then see what happens, how much code actually breaks, and based on that experience we can come up with a strategy. I've already said that I'm willing to try it on my code (that is almost 100% string processing). Why not _try_ it, later we can still philosophize
Mar 09 2018
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 08, 2018 at 10:14:16AM -0700, Jonathan M Davis via Digitalmars-d
wrote:
 On Thursday, March 08, 2018 16:34:11 Guillaume Piolat via Digitalmars-d 
 wrote:
[...]
 I'd agree with you, hate the special casing. However it seems to
 me this has been debated to death already, and that auto-decoding
 was successfully advocated by Alexandrescu and al; surviving the
 controversy years ago.
Most everyone who debated in favor of it early on is very much against it now (and I'm one of them). Experience and a better understanding of Unicode has shown it to be a terrible idea. I question that you will find any significant contributor to Phobos who would choose to have it if we were starting from scratch, and most of the folks who post in the newsgroup agree with that.
[...] Yeah, the only reason autodecoding survived in the beginning was because Andrei (wrongly) thought that a Unicode code point was equivalent to a grapheme. If that had been the case, the cost associated with auto-decoding may have been justifiable. Unfortunately, that is not the case, which greatly diminishes most of the advantages that autodecoding was meant to have. So it ended up being something that incurred a significant performance hit, yet did not offer the advantages it was supposed to. To fully live up to Andrei's original vision, it would have to include grapheme segmentation as well. Unfortunately, graphemes are of arbitrary length and cannot in general fit in a single dchar (or any fixed-size type), and grapheme segmentation is extremely costly to compute, so doing it by default would kill D's string manipulation performance. In hindsight, it was obviously a failure and a wrong design decision. Walter is clearly against it after he learned that it comes with a hefty performance cost, and even Andrei himself would admit today that it was a mistake. It's only that he, understandably, does not agree with any change that would disrupt existing code. And that's what we're faced with right now. T -- Frank disagreement binds closer than feigned agreement.
Mar 08 2018
next sibling parent Henrik <henrik nothing.com> writes:
On Thursday, 8 March 2018 at 17:35:11 UTC, H. S. Teoh wrote:
 On Thu, Mar 08, 2018 at 10:14:16AM -0700, Jonathan M Davis via 
 Digitalmars-d wrote:
 [...]
[...]
 [...]
[...] Yeah, the only reason autodecoding survived in the beginning was because Andrei (wrongly) thought that a Unicode code point was equivalent to a grapheme. If that had been the case, the cost associated with auto-decoding may have been justifiable. Unfortunately, that is not the case, which greatly diminishes most of the advantages that autodecoding was meant to have. So it ended up being something that incurred a significant performance hit, yet did not offer the advantages it was supposed to. To fully live up to Andrei's original vision, it would have to include grapheme segmentation as well. Unfortunately, graphemes are of arbitrary length and cannot in general fit in a single dchar (or any fixed-size type), and grapheme segmentation is extremely costly to compute, so doing it by default would kill D's string manipulation performance. [...]
Which companies are against changing this? They must be powerful indeed if their convenience is important enough to protect so destructive features. Even C++ managed to give up trigraphs against the will of IBM. Surely D can give up something that is even more destructive?
Mar 08 2018
prev sibling parent Guillaume Piolat <notthat email.com> writes:
On Thursday, 8 March 2018 at 17:35:11 UTC, H. S. Teoh wrote:
 Yeah, the only reason autodecoding survived in the beginning 
 was because Andrei (wrongly) thought that a Unicode code point 
 was equivalent to a grapheme.  If that had been the case, the 
 cost associated with auto-decoding may have been justifiable.  
 Unfortunately, that is not the case, which greatly diminishes 
 most of the advantages that autodecoding was meant to have.  So 
 it ended up being something that incurred a significant 
 performance hit, yet did not offer the advantages it was 
 supposed to.  To fully live up to Andrei's original vision, it 
 would have to include grapheme segmentation as well.  
 Unfortunately, graphemes are of arbitrary length and cannot in 
 general fit in a single dchar (or any fixed-size type), and 
 grapheme segmentation is extremely costly to compute, so doing 
 it by default would kill D's string manipulation performance.
I remember something a bit different last time it was discussed: - removing auto-decoding was breaking a lot of code, it's used in lots of place - performance loss could be mitigated with .byCodeUnit everytime - Andrei correctly advocating against breakage Personally I do use auto-decoding, often iterating by codepoint, and uses it for fonts and parsers. It's correct for a large subset of languages. You gave us a feature and now we are using it ;)
Mar 09 2018
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 3/7/18 1:00 AM, Taylor Hillegeist wrote:
 So i've seen on the forum over the years arguments about auto-decoding 
 (mostly) and some other things. Things that have been considered 
 mistakes, and cannot be corrected because of the breaking changes it 
 would create. And I always wonder why not make a solution to the tune of 
 a flag that makes things work as they used too, and make the new 
 behavior default.
 
 dmd --UseAutoDecoding
 
 That way the breaking change was easily fixable, and the mistakes of the 
 past not forever. Is it just the cost of maintenance?
Note, autodecoding is NOT a feature of the language, but rather a feature of Phobos. It would be quite interesting I think to create a modified phobos where autodecoding was optional and see what happens (could be switched with a -version=autodecoding). It wouldn't take much effort -- just take out the specializations for strings in std.array. -Steve
Mar 07 2018
parent reply Seb <seb wilzba.ch> writes:
On Wednesday, 7 March 2018 at 14:59:35 UTC, Steven Schveighoffer 
wrote:
 On 3/7/18 1:00 AM, Taylor Hillegeist wrote:
 [...]
Note, autodecoding is NOT a feature of the language, but rather a feature of Phobos. It would be quite interesting I think to create a modified phobos where autodecoding was optional and see what happens (could be switched with a -version=autodecoding). It wouldn't take much effort -- just take out the specializations for strings in std.array. -Steve
Well, I tried that already: https://github.com/dlang/phobos/pull/5513 In short: very easy to do, but not much interest at the time.
Mar 07 2018
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Mar 07, 2018 at 04:29:33PM +0000, Seb via Digitalmars-d wrote:
 On Wednesday, 7 March 2018 at 14:59:35 UTC, Steven Schveighoffer wrote:
 On 3/7/18 1:00 AM, Taylor Hillegeist wrote:
 [...]
Note, autodecoding is NOT a feature of the language, but rather a feature of Phobos. It would be quite interesting I think to create a modified phobos where autodecoding was optional and see what happens (could be switched with a -version=autodecoding). It wouldn't take much effort -- just take out the specializations for strings in std.array. -Steve
Well, I tried that already: https://github.com/dlang/phobos/pull/5513 In short: very easy to do, but not much interest at the time.
Argh... this really struck a nerve. "Not much interest"?! I think a more accurate description is every passerby going "that looks dangerous and I don't have enough time to spare to look into it right now, so better just leave it up to somebody else to stick their neck out and get beheaded by Andrei later", resulting in nobody taking apparent interest in the PR, even though many of us *really* want to see autodecoding go the way of the dodo. T -- EMACS = Extremely Massive And Cumbersome System
Mar 07 2018
prev sibling parent Dukc <ajieskola gmail.com> writes:
On Wednesday, 7 March 2018 at 16:29:33 UTC, Seb wrote:
 Well, I tried that already:

 https://github.com/dlang/phobos/pull/5513

 In short: very easy to do, but not much interest at the time.
No. The main problem with that (and the idea of using a compiler flag in general) is that it affects the whole compilation. That means that every single third-party library, not only Phobos, has to work BOTH with and without the switch. IMO, if we find a way to enable or disable autodecoding per module, not per compilation, that will make deprectating it more than worthwhile.
Mar 08 2018
prev sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist 
wrote:
 So i've seen on the forum over the years arguments about 
 auto-decoding (mostly) and some other things. Things that have 
 been considered mistakes, and cannot be corrected because of 
 the breaking changes it would create. And I always wonder why 
 not make a solution to the tune of a flag that makes things 
 work as they used too, and make the new behavior default.

 dmd --UseAutoDecoding

 That way the breaking change was easily fixable, and the 
 mistakes of the past not forever. Is it just the cost of 
 maintenance?
Auto-decoding is a significant issue for the applications I work on (search engines). There is a lot of string manipulation in these environments, and performance matters. Auto-decoding is a meaningful performance hit. Otherwise, Phobos has a very nice collection of algorithms for string manipulation. It would be great to have a way to turn auto-decoding off in Phobos. --Jon
Mar 07 2018
parent reply Seb <seb wilzba.ch> writes:
On Wednesday, 7 March 2018 at 15:26:40 UTC, Jon Degenhardt wrote:
 On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist 
 wrote:
 [...]
Auto-decoding is a significant issue for the applications I work on (search engines). There is a lot of string manipulation in these environments, and performance matters. Auto-decoding is a meaningful performance hit. Otherwise, Phobos has a very nice collection of algorithms for string manipulation. It would be great to have a way to turn auto-decoding off in Phobos. --Jon
Well you can use byCodeUnit, which disables auto-decoding Though it's not well-known and rather annoying to explicitly add it almost everywhere.
Mar 07 2018
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Mar 07, 2018 at 04:33:25PM +0000, Seb via Digitalmars-d wrote:
 On Wednesday, 7 March 2018 at 15:26:40 UTC, Jon Degenhardt wrote:
[...]
 Auto-decoding is a significant issue for the applications I work on
 (search engines). There is a lot of string manipulation in these
 environments, and performance matters. Auto-decoding is a meaningful
 performance hit. Otherwise, Phobos has a very nice collection of
 algorithms for string manipulation. It would be great to have a way
 to turn auto-decoding off in Phobos.
[...]
 Well you can use byCodeUnit, which disables auto-decoding
 
 Though it's not well-known and rather annoying to explicitly add it
 almost everywhere.
And therein lies the rub: because it's *auto* decoding, rather than just decoding, it's implicit everywhere, adding to the performance hit without the coder being necessarily aware of it. You have to put in the effort to add .byCodeUnit everywhere. Worse yet, it gives the false sense of security that you're doing Unicode "right", when actually that is *not* true at all, because a code point is not equal to a grapheme (what people normally know as a "character"). But because operating at the code point level *appears* to be correct 80% of the time, bugs in string handling often go unnoticed, unlike operating at the code unit level, where any Unicode handling bugs are immediately obvious as soon as your string contains non-ASCII characters. So you're essentially paying the price of a significant performance hit for the dubious benefit of non-100%-correct code, but with bugs conveniently obscured so that it's harder to notice them. Kill autodecoding, I say. Kill it with fire!! T -- MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs
Mar 07 2018
parent Gary Willoughby <dev nomad.so> writes:
On Wednesday, 7 March 2018 at 17:11:55 UTC, H. S. Teoh wrote:
 Kill autodecoding, I say. Kill it with fire!!


 T
Please!!!
Mar 09 2018
prev sibling parent Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 7 March 2018 at 16:33:25 UTC, Seb wrote:
 On Wednesday, 7 March 2018 at 15:26:40 UTC, Jon Degenhardt 
 wrote:
 On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist 
 wrote:
 [...]
Auto-decoding is a significant issue for the applications I work on (search engines). There is a lot of string manipulation in these environments, and performance matters. Auto-decoding is a meaningful performance hit. Otherwise, Phobos has a very nice collection of algorithms for string manipulation. It would be great to have a way to turn auto-decoding off in Phobos.
Well you can use byCodeUnit, which disables auto-decoding Though it's not well-known and rather annoying to explicitly add it almost everywhere.
I looked at this once. It didn't appear to be a viable solution, though I forget the details. I can probably resurrect them if that would be helpful.
Mar 07 2018