www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Autodecode?

reply JN <666total wp.pl> writes:
Related to this thread: 
https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq forum.dlang.org

I don't want to hijack it with my newbie questions. What is 
autodecode and why is it such a big deal? From what I've seen 
it's related to handling Unicode characters? And D has the wrong 
defaults?
Aug 16 2020
next sibling parent aberba <karabutaworld gmail.com> writes:
On Sunday, 16 August 2020 at 20:53:41 UTC, JN wrote:
 Related to this thread: 
 https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq forum.dlang.org

 I don't want to hijack it with my newbie questions. What is 
 autodecode and why is it such a big deal? From what I've seen 
 it's related to handling Unicode characters? And D has the 
 wrong defaults?
https://forum.dlang.org/thread/qitnkf$2736$1 digitalmars.com?page=1
Aug 16 2020
prev sibling next sibling parent Paul Backus <snarwin gmail.com> writes:
On Sunday, 16 August 2020 at 20:53:41 UTC, JN wrote:
 Related to this thread: 
 https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq forum.dlang.org

 I don't want to hijack it with my newbie questions. What is 
 autodecode and why is it such a big deal? From what I've seen 
 it's related to handling Unicode characters? And D has the 
 wrong defaults?
For built-in arrays, the range primitives (empty, front, popFront, etc.) are implemented as free functions in the standard-library module `std.range.primitives`. [1] For most arrays, these work the way you'd expect: empty checks if the array is empty, front returns `array[0]`, and popFront does `array = array[1..$]`. But for char[] and wchar[] specifically, `front` and `popFront` work differently. They treat the arrays as UTF-8 or UTF-16 encoded Unicode strings, and return/pop the first *code point* instead of the first *code unit*. In other words, they "automatically decode" the array. This has a number of annoying consequences. New users get mysterious template errors in the middle of range pipelines complaining about a mismatch between `dchar` (the type of a code point) and `char` (the type of a code unit). Generic code that deals with arrays has to add special cases for char[] and wchar[]. Strings don't work correctly in betterC because Unicode decoding can throw an exception. [2] If you search the forums, you'll find plenty more complaints. The intent behind autodecoding was to help programmers avoid common Unicode-related errors by doing "the right thing" by default. The problem is that (a) decoding to code points isn't always the right thing, and (b) autodecoding ended up causing a bunch of additional problems of its own. [1] http://dpldocs.info/experimental-docs/std.range.primitives.html [2] https://issues.dlang.org/show_bug.cgi?id=20139
Aug 16 2020
prev sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 8/16/20 4:53 PM, JN wrote:
 Related to this thread: 
 https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq forum.dlang.org
 
 I don't want to hijack it with my newbie questions. What is autodecode 
 and why is it such a big deal? From what I've seen it's related to 
 handling Unicode characters? And D has the wrong defaults?
Aside from what others have said, autodecode isn't really terrible as a default. But what IS terrible is the inconsistency. Phobos says char[] is not an array, but the language does. e.g.: char[] example; static assert(!hasLength!(typeof(example))); // Phobos: no length here! auto l = example.length; // dlang: um... yes, there is. static assert(!hasIndexing!(typeof(example))); // P: no index support! auto e = example[0]; // D: yes, you can index. And probably my favorite WTF: static assert(is(ElementType!(typeof(example)) == dchar)); // P: char is a range of dchar! foreach(e; example) static assert(is(typeof(e)) == char)); // D: nope, it's an array of char This leads to all kinds of fun stuff. Like try chaining together several char[] arrays, and then converting the result into an array. Surprise! it's a dchar[], and you just wasted a bunch of CPU cycles decoding them, not to mention the RAM to store it. But then Phobos, as it's telling you that all these things aren't true, then goes behind your back and implements all kinds of special cases to deal with these narrow strings *using the array interfaces* because it performs better. We will be much better off to be done with autodecoding. And for many cases, autodecoding is just fine! Most of the time, you only care about the entire string and not what it's made of. Many other languages do "autodecoding", and in fact the string type is opaque. But then it gives you ways to use it that aren't silly (like concatenating 2 strings knows what the underlying types are and figures out the most efficient way possible). If `string` was a custom type and not an array, we probably wouldn't have so many issues with it. -Steve
Aug 16 2020