www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Char representation

reply RazvanN <razvan.nitu1305 gmail.com> writes:
Given the following code:

  char[5] a = ['a', 'b', 'c', 'd', 'e'];
  alias Range = char[];
  writeln(is(ElementType!Range == char));

One would expect that the program will print true. In fact, it 
prints false and I noticed that if Range is char[], wchar[], 
dchar[], string, wstring, dstring
Unqual!(ElementType!Range) is dchar. I find it odd that the 
internal representation for char and string is dchar. Is this a 
bug?
Nov 22 2016
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 23/11/2016 2:29 AM, RazvanN wrote:
 Given the following code:

  char[5] a = ['a', 'b', 'c', 'd', 'e'];
  alias Range = char[];
  writeln(is(ElementType!Range == char));

 One would expect that the program will print true. In fact, it prints
 false and I noticed that if Range is char[], wchar[], dchar[], string,
 wstring, dstring
 Unqual!(ElementType!Range) is dchar. I find it odd that the internal
 representation for char and string is dchar. Is this a bug?
"For example, ElementType!(T[]) is T if T[] isn't a narrow string; if it is, the element type is dchar"[0]. [0] https://dlang.org/phobos/std_range_primitives.html#ElementType
Nov 22 2016
prev sibling next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 22 November 2016 at 13:29:47 UTC, RazvanN wrote:
 Given the following code:

  char[5] a = ['a', 'b', 'c', 'd', 'e'];
  alias Range = char[];
  writeln(is(ElementType!Range == char));

 One would expect that the program will print true. In fact, it 
 prints false and I noticed that if Range is char[], wchar[], 
 dchar[], string, wstring, dstring
 Unqual!(ElementType!Range) is dchar. I find it odd that the 
 internal representation for char and string is dchar. Is this a 
 bug?
When seen as a range the element type of a char[] is indeed dchar. This is autodecoding at work.
Nov 22 2016
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Tuesday, 22 November 2016 at 13:29:47 UTC, RazvanN wrote:
 Is this a bug?
The language is sane. The standard library is not.... alas, it is insane by design, so not a bug.
Nov 22 2016
prev sibling next sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 22.11.2016 v 14:29 RazvanN via Digitalmars-d-learn napsal(a):

 Given the following code:

  char[5] a = ['a', 'b', 'c', 'd', 'e'];
  alias Range = char[];
  writeln(is(ElementType!Range == char));

 One would expect that the program will print true. In fact, it prints 
 false and I noticed that if Range is char[], wchar[], dchar[], string, 
 wstring, dstring
 Unqual!(ElementType!Range) is dchar. I find it odd that the internal 
 representation for char and string is dchar. Is this a bug?
RTFM: https://dlang.org/library/std/range/primitives/element_type.html
Nov 22 2016
prev sibling next sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 22.11.2016 v 14:29 RazvanN via Digitalmars-d-learn napsal(a):

 Given the following code:

  char[5] a = ['a', 'b', 'c', 'd', 'e'];
  alias Range = char[];
  writeln(is(ElementType!Range == char));

 One would expect that the program will print true. In fact, it prints 
 false and I noticed that if Range is char[], wchar[], dchar[], string, 
 wstring, dstring
 Unqual!(ElementType!Range) is dchar. I find it odd that the internal 
 representation for char and string is dchar. Is this a bug?
https://dlang.org/library/std/range/primitives/element_encoding_type.html
Nov 22 2016
prev sibling next sibling parent reply Jonathan M Davis via Digitalmars-d-learn writes:
On Tuesday, November 22, 2016 13:29:47 RazvanN via Digitalmars-d-learn 
wrote:
 Given the following code:

   char[5] a = ['a', 'b', 'c', 'd', 'e'];
   alias Range = char[];
   writeln(is(ElementType!Range == char));

 One would expect that the program will print true. In fact, it
 prints false and I noticed that if Range is char[], wchar[],
 dchar[], string, wstring, dstring
 Unqual!(ElementType!Range) is dchar. I find it odd that the
 internal representation for char and string is dchar. Is this a
 bug?
You misunderstand. char[] is a dynamic array of char, wchar[] is a dynamic array of wchar[], and dchar[] is a dynamic array of dchar. There is nothing funny going on with the internal representation. Rather, the problem is with the range API and the traits that go with it. And it's not a bug; it's a design mistake. I don't know how much you know about Unicode, but for a quick explanation, you have code units, code points, and graphemes. A grapheme is made up of one or more code points, and a code point is made up of one or more code units. In the case of UTF-8, a code unit is 8 bits; in UTF-16, a code unit is 16 bits; and in UTF-32, a code unit is 32 bits. Those are represented in D by char, wchar, dchar respectively. There is no guarantee that a char, wchar, or dchar is a representable character. A code unit is just a piece of a character except in the cases where it happens to be a full character. :| A code point, on the other hand, actually makes up something composable and printable. It's something like the letter A, or é, or, の, etc. It could also be an accent, a superscript, subscript, etc. In the case of UTF-8 and UTF-16, it can take several code units to form a single code point. In the case of UTF-32, a single code unit is always a code point, because code points take up 32 bits. However, that's still not necessarily a full character. After all, an accent or a superscript is not really a character. Rather, it's a modifier for a character. So, one or more code points can be combined to form graphemes which _are_ actual characters. Unfortunately, there are several normalization schemes for the order of code points in a grapheme, and some graphemes can be represented as a single code point or as several (most notably, the characters which commonly have accents on them such as é come both as single code points and as combined code points). So, this whole thing gets stupidly complicated. It's even worse when you want to handle it all _efficiently_. Well, when Andrei added ranges to D, he tried to simplify things so that the default was correct and reasonably efficient while allowing for code to specialize where appropriate to get the full efficiency. That's a noble goal, but unfortunately, he didn't know about graphemes at the time. He thought that code points were guaranteed to be full characters and that if you operated at the code point level, you were guaranteed full correctness. So, in order to avoid errors related to chopping up strings of char or wchar in the middle of code points, he came up with the concept of "narrow" strings - i.e. strings which are made up of char or wchar rather than dchar (so strings where each code unit is not guaranteed to be a code point), and he restricted what narrow strings could do by default per the range API and its associated traits. So, we get fun like this. assert(!hasLength!string); assert(!hasLength!wstring); assert(hasLength!dstring); assert(!isRandomAccessRange!string); assert(!isRandomAccessRange!wstring); assert(isRandomAccessRange!dstring); assert(is(ElementType!string == dchar)); assert(is(ElementType!wstring == dchar)); assert(is(ElementType!dstring == dchar)); And front, popFront, back, and popBack all automatically decode the code units in a string to code points. So, front and back both return dchar even if the string is a string of char or wchar. The arrays themselves do not change. However, the way that the traits in std.range.primitives treat them is then fundamentally different from how the language treats them. So, even though string str = "hello world"; for(auto r = str; !r.empty; r.popFront()) { auto e = range.front; } will iterate by dchar string str = "hello world"; foreach(e; str) { } will iterate by char. If you want it to iterate by dchar, then you make it explicit. string str = "hello world"; foreach(dchar e; str) { } The result of all of this is that by default, when you treat strings as ranges, you operate at the code point level. This avoids certain bugs where code would otherwise chop up code points by operating on code units, but since it doesn't actually go to the grapheme level, it still isn't actually correct, and it's easier to miss the fact that it's wrong, since more cases work. It's also inefficient, because the code units are always decoded to code points regardless of whether the algorithm in question actually needs to do that or not. It also creates confusion and questions like yours. Most of us agree at this point that all of this was a mistake and that narrow strings should not have been treated specially. Rather, it should be required for the programmer to wrap them in other ranges to decode code units to code points or graphemes so that the programmer has full control over it. But unfortunately, changing it at this point would be a _huge_ breaking change. So, it's unlikely that we're going to be able to. We hope that we'll find a way, but for now, we're stuck. To work around this, phobos tends to special case algorithms on strings in order to avoid the auto-decoding. find would be a prime example of this. As long as the code points are normalized, you can do a find using code units rather than code points. Decoding to code points is just a waste. However, some algorithms such as filter can't do that, because there is no obviously correct solution. The programmer really needs to be the one to decide, so they just always do the auto-decoding. traits like isNarrowString and ElementEncodingType can be used to detect when you're dealing with narrow strings and operate on them as strings rather than via the range API, and Phobos uses them heavily. In addition, std.utf has byCodeUnit and by{C,Wc,Dc}har, and std.uni has byGrapheme. So, a lot of range-based code should really be using those rather than operating on strings directly, though there are a number of parts of Phobos that don't yet fully support arbitrary ranges of char or wchar (since previously, it was assumed that all ranges of character types were ranges of dchar). So, sometimes stuff that should work doesn't (the situation is improving though). Alternatively, there's std.string.representation that can be used to cast a string of char, wchar, or dchar to an array of ubyte, ushort, or uint with the proper constness, and code can then operate on those integer types, and that won't auto-decode, but that doesn't work very well if you want to use functions intended specifically for strings (e.g. most of std.string doesn't work with arrays of ubyte, ushort, or uint). So, yes. This is a bit of a mess. It works fairly well overall in spite of the problems, but it's still a mess. And you're far from alone in being confused by it. - Jonathan M Davis
Nov 22 2016
parent RazvanN <razvan.nitu1305 gmail.com> writes:
On Tuesday, 22 November 2016 at 14:23:28 UTC, Jonathan M Davis 
wrote:
 On Tuesday, November 22, 2016 13:29:47 RazvanN via 
 Digitalmars-d-learn wrote:
 [...]
You misunderstand. char[] is a dynamic array of char, wchar[] is a dynamic array of wchar[], and dchar[] is a dynamic array of dchar. There is nothing funny going on with the internal representation. Rather, the problem is with the range API and the traits that go with it. And it's not a bug; it's a design mistake. [...]
Thank you very much for this great explanation. Things are starting to make sense now. Razvan Nitu
Nov 22 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Tuesday, 22 November 2016 at 13:29:47 UTC, RazvanN wrote:
 Given the following code:

  char[5] a = ['a', 'b', 'c', 'd', 'e'];
  alias Range = char[];
  writeln(is(ElementType!Range == char));

 One would expect that the program will print true. In fact, it 
 prints false and I noticed that if Range is char[], wchar[], 
 dchar[], string, wstring, dstring
 Unqual!(ElementType!Range) is dchar. I find it odd that the 
 internal representation for char and string is dchar. Is this a 
 bug?
Here's the reading: https://forum.dlang.org/post/nh2o9i$hr0$1 digitalmars.com
Nov 22 2016