www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - ElementType!string

reply "qznc" <qznc web.de> writes:
Apparently, ElementType!string evaluates to dchar. I would have 
expected char. Why is that?
Aug 25 2013
next sibling parent reply "Paolo Invernizzi" <paolo.invernizzi gmail.com> writes:
On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
 Apparently, ElementType!string evaluates to dchar. I would have 
 expected char. Why is that?
I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol. If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting. Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index. - Paolo Invernizzi
Aug 25 2013
parent reply "qznc" <qznc web.de> writes:
On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:
 On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
 Apparently, ElementType!string evaluates to dchar. I would 
 have expected char. Why is that?
I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol. If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting. Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index. - Paolo Invernizzi
Thanks, somewhat unintuitive. This also seems to the explanation, why the types documentation decribes char as "unsigned 8 bit UTF-8", which is different than ubyte "unsigned 8 bit". Confirmed by this unittest: string raw = "Maß";↵ assert(raw.length == 4);↵ assert(walkLength(raw)== 3);
Aug 25 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 25 August 2013 at 19:51:50 UTC, qznc wrote:
 Thanks, somewhat unintuitive.
Yes, but un-intuitive... to the un-initiated. By default, it's also safer. A string *is* conceptually, a sequence of unicode codepoints. The fact that it is made of UTF-8 codepoints is really just low level implementation detail. Thanks to this behavior, things like: string s = "日本語" //search for '本'; for ( ; s.front != '本' ; s.popFront()) {} Well, they *just work* (TM). Now... If you *know* what you are doing, then by all means, iterate on the UTF8 code units. But be warned, you must really know what you are doing. Back to your original subject, you can use: ElementEncodingType!S ElementEncodingType works just like ElementType, but for strings, *really* takes the array's element's type. This is usually *not* the default you want. Also related, foreach naturally iterates on codeunits by default (for some weird reason). I recommend to try to iterate on dchars.
Aug 26 2013
prev sibling next sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
 Apparently, ElementType!string evaluates to dchar. I would have 
 expected char. Why is that?
It is mentioned in the documentation of `ElementType`. Use `std.range.ElementEncodingType` or `std.traits.ForeachType` to get `char` and `wchar` when given arrays of those two types. As for the rationale: `string`, being an alias for `immutable(char)[]`, is an array of UTF-8 code units - an array of `char`s. However, it is indeed a forward range of code points (represented as a UTF-32 code unit - `dchar`). It's a (slightly controversial) choice that was made to make Unicode-correct code the easiest and most intuitive to write, as code points are much more useful than code units. Note that it is not a random-access range. UTF-8 is a variable length encoding, so several code units can be required to encode a single code point. Hence, a non-trivial search is required to get the n'th code point in a UTF-8 or UTF-16 string. Another name for a code point is "character" (technically, a character is what the code point translates to in the UCS). However, it can be a deceptive name - the units we see on screen when rendered are "graphemes", as Unicode characters can be combining, zero-width etc. To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.
Aug 25 2013
parent reply "Jason den Dulk" <public2 jasondendulk.com> writes:
On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:

 Thanks, somewhat unintuitive.
It is a trap for the unwary, but in this case the benefits outweigh the costs. On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:
 To get a range of UTF-8 or UTF-16 code units, the code units 
 have to be represented as something other than `char` and 
 `wchar`. For example, you can cast your string to 
 immutable(ubyte)[] to operate on that, then cast it back at a 
 later point.
To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: auto no_convert(C)(C[] s) if (isSomeChar!C) { struct No { private C[] s; this(C[] _s) { s = _s; } property bool empty() { return s.length == 0; } property C front() in{ assert(s.length != 0); } body{ return s[0]; } void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } } return No(s); } it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?
Aug 27 2013
next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:
 On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi 
 wrote:

 Thanks, somewhat unintuitive.
It is a trap for the unwary, but in this case the benefits outweigh the costs. On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:
 To get a range of UTF-8 or UTF-16 code units, the code units 
 have to be represented as something other than `char` and 
 `wchar`. For example, you can cast your string to 
 immutable(ubyte)[] to operate on that, then cast it back at a 
 later point.
To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: auto no_convert(C)(C[] s) if (isSomeChar!C) { struct No { private C[] s; this(C[] _s) { s = _s; } property bool empty() { return s.length == 0; } property C front() in{ assert(s.length != 0); } body{ return s[0]; } void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } } return No(s); } it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?
That should work. It's the functions in std.array that make ranges out of arrays by provideng empty, front and popFront. As long as you don't use these, everything is fine. Actually I think that your wrapper should do the conversion and std.array should not, but that train is long gone.
Aug 27 2013
prev sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:
 On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi 
 wrote:

 Thanks, somewhat unintuitive.
It is a trap for the unwary, but in this case the benefits outweigh the costs. On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:
 To get a range of UTF-8 or UTF-16 code units, the code units 
 have to be represented as something other than `char` and 
 `wchar`. For example, you can cast your string to 
 immutable(ubyte)[] to operate on that, then cast it back at a 
 later point.
To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: auto no_convert(C)(C[] s) if (isSomeChar!C) { struct No { private C[] s; this(C[] _s) { s = _s; } property bool empty() { return s.length == 0; } property C front() in{ assert(s.length != 0); } body{ return s[0]; } void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } } return No(s); } it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?
It might, but that range of yours is underwhelming: no indexing, no length, no nothing. Why would you want to do *that* though? Is it because you have an ASCII string? In that case, you should be interested in std.encoding.AsciiChar and std.encoding.AsciiString.
Aug 27 2013
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
qznc:

 Apparently, ElementType!string evaluates to dchar. I would have 
 expected char. Why is that?
Try also ForeachType :-) Bye, bearophile
Aug 25 2013