digitalmars.D.learn - ElementType!string
- qznc (2/2) Aug 25 2013 Apparently, ElementType!string evaluates to dchar. I would have
- Paolo Invernizzi (10/12) Aug 25 2013 I think is because they are iterated by ranges as dchar, that's
- qznc (9/21) Aug 25 2013 Thanks, somewhat unintuitive.
- monarch_dodra (21/22) Aug 26 2013 Yes, but un-intuitive... to the un-initiated. By default, it's
- Jakob Ovrum (24/26) Aug 25 2013 It is mentioned in the documentation of `ElementType`. Use
- Jason den Dulk (22/28) Aug 27 2013 It is a trap for the unwary, but in this case the benefits
- Tobias Pankrath (6/35) Aug 27 2013 That should work. It's the functions in std.array that make
- monarch_dodra (6/35) Aug 27 2013 It might, but that range of yours is underwhelming: no indexing,
- bearophile (4/6) Aug 25 2013 Try also ForeachType :-)
Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?
Aug 25 2013
On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol. If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting. Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index. - Paolo Invernizzi
Aug 25 2013
On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:Thanks, somewhat unintuitive. This also seems to the explanation, why the types documentation decribes char as "unsigned 8 bit UTF-8", which is different than ubyte "unsigned 8 bit". Confirmed by this unittest: string raw = "Maß";↵ assert(raw.length == 4);↵ assert(walkLength(raw)== 3);Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol. If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting. Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index. - Paolo Invernizzi
Aug 25 2013
On Sunday, 25 August 2013 at 19:51:50 UTC, qznc wrote:Thanks, somewhat unintuitive.Yes, but un-intuitive... to the un-initiated. By default, it's also safer. A string *is* conceptually, a sequence of unicode codepoints. The fact that it is made of UTF-8 codepoints is really just low level implementation detail. Thanks to this behavior, things like: string s = "日本語" //search for '本'; for ( ; s.front != '本' ; s.popFront()) {} Well, they *just work* (TM). Now... If you *know* what you are doing, then by all means, iterate on the UTF8 code units. But be warned, you must really know what you are doing. Back to your original subject, you can use: ElementEncodingType!S ElementEncodingType works just like ElementType, but for strings, *really* takes the array's element's type. This is usually *not* the default you want. Also related, foreach naturally iterates on codeunits by default (for some weird reason). I recommend to try to iterate on dchars.
Aug 26 2013
On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?It is mentioned in the documentation of `ElementType`. Use `std.range.ElementEncodingType` or `std.traits.ForeachType` to get `char` and `wchar` when given arrays of those two types. As for the rationale: `string`, being an alias for `immutable(char)[]`, is an array of UTF-8 code units - an array of `char`s. However, it is indeed a forward range of code points (represented as a UTF-32 code unit - `dchar`). It's a (slightly controversial) choice that was made to make Unicode-correct code the easiest and most intuitive to write, as code points are much more useful than code units. Note that it is not a random-access range. UTF-8 is a variable length encoding, so several code units can be required to encode a single code point. Hence, a non-trivial search is required to get the n'th code point in a UTF-8 or UTF-16 string. Another name for a code point is "character" (technically, a character is what the code point translates to in the UCS). However, it can be a deceptive name - the units we see on screen when rendered are "graphemes", as Unicode characters can be combining, zero-width etc. To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.
Aug 25 2013
On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:Thanks, somewhat unintuitive.It is a trap for the unwary, but in this case the benefits outweigh the costs. On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: auto no_convert(C)(C[] s) if (isSomeChar!C) { struct No { private C[] s; this(C[] _s) { s = _s; } property bool empty() { return s.length == 0; } property C front() in{ assert(s.length != 0); } body{ return s[0]; } void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } } return No(s); } it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?
Aug 27 2013
On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:That should work. It's the functions in std.array that make ranges out of arrays by provideng empty, front and popFront. As long as you don't use these, everything is fine. Actually I think that your wrapper should do the conversion and std.array should not, but that train is long gone.Thanks, somewhat unintuitive.It is a trap for the unwary, but in this case the benefits outweigh the costs. On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: auto no_convert(C)(C[] s) if (isSomeChar!C) { struct No { private C[] s; this(C[] _s) { s = _s; } property bool empty() { return s.length == 0; } property C front() in{ assert(s.length != 0); } body{ return s[0]; } void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } } return No(s); } it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?
Aug 27 2013
On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:It might, but that range of yours is underwhelming: no indexing, no length, no nothing. Why would you want to do *that* though? Is it because you have an ASCII string? In that case, you should be interested in std.encoding.AsciiChar and std.encoding.AsciiString.Thanks, somewhat unintuitive.It is a trap for the unwary, but in this case the benefits outweigh the costs. On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: auto no_convert(C)(C[] s) if (isSomeChar!C) { struct No { private C[] s; this(C[] _s) { s = _s; } property bool empty() { return s.length == 0; } property C front() in{ assert(s.length != 0); } body{ return s[0]; } void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } } return No(s); } it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?
Aug 27 2013
qznc:Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?Try also ForeachType :-) Bye, bearophile
Aug 25 2013