digitalmars.D.learn - ElementType!string

qznc (2/2) Aug 25 2013 Apparently, ElementType!string evaluates to dchar. I would have

Paolo Invernizzi (10/12) Aug 25 2013 I think is because they are iterated by ranges as dchar, that's

qznc (9/21) Aug 25 2013 Thanks, somewhat unintuitive.

monarch_dodra (21/22) Aug 26 2013 Yes, but un-intuitive... to the un-initiated. By default, it's

Jakob Ovrum (24/26) Aug 25 2013 It is mentioned in the documentation of `ElementType`. Use

Jason den Dulk (22/28) Aug 27 2013 It is a trap for the unwary, but in this case the benefits

Tobias Pankrath (6/35) Aug 27 2013 That should work. It's the functions in std.array that make
monarch_dodra (6/35) Aug 27 2013 It might, but that range of yours is underwhelming: no indexing,

bearophile (4/6) Aug 25 2013 Try also ForeachType :-)

"qznc" <qznc web.de> writes:

Apparently, ElementType!string evaluates to dchar. I would have 
expected char. Why is that?

Aug 25 2013

"Paolo Invernizzi" <paolo.invernizzi gmail.com> writes:

On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
 Apparently, ElementType!string evaluates to dchar. I would have 
 expected char. Why is that?

I think is because they are iterated by ranges as dchar, that's 
equivalent to iterating the unicode symbol.

If they were iterated by char, you would get during the iteration 
the singles pieces of the utf8 encoding, and usually that is not 
what an user is expecting.

Note on the other side that static assert( is(typeof(" "[0]) == 
immutable(char)) ), so you can iterate the string by chars using 
the index.

- Paolo Invernizzi

Aug 25 2013

"qznc" <qznc web.de> writes:

On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:
 On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
 Apparently, ElementType!string evaluates to dchar. I would 
 have expected char. Why is that?

 I think is because they are iterated by ranges as dchar, that's 
 equivalent to iterating the unicode symbol.

 If they were iterated by char, you would get during the 
 iteration the singles pieces of the utf8 encoding, and usually 
 that is not what an user is expecting.

 Note on the other side that static assert( is(typeof(" "[0]) == 
 immutable(char)) ), so you can iterate the string by chars 
 using the index.

 - Paolo Invernizzi

Thanks, somewhat unintuitive.

This also seems to the explanation, why the types documentation 
decribes char as "unsigned 8 bit UTF-8", which is different than 
ubyte "unsigned 8 bit".

Confirmed by this unittest:

string raw = "Maß";↵
assert(raw.length == 4);↵
assert(walkLength(raw)== 3);

Aug 25 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 25 August 2013 at 19:51:50 UTC, qznc wrote:
 Thanks, somewhat unintuitive.

Yes, but un-intuitive... to the un-initiated. By default, it's 
also safer. A string *is* conceptually, a sequence of unicode 
codepoints. The fact that it is made of UTF-8 codepoints is 
really just low level implementation detail.

Thanks to this behavior, things like:
string s = "日本語"
//search for '本';
for ( ; s.front != '本' ; s.popFront())
{}

Well, they *just work* (TM).

Now... If you *know* what you are doing, then by all means, 
iterate on the UTF8 code units. But be warned, you must really 
know what you are doing.

Back to your original subject, you can use:
ElementEncodingType!S

ElementEncodingType works just like ElementType, but for strings, 
*really* takes the array's element's type. This is usually *not* 
the default you want.

Also related, foreach naturally iterates on codeunits by default 
(for some weird reason). I recommend to try to iterate on dchars.

Aug 26 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
 Apparently, ElementType!string evaluates to dchar. I would have 
 expected char. Why is that?

It is mentioned in the documentation of `ElementType`. Use 
`std.range.ElementEncodingType` or `std.traits.ForeachType` to 
get `char` and `wchar` when given arrays of those two types.

As for the rationale:

`string`, being an alias for `immutable(char)[]`, is an array of 
UTF-8 code units - an array of `char`s. However, it is indeed a 
forward range of code points (represented as a UTF-32 code unit - 
`dchar`). It's a (slightly controversial) choice that was made to 
make Unicode-correct code the easiest and most intuitive to 
write, as code points are much more useful than code units.

Note that it is not a random-access range. UTF-8 is a variable 
length encoding, so several code units can be required to encode 
a single code point. Hence, a non-trivial search is required to 
get the n'th code point in a UTF-8 or UTF-16 string.

Another name for a code point is "character" (technically, a 
character is what the code point translates to in the UCS). 
However, it can be a deceptive name - the units we see on screen 
when rendered are "graphemes", as Unicode characters can be 
combining, zero-width etc.

To get a range of UTF-8 or UTF-16 code units, the code units have 
to be represented as something other than `char` and `wchar`. For 
example, you can cast your string to immutable(ubyte)[] to 
operate on that, then cast it back at a later point.

Aug 25 2013

"Jason den Dulk" <public2 jasondendulk.com> writes:

On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:

 Thanks, somewhat unintuitive.

It is a trap for the unwary, but in this case the benefits 
outweigh the costs.

On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:

 To get a range of UTF-8 or UTF-16 code units, the code units 
 have to be represented as something other than `char` and 
 `wchar`. For example, you can cast your string to 
 immutable(ubyte)[] to operate on that, then cast it back at a 
 later point.

To have to use ubyte would seem to defeat the purpose of having 
char. If I were to have this:

   auto no_convert(C)(C[] s) if (isSomeChar!C)
   {
     struct No
     {
       private C[] s;
       this(C[] _s) { s = _s; }

        property bool empty() { return s.length == 0; }
        property C front() in{ assert(s.length != 0); } body{ 
return s[0]; }
       void popFront() in{ assert(s.length != 0); } body{ s = 
s[1..$]; }
     }
     return No(s);
   }

it's element type would be char for strings. Would this still 
result in conversions if I used it with other algorithms?

Aug 27 2013

"Tobias Pankrath" <tobias pankrath.net> writes:

On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:
 On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi 
 wrote:

 Thanks, somewhat unintuitive.

 It is a trap for the unwary, but in this case the benefits 
 outweigh the costs.

 On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:

 To get a range of UTF-8 or UTF-16 code units, the code units 
 have to be represented as something other than `char` and 
 `wchar`. For example, you can cast your string to 
 immutable(ubyte)[] to operate on that, then cast it back at a 
 later point.

 To have to use ubyte would seem to defeat the purpose of having 
 char. If I were to have this:

   auto no_convert(C)(C[] s) if (isSomeChar!C)
   {
     struct No
     {
       private C[] s;
       this(C[] _s) { s = _s; }

        property bool empty() { return s.length == 0; }
        property C front() in{ assert(s.length != 0); } body{ 
 return s[0]; }
       void popFront() in{ assert(s.length != 0); } body{ s = 
 s[1..$]; }
     }
     return No(s);
   }

 it's element type would be char for strings. Would this still 
 result in conversions if I used it with other algorithms?

That should work. It's the functions in std.array that make 
ranges out of arrays by provideng empty, front and popFront. As 
long as you don't use these, everything is fine.

Actually I think that your wrapper should do the conversion and 
std.array should not, but that train is long gone.

Aug 27 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:
 On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi 
 wrote:

 Thanks, somewhat unintuitive.

 It is a trap for the unwary, but in this case the benefits 
 outweigh the costs.

 On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:

 To get a range of UTF-8 or UTF-16 code units, the code units 
 have to be represented as something other than `char` and 
 `wchar`. For example, you can cast your string to 
 immutable(ubyte)[] to operate on that, then cast it back at a 
 later point.

 To have to use ubyte would seem to defeat the purpose of having 
 char. If I were to have this:

   auto no_convert(C)(C[] s) if (isSomeChar!C)
   {
     struct No
     {
       private C[] s;
       this(C[] _s) { s = _s; }

        property bool empty() { return s.length == 0; }
        property C front() in{ assert(s.length != 0); } body{ 
 return s[0]; }
       void popFront() in{ assert(s.length != 0); } body{ s = 
 s[1..$]; }
     }
     return No(s);
   }

 it's element type would be char for strings. Would this still 
 result in conversions if I used it with other algorithms?

It might, but that range of yours is underwhelming: no indexing, 
no length, no nothing.

Why would you want to do *that* though? Is it because you have an 
ASCII string? In that case, you should be interested in 
std.encoding.AsciiChar and std.encoding.AsciiString.

Aug 27 2013

"bearophile" <bearophileHUGS lycos.com> writes:

qznc:

 Apparently, ElementType!string evaluates to dchar. I would have 
 expected char. Why is that?

Try also ForeachType :-)

Bye,
bearophile

Aug 25 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - ElementType!string