digitalmars.D.learn - Why is string.front dchar?

TheFlyingFiddle (3/3) Jan 13 2014 I'm curious, why is the .front property of narrow strings of type

bearophile (14/17) Jan 13 2014 There was a long discussion on this. It was chosen this way to

TheFlyingFiddle (5/11) Jan 15 2014 This is why i was confused really since the normal foreach is

Jakob Ovrum (24/28) Jan 15 2014 Unfortunately, it's not that simple. D arrays/slices have two

Jonathan M Davis (6/9) Jan 13 2014 It's to promote the correct handling of Unicode. A couple of related que...

Meta (4/15) Jan 13 2014 Also somewhat related:

qznc (2/2) Jan 14 2014 And a short overview over Unicode in D:

Maxim Fomin (9/12) Jan 14 2014 The root of the issue is that string literals containing

Jakob Ovrum (20/28) Jan 15 2014 This assertion makes all the wrong assumptions.

Maxim Fomin (46/75) Jan 15 2014 This is wrong. String in D is de facto (by implementation, spec

Jakob Ovrum (20/63) Jan 20 2014 By implementation they are also UTF strings. String literals use

Maxim Fomin (27/75) Jan 20 2014 import std.stdio;

Tobias Pankrath (12/15) Jan 20 2014 No, since this literal can be encoded as utf8 just fine. Keep in

Jakob Ovrum (6/14) Jan 20 2014 Code units, not code points.

Tobias Pankrath (2/9) Jan 20 2014 Arg! Of course.

Timon Gehr (4/9) Jan 22 2014 A character can be made of more than one dchar. (There are also more

Jakob Ovrum (2/16) Jan 22 2014 No, I believe you are thinking of graphemes.

Timon Gehr (3/16) Jan 23 2014 Sure. Their existence means it is in general wrong to think of a dchar

Jakob Ovrum (4/6) Jan 23 2014 As stated, I was specifically talking about the Unicode

"TheFlyingFiddle" <kurtyan student.chalmers.se> writes:

I'm curious, why is the .front property of narrow strings of type 
dchar?
And not the underlying character type for the string.

Jan 13 2014

"bearophile" <bearophileHUGS lycos.com> writes:

TheFlyingFiddle:

 I'm curious, why is the .front property of narrow strings of 
 type dchar?
 And not the underlying character type for the string.

There was a long discussion on this. It was chosen this way to 
allow most range-based algorithms to work correctly on UTF8 and 
UTF16 strings.

In some cases you can use the std.string.representation function 
to avoid to pay the UTF decoding, or/and to use some algorithms 
as sort().

But for backwards compatibility reasons in this code:

foreach (c; "somestring")

c is a char, not a dchar. You have to type it explicitly to 
handle the UTF safely:

foreach (dchar c; "somestring")

Bye,
bearophile

Jan 13 2014

"TheFlyingFiddle" <kurtyan student.chalmers.se> writes:

On Tuesday, 14 January 2014 at 01:12:40 UTC, bearophile wrote:
 TheFlyingFiddle:

 But for backwards compatibility reasons in this code:

 foreach (c; "somestring")

 c is a char, not a dchar. You have to type it explicitly to 
 handle the UTF safely:

 foreach (dchar c; "somestring")

This is why i was confused really since the normal foreach is 
char it's weird that string.front is not a char. But if foreach 
being a char is only the way it is for legacy reasons it all 
makes sense.

Jan 15 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 15 January 2014 at 20:05:32 UTC, TheFlyingFiddle 
wrote:
 This is why i was confused really since the normal foreach is 
 char it's weird that string.front is not a char. But if foreach 
 being a char is only the way it is for legacy reasons it all 
 makes sense.

Unfortunately, it's not that simple. D arrays/slices have two 
distinct interfaces - the slice interface and the range 
interface. The latter is a library convention built on top of the 
former - thus the existence of the slice interface is necessary.

A generic algorithm can choose to work on arrays (array 
algorithm) or ranges (range algorithm) among other kinds of type 
federations:

auto algo(E)(E[] t); // array algorithm
auto algo(R)(R r) if (isInputRange!R); // range algorithm

The array algorithm can assume that:

foreach(e; t)
     static assert(is(typeof(e) == E));

While the range algorithm *cannot* assume that:

foreach(e; r)
     static assert(is(typeof(e) == ElementType!R));

Because this fails when R is a narrow string (slice of UTF-8 or 
UTF-16 code units). Thus, the correct way to use foreach over a 
range in a generic range algorithm is:

foreach(ElementType!R e; r) {}

Swapping the default just swaps which kind of algorithm can make 
the assumption. The added cost of breaking existing algorithms is 
a big deal, but as demonstrated, it's not a panacea.

Jan 15 2014

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, January 13, 2014 23:10:03 TheFlyingFiddle wrote:
 I'm curious, why is the .front property of narrow strings of type
 dchar?
 And not the underlying character type for the string.

It's to promote the correct handling of Unicode. A couple of related questions 
and answers:

http://stackoverflow.com/questions/12288465/std-algorithm-joinerstring-string-why-result-elements-are-dchar-and-not-ch

http://stackoverflow.com/questions/16590650/how-to-read-a-string-character-by-character-as-a-range-in-d

- Jonathan M Davis

Jan 13 2014

"Meta" <jared771 gmail.com> writes:

On Tuesday, 14 January 2014 at 03:01:53 UTC, Jonathan M Davis 
wrote:
 On Monday, January 13, 2014 23:10:03 TheFlyingFiddle wrote:
 I'm curious, why is the .front property of narrow strings of 
 type
 dchar?
 And not the underlying character type for the string.

 It's to promote the correct handling of Unicode. A couple of 
 related questions
 and answers:

 http://stackoverflow.com/questions/12288465/std-algorithm-joinerstring-string-why-result-elements-are-dchar-and-not-ch

 http://stackoverflow.com/questions/16590650/how-to-read-a-string-character-by-character-as-a-range-in-d

 - Jonathan M Davis

Also somewhat related:

http://stackoverflow.com/questions/13368728/why-isnt-dchar-the-standard-character-type-in-d

Jan 13 2014

"qznc" <qznc web.de> writes:

And a short overview over Unicode in D:
http://qznc.github.io/d-tut/unicode.html

Jan 14 2014

"Maxim Fomin" <maxim maxim-fomin.ru> writes:

On Monday, 13 January 2014 at 23:10:04 UTC, TheFlyingFiddle wrote:
 I'm curious, why is the .front property of narrow strings of 
 type dchar?
 And not the underlying character type for the string.

The root of the issue is that string literals containing 
characters which do not fit into signle byte are still converted 
to char[] array. This is strictly speaking not type safe because 
it allows to reinterpret 2 or 4 byte code unit as sequence of 
characters of 1 byte size. The string type is in some sense 
problematic in D. That's why the fact that .front returns dhcar 
is a way to correct the problem, it is not an attempt to 
introduce confusion.

Jan 14 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Tuesday, 14 January 2014 at 11:42:34 UTC, Maxim Fomin wrote:
 The root of the issue is that string literals containing 
 characters which do not fit into signle byte are still 
 converted to char[] array. This is strictly speaking not type 
 safe because it allows to reinterpret 2 or 4 byte code unit as 
 sequence of characters of 1 byte size. The string type is in 
 some sense problematic in D. That's why the fact that .front 
 returns dhcar is a way to correct the problem, it is not an 
 attempt to introduce confusion.

This assertion makes all the wrong assumptions.

`char` is a UTF-8 code unit[1], and `string` is an array of 
immutable UTF-8 code units. The whole point of UTF-8 is the 
ability to encode code points that need multiple bytes (UTF-8 
code units), so the string literal behaviour is perfectly regular.

Operations on code units are rare, which is why the standard 
library instead treats strings as ranges of code points, for 
correctness by default. However, we must not prevent the user 
from being able to work on arrays of code units, as many string 
algorithms can be optimized by not doing full UTF decoding. The 
standard library does this on many occasions, and there are more 
to come.

Note that the Unicode definition of an unqualified "character" is 
the translation of a code *point*, which is very different from a 
*glyph*, which is what people generally associate the word 
"character" with. Thus, `string` is not an array of characters 
(i.e. an array where each element is a character), but `dstring` 
can be said to be.

[1] http://dlang.org/type

Jan 15 2014

"Maxim Fomin" <maxim maxim-fomin.ru> writes:

On Thursday, 16 January 2014 at 05:56:48 UTC, Jakob Ovrum wrote:
 On Tuesday, 14 January 2014 at 11:42:34 UTC, Maxim Fomin wrote:
 The root of the issue is that string literals containing 
 characters which do not fit into signle byte are still 
 converted to char[] array. This is strictly speaking not type 
 safe because it allows to reinterpret 2 or 4 byte code unit as 
 sequence of characters of 1 byte size. The string type is in 
 some sense problematic in D. That's why the fact that .front 
 returns dhcar is a way to correct the problem, it is not an 
 attempt to introduce confusion.

 This assertion makes all the wrong assumptions.

 `char` is a UTF-8 code unit[1], and `string` is an array of 
 immutable UTF-8 code units. The whole point of UTF-8 is the 
 ability to encode code points that need multiple bytes (UTF-8 
 code units), so the string literal behaviour is perfectly 
 regular.

This is wrong. String in D is de facto (by implementation, spec 
may say whatever is convenient for advertising D) array of single 
bytes which can keep UTF-8 code units. No way string type in D is 
always a string in a sense of code points/characters. Sometimes 
it happens that string type behaves like 'string', but if you put 
UTF-16 or UTF-32 text it would remind you what string type really 
is.

 Operations on code units are rare, which is why the standard 
 library instead treats strings as ranges of code points, for 
 correctness by default. However, we must not prevent the user 
 from being able to work on arrays of code units, as many string 
 algorithms can be optimized by not doing full UTF decoding. The 
 standard library does this on many occasions, and there are 
 more to come.

This is attempt to explain problematic design as a wise action.

 Note that the Unicode definition of an unqualified "character" 
 is the translation of a code *point*, which is very different 
 from a *glyph*, which is what people generally associate the 
 word "character" with. Thus, `string` is not an array of 
 characters (i.e. an array where each element is a character), 
 but `dstring` can be said to be.

 [1] http://dlang.org/type

By the way, the link you provide says char is unsigned 8 bit type 
which can keep value of UTF-8 code unit.

UTF is irrelevant because the problem is in D implementation. See 
http://forum.dlang.org/thread/hoopiiobddbapybbwwoc forum.dlang.org 
(in particular 2nd page).

The root of the issue is that D does not provide 'utf' type which 
would handle correctly strings and characters irrespective of the 
format. But instead the language pretends that it supports such 
type by allowing to convert to single byte char array both 
literals "sad" and "säд". And ['s', 'ä', 'д'] is by the way 
neither char[], no wchar[], even not dchar[] but sequence of 
integers, which compounds oddities in character types.

Problems with string type can be illustrated as possible 
situation in domain of integers type. Assume that user wants 
'number' type which accepts both integers, floats and doubles and 
treats them properly. This would require either library solution 
or a new special type in a language which is supported by both 
compiler and runtime library, which performs operation at runtime 
on objects of number type according to their effective type.

D designers want to support such feature (to make the language 
better), but as it happens in other situations, the support is 
only limited: compiler allows to do

alias immutable(int)[] number;
number my_number = [0, 3.14, 3.14l];

but there is no support in runtime library. On surface, this 
looks like language have type which supports wanted feature, but 
if you use it, you will face the problems, as my_number[2] would 
give strange integer instead of float 3.14 and length of this 
array is 4 instead of 3. In addition this is not a type safe 
approach because there is option to reinterpret double as two 
integers.

Now in order to fix this, there is number of functions in library 
which treat number type properly. Such practice (limited and 
broken language support, unsafe and illogical type, functions to 
correct design mistakes) is embedded into practice so deeply, 
that anyone who point out on this problem in newsgroup is treated 
as a fool and is sent to study IEE 754 standard.

Jan 15 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 16 January 2014 at 06:59:43 UTC, Maxim Fomin wrote:
 This is wrong. String in D is de facto (by implementation, spec 
 may say whatever is convenient for advertising D) array of 
 single bytes which can keep UTF-8 code units. No way string 
 type in D is always a string in a sense of code 
 points/characters. Sometimes it happens that string type 
 behaves like 'string', but if you put UTF-16 or UTF-32 text it 
 would remind you what string type really is.

By implementation they are also UTF strings. String literals use 
UTF, `char.init` is 0xFF and `wchar.init` is 0xFFFF, foreach over 
narrow strings with `dchar` iterator variable type does UTF 
decoding etc.

I don't think you know what you're talking about; putting UTF-16 
or UTF-32 in `string` is utter madness and not trivially 
possible. We have `wchar`/`wstring` and `dchar`/`dstring` for 
UTF-16 and UTF-32, respectively.

 Operations on code units are rare, which is why the standard 
 library instead treats strings as ranges of code points, for 
 correctness by default. However, we must not prevent the user 
 from being able to work on arrays of code units, as many 
 string algorithms can be optimized by not doing full UTF 
 decoding. The standard library does this on many occasions, 
 and there are more to come.

 This is attempt to explain problematic design as a wise action.

No, it's not. Please leave crappy, unsubstantiated arguments like 
this out of these forums.

 [1] http://dlang.org/type

 By the way, the link you provide says char is unsigned 8 bit 
 type which can keep value of UTF-8 code unit.

Not *can*, but *does*. Otherwise it is an error in the program. 
The specification, compiler implementation (as shown above) and 
standard library all treat `char` as a UTF-8 code unit. Treat it 
otherwise at your own peril.

 UTF is irrelevant because the problem is in D implementation. 
 See 
 http://forum.dlang.org/thread/hoopiiobddbapybbwwoc forum.dlang.org 
 (in particular 2nd page).

 The root of the issue is that D does not provide 'utf' type 
 which would handle correctly strings and characters 
 irrespective of the format. But instead the language pretends 
 that it supports such type by allowing to convert to single 
 byte char array both literals "sad" and "säд". And ['s', 'ä', 
 'д'] is by the way neither char[], no wchar[], even not dchar[] 
 but sequence of integers, which compounds oddities in character 
 types.

The only problem in the implementation here that you illustrate 
is that `['s', 'ä', 'д']` is of type `int[]`, which is a bug. It 
should be `dchar[]`. The length of `char[]` works as intended.

 Problems with string type can be illustrated as possible 
 situation in domain of integers type. Assume that user wants 
 'number' type which accepts both integers, floats and doubles 
 and treats them properly. This would require either library 
 solution or a new special type in a language which is supported 
 by both compiler and runtime library, which performs operation 
 at runtime on objects of number type according to their 
 effective type.

 D designers want to support such feature (to make the language 
 better), but as it happens in other situations, the support is 
 only limited: compiler allows to do

 alias immutable(int)[] number;
 number my_number = [0, 3.14, 3.14l];

I don't understand this example. The compiler does *not* allow 
that code; try it for yourself.

Jan 20 2014

"Maxim Fomin" <maxim maxim-fomin.ru> writes:

On Monday, 20 January 2014 at 09:58:07 UTC, Jakob Ovrum wrote:
 On Thursday, 16 January 2014 at 06:59:43 UTC, Maxim Fomin wrote:
 This is wrong. String in D is de facto (by implementation, 
 spec may say whatever is convenient for advertising D) array 
 of single bytes which can keep UTF-8 code units. No way string 
 type in D is always a string in a sense of code 
 points/characters. Sometimes it happens that string type 
 behaves like 'string', but if you put UTF-16 or UTF-32 text it 
 would remind you what string type really is.

 By implementation they are also UTF strings. String literals 
 use UTF, `char.init` is 0xFF and `wchar.init` is 0xFFFF, 
 foreach over narrow strings with `dchar` iterator variable type 
 does UTF decoding etc.

 I don't think you know what you're talking about; putting 
 UTF-16 or UTF-32 in `string` is utter madness and not trivially 
 possible. We have `wchar`/`wstring` and `dchar`/`dstring` for 
 UTF-16 and UTF-32, respectively.

import std.stdio;

void main()
{
	string s = "о";
	writeln(s.length);
}

This compiles and prints 2. This means that string type is 
broken. It is broken in the way as I was attempting to explain.

 This is attempt to explain problematic design as a wise action.

 No, it's not. Please leave crappy, unsubstantiated arguments 
 like this out of these forums.

Note, that I provided examples why design is problematic. The 
arguement isn't unsubstained.

 [1] http://dlang.org/type

 By the way, the link you provide says char is unsigned 8 bit 
 type which can keep value of UTF-8 code unit.

 Not *can*, but *does*. Otherwise it is an error in the program. 
 The specification, compiler implementation (as shown above) and 
 standard library all treat `char` as a UTF-8 code unit. Treat 
 it otherwise at your own peril.

But such treating is nonsense. It is like treating integer or 
floating number as sequence of bytes. You are essentially saying 
that treating char as UTF-8 code unit is OK because language is 
treating char as UTF-8 code unit.

 The only problem in the implementation here that you illustrate 
 is that `['s', 'ä', 'д']` is of type `int[]`, which is a bug. 
 It should be `dchar[]`. The length of `char[]` works as 
 intended.

You are saying that length of char works as intended, which is 
true, but shows that design is broken.

 Problems with string type can be illustrated as possible 
 situation in domain of integers type. Assume that user wants 
 'number' type which accepts both integers, floats and doubles 
 and treats them properly. This would require either library 
 solution or a new special type in a language which is 
 supported by both compiler and runtime library, which performs 
 operation at runtime on objects of number type according to 
 their effective type.

 D designers want to support such feature (to make the language 
 better), but as it happens in other situations, the support is 
 only limited: compiler allows to do

 alias immutable(int)[] number;
 number my_number = [0, 3.14, 3.14l];

 I don't understand this example. The compiler does *not* allow 
 that code; try it for yourself.

It does not allow because it is nonsense. However it does allows 
equivalent nonsesnce in character types.

alias immutable(int)[] number;
number my_number = [0, 3.14, 3.14l]; // does not compile

alias immutable(char)[] string;
string s = "säд"; // compiles, however "säд" should default to 
wstring or dstring

Same reasons which prevent sane person from being OK with int[] 
number = [3.14l] should prevent him from being OK with string s = 
"säд"

Jan 20 2014

"Tobias Pankrath" <tobias pankrath.net> writes:

 Same reasons which prevent sane person from being OK with int[] 
 number = [3.14l] should prevent him from being OK with string s 
 = "säд"

No, since this literal can be encoded as utf8 just fine. Keep in 
mind that literals are nothing else as values written directly 
into the source. And as is happens your example is a perfect 
value of type string.

(w|d)string.length returning anything else then the number of 
underlying code points would be inconsistent to other array types 
and m aking (d|w)string arrays of code points was a (arguably 
good) design decision.

That said: nothing prevents you from writing another string type 
that abstracts from the actual string encoding.


Phobos did it wrong though with handling char[] different from 
T[].

Jan 20 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Monday, 20 January 2014 at 13:30:11 UTC, Tobias Pankrath wrote:
 (w|d)string.length returning anything else then the number of 
 underlying code points would be inconsistent to other array 
 types and m aking (d|w)string arrays of code points was a 
 (arguably good) design decision.

Code units, not code points.

Of course, a single UTF-32 code unit is also a single code point.

 That said: nothing prevents you from writing another string 
 type that abstracts from the actual string encoding.

Such types tend to have absolutely awful performance. It is a 
minefield of disastrous algorithmic complexity, too (e.g. length).

 Phobos did it wrong though with handling char[] different from 
 T[].

It is only for ranges, and I think it's a good decision.

Jan 20 2014

"Tobias Pankrath" <tobias pankrath.net> writes:

On Monday, 20 January 2014 at 16:53:32 UTC, Jakob Ovrum wrote:
 On Monday, 20 January 2014 at 13:30:11 UTC, Tobias Pankrath 
 wrote:
 (w|d)string.length returning anything else then the number of 
 underlying code points would be inconsistent to other array 
 types and making (d|w)string arrays of code points was a 
 (arguably good) design decision.

 Code units, not code points.

Arg! Of course.

Jan 20 2014

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/16/2014 06:56 AM, Jakob Ovrum wrote:
 Note that the Unicode definition of an unqualified "character" is the
 translation of a code *point*, which is very different from a *glyph*,
 which is what people generally associate the word "character" with.
 Thus, `string` is not an array of characters (i.e. an array where each
 element is a character), but `dstring` can be said to be.

A character can be made of more than one dchar. (There are also more 
exotic examples, eg. IIRC there are cases where three dchars make 
approximately two characters.)

Jan 22 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 23 January 2014 at 01:17:19 UTC, Timon Gehr wrote:
 On 01/16/2014 06:56 AM, Jakob Ovrum wrote:
 Note that the Unicode definition of an unqualified "character" 
 is the
 translation of a code *point*, which is very different from a 
 *glyph*,
 which is what people generally associate the word "character" 
 with.
 Thus, `string` is not an array of characters (i.e. an array 
 where each
 element is a character), but `dstring` can be said to be.

 A character can be made of more than one dchar. (There are also 
 more exotic examples, eg. IIRC there are cases where three 
 dchars make approximately two characters.)

No, I believe you are thinking of graphemes.

Jan 22 2014

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/23/2014 02:39 AM, Jakob Ovrum wrote:
 On Thursday, 23 January 2014 at 01:17:19 UTC, Timon Gehr wrote:
 On 01/16/2014 06:56 AM, Jakob Ovrum wrote:
 Note that the Unicode definition of an unqualified "character" is the
 translation of a code *point*, which is very different from a *glyph*,
 which is what people generally associate the word "character" with.
 Thus, `string` is not an array of characters (i.e. an array where each
 element is a character), but `dstring` can be said to be.

 A character can be made of more than one dchar. (There are also more
 exotic examples, eg. IIRC there are cases where three dchars make
 approximately two characters.)

 No, I believe you are thinking of graphemes.

Sure. Their existence means it is in general wrong to think of a dchar 
as one character.

Jan 23 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 23 January 2014 at 10:25:40 UTC, Timon Gehr wrote:
 Sure. Their existence means it is in general wrong to think of 
 a dchar as one character.

As stated, I was specifically talking about the Unicode 
definition of a character, which is completely distinct from 
graphemes.

Jan 23 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Why is string.front dchar?