digitalmars.D - Implicit encoding conversion on string ~= int ?

Marco Leise (18/18) Jun 23 2013 I've seen some C code, that does something like string[i] =

Adam D. Ruppe (10/10) Jun 23 2013 I think what's happening is the compiler considers chars to be

bearophile (4/10) Jun 23 2013 I didn't know that, is this already in Bugzilla?

Marco Leise (21/34) Jun 23 2013 No no no, this is not what happens. In my case it was:

Adam D. Ruppe (9/11) Jun 23 2013 228 (e4 in hex) is also the Unicode code point for ä, which is
Marco Leise (13/44) Jun 23 2013 st

Adam D. Ruppe (5/6) Jun 23 2013 I don't know, but if it is, it is probably marked as won't fix

Jonathan M Davis (10/17) Jun 23 2013 This is definitely by design. Walter is definitely in the camp that thin...

Marco Leise (11/17) Jun 23 2013 You can take bool to int promotion out of my...

Jonathan M Davis (19/39) Jun 23 2013 And in all those cases, you can cast to int to get the value you want. T...
Timon Gehr (2/17) Jun 24 2013
Jonathan M Davis (19/39) Jun 24 2013 And in all those cases, you can cast to int to get the value you want. T...
Jakob Ovrum (6/13) Jun 25 2013 If you're switching between 0 and 1, chances are you should be

monarch_dodra (10/24) Jun 23 2013 Well, chars are integral values... but are integral values char?

Marco Leise (11/23) Jun 23 2013 +1. This is probably the best thinking. It would allow:

Marco Leise <Marco.Leise gmx.de> writes:

I've seen some C code, that does something like string[i] =
int, which seems to implicitly cast the int to a char.
Now in D to get it running I just did string ~= int and
wondered why the resulting code page 850 string looked correct
on the UTF-8 terminal. Then I asserted that 'string' only ever
grows by one byte for each append and the assertion failed. So
there is a hidden conversion from some charset (probably
Windows or Latin-1?) to a UTF-8 multi-byte string going on.

While it is convenient, this code uses some form of LZ77 and
will from time to time append copies of previous parts of
'string' to it. In that case the byte offsets wouldn't match
any more and the result be garbage.

Eventually I'd have looked over the code and created the CP850
string in a temporary ubyte[], but in the mean time I wonder
what the rationale behind this automatic conversion is and if
we want to keep it like that. Is this documented behavior ?

-- 
Marco

Jun 23 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

I think what's happening is the compiler considers chars to be 
integral types (like they were in C), which means some implicit 
conversions between char, int, dchar, and others happen.

So

char[] a;
int b = 1000;
a ~= b;

the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar -> 
char means it may be multibyte encoded, going from utf-32 to 
utf-8.

Jun 23 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Adam D. Ruppe:

 char[] a;
 int b = 1000;
 a ~= b;

 the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar 
 -> char means it may be multibyte encoded, going from utf-32 to 
 utf-8.

I didn't know that, is this already in Bugzilla?

Bye,
bearophile

Jun 23 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 23 Jun 2013 18:37:16 +0200
schrieb "bearophile" <bearophileHUGS lycos.com>:

 Adam D. Ruppe:
=20
 char[] a;
 int b =3D 1000;
 a ~=3D b;

 the "a ~=3D b" is more like "a ~=3D cast(dchar) b", and then dchar=20
 -> char means it may be multibyte encoded, going from utf-32 to=20
 utf-8.


No no no, this is not what happens. In my case it was:
string a;
int b =3D 228;  // CP850 value for '=C3=A4'. Note: fits in a single byte!
a ~=3D b;

Maybe it goes as follows:
o compiler sees ~=3D to a string and becomes "aware" of wchar and dchar
  conversions to char
o appended value is only checked for size (type and signedness are lost)
  and maps int to dchar
o this dchar value is now checked for Unicode conformity and fails the test
o the dchar value is now assumed to be Latin-1, Windows-1252 or similar
  and a conversion routine invoked
o the dchar value is converted to utf-8 and...
o appended as a multi-byte string to variable "a".

That still doesn't sound right to me thought. What if the dchar value is
not valid Unicode AND >=3D 256 ?

 I didn't know that, is this already in Bugzilla?
=20
 Bye,
 bearophile

I don't know what exactly is supposed to happen here.

--=20
Marco

Jun 23 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Sunday, 23 June 2013 at 17:12:41 UTC, Marco Leise wrote:
 int b = 228;  // CP850 value for 'ä'. Note: fits in a single 
 byte!

228 (e4 in hex) is also the Unicode code point for ä, which is 
[195, 164] when encoded as UTF-8. see: 
http://www.utf8-chartable.de/unicode-utf8-table.pl?number=512&utf8=dec

While the number 228 would fit in a byte normally, utf-8 uses the 
high bits as markers that this is part of a multibyte sequence 
(this helps with ascii compatibility), so any code point > 127 
will always be a multibyte sequence in utf-8. see: 
http://en.wikipedia.org/wiki/UTF-8#Description

Jun 23 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 23 Jun 2013 19:12:21 +0200
schrieb Marco Leise <Marco.Leise gmx.de>:

 Am Sun, 23 Jun 2013 18:37:16 +0200
 schrieb "bearophile" <bearophileHUGS lycos.com>:
=20
 Adam D. Ruppe:
=20
 char[] a;
 int b =3D 1000;
 a ~=3D b;

 the "a ~=3D b" is more like "a ~=3D cast(dchar) b", and then dchar=20
 -> char means it may be multibyte encoded, going from utf-32 to=20
 utf-8.


=20
 No no no, this is not what happens. In my case it was:
 string a;
 int b =3D 228;  // CP850 value for '=C3=A4'. Note: fits in a single byte!
 a ~=3D b;
=20
 Maybe it goes as follows:
 o compiler sees ~=3D to a string and becomes "aware" of wchar and dchar
   conversions to char
 o appended value is only checked for size (type and signedness are lost)
   and maps int to dchar
 o this dchar value is now checked for Unicode conformity and fails the te=

st
 o the dchar value is now assumed to be Latin-1, Windows-1252 or similar
   and a conversion routine invoked
 o the dchar value is converted to utf-8 and...
 o appended as a multi-byte string to variable "a".
=20
 That still doesn't sound right to me thought. What if the dchar value is
 not valid Unicode AND >=3D 256 ?

Actually you were 100% right, Adam. I was distracted by the
fact that the source was CP850.
UTF-32 maps all of Latin-1 in a 1:1 correspondence and most of
CP850 has the same code in Latin-1. So yes, all the compiler
was doing is to append a dchar value.
And with char/ubyte I do find it convenient to mix them
without casting. E.g. "if (someChar < 0x80)" and similar code.

As confusing as it was for me, I agree with "WONT FIX".

--=20
Marco

Jun 23 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Sunday, 23 June 2013 at 16:37:18 UTC, bearophile wrote:
 I didn't know that, is this already in Bugzilla?

I don't know, but if it is, it is probably marked as won't fix 
because I'm pretty sure this has come up before, but it is 
actually by design because a char in C is considered an integral 
type too.

Jun 23 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, June 23, 2013 19:25:41 Adam D. Ruppe wrote:
 On Sunday, 23 June 2013 at 16:37:18 UTC, bearophile wrote:
 I didn't know that, is this already in Bugzilla?

 
 I don't know, but if it is, it is probably marked as won't fix
 because I'm pretty sure this has come up before, but it is
 actually by design because a char in C is considered an integral
 type too.

This is definitely by design. Walter is definitely in the camp that thinks that 
chars are integral types, so they follow all of the various integral 
conversion rules. In some cases this is nice. In others, it's bug-prone, but I 
think that we're stuck with it regardless of whether it's ultimately a good 
idea or not. I don't think that we even succeeded at coming close to 
convincing Walter that _bool_ isn't an integral type and shouldn't be treated 
as such (when it was discussed right before deconf), and that should be a far 
more clearcut case.

- Jonathan M Davis

Jun 23 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 23 Jun 2013 17:50:01 -0700
schrieb Jonathan M Davis <jmdavisProg gmx.com>:

 I don't think that we even succeeded at coming close to 
 convincing Walter that _bool_ isn't an integral type and shouldn't be treated 
 as such (when it was discussed right before deconf), and that should be a far 
 more clearcut case.
 
 - Jonathan M Davis

You can take bool to int promotion out of my...

// best way to toggle forth and back between 0 and 1. "!" returns a bool.
value = !value  

// don't ask, I've seen this :)
arr[someBool]

// sometimes the bool has just the value you need
length -= boolRemoveTerminator

-- 
Marco

Jun 23 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, June 24, 2013 07:20:10 Marco Leise wrote:
 Am Sun, 23 Jun 2013 17:50:01 -0700
 
 schrieb Jonathan M Davis <jmdavisProg gmx.com>:
 I don't think that we even succeeded at coming close to
 convincing Walter that _bool_ isn't an integral type and shouldn't be
 treated as such (when it was discussed right before deconf), and that
 should be a far more clearcut case.
 
 - Jonathan M Davis

 
 You can take bool to int promotion out of my...
 
 // best way to toggle forth and back between 0 and 1. "!" returns a bool.
 value = !value
 
 // don't ask, I've seen this :)
 arr[someBool]
 
 // sometimes the bool has just the value you need
 length -= boolRemoveTerminator

And in all those cases, you can cast to int to get the value you want. The 
case that brought up the big discussion on it a couple of months ago was when 
you had

auto foo(bool b) {...}
auto foo(long l) {...}

Which one does foo(1) call? It calls the bool version, because of how the 
integer conversion rules work. IMHO, this is _very_ broken, but Walter's 
response is that the solution is to add the overload

auto foo(int i) {...}

And that does fix the code in question, but it means that bool is _not_ 
strongly typed in D, and you get a variety of weird cases that cause bugs 
because of such implicit conversions. I would strongly argue that the case 
where you want bool to act like an integer is by far the rarer case and that 
casting fixes that problem nicely. Plenty of others agree with me. But no one 
has been able to convince Walter.

You can read the thread here:

http://forum.dlang.org/post/klc5r7$3c4$1 digitalmars.com

- Jonathan M Davis

Jun 23 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/24/2013 07:20 AM, Marco Leise wrote:
 Am Sun, 23 Jun 2013 17:50:01 -0700
 schrieb Jonathan M Davis <jmdavisProg gmx.com>:

 I don't think that we even succeeded at coming close to
 convincing Walter that _bool_ isn't an integral type and shouldn't be treated
 as such (when it was discussed right before deconf), and that should be a far
 more clearcut case.

 - Jonathan M Davis

 You can take bool to int promotion out of my...

 // best way to toggle forth and back between 0 and 1. "!" returns a bool.
 value = !value

value^=1

 // don't ask, I've seen this :)
 arr[someBool]

 // sometimes the bool has just the value you need
 length -= boolRemoveTerminator

Jun 24 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, June 24, 2013 07:20:10 Marco Leise wrote:
 Am Sun, 23 Jun 2013 17:50:01 -0700
 
 schrieb Jonathan M Davis <jmdavisProg gmx.com>:
 I don't think that we even succeeded at coming close to
 convincing Walter that _bool_ isn't an integral type and shouldn't be
 treated as such (when it was discussed right before deconf), and that
 should be a far more clearcut case.
 
 - Jonathan M Davis

 
 You can take bool to int promotion out of my...
 
 // best way to toggle forth and back between 0 and 1. "!" returns a bool.
 value = !value
 
 // don't ask, I've seen this :)
 arr[someBool]
 
 // sometimes the bool has just the value you need
 length -= boolRemoveTerminator

And in all those cases, you can cast to int to get the value you want. The 
case that brought up the big discussion on it a couple of months ago was when 
you had

auto foo(bool b) {...}
auto foo(long l) {...}

Which one does foo(1) call? It calls the bool version, because of how the 
integer conversion rules work. IMHO, this is _very_ broken, but Walter's 
response is that the solution is to add the overload

auto foo(int i) {...}

And that does fix the code in question, but it means that bool is _not_ 
strongly typed in D, and you get a variety of weird cases that cause bugs 
because of such implicit conversions. I would strongly argue that the case 
where you want bool to act like an integer is by far the rarer case and that 
casting fixes that problem nicely. Plenty of others agree with me. But no one 
has been able to convince Walter.

You can read the thread here:

http://forum.dlang.org/post/klc5r7$3c4$1 digitalmars.com

- Jonathan M Davis

Jun 24 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Monday, 24 June 2013 at 05:20:31 UTC, Marco Leise wrote:
 // best way to toggle forth and back between 0 and 1. "!" 
 returns a bool.
 value = !value

If you're switching between 0 and 1, chances are you should be 
using a bool in the first place.

 // don't ask, I've seen this :)
 arr[someBool]

Ew.

 // sometimes the bool has just the value you need
 length -= boolRemoveTerminator

Ew.

I think it's a big plus that it stops this needless obfuscation.

Jun 25 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 24 June 2013 at 00:50:17 UTC, Jonathan M Davis wrote:
 On Sunday, June 23, 2013 19:25:41 Adam D. Ruppe wrote:
 On Sunday, 23 June 2013 at 16:37:18 UTC, bearophile wrote:
 I didn't know that, is this already in Bugzilla?

 
 I don't know, but if it is, it is probably marked as won't fix
 because I'm pretty sure this has come up before, but it is
 actually by design because a char in C is considered an 
 integral
 type too.

 This is definitely by design. Walter is definitely in the camp 
 that thinks that
 chars are integral types, so they follow all of the various 
 integral
 conversion rules.

Well, chars are integral values... but are integral values char?

I mean, I only see the promotion making sense one way: Converting 
a char to an uint can make sense for range analysis, but what 
about the other way around?

Same with bool: I can see bool to int making sense, but int to 
bool not so much, which is why a cast is required (except in an 
if).

In C, int to char was important, since char is the "byte" type. 
But D has byte, so I don't see why we'd allow int to byte...

Jun 23 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 24 Jun 2013 08:03:27 +0200
schrieb "monarch_dodra" <monarchdodra gmail.com>:

 Well, chars are integral values... but are integral values char?
 
 I mean, I only see the promotion making sense one way: Converting 
 a char to an uint can make sense for range analysis, but what 
 about the other way around?
 
 Same with bool: I can see bool to int making sense, but int to 
 bool not so much, which is why a cast is required (except in an 
 if).
 
 In C, int to char was important, since char is the "byte" type. 
 But D has byte, so I don't see why we'd allow int to byte...

+1. This is probably the best thinking. It would allow:

value = !value

and its ilk, but prevent

foo(1)

from going to a bool overload. But should that work?:

char D = 'A' + 3;
char E = 4 + 'A';

-- 
Marco

Jun 23 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Implicit encoding conversion on string ~= int ?