digitalmars.D - Character is only first byte of an UTF-8 sequence

L�nglich (26/26) Sep 02 2007 Hello!

=?ISO-8859-1?Q?L=e6nglich?= (4/4) Sep 02 2007 Oops,
Deewiant (3/3) Sep 02 2007 You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/T...

Daniel Keep (3/5) Sep 02 2007 <3

Deewiant (4/8) Sep 02 2007 Seemed to be the best text on Wiki4D.

=?ISO-8859-1?Q?L=e6nglich?= (5/6) Sep 02 2007 Yes, that was the explanation I was searching for. Thank you very much! ...

Nikita Kalaganov (17/18) Sep 03 2007 And, IMHO, solution is simple - chars must be treated by compiler and

Stewart Gordon (19/50) Sep 03 2007 I'm a bit puzzled. Concatenating arrays shouldn't care about their cont...

=?ISO-8859-1?Q?L=c3=a6nglich?= (7/9) Sep 08 2007 No, it was just because of my misunderstanding of what a »char« is in ...

L�nglich <nospam void.de> writes:

Hello!

From what I've read about D I think I will like this language much more then
C++, Java and the other well-known languages. But now that I'm using it the
first time, I've got a serious problem with the handling of user input.

The input comes from a TextBox from the DFL (D Forms Library) which seems to
be working fine - except the problem that I cannot sensefully access any given
string (char[]). Whenever I try to do something with the string (e.g. concat
it to another one, or use a string function like tolower), I get an "Invalid
UTF-8 sequence" error. When I try to access a character directly (e.g. with a
foreach loop over the string), I only get the first byte of each character.
For example: If the character is '�' (i.e. has the UTF-8 encoding C3 A4) and I
cast it to int, the result is 195 - which equals C3. The second byte, A4,
seems to be lost.
If it is an ASCII-character, everything works as desired, but with all higher
characters I have this problem. I tried using dchar instead of char, and I
tried applying all of the converting functions from std.utf, but the problem
did not even change.

So, is there an encoding function which returns the real characters* so that I
can work with them, or do I actually have to work with single bytes (which
would necessarily result in reinventing the squared wheel)?

By the way, I'm using MS Windows XP SP2 in German, and my source code ist
UTF-8 with BOM. I'm not sure if one of these facts matters.

Thank you for any feedback and kindest regards,
L�nglich

* The encoding doesn't matter to me. I just want to be able to compare them to
other characters without them always being equal to 195.

Sep 02 2007

=?ISO-8859-1?Q?L=e6nglich?= <nospam void.invalid> writes:

Oops,

I've just seen that void.de actually exists, so they get the spam now. Is it
possible to edit or remove the E-mail address?

Kindest regards,
L�nglich

Sep 02 2007

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD

-- 
Remove ".doesnotlike.spam" from the mail address.

Sep 02 2007

Daniel Keep <daniel.keep.lists gmail.com> writes:

Deewiant wrote:
 You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD
 

<3

	-- Daniel

Sep 02 2007

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

Daniel Keep wrote:
 Deewiant wrote:
 You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD

 <3

Seemed to be the best text on Wiki4D.

-- 
Remove ".doesnotlike.spam" from the mail address.

Sep 02 2007

=?ISO-8859-1?Q?L=e6nglich?= <nospam void.invalid> writes:

�Hola!

 Seemed to be the best text on Wiki4D.

Yes, that was the explanation I was searching for. Thank you very much! :-)
Now that I know why it doesn't work I think I can fix it soon.

Thanks again and kindest regards,
L�nglich

Sep 02 2007

"Nikita Kalaganov" <riven-mage id.ru> writes:

 http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD

And, IMHO, solution is simple - chars must be treated by compiler and  
libraries as complete codepoints.

So, "char" can represent codepoints 0x20-0xFF (Basic latin and Latin-1  
supplement), "wchar" - codepoints from 0x20...0xFFFF (complete basic  
multilingual plane), and "dchar" - all codepoints (including supplementary  
planes).

If your program is 100% latin, use char[]. For multi-language programs use  
wchar[]. Use dchar[] for exotics :)

Conversion from char[] to wchar/dchar and from wchar to dchar is implicit.  
Reverse conversions is not always possible(*).

Main problems solved:
1. Slice-able strings.
2. length property contains real "length" of string.
3. Printable.
4. Easy to understand :)

All conversion from/to UTF-8,UTF-16 and UTF32 should be explicit.

Price is (*).

Sep 03 2007

"Stewart Gordon" <smjg_1998 yahoo.com> writes:

"Længlich" <nospam void.de> wrote in message 
news:fbeldf$1tbn$1 digitalmars.com...
 Hello!

 From what I've read about D I think I will like this language much more 
 then
 C++, Java and the other well-known languages. But now that I'm using it 
 the
 first time, I've got a serious problem with the handling of user input.

 The input comes from a TextBox from the DFL (D Forms Library) which seems 
 to
 be working fine - except the problem that I cannot sensefully access any 
 given
 string (char[]). Whenever I try to do something with the string (e.g. 
 concat
 it to another one, or use a string function like tolower), I get an 
 "Invalid
 UTF-8 sequence" error.

I'm a bit puzzled.  Concatenating arrays shouldn't care about their content.

 When I try to access a character directly (e.g. with a
 foreach loop over the string), I only get the first byte of each 
 character.
 For example: If the character is '�' (i.e. has the UTF-8 encoding C3 A4) 
 and I
 cast it to int, the result is 195 - which equals C3. The second byte, A4,
 seems to be lost.

Sounds as though DFL is buggy.  A char is indeed a single byte, but it 
shouldn't be losing the remaining bytes of the character.  Are you sure it's 
actually returning the first UTF-8 byte of each character, and not some 
other encoding like ANSI?

I don't know DFL myself, but meanwhile, please try evaluating
    std.string.format(cast(ubyte[]) text)
on the text retrieved from your TextBox, and then post the result (along 
with what text you typed).  This might help with diagnosing the problem.

 If it is an ASCII-character, everything works as desired, but with all 
 higher
 characters I have this problem. I tried using dchar instead of char, and I
 tried applying all of the converting functions from std.utf, but the 
 problem
 did not even change.

You can foreach with dchar over a char[].  Or have you tried that?

<snip>
 * The encoding doesn't matter to me. I just want to be able to compare 
 them to
 other characters without them always being equal to 195.

If you want to compare them _to_ other characters, it would make most sense 
to do so if they are all the same.  If you want to compare them _with_ other 
characters, OTOH....

If different characters are all coming out as 195, with no bytes in between 
to distinguish them, then it's definitely a bug in DFL.

Stewart.

Sep 03 2007

=?ISO-8859-1?Q?L=c3=a6nglich?= <nospam void.invalid> writes:

Hi,

 If different characters are all coming out as 195, with no bytes in between 
 to distinguish them, then it's definitely a bug in DFL.

No, it was just because of my misunderstanding of what a »char« is in D. Now
that I know that char[] is much like a byte array and not really like a string
in other languages, I see that no data is lost.
Obviously I just couldn't get the second byte, because it always throwed an
exception in my context. But the problem is solved now.

My program has to deal with input in arbitrary languages; I want every possible
character to work fine (even those from higher planes). So I now use dchar for
all my functions, and since this change everything works as desired.

Thanks to all of you!

Kindest regards,
Længlich

Sep 08 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Character is only first byte of an UTF-8 sequence