www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Character is only first byte of an UTF-8 sequence

reply Længlich <nospam void.de> writes:
Hello!

From what I've read about D I think I will like this language much more then
C++, Java and the other well-known languages. But now that I'm using it the
first time, I've got a serious problem with the handling of user input.

The input comes from a TextBox from the DFL (D Forms Library) which seems to
be working fine - except the problem that I cannot sensefully access any given
string (char[]). Whenever I try to do something with the string (e.g. concat
it to another one, or use a string function like tolower), I get an "Invalid
UTF-8 sequence" error. When I try to access a character directly (e.g. with a
foreach loop over the string), I only get the first byte of each character.
For example: If the character is 'ä' (i.e. has the UTF-8 encoding C3 A4) and I
cast it to int, the result is 195 - which equals C3. The second byte, A4,
seems to be lost.
If it is an ASCII-character, everything works as desired, but with all higher
characters I have this problem. I tried using dchar instead of char, and I
tried applying all of the converting functions from std.utf, but the problem
did not even change.

So, is there an encoding function which returns the real characters* so that I
can work with them, or do I actually have to work with single bytes (which
would necessarily result in reinventing the squared wheel)?

By the way, I'm using MS Windows XP SP2 in German, and my source code ist
UTF-8 with BOM. I'm not sure if one of these facts matters.

Thank you for any feedback and kindest regards,
Længlich

* The encoding doesn't matter to me. I just want to be able to compare them to
other characters without them always being equal to 195.
Sep 02 2007
next sibling parent =?ISO-8859-1?Q?L=e6nglich?= <nospam void.invalid> writes:
Oops,

I've just seen that void.de actually exists, so they get the spam now. Is it
possible to edit or remove the E-mail address?

Kindest regards,
Længlich
Sep 02 2007
prev sibling next sibling parent reply Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD

-- 
Remove ".doesnotlike.spam" from the mail address.
Sep 02 2007
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Deewiant wrote:
 You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD
 
<3 -- Daniel
Sep 02 2007
next sibling parent reply Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
Daniel Keep wrote:
 Deewiant wrote:
 You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD
<3
Seemed to be the best text on Wiki4D. -- Remove ".doesnotlike.spam" from the mail address.
Sep 02 2007
parent =?ISO-8859-1?Q?L=e6nglich?= <nospam void.invalid> writes:
¡Hola!

 Seemed to be the best text on Wiki4D.
Yes, that was the explanation I was searching for. Thank you very much! :-) Now that I know why it doesn't work I think I can fix it soon. Thanks again and kindest regards, Længlich
Sep 02 2007
prev sibling parent "Nikita Kalaganov" <riven-mage id.ru> writes:
 http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD
And, IMHO, solution is simple - chars must be treated by compiler and libraries as complete codepoints. So, "char" can represent codepoints 0x20-0xFF (Basic latin and Latin-1 supplement), "wchar" - codepoints from 0x20...0xFFFF (complete basic multilingual plane), and "dchar" - all codepoints (including supplementary planes). If your program is 100% latin, use char[]. For multi-language programs use wchar[]. Use dchar[] for exotics :) Conversion from char[] to wchar/dchar and from wchar to dchar is implicit. Reverse conversions is not always possible(*). Main problems solved: 1. Slice-able strings. 2. length property contains real "length" of string. 3. Printable. 4. Easy to understand :) All conversion from/to UTF-8,UTF-16 and UTF32 should be explicit. Price is (*).
Sep 03 2007
prev sibling parent reply "Stewart Gordon" <smjg_1998 yahoo.com> writes:
"Længlich" <nospam void.de> wrote in message 
news:fbeldf$1tbn$1 digitalmars.com...
 Hello!

 From what I've read about D I think I will like this language much more 
 then
 C++, Java and the other well-known languages. But now that I'm using it 
 the
 first time, I've got a serious problem with the handling of user input.

 The input comes from a TextBox from the DFL (D Forms Library) which seems 
 to
 be working fine - except the problem that I cannot sensefully access any 
 given
 string (char[]). Whenever I try to do something with the string (e.g. 
 concat
 it to another one, or use a string function like tolower), I get an 
 "Invalid
 UTF-8 sequence" error.
I'm a bit puzzled. Concatenating arrays shouldn't care about their content.
 When I try to access a character directly (e.g. with a
 foreach loop over the string), I only get the first byte of each 
 character.
 For example: If the character is '�' (i.e. has the UTF-8 encoding C3 A4) 
 and I
 cast it to int, the result is 195 - which equals C3. The second byte, A4,
 seems to be lost.
Sounds as though DFL is buggy. A char is indeed a single byte, but it shouldn't be losing the remaining bytes of the character. Are you sure it's actually returning the first UTF-8 byte of each character, and not some other encoding like ANSI? I don't know DFL myself, but meanwhile, please try evaluating std.string.format(cast(ubyte[]) text) on the text retrieved from your TextBox, and then post the result (along with what text you typed). This might help with diagnosing the problem.
 If it is an ASCII-character, everything works as desired, but with all 
 higher
 characters I have this problem. I tried using dchar instead of char, and I
 tried applying all of the converting functions from std.utf, but the 
 problem
 did not even change.
You can foreach with dchar over a char[]. Or have you tried that? <snip>
 * The encoding doesn't matter to me. I just want to be able to compare 
 them to
 other characters without them always being equal to 195.
If you want to compare them _to_ other characters, it would make most sense to do so if they are all the same. If you want to compare them _with_ other characters, OTOH.... If different characters are all coming out as 195, with no bytes in between to distinguish them, then it's definitely a bug in DFL. Stewart.
Sep 03 2007
parent =?ISO-8859-1?Q?L=c3=a6nglich?= <nospam void.invalid> writes:
Hi,

 If different characters are all coming out as 195, with no bytes in between 
 to distinguish them, then it's definitely a bug in DFL.
No, it was just because of my misunderstanding of what a »char« is in D. Now that I know that char[] is much like a byte array and not really like a string in other languages, I see that no data is lost. Obviously I just couldn't get the second byte, because it always throwed an exception in my context. But the problem is solved now. My program has to deal with input in arbitrary languages; I want every possible character to work fine (even those from higher planes). So I now use dchar for all my functions, and since this change everything works as desired. Thanks to all of you! Kindest regards, Længlich
Sep 08 2007