digitalmars.D - Character is only first byte of an UTF-8 sequence
- Længlich (26/26) Sep 02 2007 Hello!
- =?ISO-8859-1?Q?L=e6nglich?= (4/4) Sep 02 2007 Oops,
- Deewiant (3/3) Sep 02 2007 You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/T...
- Daniel Keep (3/5) Sep 02 2007 <3
- Deewiant (4/8) Sep 02 2007 Seemed to be the best text on Wiki4D.
- =?ISO-8859-1?Q?L=e6nglich?= (5/6) Sep 02 2007 Yes, that was the explanation I was searching for. Thank you very much! ...
- Nikita Kalaganov (17/18) Sep 03 2007 And, IMHO, solution is simple - chars must be treated by compiler and
- Stewart Gordon (19/50) Sep 03 2007 I'm a bit puzzled. Concatenating arrays shouldn't care about their cont...
- =?ISO-8859-1?Q?L=c3=a6nglich?= (7/9) Sep 08 2007 No, it was just because of my misunderstanding of what a »char« is in ...
Hello! From what I've read about D I think I will like this language much more then C++, Java and the other well-known languages. But now that I'm using it the first time, I've got a serious problem with the handling of user input. The input comes from a TextBox from the DFL (D Forms Library) which seems to be working fine - except the problem that I cannot sensefully access any given string (char[]). Whenever I try to do something with the string (e.g. concat it to another one, or use a string function like tolower), I get an "Invalid UTF-8 sequence" error. When I try to access a character directly (e.g. with a foreach loop over the string), I only get the first byte of each character. For example: If the character is 'ä' (i.e. has the UTF-8 encoding C3 A4) and I cast it to int, the result is 195 - which equals C3. The second byte, A4, seems to be lost. If it is an ASCII-character, everything works as desired, but with all higher characters I have this problem. I tried using dchar instead of char, and I tried applying all of the converting functions from std.utf, but the problem did not even change. So, is there an encoding function which returns the real characters* so that I can work with them, or do I actually have to work with single bytes (which would necessarily result in reinventing the squared wheel)? By the way, I'm using MS Windows XP SP2 in German, and my source code ist UTF-8 with BOM. I'm not sure if one of these facts matters. Thank you for any feedback and kindest regards, Længlich * The encoding doesn't matter to me. I just want to be able to compare them to other characters without them always being equal to 195.
Sep 02 2007
Oops, I've just seen that void.de actually exists, so they get the spam now. Is it possible to edit or remove the E-mail address? Kindest regards, Længlich
Sep 02 2007
You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD -- Remove ".doesnotlike.spam" from the mail address.
Sep 02 2007
Deewiant wrote:You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD<3 -- Daniel
Sep 02 2007
Daniel Keep wrote:Deewiant wrote:Seemed to be the best text on Wiki4D. -- Remove ".doesnotlike.spam" from the mail address.You might want to read http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD<3
Sep 02 2007
¡Hola!Seemed to be the best text on Wiki4D.Yes, that was the explanation I was searching for. Thank you very much! :-) Now that I know why it doesn't work I think I can fix it soon. Thanks again and kindest regards, Længlich
Sep 02 2007
http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInDAnd, IMHO, solution is simple - chars must be treated by compiler and libraries as complete codepoints. So, "char" can represent codepoints 0x20-0xFF (Basic latin and Latin-1 supplement), "wchar" - codepoints from 0x20...0xFFFF (complete basic multilingual plane), and "dchar" - all codepoints (including supplementary planes). If your program is 100% latin, use char[]. For multi-language programs use wchar[]. Use dchar[] for exotics :) Conversion from char[] to wchar/dchar and from wchar to dchar is implicit. Reverse conversions is not always possible(*). Main problems solved: 1. Slice-able strings. 2. length property contains real "length" of string. 3. Printable. 4. Easy to understand :) All conversion from/to UTF-8,UTF-16 and UTF32 should be explicit. Price is (*).
Sep 03 2007
"Længlich" <nospam void.de> wrote in message news:fbeldf$1tbn$1 digitalmars.com...Hello! From what I've read about D I think I will like this language much more then C++, Java and the other well-known languages. But now that I'm using it the first time, I've got a serious problem with the handling of user input. The input comes from a TextBox from the DFL (D Forms Library) which seems to be working fine - except the problem that I cannot sensefully access any given string (char[]). Whenever I try to do something with the string (e.g. concat it to another one, or use a string function like tolower), I get an "Invalid UTF-8 sequence" error.I'm a bit puzzled. Concatenating arrays shouldn't care about their content.When I try to access a character directly (e.g. with a foreach loop over the string), I only get the first byte of each character. For example: If the character is '�' (i.e. has the UTF-8 encoding C3 A4) and I cast it to int, the result is 195 - which equals C3. The second byte, A4, seems to be lost.Sounds as though DFL is buggy. A char is indeed a single byte, but it shouldn't be losing the remaining bytes of the character. Are you sure it's actually returning the first UTF-8 byte of each character, and not some other encoding like ANSI? I don't know DFL myself, but meanwhile, please try evaluating std.string.format(cast(ubyte[]) text) on the text retrieved from your TextBox, and then post the result (along with what text you typed). This might help with diagnosing the problem.If it is an ASCII-character, everything works as desired, but with all higher characters I have this problem. I tried using dchar instead of char, and I tried applying all of the converting functions from std.utf, but the problem did not even change.You can foreach with dchar over a char[]. Or have you tried that? <snip>* The encoding doesn't matter to me. I just want to be able to compare them to other characters without them always being equal to 195.If you want to compare them _to_ other characters, it would make most sense to do so if they are all the same. If you want to compare them _with_ other characters, OTOH.... If different characters are all coming out as 195, with no bytes in between to distinguish them, then it's definitely a bug in DFL. Stewart.
Sep 03 2007
Hi,If different characters are all coming out as 195, with no bytes in between to distinguish them, then it's definitely a bug in DFL.No, it was just because of my misunderstanding of what a »char« is in D. Now that I know that char[] is much like a byte array and not really like a string in other languages, I see that no data is lost. Obviously I just couldn't get the second byte, because it always throwed an exception in my context. But the problem is solved now. My program has to deal with input in arbitrary languages; I want every possible character to work fine (even those from higher planes). So I now use dchar for all my functions, and since this change everything works as desired. Thanks to all of you! Kindest regards, Længlich
Sep 08 2007