digitalmars.D - Error: invalid UTF-8 sequence
- Carotinho (17/17) Nov 28 2004 Hi all!
- =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (4/9) Nov 28 2004 D only works with Unicode. You need to set your shell to UTF-8.
- Simon Buchan (14/23) Nov 28 2004 I don't think cast works. Unfortunately, the Windows shell can't use
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (25/44) Nov 29 2004 It does. The problem is that D just assumes that the shell is UTF-8,
- Ben Hinkle (78/82) Nov 29 2004 The following solution doesn't handle errors well due to some errno
- Carotinho (3/3) Nov 29 2004 I thanks you all, I'll start experiments!
- Ben Hinkle (5/8) Nov 29 2004 oh, even better. you don't need the dll then - just get the .d file that...
- Simon Buchan (11/33) Nov 29 2004 This doesnt let you make UTF-8 into an OEM codepage, though, does it?
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (14/70) Nov 30 2004 This code doesn't work everywhere... (POSIX?)
- Ben Hinkle (16/86) Nov 30 2004 can
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (5/7) Nov 30 2004 Yes, that is in the Unicode specification. BE is the default order.
-
Kris
(4/4)
Nov 30 2004
"Ben Hinkle"
wrote ...
Hi all! I'm new here and to D. I wrote a simple program: import std.stdio; import std.stream; int main() { char[] stringa; stringa = std.stream.stdin.readLine(); writefln("%s",stringa); return 0; } If i type normal characters, like a,b,c etc. everything is ok. But when I tries to type special characters like è, ò, ù I get Error: invalid UTF-8 sequence when the program tries to rewrite the string I got. What is this? Thanks in advance! Carotinho
Nov 28 2004
Carotinho wrote:If i type normal characters, like a,b,c etc. everything is ok. But when I tries to type special characters like è, ò, ù I get Error: invalid UTF-8 sequence when the program tries to rewrite the string I got. What is this?D only works with Unicode. You need to set your shell to UTF-8. (Or, the tricky version, you can cast(ubyte[]) and convert it ?) --anders
Nov 28 2004
On Mon, 29 Nov 2004 00:39:29 +0100, Anders F Björklund <afb algonet.se> wrote:Carotinho wrote:I don't think cast works. Unfortunately, the Windows shell can't use UTF. This discussion was referenced somewhere else (maybe digitalmars.D.bugs?) I have a project that tries to write a file with funky punctuation to the screen... the closest I got was to use read/writeString exclusively which gives you rubbish for special characters. There was something mentioned about a Win32 API that converted UTF to codepages and vice-versa... sounded promising, but I don't know if it is currently available to D. Look around, you may get lucky. (and if you do, tell the rest of us :D) -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/If i type normal characters, like a,b,c etc. everything is ok. But when I tries to type special characters like è, ò, ù I get Error: invalid UTF-8 sequence when the program tries to rewrite the string I got. What is this?D only works with Unicode. You need to set your shell to UTF-8. (Or, the tricky version, you can cast(ubyte[]) and convert it ?) --anders
Nov 28 2004
Simon Buchan wrote:It does. The problem is that D just assumes that the shell is UTF-8, and feeds you char[] that are *invalid* (as they are native-encoded) If you translate them yourself, I've found it to work just fine... I don't have a DOS console ( echo off allergies), but it does work with a zsh console set to the ISO-8859-1 encoding (instead of UTF-8) Of course, if the console *is* Unicode - then this doesn't work... Anyway, my test code looked like:(Or, the tricky version, you can cast(ubyte[]) and convert it ?)I don't think cast works. Unfortunately, the Windows shell can't use UTF. This discussion was referenced somewhere else (maybe digitalmars.D.bugs?)void main(char[][] args) { wchar[256] mapping = iso88591.mapping; char[] test = cast(char[]) decode_string(cast(ubyte[]) args[1], mapping); writefln("%s",test); static ubyte[1] z = [ 0 ]; printf("%s\n", cast(char*) (encode_string(test, mapping) ~ z) ); }Usually when you call old C functions, you want ubyte[] and not char[] since they don't handle UTF-8? The D tradition is to pretend that they have the D definition (char *) anyway, since "it is the same bit size". I use ubyte[] for legacy 8-bit encodings, and char[] for Unicode only.There was something mentioned about a Win32 API that converted UTF to codepages and vice-versa... sounded promising, but I don't know if it is currently available to D. Look around, you may get lucky. (and if you do, tell the rest of us :D)There is a Win32-only API, and some open source libraries (iconv, ICU): http://msdn.microsoft.com/library/en-us/intl/unicode_19mb.asp http://www.gnu.org/software/libiconv/ http://oss.software.ibm.com/icu/ I might share my own little hack later on too, when I've packaged it up. (it just does the 4 main mappings, not the other 200* that the above do, ISO-8859-1 [Latin-1], CP-437 [DOS], CP-1252 [Win], MacRoman [Mac OS 9] ) It's a lot smaller than the real mccoy, and will be under zlib license. http://www.opensource.org/licenses/zlib-license.php (my usual license) If you need the full functionality, look at Mango/ICU or iconv instead? --anders PS. I'm not kidding, it really has hundreds (!) of different encodings: http://oss.software.ibm.com/icu/charset/
Nov 29 2004
[snip]There was something mentioned about a Win32 API that converted UTF to codepages and vice-versa... sounded promising, but I don't know if it is currently available to D. Look around, you may get lucky. (and if you do, tell the rest of us :D)The following solution doesn't handle errors well due to some errno confusion I'm trying to figure out, but it is a start. Here's what you can do. Get iconv.dll from the zip file at http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download and put it in the same directory as your executable. The attached libiconv.d will load the three functions you need. The attached iconv_example.d shows how to call iconv to convert utf-8 to utf-16 little endian. I'm looking into the errno issues and will probably have to recompile libiconv with DMC or something. But for typical usage the above instructions should work. Also I'd like to put a small wrapper around the low-level API to make it easier to use for the simple cases when the input is complete. -Ben begin 666 libiconv.d M(%5S97)S(&UU<W0 <'5T(&EC;VYV+F1L;"!F<F]M( T*(" O+R :'1T<#HO M+W!R9&]W;FQO861S+G-O=7)C969O<F=E+FYE="]G971T97AT+VQI8FEC;VYV M+3$N.2XQ+F)I;BYW;V4S,BYZ:7 _9&]W;FQO860-"B +R\ ;VX =&AE:7( M<&%T:" H96<L('1H92!S86UE(&1I<F5C=&]R>2!A<R!T:&4 ;6%I;B!E>&5C M8VAA<G-E=', 9G)O;6-O9&4 86YD('1O8V]D90T*("!E>'1E<FX *$,I(&EC M;VYV7W0 *"II8V]N=E]O<&5N*2 H8VAA<B J=&]C;V1E+"!C:&%R("IF<F]M M9B!T;R!U;G5S960 ;W5T<'5T(&%N9"!R971U<FX ;G5M8F5R(&]F(&YO;BUR M979E<G-A8FQE( T*(" O+R!C;VYV97)S:6]N<R!O<B M,2!O;B!E<G)O<BX- M"0D (" ("!V;VED("HJ;W5T8G5F+ T*"0D)(" (" <VEZ95]T("IO=71B M9'5L92!M;V0 /2!%>&5-;V1U;&5?3&]A9" B:6-O;G8B*3L-"B ("!I9B H M;V%D(&EC;VYV(&1Y;F%M:6, ;&EB<F%R>2(I.PT*(" (&EC;VYV7V]P96X M/2!C87-T*'1Y<&5O9BAI8V]N=E]O<&5N*2E%>&5-;V1U;&5?1V5T4WEM8F]L M*&UO9"PB;&EB:6-O;G9?;W!E;B(I.PT*(" (&EC;VYV7V-L;W-E(#T 8V%S M="AT>7!E;V8H:6-O;G9?8VQO<V4I*45X94UO9'5L95]'9713>6UB;VPH;6]D M+")L:6)I8V]N=E]C;&]S92(I.PT*(" (&EC;VYV(#T 8V%S="AT>7!E;V8H M:6-O;G8I*45X94UO9'5L95]'9713>6UB;VPH;6]D+")L:6)I8V]N=B(I.PT* M;G8 :7, 8G5I;'0 :6YT;R!L:6)C('-O(&QO861I;F< :7, 875T;VUA=&EC M8V]N=E]O<&5N("AC:&%R("IT;V-O9&4L(&-H87( *F9R;VUC;V1E*3L-" T* M(" O+R!C;VYV97)T(&EN8G5F('1O(&]U=&)U9B!A;F0 <V5T(&EN8GET97-L M969T('1O('5N=7-E9"!I;G!U="!A;F0-"B +R\ ;W5T8G5F('1O('5N=7-E M9"!O=71P=70 86YD(')E='5R;B!N=6UB97( ;V8 ;F]N+7)E=F5R<V%B;&4 M*$,I('-I>F5?="!I8V]N=B H:6-O;G9?="!C9"P =F]I9" J*FEN8G5F+ T* M"0D)(" <VEZ95]T("II;F)Y=&5S;&5F="P-" D)"2 ('9O:60 *BIO=71B 28V]N=E]T(&-D*3L-" T*?0T* ` end begin 666 iconv_example.d M:6UP;W)T(&QI8FEC;VYV.PT*:6UP;W)T('-T9"YS=&1I;SL-" T*=F]I9"!L M=&8M."!T;R!U=&8M,38 ;&ET=&QE(&5N9&EA; T*("!I8V]N=E]T(&-D(#T M('9O:60J(&EN<" ]('-T<CL-"B <VEZ95]T(&EN7VQE;B ]('-T<BYL96YG M;W5T<W1R.R O+R!S;VUE(&=I86YT(&)U9F9E< T*("!V;VED*B!O=71P(#T M"B +R\ 9&\ =&AE(&-O;G9E<G-I;VX-"B <VEZ95]T(')E<R ](&EC;VYV ` end
Nov 29 2004
I thanks you all, I'll start experiments! For information, I'm running Linux, and even here I'm quite a newbie :) Byez!
Nov 29 2004
"Carotinho" <carotinobg yahoo.it> wrote in message news:cog796$1rjr$1 digitaldaemon.com...I thanks you all, I'll start experiments! For information, I'm running Linux, and even here I'm quite a newbie :) Byez!oh, even better. you don't need the dll then - just get the .d file that declares the iconv functions and you're all set (well, except for figuring out the API and getting the right encodings).
Nov 29 2004
On Mon, 29 Nov 2004 14:14:33 -0500, Ben Hinkle <bhinkle mathworks.com> wrote:[snip]This doesnt let you make UTF-8 into an OEM codepage, though, does it? Linux users should be fine if they set their console to a UTF, but poor Windows users are stuck with weird codepages. (I do have the UTF codepages installed, they have to be, but I don't know how you can tell the console to use them) -- "Unhappy Microsoft customers have a funny way of becoming Linux, Salesforce.com and Oracle customers." - www.microsoft-watch.com: "The Year in Review: Microsoft Opens Up"There was something mentioned about a Win32 API that converted UTF to codepages and vice-versa... sounded promising, but I don't know if it is currently available to D. Look around, you may get lucky. (and if you do, tell the rest of us :D)The following solution doesn't handle errors well due to some errno confusion I'm trying to figure out, but it is a start. Here's what you can do. Get iconv.dll from the zip file at http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download and put it in the same directory as your executable. The attached libiconv.d will load the three functions you need. The attached iconv_example.d shows how to call iconv to convert utf-8 to utf-16 little endian. I'm looking into the errno issues and will probably have to recompile libiconv with DMC or something. But for typical usage the above instructions should work. Also I'd like to put a small wrapper around the low-level API to make it easier to use for the simple cases when the input is complete. -Ben
Nov 29 2004
Ben Hinkle wrote:The following solution doesn't handle errors well due to some errno confusion I'm trying to figure out, but it is a start. Here's what you can do. Get iconv.dll from the zip file at http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download and put it in the same directory as your executable. The attached libiconv.d will load the three functions you need. The attached iconv_example.d shows how to call iconv to convert utf-8 to utf-16 little endian. I'm looking into the errno issues and will probably have to recompile libiconv with DMC or something. But for typical usage the above instructions should work. Also I'd like to put a small wrapper around the low-level API to make it easier to use for the simple cases when the input is complete.This code doesn't work everywhere... (POSIX?) At least not without some more modifications.// on POSIX systems iconv is built into libc so loading is automaticIt doesn't work on Mac OS X, unfortunately./usr/bin/ld: Undefined symbols: _iconv _iconv_close _iconv_open collect2: ld returned 1 exit status(It's being loaded from System's /usr/lib/libiconv.dylib) in /usr/include/iconv.h:#define iconv_t libiconv_t #ifndef LIBICONV_PLUG #define iconv_open libiconv_open #define iconv libiconv #define iconv_close libiconv_close #endifAnnoying, isn't it ? So one needs to declare the C functions with the "lib" prefix, and then do wrappers in D for the usual names...} else version (darwin) { // On Mac OS X, link with -liconv (/usr/lib/libiconv.dylib) typedef void *libiconv_t; // allocate a converter between charsets fromcode and tocode extern (C) libiconv_t libiconv_open (char *tocode, char *fromcode); iconv_t iconv_open (char *tocode, char *fromcode) { return cast(iconv_t) libiconv_open(tocode, fromcode); } // convert inbuf to outbuf and set inbytesleft to unused input and // outbuf to unused output and return number of non-reversable // conversions or -1 on error. extern (C) size_t libiconv (libiconv_t cd, void **inbuf, size_t *inbytesleft, void **outbuf, size_t *outbytesleft); size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft, void **outbuf, size_t *outbytesleft) { return libiconv(cast(libiconv_t) cd, inbuf, inbytesleft, outbuf, outbytesleft); } // close converter extern (C) int libiconv_close (libiconv_t cd); int iconv_close (iconv_t cd) { return libiconv_close(cast(libiconv_t) cd); } } else {And the test code assumed that everything is X86:version (LittleEndian) // convert from utf-8 to utf-16 little endian iconv_t cd = iconv_open("UTF-16LE","UTF-8"); else version (BigEndian) // convert from utf-8 to utf-16 big endian iconv_t cd = iconv_open("UTF-16BE","UTF-8");That's actually one of the biggest drawbacks of UTF-16... Besides those little flaws, the code works just fine :-) --anders
Nov 30 2004
"Anders F Björklund" <afb algonet.se> wrote in message news:coi25c$1ijr$1 digitaldaemon.com...Ben Hinkle wrote:canThe following solution doesn't handle errors well due to some errno confusion I'm trying to figure out, but it is a start. Here's what youhttp://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?downloaddo. Get iconv.dll from the zip file atlibiconv.dand put it in the same directory as your executable. The attachedshowswill load the three functions you need. The attached iconv_example.dinstructionshow to call iconv to convert utf-8 to utf-16 little endian. I'm looking into the errno issues and will probably have to recompile libiconv with DMC or something. But for typical usage the aboveAPIshould work. Also I'd like to put a small wrapper around the low-levelcomplete.to make it easier to use for the simple cases when the input isThis code doesn't work everywhere... (POSIX?) At least not without some more modifications.That is a bummer. Love those #defines!// on POSIX systems iconv is built into libc so loading is automaticIt doesn't work on Mac OS X, unfortunately./usr/bin/ld: Undefined symbols: _iconv _iconv_close _iconv_open collect2: ld returned 1 exit status(It's being loaded from System's /usr/lib/libiconv.dylib) in /usr/include/iconv.h:#define iconv_t libiconv_t #ifndef LIBICONV_PLUG #define iconv_open libiconv_open #define iconv libiconv #define iconv_close libiconv_close #endifAnnoying, isn't it ? So one needs to declare the C functions with the "lib" prefix, and then do wrappers in D for the usual names...outbytesleft); }} else version (darwin) { // On Mac OS X, link with -liconv (/usr/lib/libiconv.dylib) typedef void *libiconv_t; // allocate a converter between charsets fromcode and tocode extern (C) libiconv_t libiconv_open (char *tocode, char *fromcode); iconv_t iconv_open (char *tocode, char *fromcode) { return cast(iconv_t) libiconv_open(tocode, fromcode); } // convert inbuf to outbuf and set inbytesleft to unused input and // outbuf to unused output and return number of non-reversable // conversions or -1 on error. extern (C) size_t libiconv (libiconv_t cd, void **inbuf, size_t *inbytesleft, void **outbuf, size_t *outbytesleft); size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft, void **outbuf, size_t *outbytesleft) { return libiconv(cast(libiconv_t) cd, inbuf, inbytesleft, outbuf,Maybe I'll try using std.loader for this case, too, and have iconv be a function pointer. Hmm...// close converter extern (C) int libiconv_close (libiconv_t cd); int iconv_close (iconv_t cd) { return libiconv_close(cast(libiconv_t) cd); } } else {And the test code assumed that everything is X86:That's true. I was being lazy with the example. When I tried just plain-old "UTF-16" I think it used big-endian.version (LittleEndian) // convert from utf-8 to utf-16 little endian iconv_t cd = iconv_open("UTF-16LE","UTF-8"); else version (BigEndian) // convert from utf-8 to utf-16 big endian iconv_t cd = iconv_open("UTF-16BE","UTF-8");That's actually one of the biggest drawbacks of UTF-16...Besides those little flaws, the code works just fine :-) --andersThanks for the update. I obviously hadn't tried on the Mac.
Nov 30 2004
Ben Hinkle wrote:That's true. I was being lazy with the example. When I tried just plain-old "UTF-16" I think it used big-endian.Yes, that is in the Unicode specification. BE is the default order. Unless there is a BOM present to classify it as LE instead, that is... See http://www.unicode.org/faq/utf_bom.html --anders
Nov 30 2004
"Ben Hinkle" <bhinkle mathworks.com> wrote ... | Maybe I'll try using std.loader for this case, too, and have iconv be a | function pointer. Hmm... That won't work with dmd 0.107 ...
Nov 30 2004