digitalmars.D - INVALID UTF-8 SEQUENCE!
- Martin (4/4) Aug 18 2004 I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQ...
- J C Calvarese (12/16) Aug 18 2004 What format is your file saved in?
- Arcane Jill (10/14) Aug 18 2004 I have a sneaking suspicion you might find it will work just fine if you...
- Martin (8/26) Aug 18 2004 I think I changed the 0.93 version to the 0.98. In 0.93 my files compile...
- Arcane Jill (41/50) Aug 18 2004 That's not possible. In your original post you said "I use non.english
- Martin (9/62) Aug 18 2004 Yes you are probably right, it is some kind of extended ascii, in this c...
- Walter (11/19) Aug 18 2004 case I
- Martin (4/24) Aug 19 2004 I think I will use the \xXX. My workaround solution was much uglyer, so ...
- Arcane Jill (21/23) Aug 19 2004 Sorry, Walter - that's not right! You should not be encouraging the use ...
- Martin (13/36) Aug 19 2004 I think I will move to UTF-8 with my next version of the program. I can'...
- Arcane Jill (15/19) Aug 19 2004 But I don't think you can make demands on what encoding in which the POS...
- Walter (5/7) Aug 19 2004 this
- Walter (7/15) Aug 19 2004 \xXX
- Arcane Jill (49/55) Aug 19 2004 I think that your statement might need some clarifying. Web servers by
- Walter (8/14) Aug 18 2004 having
- Arcane Jill (3/18) Aug 19 2004 Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?
- Jonathan Leffler (9/16) Aug 19 2004 CUJ = C User's Journal (or possibly Users'?)
- Walter (13/25) Aug 19 2004 you
- Nick (8/12) Aug 18 2004 If you are on linux you can convert from latin1 to utf8 with the command
- Walter (7/12) Aug 18 2004 just
I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change? Most of the C compilers accept them, why not D?
Aug 18 2004
In article <cfvh55$2d5s$1 digitaldaemon.com>, Martin says...I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change? Most of the C compilers accept them, why not D?What format is your file saved in? From http://www.digitalmars.com/d/lex.html: Source Text D source text can be in one of the following formats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LE jcc7
Aug 18 2004
In article <cfvh55$2d5s$1 digitaldaemon.com>, Martin says...I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change?I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). So far as I know, the D compiler has not changed in this regard (except that it can now auto-detect UTF-16 and UTF-32).Most of the C compilers accept them, why not D?Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which just happens to appear to work whenever the source file encoding is the same as the run-time encoding. Arcane Jill
Aug 18 2004
Thank you for your answer!I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...).I am using the gnu midnight commander text editor, it only saves ascii.So far as I know, the D compiler has not changed in this regard (except that it >can now auto-detect UTF-16 and UTF-32).I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters. So how to I tell the dmd that the source is an ascii file? Thank you! In article <cfvjdr$2dr2$1 digitaldaemon.com>, Arcane Jill says...In article <cfvh55$2d5s$1 digitaldaemon.com>, Martin says...I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change?I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). So far as I know, the D compiler has not changed in this regard (except that it can now auto-detect UTF-16 and UTF-32).Most of the C compilers accept them, why not D?Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which just happens to appear to work whenever the source file encoding is the same as the run-time encoding. Arcane Jill
Aug 18 2004
In article <cfvlcp$2eob$1 digitaldaemon.com>, Martin says...Thank you for your answer!Err... Don't thank me yet. Save that until the problem's actually solved!That's not possible. In your original post you said "I use non.english characters in my strings.(like ÜÄÖ)". If that statement is true, you /cannot/ be using ASCII, since these characters do not even /exist/ in ASCII. If your text contains any of the characters 'Ü', 'Ä' or 'Ö' then you are /not/ using ASCII. Period. Unfortunately, I am not familiar with this text editor, so I don't know how to determine the encoding it uses, or how to change it. I may have a fix for you even so, however. (Read on...)I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...).I am using the gnu midnight commander text editor, it only saves ascii.Okay. Now, first off, the following compiles fine for me: using DMD v0.98, with the file saved as UTF-8. However, when I resaved the file as ISO-8859-1 (which is an invalid thing to do) then the compiler (correctly) gave me the compile-time error message: "Invalid UTF-8 sequence". I believe that the earlier version to which you refer (0.93) there was a bug, which was fixed in 0.96 - according to the change log: "Invalid UTF characters in string literals now diagnosed." In other words, DMD 0.93 failed to diagnose the invalid UTF-8 characters in your source file, and so the file compiled -- but it compiled incorrectly. The error would not have been detected until runtime - and even then only IF you passed your string to a UTF conversion routine. If you passed your invalid string straight to printf(), for example, the v0.93 compiler wouldn't even have noticed. But you can bet your life that even if your program had appeared to run correctly on your machine, it would not necessarily have worked on anyone else's. So tell me - what operating system are you using. The word "gnu" makes me suspect Linux, in which case I believe you need to set the environment variable, CHARSET to the value UTF-8. (But I'm a Windows user, so I could be wrong - I'm hoping someone will leap in here and correct me if so). Anyway, once you've set your environment variable, everything should work with the latest DMD - and this time, it will work for everyone, not just for you.So far as I know, the D compiler has not changed in this regard (except that it >can now auto-detect UTF-16 and UTF-32).I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters.So how to I tell the dmd that the source is an ascii file?There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters having codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with ASCII files, but your files are not ASCII. Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1"). But it's not ASCII. Arcane Jill
Aug 18 2004
Yes you are probably right, it is some kind of extended ascii, in this case I think that yes it is ISO-8859-1. My problem is, that the webserver that I am wrting this software for, uses the same encoding. With the old version everything worked fine. Everyone that used the server saw the characters right. So can I tell the dmd to use ISO-8859-1, or just not to check the things it shouldn't be checking? In article <cfvqq8$2jhu$1 digitaldaemon.com>, Arcane Jill says...In article <cfvlcp$2eob$1 digitaldaemon.com>, Martin says...Thank you for your answer!Err... Don't thank me yet. Save that until the problem's actually solved!That's not possible. In your original post you said "I use non.english characters in my strings.(like ÜÄÖ)". If that statement is true, you /cannot/ be using ASCII, since these characters do not even /exist/ in ASCII. If your text contains any of the characters 'Ü', 'Ä' or 'Ö' then you are /not/ using ASCII. Period. Unfortunately, I am not familiar with this text editor, so I don't know how to determine the encoding it uses, or how to change it. I may have a fix for you even so, however. (Read on...)I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...).I am using the gnu midnight commander text editor, it only saves ascii.Okay. Now, first off, the following compiles fine for me: using DMD v0.98, with the file saved as UTF-8. However, when I resaved the file as ISO-8859-1 (which is an invalid thing to do) then the compiler (correctly) gave me the compile-time error message: "Invalid UTF-8 sequence". I believe that the earlier version to which you refer (0.93) there was a bug, which was fixed in 0.96 - according to the change log: "Invalid UTF characters in string literals now diagnosed." In other words, DMD 0.93 failed to diagnose the invalid UTF-8 characters in your source file, and so the file compiled -- but it compiled incorrectly. The error would not have been detected until runtime - and even then only IF you passed your string to a UTF conversion routine. If you passed your invalid string straight to printf(), for example, the v0.93 compiler wouldn't even have noticed. But you can bet your life that even if your program had appeared to run correctly on your machine, it would not necessarily have worked on anyone else's. So tell me - what operating system are you using. The word "gnu" makes me suspect Linux, in which case I believe you need to set the environment variable, CHARSET to the value UTF-8. (But I'm a Windows user, so I could be wrong - I'm hoping someone will leap in here and correct me if so). Anyway, once you've set your environment variable, everything should work with the latest DMD - and this time, it will work for everyone, not just for you.So far as I know, the D compiler has not changed in this regard (except that it >can now auto-detect UTF-16 and UTF-32).I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters.So how to I tell the dmd that the source is an ascii file?There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters having codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with ASCII files, but your files are not ASCII. Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1"). But it's not ASCII. Arcane Jill
Aug 18 2004
"Martin" <Martin_member pathlink.com> wrote in message news:cg0ggt$16f3$1 digitaldaemon.com...Yes you are probably right, it is some kind of extended ascii, in thiscase Ithink that yes it is ISO-8859-1. My problem is, that the webserver that I am wrting this software for, usesthesame encoding. With the old version everything worked fine. Everyone that used the serversawthe characters right. So can I tell the dmd to use ISO-8859-1, or just not to check the thingsitshouldn't be checking?There's no way to do that right now. One of the problems with using such charsets in source code is the source code is then non-portable. Someone can just change a seemingly unrelated system setting, and poof, your builds fail. You can also use \xXX to specify the characters, though that is ugly enough to be unusable.
Aug 18 2004
I think I will use the \xXX. My workaround solution was much uglyer, so I am quite happy with this one. Thanks! In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says..."Martin" <Martin_member pathlink.com> wrote in message news:cg0ggt$16f3$1 digitaldaemon.com...Yes you are probably right, it is some kind of extended ascii, in thiscase Ithink that yes it is ISO-8859-1. My problem is, that the webserver that I am wrting this software for, usesthesame encoding. With the old version everything worked fine. Everyone that used the serversawthe characters right. So can I tell the dmd to use ISO-8859-1, or just not to check the thingsitshouldn't be checking?There's no way to do that right now. One of the problems with using such charsets in source code is the source code is then non-portable. Someone can just change a seemingly unrelated system setting, and poof, your builds fail. You can also use \xXX to specify the characters, though that is ugly enough to be unusable.
Aug 19 2004
In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says...You can also use \xXX to specify the characters, though that is ugly enough to be unusable.Sorry, Walter - that's not right! You should not be encouraging the use of \xXX in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're listening). Sticking \x's into a string literal is just another way to create an invalid UTF-8 sequence. See this code: This will output: thereby proving that s1 contains an Invalid UTF-8 sequence! (But s2 is correct). Remember - \x is used to insert literal bytes. \u inserts characters. All you've done is provided a way to get pre DMD-0.96 behavior out of a DMD-0.96+ compiler. Arcane Jill
Aug 19 2004
I think I will move to UTF-8 with my next version of the program. I can't do it right now, because then it needs some rewriting. The UTF-8 output is not the problem, it's more like UTF-8 input. I need to read the POST data from users browser, to proccess it. The problem with UTF-8 is that a character can be 1,2,3 or even 4 bytes long. I do a lot of text proccessing and I need to rewrite, atleast look over all these functions. But I have a deadline coming... I wrote my last web with C++, didn't use UTF-8, and it works fine. I am only writing application for Estonian people. But probalby you are right, I need to move to UTF-8, but not before my next version. Martin In article <cg1of6$18ss$1 digitaldaemon.com>, Arcane Jill says...In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says...You can also use \xXX to specify the characters, though that is ugly enough to be unusable.Sorry, Walter - that's not right! You should not be encouraging the use of \xXX in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're listening). Sticking \x's into a string literal is just another way to create an invalid UTF-8 sequence. See this code: This will output: thereby proving that s1 contains an Invalid UTF-8 sequence! (But s2 is correct). Remember - \x is used to insert literal bytes. \u inserts characters. All you've done is provided a way to get pre DMD-0.96 behavior out of a DMD-0.96+ compiler. Arcane Jill
Aug 19 2004
In article <cg1q02$1c1h$1 digitaldaemon.com>, Martin says...The UTF-8 output is not the problem, it's more like UTF-8 input. I need to read the POST data from users browser, to proccess it.But I don't think you can make demands on what encoding in which the POST data is going to be presented, can you? You simply have to recognize it, and decode it. If the data is in ISO-whatever, you must decode that; if the data is in MAC-ROMAN, you must decode that; if the data is in UTF-8, you must decode that. And so on.The problem with UTF-8 is that a character can be 1,2,3 or even 4 bytes long.Indeed, but D has lots of handy functions to convert them. And the problem with ISO-8859-1 (Latin-1) is that characters beyond \u00FF are completely unrepresentable. Like, AT ALL. If someone wants to use a lowercase c with an acute accent ('\u0107'), you're completely screwed. UTF-8 is the solution.I wrote my last web with C++, didn't use UTF-8, and it works fine.But only if /you/ compile it. If someone else, with a different default encoding, were to compile the same source code, it may fail badly. But it's nice to see you're writing for a non-English audience. I'm sure this trend will continue. Arcane Jill
Aug 19 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cg1rba$1fnh$1 digitaldaemon.com...But it's nice to see you're writing for a non-English audience. I'm surethistrend will continue.And that's great, because it helps us identify and shake out the problems with the internationalization support.
Aug 19 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cg1of6$18ss$1 digitaldaemon.com...In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says...\xXXYou can also use \xXX to specify the characters, though that is ugly enough to be unusable.Sorry, Walter - that's not right! You should not be encouraging the use ofin this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you'relistening).Sticking \x's into a string literal is just another way to create aninvalidUTF-8 sequence. See this code:True, but if they're used to create a ubyte[] sequence (not a char[] sequence) it should work.
Aug 19 2004
In article <cg0ggt$16f3$1 digitaldaemon.com>, Martin says...My problem is, that the webserver that I am wrting this software for, uses the same encoding.I think that your statement might need some clarifying. Web servers by definition need to do transcoding. Most programs need a concept of a "run-time encoding" (so they can do printf(), etc.), but the run-time encoding of a web server is no longer limited to that of one particular machine - a web server has to deal with machines all over the internet, each possibly with its own local encoding. The "Accept" field in an HTTP request can act as a request from the browser to the server that the web content be delivered in a particular encoding. For example: When the page is delivered, a web server sends back: If the encoding is not specified then HTML is supposed to default to ISO-8859-1, but XML (including XHTML) is supposed to default to UTF-8. A web server which doesn't do UTF-8, or which doesn't do transcoding, is all but useless. That said, you may still be able to get away with it. If you send all your web content in a particular encoding, then, as long as it is marked as such, the user's browser /may/ be able to reinterpret the page (the Accept request header is supposed to advise you of what the browser can or can't deal with). So, when you say "the webserver ... uses the same encoding [ISO-8859-1]", I'm still not clear what it uses that encoding /for/. It's the default for HTML, but are you saying your server emits no other encoding? Not even UTF-8? That would be weird. Any chance you could clarify?With the old version everything worked fine. Everyone that used the server saw the characters right.Providing your server emitted "Content-type: text/html; charset=ISO-8859-1" in its response headers, (or just "Content-type: text/html" since ISO-8859-1 is the default for HTML - but that's dangerous, since not all browsers obey the W3C spec), that is likely to be true. But still, you're relying on a parochial character set, and it /is/ possible that some viewers of your server simply won't have that encoding in their browser.So can I tell the dmd to use ISO-8859-1, or just not to check the things it shouldn't be checking?No. You *MUST* save your DMD source files in either ASCII or UTF-8 before attempting to compile them. If you wish to emit output in ISO-8859-1 then you must ISO-8859-1-encode the output at runtime (which is easy - I can show you how to do that). But why is saving your source file as UTF-8 hard? I've never heard of a modern text editor which can't do it, but if you've discovered one, why not just change to a different text editor? Nonetheless - if you really can't figure out how to save in UTF-8 (which would be surprising for someone writing a web server, with all the transcoding understanding required thereby), then your only remaining choice is to save as ASCII. You can do this by replacing your non-ASCII characters either by Unicode escape sequences (if you want DMD to interpret them) or HTML entities (if you want the users' browsers to interpret them). So replace as follows: Hope that helps. Arcane Jill
Aug 19 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cfvqq8$2jhu$1 digitaldaemon.com...There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any charactershavingcodepoints outside the range 0x00 to 0x7F. DMD is perfectly happy withASCIIfiles, but your files are not ASCII. Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1").Butit's not ASCII.You write well and understand the issues involved. Can I suggest that you write an article about this for, say, CUJ or DDJ? Such an article exploring this topic is sorely needed.
Aug 18 2004
In article <cg0gsg$16u8$2 digitaldaemon.com>, Walter says..."Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cfvqq8$2jhu$1 digitaldaemon.com...Could be fun. So what are CUJ and DDJ? Could someone give me some URLs? JillThere's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any charactershavingcodepoints outside the range 0x00 to 0x7F. DMD is perfectly happy withASCIIfiles, but your files are not ASCII. Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1").Butit's not ASCII.You write well and understand the issues involved. Can I suggest that you write an article about this for, say, CUJ or DDJ? Such an article exploring this topic is sorely needed.
Aug 19 2004
Arcane Jill wrote:In article <cg0gsg$16u8$2 digitaldaemon.com>, Walter:CUJ = C User's Journal (or possibly Users'?) http://www.cuj.com/ (where there's no apostrophe in sight) DDJ = Dr Dobb's Journal http://www.ddj.com/ -- Jonathan Leffler #include <disclaimer.h> Email: jleffler earthlink.net, jleffler us.ibm.com Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/You write well and understand the issues involved. Can I suggest that you write an article about this for, say, CUJ or DDJ? Such an article exploring this topic is sorely needed.Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?
Aug 19 2004
"Jonathan Leffler" <jleffler earthlink.net> wrote in message news:cg1n1p$13qg$1 digitaldaemon.com...Arcane Jill wrote:youIn article <cg0gsg$16u8$2 digitaldaemon.com>, Walter:You write well and understand the issues involved. Can I suggest thatexploringwrite an article about this for, say, CUJ or DDJ? Such an articleYes, they're the two main print publications that C/C++ programmers read. The D articles published by them have been well received, and the publisher (CMP Media) has indicated they want more. And besides, they even pay for articles! Getting published in CUJ or DDJ is fairly prestigious, and will look good on any resume. Many of the top highly paid C++ professionals built their reputation early on by writing articles. Many companies also have a policy of giving a bonus to engineering employees who get published in a magazine, that's worth checking out. So it's really an everybody wins kind of situation.CUJ = C User's Journal (or possibly Users'?) http://www.cuj.com/ (where there's no apostrophe in sight) DDJ = Dr Dobb's Journal http://www.ddj.com/this topic is sorely needed.Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?
Aug 19 2004
In article <cfvlcp$2eob$1 digitaldaemon.com>, Martin says...Thank you for your answer!If you are on linux you can convert from latin1 to utf8 with the command iconv -f latin1 -t utf8 file.d > newfile.d dmd newfile.d You will probably be doing that a lot, so it's best if you can put it in a script or something. Hope this helps :) NickI have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...).I am using the gnu midnight commander text editor, it only saves ascii.
Aug 18 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cfvjdr$2dr2$1 digitaldaemon.com...justMost of the C compilers accept them, why not D?Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - whichhappens to appear to work whenever the source file encoding is the same astherun-time encoding.It doesn't always work, some of the code pages include multibyte sequences where " can be the second byte :-(. That's why DMC has special switches for such. This is just the sort of thing I want to move away from.
Aug 18 2004