digitalmars.D.learn - UTF8 Encoding: Again
- jicman (31/31) Aug 09 2005 Greetings!
- Stefan (11/45) Aug 09 2005 Seems correct to me: 'août' is 61 6f c3 bb 74 in binary which will be
- Stefan (6/15) Aug 09 2005 Just realized that Notepad on XP works even without the BOM now.
- jicman (10/27) Aug 09 2005 Thanks for the help, Stefan. The problem is much deeper than that. I g...
- Stefan Zobel (6/15) Aug 09 2005 Are you sure you're reading/interpreting the file as UTF8 (and it actual...
- jicman (14/25) Aug 09 2005 Well, here is a question: do I have to change the data that I work with?...
- Derek Parnell (19/40) Aug 09 2005 Technically, if it contains accented characters it is *not* ASCII. It is
- Regan Heath (19/40) Aug 09 2005 Yes and No.
- jicman (4/125) Aug 10 2005 WOW! I thought everything was over about the subject, and then, I hit t...
- Carlos Santander (5/49) Aug 09 2005 It must be something with the editor or with the console, because I just...
- Carlos Santander (4/8) Aug 10 2005 That should've been 0.15
Greetings! Sorry about this, but I have found a wall with UTF8, again. Perhaps, some of you may be able to help me jump it or break through it. I have this code, char[] GetMonthDigit(char[] mon) { char [][char[]] sMon; sMon["août"] = "08"; return sMon[mon]; } if I call this from within the program char[] mon = GetMonthDigit("août"); mon will have "08", however if it comes from another text file, say an ASCII file, it fails. So, if you take a look at this small piece of code, import std.stream; void main() { char[] fn = "test.log"; char[] txt = "août"; File log = new File(fn,FileMode.Append); log.writeLine(txt); log.close(); } this will create a file, if it does not exists, called test.log after compile and run, and it will write, supposely, the content of txt. However, when you open the test.log file, the content of it is, août Hmmmm... I tried setting the editor settings to UTF8, and others, however, nothing has worked. Any ideas how I can fix this? thanks, josé
Aug 09 2005
... However, when you open the test.log file, the content of it is, aoûtSeems correct to me: 'août' is 61 6f c3 bb 74 in binary which will be interpreted as 'août' in ASCII mode. So I think the editor that you use to view the file is the problem. A lot of editors (e.g. Notepad) need a 'magic' byte order mark (BOM) in the beginning of the file to be able to recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is ef bb bf. Delete these with an hex editor and it'll give you the same output. Anyways, you have to make sure that you read/interpret your UTF8 file as UTF8, not ASCII. HTH, Stefan In article <ddb1bt$bcp$1 digitaldaemon.com>, jicman says...Greetings! Sorry about this, but I have found a wall with UTF8, again. Perhaps, some of you may be able to help me jump it or break through it. I have this code, char[] GetMonthDigit(char[] mon) { char [][char[]] sMon; sMon["août"] = "08"; return sMon[mon]; } if I call this from within the program char[] mon = GetMonthDigit("août"); mon will have "08", however if it comes from another text file, say an ASCII file, it fails. So, if you take a look at this small piece of code, import std.stream; void main() { char[] fn = "test.log"; char[] txt = "août"; File log = new File(fn,FileMode.Append); log.writeLine(txt); log.close(); } this will create a file, if it does not exists, called test.log after compile and run, and it will write, supposely, the content of txt. However, when you open the test.log file, the content of it is, août Hmmmm... I tried setting the editor settings to UTF8, and others, however, nothing has worked. Any ideas how I can fix this? thanks, josé
Aug 09 2005
In article <ddb5u6$fmn$1 digitaldaemon.com>, Stefan says...Just realized that Notepad on XP works even without the BOM now. You can reproduce it with WordPad, however. Any decent XML editor should be able to display correctly without BOMs. Best regards, Stefan... However, when you open the test.log file, the content of it is, août... A lot of editors (e.g. Notepad) need a 'magic' byte order mark (BOM) in the beginning of the file to be able to recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is ef bb bf. Delete these with an hex editor and it'll give you the same output.
Aug 09 2005
Stefan says...In article <ddb5u6$fmn$1 digitaldaemon.com>, Stefan says...Thanks for the help, Stefan. The problem is much deeper than that. I get input string from a file which contains certain words, say "août" which I need to test on. The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match. I guess I still don't understand UTF. :-o I will have to rewrite this whole check check function because of this. Thanks for the help. joséJust realized that Notepad on XP works even without the BOM now. You can reproduce it with WordPad, however. Any decent XML editor should be able to display correctly without BOMs. Best regards, Stefan... However, when you open the test.log file, the content of it is, août... A lot of editors (e.g. Notepad) need a 'magic' byte order mark (BOM) in the beginning of the file to be able to recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is ef bb bf. Delete these with an hex editor and it'll give you the same output.
Aug 09 2005
In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...Thanks for the help, Stefan. The problem is much deeper than that. I get input string from a file which contains certain words, say "août" which I need to test on. The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)? Nevertheless, good luck! If there's something to learn from your investigations let us other D newbies know ;-) Best regards, StefanI guess I still don't understand UTF. :-o I will have to rewrite this whole check check function because of this. Thanks for the help. josé
Aug 09 2005
Stefan Zobel says...In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it? I know that I have to save the source code files as UTF8, but do I also have to change the other text files that I work with to UTF8? That is my problem. I am saving the source files ok, but the input that I read from text files are not matching the the source code. Again, do I need to change that input data to UTF8?Thanks for the help, Stefan. The problem is much deeper than that. I get input string from a file which contains certain words, say "août" which I need to test on. The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?Nevertheless, good luck! If there's something to learn from your investigations let us other D newbies know ;-)There is nothing to learn. ;-) All I am going to do is to change any character higher than 127 to +. :-) That's how I have been able to work with this UTF stuff. :-) thanks, josé
Aug 09 2005
On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman wrote:Stefan Zobel says...Technically, if it contains accented characters it is *not* ASCII. It is some other form of character encoding. For example, my Windows XP has Code Page 850 set for the DOS console. ( http://en.wikipedia.org/wiki/Code_page_850 ) You would need to find out which character encoding standard was used in your file, then read the file in as a stream of *bytes* not chars, and convert each of the byte values into the equivalent Unicode character. You could then use UTF8 "char[]", UTF16 "wchar[]", or UTF32 "dchar[]" as your preferred coding in your program. Also have a look at http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues for further help. -- Derek (skype: derek.j.parnell) Melbourne, Australia 10/08/2005 1:03:43 PMIn article <ddb989$jkm$1 digitaldaemon.com>, jicman says...Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it? I know that I have to save the source code files as UTF8, but do I also have to change the other text files that I work with to UTF8? That is my problem. I am saving the source files ok, but the input that I read from text files are not matching the the source code. Again, do I need to change that input data to UTF8?Thanks for the help, Stefan. The problem is much deeper than that. I get input string from a file which contains certain words, say "août" which I need to test on. The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?
Aug 09 2005
On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman <jicman_member pathlink.com> wrote:Stefan Zobel says...Yes and No. As Derek said, if it has characters above 127 it's not ascii, see: http://www.columbia.edu/kermit/csettables.html I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1 Latin Alphabet 1" which are very similar. To figure it out open the text file in a binary editor and check the value of an accented character and compare it to the tables in the links above. You can read and write these non UTF characters into a char[] etc, provided you don't use writef or any other routine that actually checks whether the characters are valid UTF, i.e. writeString works but writef will give an exception. If you want to compare the data to static string you'll need to convert the data to UTF. I have a small module which will convert windows code page 1252 into UTF8, 16, and 32 and back again (tho the back again is totally untested). I needed it for much the same thing as you do. This code is public domain. ReganIn article <ddb989$jkm$1 digitaldaemon.com>, jicman says...Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it?Thanks for the help, Stefan. The problem is much deeper than that. I get input string from a file which contains certain words, say "août" which I need to test on. The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?
Aug 09 2005
WOW! I thought everything was over about the subject, and then, I hit the newsgroup again and BOOM, all of these nice responses. :-) Thanks folks. jic Regan Heath says...------------ZVb7U8x7kmqd0N1gRinYIz Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 Content-Transfer-Encoding: 8bit On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman <jicman_member pathlink.com> wrote:Stefan Zobel says...Yes and No. As Derek said, if it has characters above 127 it's not ascii, see: http://www.columbia.edu/kermit/csettables.html I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1 Latin Alphabet 1" which are very similar. To figure it out open the text file in a binary editor and check the value of an accented character and compare it to the tables in the links above. You can read and write these non UTF characters into a char[] etc, provided you don't use writef or any other routine that actually checks whether the characters are valid UTF, i.e. writeString works but writef will give an exception. If you want to compare the data to static string you'll need to convert the data to UTF. I have a small module which will convert windows code page 1252 into UTF8, 16, and 32 and back again (tho the back again is totally untested). I needed it for much the same thing as you do. This code is public domain. Regan ------------ZVb7U8x7kmqd0N1gRinYIz Content-Disposition: attachment; filename=cp1252.d Content-Type: application/octet-stream; name=cp1252.d Content-Transfer-Encoding: 8bit module cp1252; import std.utf; char[] cp1252toUTF8(ubyte[] raw) { return toUTF8(cp1252toUTF16(raw)); } wchar[] cp1252toUTF16(ubyte[] raw) { wchar[] result; result.length = raw.length; foreach(int i, ubyte b; raw) { if (b <= 0x80) result[i] = b; else result[i] = table[b]; } return result; } dchar[] cp1252toUTF32(ubyte[] raw) { return toUTF32(cp1252toUTF16(raw)); } ushort[] table = [ 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F, 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F, 0x20AC, 0x0000, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x0000, 0x017D, 0x0000, 0x0000, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x0000, 0x017E, 0x0178, 0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF, 0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF, 0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7, 0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF, 0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7, 0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF, 0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF ]; /* UNTESTED! */ ubyte[] UTF16toCP1252(char[] raw) { return UTF16toCP1252(toUTF16(raw)); } ubyte[] UTF16toCP1252(wchar[] raw) { ubyte[] result; result.length = raw.length; foreach(int i, wchar c; raw) { foreach(int j, ushort s; table) { if (c == s) { result[i] = table[j]; goto found; } } throw new Exception("Data cannot be encoded"); found: ; } return result; } ubyte[] UTF16toCP1252(dchar[] raw) { return UTF16toCP1252(toUTF16(raw)); } ------------ZVb7U8x7kmqd0N1gRinYIz--In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it?Thanks for the help, Stefan. The problem is much deeper than that. I get input string from a file which contains certain words, say "août" which I need to test on. The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?
Aug 10 2005
jicman escribió:Greetings! Sorry about this, but I have found a wall with UTF8, again. Perhaps, some of you may be able to help me jump it or break through it. I have this code, char[] GetMonthDigit(char[] mon) { char [][char[]] sMon; sMon["août"] = "08"; return sMon[mon]; } if I call this from within the program char[] mon = GetMonthDigit("août"); mon will have "08", however if it comes from another text file, say an ASCII file, it fails. So, if you take a look at this small piece of code, import std.stream; void main() { char[] fn = "test.log"; char[] txt = "août"; File log = new File(fn,FileMode.Append); log.writeLine(txt); log.close(); } this will create a file, if it does not exists, called test.log after compile and run, and it will write, supposely, the content of txt. However, when you open the test.log file, the content of it is, août Hmmmm... I tried setting the editor settings to UTF8, and others, however, nothing has worked. Any ideas how I can fix this? thanks, joséIt must be something with the editor or with the console, because I just tried with gdc-0.13 on Mac and it worked ok. -- Carlos Santander Bernal
Aug 09 2005
Carlos Santander escribió:It must be something with the editor or with the console, because I just tried with gdc-0.13 on Mac and it worked ok.That should've been 0.15 -- Carlos Santander Bernal
Aug 10 2005