digitalmars.D.learn - Reading ASCII file with some codes above 127 (exten ascii)
- Paul (8/8) May 13 2012 I am reading a file that has a few extended ASCII codes (e.g.
- Era Scarecrow (42/51) May 13 2012 Same here. I've ended up writing a custom array converter that
- Graham Fawcett (11/20) May 14 2012 This seems to work:
- Paul (2/24) May 17 2012 Awesome! Thanks a million!
- Paul (30/52) May 23 2012 I thought I was in good shape with your above suggestion. I does
- Graham Fawcett (9/68) May 23 2012 I tried the program and it seemed to work for me.
- Paul (16/90) May 23 2012 Hmmm. I'm not communicating well.
- Graham Fawcett (15/110) May 23 2012 To make sure we're on the same page -- ASCII is a 7-bit encoding,
- Paul (2/117) May 23 2012 Exactly.
- H. S. Teoh (12/23) May 23 2012 The safest way is probably to read it as binary data (i.e. byte[]), then
- Paul (7/19) May 23 2012 You mean something like Era has done in the first reply?
- Graham Fawcett (10/135) May 23 2012 This works, though it's ugly:
- Paul (22/31) May 23 2012 Awesome! What a lesson! Thannk you!
- era scarecrow (13/15) May 24 2012 My solution may have a flaw in it's lookup table; namely if I
- Era Scarecrow (8/13) May 15 2016 Well after taking to heart about a gc-less solution and doing a
- Regan Heath (11/46) May 25 2012 The only thing which would worry me about this code is the cast(char[]) ...
I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.
May 13 2012
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.Same here. I've ended up writing a custom array converter that if there's any 128+ codes it converts it and returns a new array. Maybe this is wrong, but for me it works. import std.utf; import std.ascii; //conversion table of ascii (latin-1?) to unicode for text compares. //only 128-255 private immutable wchar[] extAscii = [ 0x20AC, 0x0081, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x008D, 0x017D, 0x008F, 0x0090, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x009D, 0x017E, 0x0178, 0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF, 0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF, 0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7, 0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF, 0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7, 0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF, 0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF]; /**since I can't find a good explanation of conversion, this is custom made. if it doesn't need to be converted, it returns the original buffer*/ char[] ascii2char(ubyte[] input) { char[] o; foreach(i, b; input) { if (b & 0x80) { if (!o.length) o = cast(char[]) input[0 .. i]; encode(o, extAscii[b - 0x80]); } else if (o.length) o ~= b; } return o.length ? o : cast(char[]) input; }
May 13 2012
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 14 2012
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:Awesome! Thanks a million!I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 17 2012
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 23 2012
On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? GrahamOn Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 23 2012
On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? GrahamOn Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 23 2012
On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1. If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out. So I think what you're trying to do is 1. read a Latin-1 file, into unicode (internally in D) 2. do splitLines(), etc., generating some result 3. Convert the result back to latin-1, and output it. Is that right? GrahamOn Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? GrahamOn Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 23 2012
On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:Exactly.On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1. If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out. So I think what you're trying to do is 1. read a Latin-1 file, into unicode (internally in D) 2. do splitLines(), etc., generating some result 3. Convert the result back to latin-1, and output it. Is that right? GrahamOn Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? GrahamOn Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 23 2012
On Wed, May 23, 2012 at 09:09:27PM +0200, Paul wrote:On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:[...]The safest way is probably to read it as binary data (i.e. byte[]), then do the conversion into UTF8, then process it, and finally convert it back to latin-1 (in binary form) and output it. D assumes Unicode internally; if you try to read a Latin-1 file as char[], you may be running into some implicit UTF conversions that are corrupting the data. Best use byte[] for reading/writing, and do conversions to/from UTF-8 internally for processing. T -- Doubt is a self-fulfilling prophecy.So I think what you're trying to do is 1. read a Latin-1 file, into unicode (internally in D) 2. do splitLines(), etc., generating some result 3. Convert the result back to latin-1, and output it. Is that right? GrahamExactly.
May 23 2012
The safest way is probably to read it as binary data (i.e. byte[]), then do the conversion into UTF8, then process it, and finally convert it back to latin-1 (in binary form) and output it. D assumes Unicode internally; if you try to read a Latin-1 file as char[], you may be running into some implicit UTF conversions that are corrupting the data. Best use byte[] for reading/writing, and do conversions to/from UTF-8 internally for processing. TYou mean something like Era has done in the first reply? If that is so I have to say I'm really surprized. To write D so it natively expects and outputs unicode is one thing but not making a clean simple way to read extended ASCII chars (i.e. Latin1) and write them back out seems like an oversight. I think I'm (actually Graham) is close. Thanks for your feedback HS.
May 23 2012
On Wednesday, 23 May 2012 at 19:09:29 UTC, Paul wrote:On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:This works, though it's ugly: foreach(line; uniS.splitLines()) { transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle. GrahamOn Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:Exactly.On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1. If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out. So I think what you're trying to do is 1. read a Latin-1 file, into unicode (internally in D) 2. do splitLines(), etc., generating some result 3. Convert the result back to latin-1, and output it. Is that right? GrahamOn Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? GrahamOn Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this.This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 23 2012
This works, though it's ugly: foreach(line; uniS.splitLines()) { transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle. GrahamAwesome! What a lesson! Thannk you! So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1. I wonder about the speed between this method and Era's home-spun solution? import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } }
May 23 2012
On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote:I wonder about the speed between this method and Era's home-spun solution?My solution may have a flaw in it's lookup table; namely if I got one of the codes wrong. I used regex and a site to reference them all so I Hope it's right. I can't remember but I think it was from http://www.alanwood.net/demos/ansi.html The main reason I wrote it was there was no good explanations in the documentation of anywhere of how to use std.encoding and transcode. This meant I was stuck and needed some simple solution. I'm not sure if my solution is going to be faster, but it does do minimal object allocation/resizing/abstraction, and tries not to make a new string if it doesn't have to. Who knows? Perhaps it will be added to phobos once the table is verified.
May 24 2012
On Thursday, 24 May 2012 at 19:47:06 UTC, era scarecrow wrote:On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote:Well after taking to heart about a gc-less solution and doing a inputRange I re-wrote the entire thing. Of course to make it even faster/simpler a full lookup table conversion is used instead. Further reduction has made a very tiny simple filter. Curiously relooking at it there's actually very few codes that are there that really require special attention. If there's still any interest in this I can release it.I wonder about the speed between this method and Era's home-spun solution?Who knows? Perhaps it will be added to phobos once the table is verified.
May 15 2016
On Wed, 23 May 2012 22:02:25 +0100, Paul <phshaffer gmail.com> wrote:The only thing which would worry me about this code is the cast(char[]) in the final writeln.. I know some parts of phobos verify the char data is correct UTF-8 and this line casts latin-1 to char[] which can potentially create invalid UTF-8 data. That said, I had a really quick look at the phobos code for File.writeln and I'm not sure whether this function does any UTF-8 validation. I would be happier if the latin-1 was written as a stream of bytes with no assumed interpretation, IMO. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/This works, though it's ugly: foreach(line; uniS.splitLines()) { transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle. GrahamAwesome! What a lesson! Thannk you! So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1. I wonder about the speed between this method and Era's home-spun solution? import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } }
May 25 2012