digitalmars.D.learn - Reading ASCII file with some codes above 127 (exten ascii)

Paul (8/8) May 13 2012 I am reading a file that has a few extended ASCII codes (e.g.

Era Scarecrow (42/51) May 13 2012 Same here. I've ended up writing a custom array converter that
Graham Fawcett (11/20) May 14 2012 This seems to work:

Paul (2/24) May 17 2012 Awesome! Thanks a million!
Paul (30/52) May 23 2012 I thought I was in good shape with your above suggestion. I does

Graham Fawcett (9/68) May 23 2012 I tried the program and it seemed to work for me.

Paul (16/90) May 23 2012 Hmmm. I'm not communicating well.

Graham Fawcett (15/110) May 23 2012 To make sure we're on the same page -- ASCII is a 7-bit encoding,

Paul (2/117) May 23 2012 Exactly.

H. S. Teoh (12/23) May 23 2012 The safest way is probably to read it as binary data (i.e. byte[]), then

Paul (7/19) May 23 2012 You mean something like Era has done in the first reply?

Graham Fawcett (10/135) May 23 2012 This works, though it's ugly:

Paul (22/31) May 23 2012 Awesome! What a lesson! Thannk you!

era scarecrow (13/15) May 24 2012 My solution may have a flaw in it's lookup table; namely if I

Era Scarecrow (8/13) May 15 2016 Well after taking to heart about a gc-less solution and doing a

Regan Heath (11/46) May 25 2012 The only thing which would worry me about this code is the cast(char[]) ...

"Paul" <phshaffer gmail.com> writes:

I am reading a file that has a few extended ASCII codes (e.g. 
degree symdol). Depending on how I read the file in and what I do 
with it the error shows up at different points.  I'm pretty sure 
it all boils down to the these extended ascii codes.

Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? 
  I've messed with the std.encoding module but really can't figure 
out what I need to do.

There must be a simple solution to this.

May 13 2012

"Era Scarecrow" <rtcvb32 yahoo.com> writes:

On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

  Same here. I've ended up writing a custom array converter that 
if there's any 128+ codes it converts it and returns a new array. 
Maybe this is wrong, but for me it works.

import std.utf;
import std.ascii;

//conversion table of ascii (latin-1?) to unicode for text 
compares.
//only 128-255
private immutable wchar[] extAscii = [
   0x20AC, 0x0081, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021,
   0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x008D, 0x017D, 0x008F,
   0x0090, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
   0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x009D, 0x017E, 0x0178,
   0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7,
   0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
   0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7,
   0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF,
   0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7,
   0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF,
   0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7,
   0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF,
   0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7,
   0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF,
   0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7,
   0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF];

/**since I can't find a good explanation of conversion, this is 
custom made.
    if it doesn't need to be converted, it returns the original 
buffer*/
char[] ascii2char(ubyte[] input) {
   char[] o;

   foreach(i, b; input) {
     if (b & 0x80) {
       if (!o.length)
         o = cast(char[]) input[0 .. i];

       encode(o, extAscii[b - 0x80]);
     } else if (o.length)
       o ~= b;
   }

   return o.length ? o : cast(char[]) input;
}

May 13 2012

"Graham Fawcett" <fawcett uwindsor.ca> writes:

On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
  I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

This seems to work:


import std.stdio, std.file, std.encoding;

void main()
{
     auto latin = cast(Latin1String) read("/tmp/hi.8859");
     string s;
     transcode(latin, s);
     writeln(s);
}


Graham

May 14 2012

"Paul" <phshaffer gmail.com> writes:

On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

 This seems to work:


 import std.stdio, std.file, std.encoding;

 void main()
 {
     auto latin = cast(Latin1String) read("/tmp/hi.8859");
     string s;
     transcode(latin, s);
     writeln(s);
 }


 Graham

Awesome! Thanks a million!

May 17 2012

"Paul" <phshaffer gmail.com> writes:

On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

 This seems to work:


 import std.stdio, std.file, std.encoding;

 void main()
 {
     auto latin = cast(Latin1String) read("/tmp/hi.8859");
     string s;
     transcode(latin, s);
     writeln(s);
 }


 Graham

I thought I was in good shape with your above suggestion.  I does 
help me read and process text.  But when I go to print it out I 
have problems.

Here is my input file:
°F

Here is my code:
import std.stdio;
import std.string;
import std.file;
import std.encoding;

// Main function
void main(){
     auto fout = File("out.txt","w");
     auto latinS = cast(Latin1String) read("in.txt");
     string uniS;
     transcode(latinS, uniS);
     foreach(line; uniS.splitLines()){
        transcode(line, latinS);
        fout.writeln(line);
        fout.writeln(latinS);
     }
}

Here is the output:
Â°F
[cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]

If I print the Unicode string I get an extra weird character.  If 
I print the Unicode string retranslated to Latin1, it get weird 
pseudo-code.
Can you help?

May 23 2012

"Graham Fawcett" <fawcett uwindsor.ca> writes:

On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what 
 I do with it the error shows up at different points.  I'm 
 pretty sure it all boils down to the these extended ascii 
 codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

 This seems to work:


 import std.stdio, std.file, std.encoding;

 void main()
 {
    auto latin = cast(Latin1String) read("/tmp/hi.8859");
    string s;
    transcode(latin, s);
    writeln(s);
 }


 Graham

 I thought I was in good shape with your above suggestion.  I 
 does help me read and process text.  But when I go to print it 
 out I have problems.

 Here is my input file:
 °F

 Here is my code:
 import std.stdio;
 import std.string;
 import std.file;
 import std.encoding;

 // Main function
 void main(){
     auto fout = File("out.txt","w");
     auto latinS = cast(Latin1String) read("in.txt");
     string uniS;
     transcode(latinS, uniS);
     foreach(line; uniS.splitLines()){
        transcode(line, latinS);
        fout.writeln(line);
        fout.writeln(latinS);
     }
 }

 Here is the output:
 Â°F
 [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]

 If I print the Unicode string I get an extra weird character.  
 If I print the Unicode string retranslated to Latin1, it get 
 weird pseudo-code.
 Can you help?

I tried the program and it seemed to work for me.

What program are you using to read "out.txt"? Are you sure it 
supports UTF-8, and knows to open the file as UTF-8? (This looks 
suspiciously like a tool's attempt to misinterpret a UTF-8 string 
as Latin-1.)

If you're on a Unix system, what does "file in.txt out.txt" 
report?

Graham

May 23 2012

"Paul" <phshaffer gmail.com> writes:

On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

 This seems to work:


 import std.stdio, std.file, std.encoding;

 void main()
 {
   auto latin = cast(Latin1String) read("/tmp/hi.8859");
   string s;
   transcode(latin, s);
   writeln(s);
 }


 Graham

 I thought I was in good shape with your above suggestion.  I 
 does help me read and process text.  But when I go to print it 
 out I have problems.

 Here is my input file:
 °F

 Here is my code:
 import std.stdio;
 import std.string;
 import std.file;
 import std.encoding;

 // Main function
 void main(){
    auto fout = File("out.txt","w");
    auto latinS = cast(Latin1String) read("in.txt");
    string uniS;
    transcode(latinS, uniS);
    foreach(line; uniS.splitLines()){
       transcode(line, latinS);
       fout.writeln(line);
       fout.writeln(latinS);
    }
 }

 Here is the output:
 Â°F
 [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]

 If I print the Unicode string I get an extra weird character.  
 If I print the Unicode string retranslated to Latin1, it get 
 weird pseudo-code.
 Can you help?

 I tried the program and it seemed to work for me.

 What program are you using to read "out.txt"? Are you sure it 
 supports UTF-8, and knows to open the file as UTF-8? (This 
 looks suspiciously like a tool's attempt to misinterpret a 
 UTF-8 string as Latin-1.)

 If you're on a Unix system, what does "file in.txt out.txt" 
 report?

 Graham

Hmmm.  I'm not communicating well.
I want to read and write ASCII.  The only reason I'm converting 
to Unicode is because D needs it (as I understand).

Yes if I open Â°F in notepad++ and tell notepad++ that it is 
UTF-8, it shows °F.

I want to:
1) Read an ascii file that may have codes above 127.
2) Convert to unicode so D funcs like .splitLines() can work with 
it.
3) Convert back to ascii so that stuff like °F writes out as it 
was read in.

If I open in.txt and out.txt in an ascii editor, °F should look 
the same in both files with the editor encoding the files as 
ANSI/ASCII.  I thought my program was doing just that.
Thanks for your assistance.

May 23 2012

"Graham Fawcett" <fawcett uwindsor.ca> writes:

On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

 This seems to work:


 import std.stdio, std.file, std.encoding;

 void main()
 {
  auto latin = cast(Latin1String) read("/tmp/hi.8859");
  string s;
  transcode(latin, s);
  writeln(s);
 }


 Graham

 I thought I was in good shape with your above suggestion.  I 
 does help me read and process text.  But when I go to print 
 it out I have problems.

 Here is my input file:
 °F

 Here is my code:
 import std.stdio;
 import std.string;
 import std.file;
 import std.encoding;

 // Main function
 void main(){
   auto fout = File("out.txt","w");
   auto latinS = cast(Latin1String) read("in.txt");
   string uniS;
   transcode(latinS, uniS);
   foreach(line; uniS.splitLines()){
      transcode(line, latinS);
      fout.writeln(line);
      fout.writeln(latinS);
   }
 }

 Here is the output:
 Â°F
 [cast(immutable(Latin1Char))176, 
 cast(immutable(Latin1Char))70]

 If I print the Unicode string I get an extra weird character.
  If I print the Unicode string retranslated to Latin1, it get 
 weird pseudo-code.
 Can you help?

 I tried the program and it seemed to work for me.

 What program are you using to read "out.txt"? Are you sure it 
 supports UTF-8, and knows to open the file as UTF-8? (This 
 looks suspiciously like a tool's attempt to misinterpret a 
 UTF-8 string as Latin-1.)

 If you're on a Unix system, what does "file in.txt out.txt" 
 report?

 Graham

 Hmmm.  I'm not communicating well.
 I want to read and write ASCII.  The only reason I'm converting 
 to Unicode is because D needs it (as I understand).

 Yes if I open Â°F in notepad++ and tell notepad++ that it is 
 UTF-8, it shows °F.

 I want to:
 1) Read an ascii file that may have codes above 127.
 2) Convert to unicode so D funcs like .splitLines() can work 
 with it.
 3) Convert back to ascii so that stuff like °F writes out as 
 it was read in.

 If I open in.txt and out.txt in an ascii editor, °F should 
 look the same in both files with the editor encoding the files 
 as ANSI/ASCII.  I thought my program was doing just that.
 Thanks for your assistance.

To make sure we're on the same page -- ASCII is a 7-bit encoding, 
and any character above 127 is by definition not an ASCII 
character. At that point we're talking about an encoding other 
than ASCII, such as UTF-8 or Latin-1.

If you're reading a file that has bytes > 127, you really have no 
choice but to specify (assume?) an encoding, Latin-1 for example. 
There's no guarantee your input file is Latin-1, though, and 
garbage-in will result in garbage-out.

So I think what you're trying to do is

1. read a Latin-1 file, into unicode (internally in D)
2. do splitLines(), etc., generating some result
3. Convert the result back to latin-1, and output it.

Is that right?
Graham

May 23 2012

"Paul" <phshaffer gmail.com> writes:

On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett 
 wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 
 8859-1 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

 This seems to work:


 import std.stdio, std.file, std.encoding;

 void main()
 {
 auto latin = cast(Latin1String) read("/tmp/hi.8859");
 string s;
 transcode(latin, s);
 writeln(s);
 }


 Graham

 I thought I was in good shape with your above suggestion.  I 
 does help me read and process text.  But when I go to print 
 it out I have problems.

 Here is my input file:
 °F

 Here is my code:
 import std.stdio;
 import std.string;
 import std.file;
 import std.encoding;

 // Main function
 void main(){
  auto fout = File("out.txt","w");
  auto latinS = cast(Latin1String) read("in.txt");
  string uniS;
  transcode(latinS, uniS);
  foreach(line; uniS.splitLines()){
     transcode(line, latinS);
     fout.writeln(line);
     fout.writeln(latinS);
  }
 }

 Here is the output:
 Â°F
 [cast(immutable(Latin1Char))176, 
 cast(immutable(Latin1Char))70]

 If I print the Unicode string I get an extra weird character.
 If I print the Unicode string retranslated to Latin1, it get 
 weird pseudo-code.
 Can you help?

 I tried the program and it seemed to work for me.

 What program are you using to read "out.txt"? Are you sure it 
 supports UTF-8, and knows to open the file as UTF-8? (This 
 looks suspiciously like a tool's attempt to misinterpret a 
 UTF-8 string as Latin-1.)

 If you're on a Unix system, what does "file in.txt out.txt" 
 report?

 Graham

 Hmmm.  I'm not communicating well.
 I want to read and write ASCII.  The only reason I'm 
 converting to Unicode is because D needs it (as I understand).

 Yes if I open Â°F in notepad++ and tell notepad++ that it is 
 UTF-8, it shows °F.

 I want to:
 1) Read an ascii file that may have codes above 127.
 2) Convert to unicode so D funcs like .splitLines() can work 
 with it.
 3) Convert back to ascii so that stuff like °F writes out as 
 it was read in.

 If I open in.txt and out.txt in an ascii editor, °F should 
 look the same in both files with the editor encoding the files 
 as ANSI/ASCII.  I thought my program was doing just that.
 Thanks for your assistance.

 To make sure we're on the same page -- ASCII is a 7-bit 
 encoding, and any character above 127 is by definition not an 
 ASCII character. At that point we're talking about an encoding 
 other than ASCII, such as UTF-8 or Latin-1.

 If you're reading a file that has bytes > 127, you really have 
 no choice but to specify (assume?) an encoding, Latin-1 for 
 example. There's no guarantee your input file is Latin-1, 
 though, and garbage-in will result in garbage-out.

 So I think what you're trying to do is

 1. read a Latin-1 file, into unicode (internally in D)
 2. do splitLines(), etc., generating some result
 3. Convert the result back to latin-1, and output it.

 Is that right?
 Graham

Exactly.

May 23 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, May 23, 2012 at 09:09:27PM +0200, Paul wrote:
 On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:

[...]
So I think what you're trying to do is

1. read a Latin-1 file, into unicode (internally in D)
2. do splitLines(), etc., generating some result
3. Convert the result back to latin-1, and output it.

Is that right?
Graham

 
 Exactly.

The safest way is probably to read it as binary data (i.e. byte[]), then
do the conversion into UTF8, then process it, and finally convert it
back to latin-1 (in binary form) and output it.

D assumes Unicode internally; if you try to read a Latin-1 file as
char[], you may be running into some implicit UTF conversions that are
corrupting the data. Best use byte[] for reading/writing, and do
conversions to/from UTF-8 internally for processing.


T

-- 
Doubt is a self-fulfilling prophecy.

May 23 2012

"Paul" <phshaffer gmail.com> writes:

 The safest way is probably to read it as binary data (i.e. 
 byte[]), then
 do the conversion into UTF8, then process it, and finally 
 convert it
 back to latin-1 (in binary form) and output it.

 D assumes Unicode internally; if you try to read a Latin-1 file 
 as
 char[], you may be running into some implicit UTF conversions 
 that are
 corrupting the data. Best use byte[] for reading/writing, and do
 conversions to/from UTF-8 internally for processing.


 T

You mean something like Era has done in the first reply?

If that is so I have to say I'm really surprized.  To write D so 
it natively expects and outputs unicode is one thing but not 
making a clean simple way to read extended ASCII chars (i.e. 
Latin1) and write them back out seems like an oversight.

I think I'm (actually Graham) is close.
Thanks for your feedback HS.

May 23 2012

"Graham Fawcett" <fawcett uwindsor.ca> writes:

On Wednesday, 23 May 2012 at 19:09:29 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett 
 wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett 
 wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 
 8859-1 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.

 This seems to work:


 import std.stdio, std.file, std.encoding;

 void main()
 {
 auto latin = cast(Latin1String) read("/tmp/hi.8859");
 string s;
 transcode(latin, s);
 writeln(s);
 }


 Graham

 I thought I was in good shape with your above suggestion.  
 I does help me read and process text.  But when I go to 
 print it out I have problems.

 Here is my input file:
 °F

 Here is my code:
 import std.stdio;
 import std.string;
 import std.file;
 import std.encoding;

 // Main function
 void main(){
 auto fout = File("out.txt","w");
 auto latinS = cast(Latin1String) read("in.txt");
 string uniS;
 transcode(latinS, uniS);
 foreach(line; uniS.splitLines()){
    transcode(line, latinS);
    fout.writeln(line);
    fout.writeln(latinS);
 }
 }

 Here is the output:
 Â°F
 [cast(immutable(Latin1Char))176, 
 cast(immutable(Latin1Char))70]

 If I print the Unicode string I get an extra weird 
 character.
 If I print the Unicode string retranslated to Latin1, it 
 get weird pseudo-code.
 Can you help?

 I tried the program and it seemed to work for me.

 What program are you using to read "out.txt"? Are you sure 
 it supports UTF-8, and knows to open the file as UTF-8? 
 (This looks suspiciously like a tool's attempt to 
 misinterpret a UTF-8 string as Latin-1.)

 If you're on a Unix system, what does "file in.txt out.txt" 
 report?

 Graham

 Hmmm.  I'm not communicating well.
 I want to read and write ASCII.  The only reason I'm 
 converting to Unicode is because D needs it (as I understand).

 Yes if I open Â°F in notepad++ and tell notepad++ that it 
 is UTF-8, it shows °F.

 I want to:
 1) Read an ascii file that may have codes above 127.
 2) Convert to unicode so D funcs like .splitLines() can work 
 with it.
 3) Convert back to ascii so that stuff like °F writes out as 
 it was read in.

 If I open in.txt and out.txt in an ascii editor, °F should 
 look the same in both files with the editor encoding the 
 files as ANSI/ASCII.  I thought my program was doing just 
 that.
 Thanks for your assistance.

 To make sure we're on the same page -- ASCII is a 7-bit 
 encoding, and any character above 127 is by definition not an 
 ASCII character. At that point we're talking about an encoding 
 other than ASCII, such as UTF-8 or Latin-1.

 If you're reading a file that has bytes > 127, you really have 
 no choice but to specify (assume?) an encoding, Latin-1 for 
 example. There's no guarantee your input file is Latin-1, 
 though, and garbage-in will result in garbage-out.

 So I think what you're trying to do is

 1. read a Latin-1 file, into unicode (internally in D)
 2. do splitLines(), etc., generating some result
 3. Convert the result back to latin-1, and output it.

 Is that right?
 Graham

 Exactly.

This works, though it's ugly:


     foreach(line; uniS.splitLines()) {
        transcode(line, latinS);
        fout.writeln((cast(char[]) latinS));
     }

The Latin1String type, at the storage level, is a ubyte[]. By 
casting to char[], you can get a similar-to-string thing that 
writeln() can handle.

Graham

May 23 2012

"Paul" <phshaffer gmail.com> writes:

 This works, though it's ugly:


     foreach(line; uniS.splitLines()) {
        transcode(line, latinS);
        fout.writeln((cast(char[]) latinS));
     }

 The Latin1String type, at the storage level, is a ubyte[]. By 
 casting to char[], you can get a similar-to-string thing that 
 writeln() can handle.

 Graham

Awesome!  What a lesson! Thannk you!

So if anyone is following this thread heres my code now.  This 
reads a text file(encoded in Latin1 which is basic ascii with 
extended ascii codes), allows D to work with it in unicode, and 
then spits it back out as Latin1.

I wonder about the speed between this method and Era's home-spun 
solution?

import std.stdio;
import std.string;
import std.file;
import std.encoding;

// Main function
void main(){
     auto fout = File("out.txt","w");
     auto latinS = cast(Latin1String) read("in.txt");
     string uniS;
     transcode(latinS, uniS);
     foreach(line; uniS.splitLines()){
        transcode(line, latinS);
        fout.writeln((cast(char[]) latinS));
     }
}

May 23 2012

"era scarecrow" <rtcvb32 yahoo.com> writes:

On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote:
 I wonder about the speed between this method and Era's 
 home-spun solution?

  My solution may have a flaw in it's lookup table; namely if I 
got one of the codes wrong. I used regex and a site to reference 
them all so I Hope it's right. I can't remember but I think it 
was from http://www.alanwood.net/demos/ansi.html

  The main reason I wrote it was there was no good explanations in 
the documentation of anywhere of how to use std.encoding and 
transcode. This meant I was stuck and needed some simple 
solution. I'm not sure if my solution is going to be faster, but 
it does do minimal object allocation/resizing/abstraction, and 
tries not to make a new string if it doesn't have to.

  Who knows? Perhaps it will be added to phobos once the table is 
verified.

May 24 2012

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Thursday, 24 May 2012 at 19:47:06 UTC, era scarecrow wrote:
 On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote:
 I wonder about the speed between this method and Era's 
 home-spun solution?

  Who knows? Perhaps it will be added to phobos once the table 
 is verified.

  Well after taking to heart about a gc-less solution and doing a 
inputRange I re-wrote the entire thing. Of course to make it even 
faster/simpler a full lookup table conversion is used instead. 
Further reduction has made a very tiny simple filter.

  Curiously relooking at it there's actually very few codes that 
are there that really require special attention. If there's still 
any interest in this I can release it.

May 15 2016

"Regan Heath" <regan netmail.co.nz> writes:

On Wed, 23 May 2012 22:02:25 +0100, Paul <phshaffer gmail.com> wrote:
 This works, though it's ugly:


     foreach(line; uniS.splitLines()) {
        transcode(line, latinS);
        fout.writeln((cast(char[]) latinS));
     }

 The Latin1String type, at the storage level, is a ubyte[]. By casting  
 to char[], you can get a similar-to-string thing that writeln() can  
 handle.

 Graham

 Awesome!  What a lesson! Thannk you!

 So if anyone is following this thread heres my code now.  This reads a  
 text file(encoded in Latin1 which is basic ascii with extended ascii  
 codes), allows D to work with it in unicode, and then spits it back out  
 as Latin1.

 I wonder about the speed between this method and Era's home-spun  
 solution?

 import std.stdio;
 import std.string;
 import std.file;
 import std.encoding;

 // Main function
 void main(){
      auto fout = File("out.txt","w");
      auto latinS = cast(Latin1String) read("in.txt");
      string uniS;
      transcode(latinS, uniS);
      foreach(line; uniS.splitLines()){
         transcode(line, latinS);
         fout.writeln((cast(char[]) latinS));
      }
 }

The only thing which would worry me about this code is the cast(char[]) in  
the final writeln.. I know some parts of phobos verify the char data is  
correct UTF-8 and this line casts latin-1 to char[] which can potentially  
create invalid UTF-8 data.  That said, I had a really quick look at the  
phobos code for File.writeln and I'm not sure whether this function does  
any UTF-8 validation.  I would be happier if the latin-1 was written as a  
stream of bytes with no assumed interpretation, IMO.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

May 25 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Reading ASCII file with some codes above 127 (exten ascii)