www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - ANSI to UTF8 problem

reply jicman <cabrera_ _wrc.xerox.com> writes:
Greetings.

I have this program,

import std.stdio;
import juno.base.text;
import std.file;
import std.windows.charset;
import std.utf;

int main(char[][] args)
{
  char[] ansi = r"c:\ansi.txt";
  char[] utf8 = r"c:\utf8.txt";
  try
  {
    char[] t = cast(char[]) read(ansi);
    write(utf8, std.windows.charset.fromMBSz(t.ptr,0));
    writefln(" converted to UTF8.");
  }
  catch (UtfException e)
  {
    writefln(" is not ANSI");
    return 1;
  }
  return(0);
}

the ansi.txt file contains,

josé
áéíóúñÑ

the utf8.txt file when opened with Wordpad looks like this:

josé
áéíóúñÑ

The file did change from ANSI to UTF8, however, it display wrong with Wordpad. 
The problem is that there is one application that I am trying to filled with
these UTF8 files that is behaving or displaying the same problem as Wordpad.

Any help would be greatly appreciated.

thanks,

josé
Aug 16 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"jicman" <cabrera_ _wrc.xerox.com> wrote in message 
news:i4cn8h$2vtn$1 digitalmars.com...
 Greetings.

 I have this program,

 import std.stdio;
 import juno.base.text;
 import std.file;
 import std.windows.charset;
 import std.utf;

 int main(char[][] args)
 {
  char[] ansi = r"c:\ansi.txt";
  char[] utf8 = r"c:\utf8.txt";
  try
  {
    char[] t = cast(char[]) read(ansi);
    write(utf8, std.windows.charset.fromMBSz(t.ptr,0));
    writefln(" converted to UTF8.");
  }
  catch (UtfException e)
  {
    writefln(" is not ANSI");
    return 1;
  }
  return(0);
 }

 the ansi.txt file contains,

 josé
 áéíóúñÑ

 the utf8.txt file when opened with Wordpad looks like this:

 josé
 áéíóúñÑ

 The file did change from ANSI to UTF8, however, it display wrong with 
 Wordpad.  The problem is that there is one application that I am trying to 
 filled with these UTF8 files that is behaving or displaying the same 
 problem as Wordpad.

 Any help would be greatly appreciated.

 thanks,

 josé
The utf8.txt file is probably missing the UTF-8 BOM (I'm not familiar with fromMBSz: I *assume* it doesn't add the BOM, but maybe I'm wrong?). Without that BOM, Wordpad is probably assuming it's "ASCII with some codepage" instead of UTF8. Open utf8.txt in a hex editor (I like XVI32). If it doesn't start with EF BB BF then that's probably the problem, and you'll need to change: write(utf8, std.windows.charset.fromMBSz(t.ptr,0)); to: write(utf8, x"EF BB BF" ~ std.windows.charset.fromMBSz(t.ptr,0));
Aug 16 2010
parent jicman <cabrera_ _wrc.xerox.com> writes:
Nick Sabalausky Wrote:

 "jicman" <cabrera_ _wrc.xerox.com> wrote in message 
 news:i4cn8h$2vtn$1 digitalmars.com...
 Greetings.

 I have this program,

 import std.stdio;
 import juno.base.text;
 import std.file;
 import std.windows.charset;
 import std.utf;

 int main(char[][] args)
 {
  char[] ansi = r"c:\ansi.txt";
  char[] utf8 = r"c:\utf8.txt";
  try
  {
    char[] t = cast(char[]) read(ansi);
    write(utf8, std.windows.charset.fromMBSz(t.ptr,0));
    writefln(" converted to UTF8.");
  }
  catch (UtfException e)
  {
    writefln(" is not ANSI");
    return 1;
  }
  return(0);
 }

 the ansi.txt file contains,

 josé
 áéíóúñÑ

 the utf8.txt file when opened with Wordpad looks like this:

 josé
 áéíóúñÑ

 The file did change from ANSI to UTF8, however, it display wrong with 
 Wordpad.  The problem is that there is one application that I am trying to 
 filled with these UTF8 files that is behaving or displaying the same 
 problem as Wordpad.

 Any help would be greatly appreciated.

 thanks,

 josé
The utf8.txt file is probably missing the UTF-8 BOM (I'm not familiar with fromMBSz: I *assume* it doesn't add the BOM, but maybe I'm wrong?). Without that BOM, Wordpad is probably assuming it's "ASCII with some codepage" instead of UTF8. Open utf8.txt in a hex editor (I like XVI32). If it doesn't start with EF BB BF then that's probably the problem, and you'll need to change: write(utf8, std.windows.charset.fromMBSz(t.ptr,0)); to: write(utf8, x"EF BB BF" ~ std.windows.charset.fromMBSz(t.ptr,0));
DOH! Yep! Thanks, Nick. josé
Aug 16 2010