digitalmars.D.learn - reading an unicode file

jicman (10/10) May 10 2007 Greetings!

Bill Baxter (9/22) May 10 2007 Yeh, the file is probably UCS2 (UTF16) rather than UTF8. Meaning every

jicman (9/9) May 10 2007 Thanks BB.

Bill Baxter (4/16) May 10 2007 Yep. That is probably going to break in horrible ways when you start to
Chris Nicholson-Sauls (5/19) May 11 2007 Even though it increases sizes, I find using dchar provides vast conveni...

Dejan Lekic (2/2) May 10 2007 By reading a BOM of the file you should be able to detect which text for...

jicman <cabrera_ _wrc.xerox.com> writes:

Greetings!

I am reading this file into a char[][] array and all the data is broken down
by a space.  So, if a line of data read has,

hi there folks!

the string contains,

h i  t h e r e  f o l k s !

I know this has to do with UTF8 and unicode, but how do I fix that?

Any help would be greatly appreciated.

thanks,

jos�

May 10 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

jicman wrote:
 Greetings!
 
 I am reading this file into a char[][] array and all the data is broken down
 by a space.  So, if a line of data read has,
 
 hi there folks!
 
 the string contains,
 
 h i  t h e r e  f o l k s !
 
 I know this has to do with UTF8 and unicode, but how do I fix that?

Yeh, the file is probably UCS2 (UTF16) rather than UTF8.  Meaning every 
char is 2 bytes (with a few exceptions).  The things between the 
characters are probably not spaces, but rather null characters (a 0-byte).

 Any help would be greatly appreciated.

Try to read it as binary and use std.utf functions to convert?
Or maybe read as wchar's with the funcs in std.stream (then convert to 
utf8 if neceesary with std.utf funcs).

Never done this stuff myself, but that's where I'd look.

--bb

May 10 2007

jicman <cabrera_ _wrc.xerox.com> writes:

Thanks BB.

I should stop using char and do more wchars.  But that is a whole new world for
me. :-)

Interesting enough, I did this command to the string,

char[] n = std.string.replace(s,"\000","");

and now strings show correctly.  The problem is that I work with accented
characters, which will probably break something. I am going to have to look into
this, but for now, it's working for this task.

Thanks for the help.

May 10 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

jicman wrote:
 Thanks BB.
 
 I should stop using char and do more wchars.  But that is a whole new world for
 me. :-)
 
 Interesting enough, I did this command to the string,
 
 char[] n = std.string.replace(s,"\000","");
 
 and now strings show correctly.  The problem is that I work with accented
 characters, which will probably break something. I am going to have to look
into
 this, but for now, it's working for this task.

Yep. That is probably going to break in horrible ways when you start to 
encounter more than just plain 7-bit ASCII.

--bb

May 10 2007

Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:

jicman wrote:
 Thanks BB.
 
 I should stop using char and do more wchars.  But that is a whole new world for
 me. :-)
 
 Interesting enough, I did this command to the string,
 
 char[] n = std.string.replace(s,"\000","");
 
 and now strings show correctly.  The problem is that I work with accented
 characters, which will probably break something. I am going to have to look
into
 this, but for now, it's working for this task.
 
 Thanks for the help.

Even though it increases sizes, I find using dchar provides vast convenience in
cases 
where you /know/ you want|need to support various sorts of character outside
ASCII.  Of 
course you ought to experiment to see if wchar is fine for your use case.

-- Chris Nicholson-Sauls

May 11 2007

Dejan Lekic <dejan.lekic gmail.com> writes:

By reading a BOM of the file you should be able to detect which text format to
use. More about BOM: http://unicode.org/faq/utf_bom.html#BOM .
So, I would first chech which BOM is it, than use appropriate readLine() or
readLineW() InputStream methods to read the file line-by-line, or if you prefer
just to read until the eof, than appropriate read() method.

May 10 2007

D Programming

C/C++ Programming

Other

digitalmars.D.learn - reading an unicode file