digitalmars.D.learn - reading UTF-8 formated file into dynamic array

thofis (14/14) Jul 24 2007 Hi,

Carlos Santander (6/26) Jul 24 2007 std.stream contains a BOM Stream class or some such. Try that instead of...
Stewart Gordon (18/22) Jul 24 2007 "thofis" wrote in message

thofis <thofis gmail.com> writes:

Hi,

I try to read a text file into a dynamic array but got stucked with non
ASCII characters. While trying I looked at the word count example.
(http://www.digitalmars.com/d/wc.html)

Basicly there's used:
char[] input;
// read file into input[]
input = cast(char[])std.file.read(args[i]);

But it seems that this casting does not result in proper filling of
_input_. Any non ACSII character is splited into multiple characters.
(BOM is also contained in _input_.)

So..., how I get a UTF-8 file properly mapped into an dynamic array?

TIA

Bye

Jul 24 2007

Carlos Santander <csantander619 gmail.com> writes:

thofis escribi�:
 Hi,
 
 I try to read a text file into a dynamic array but got stucked with non
 ASCII characters. While trying I looked at the word count example.
 (http://www.digitalmars.com/d/wc.html)
 
 Basicly there's used:
 char[] input;
 // read file into input[]
 input = cast(char[])std.file.read(args[i]);
 
 But it seems that this casting does not result in proper filling of
 _input_. Any non ACSII character is splited into multiple characters.

That's just the nature of UTF-8 and how it's handled in D.

 (BOM is also contained in _input_.)
 
 So..., how I get a UTF-8 file properly mapped into an dynamic array?
 
 TIA
 
 Bye

std.stream contains a BOM Stream class or some such. Try that instead of 
std.file.read.

-- 
Carlos Santander Bernal

Jul 24 2007

"Stewart Gordon" <smjg_1998 yahoo.com> writes:

"thofis" <thofis gmail.com> wrote in message 
news:f8559q$2ld9$1 digitalmars.com...
<snip>
 But it seems that this casting does not result in proper filling of
 _input_. Any non ACSII character is splited into multiple characters.
 (BOM is also contained in _input_.)

Welcome to UTF-8.

 So..., how I get a UTF-8 file properly mapped into an dynamic array?

It is properly mapped.  If you examine the file in a hex editor/dumper and 
compare with what your program has made of it, you'll see that the file has 
been read byte for byte.  You'll also notice that input.length matches the 
file size reported by the OS.

In UTF-8, every non-ASCII character occupies more than one byte.  There is 
no splitting into multiple _characters_ - when the file is saved as UTF-8 in 
the first place, any character above U+007F is split into multiple _bytes_, 
but it is still only one _character_.

If you want to work with the file data in a strict character-by-character 
format, you can do any of the following after reading the file:
- convert it to UTF-32, using std.utf.toUTF32
- use std.utf.decode to extract characters from the read-in UTF-8 text
- use foreach with a dchar variable

Stewart.

Jul 24 2007

D Programming

C/C++ Programming

Other

digitalmars.D.learn - reading UTF-8 formated file into dynamic array