digitalmars.D.learn - invalid utf-8 sequence

james (3/3) Jan 06 2009 im writing an indexer, but im having a problem because on some file, whe...

Jarrett Billingsley (8/11) Jan 06 2009 You're probably reading a file that's encoded in some non-Unicode

james (2/17) Jan 06 2009 is there any library or function that can automatically convert these un...

Jarrett Billingsley (2/19) Jan 06 2009 Not that I know of, for D anyway.

james (2/24) Jan 06 2009 i just found out about a function 'UnicodeFile' in tango, but im using D...

Jarrett Billingsley (3/5) Jan 06 2009 It wouldn't help you anyway. UnicodeFile reads.. uh, Unicode files.

Stewart Gordon (6/24) Jan 07 2009 You mean that tries to work out what character set a file is in and then...

Stewart Gordon (6/11) Jan 06 2009 Probably, but since you've decided not to post your code, nobody can

james <Jamesg4 gmail.com> writes:

im writing an indexer, but im having a problem because on some file, when i
read gives this error   

Error 4: invalid UTF-8 sequence

is there a way to fix it.

Jan 06 2009

"Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:

On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.

You're probably reading a file that's encoded in some non-Unicode
encoding, like Latin-1.  You could read in the file data as byte[]
instead of as char[], but that still doesn't deal with the problem
that you have characters in your file that are outside the ASCII
range.  If you know what encoding your file uses, you could do some
transformations on it to turn it into valid Unicode, or you could just
ignore characters outside the ASCII range :P

Jan 06 2009

james <Jamesg4 gmail.com> writes:

Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.

 
 You're probably reading a file that's encoded in some non-Unicode
 encoding, like Latin-1.  You could read in the file data as byte[]
 instead of as char[], but that still doesn't deal with the problem
 that you have characters in your file that are outside the ASCII
 range.  If you know what encoding your file uses, you could do some
 transformations on it to turn it into valid Unicode, or you could just
 ignore characters outside the ASCII range :P

is there any library or function that can automatically convert these unknown
html charset into UTF-8

Jan 06 2009

"Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:

On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4 gmail.com> wrote:
 Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.

 You're probably reading a file that's encoded in some non-Unicode
 encoding, like Latin-1.  You could read in the file data as byte[]
 instead of as char[], but that still doesn't deal with the problem
 that you have characters in your file that are outside the ASCII
 range.  If you know what encoding your file uses, you could do some
 transformations on it to turn it into valid Unicode, or you could just
 ignore characters outside the ASCII range :P

 is there any library or function that can automatically convert these unknown
html charset into UTF-8

Not that I know of, for D anyway.

Jan 06 2009

james <Jamesg4 gmail.com> writes:

Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4 gmail.com> wrote:
 Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.

 You're probably reading a file that's encoded in some non-Unicode
 encoding, like Latin-1.  You could read in the file data as byte[]
 instead of as char[], but that still doesn't deal with the problem
 that you have characters in your file that are outside the ASCII
 range.  If you know what encoding your file uses, you could do some
 transformations on it to turn it into valid Unicode, or you could just
 ignore characters outside the ASCII range :P

 is there any library or function that can automatically convert these unknown
html charset into UTF-8

 
 Not that I know of, for D anyway.

i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and
phobos, maybe i should write one of my own.

Jan 06 2009

"Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:

On Tue, Jan 6, 2009 at 10:34 PM, james <Jamesg4 gmail.com> wrote:
 Not that I know of, for D anyway.

 i just found out about a function 'UnicodeFile' in tango, but im using D1.0
and phobos, maybe i should write one of my own.

It wouldn't help you anyway.  UnicodeFile reads.. uh, Unicode files.
Your file is _not_ Unicode.

Jan 06 2009

Stewart Gordon <smjg_1998 yahoo.com> writes:

james wrote:
 Jarrett Billingsley Wrote:
 
 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.

 You're probably reading a file that's encoded in some non-Unicode
 encoding, like Latin-1.  You could read in the file data as byte[]
 instead of as char[], but that still doesn't deal with the problem
 that you have characters in your file that are outside the ASCII
 range.  If you know what encoding your file uses, you could do some
 transformations on it to turn it into valid Unicode, or you could just
 ignore characters outside the ASCII range :P

 
 is there any library or function that can automatically convert these unknown
html charset into UTF-8

You mean that tries to work out what character set a file is in and then 
translates it?

(What is the current state of the art of character set detection 
heuristics?)

Stewart.

Jan 07 2009

Stewart Gordon <smjg_1998 yahoo.com> writes:

james wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error   
 
 Error 4: invalid UTF-8 sequence
 
 is there a way to fix it.

Probably, but since you've decided not to post your code, nobody can 
tell you for sure what that way is.

Moreover, what is giving this error - the compiler, or your compiled 
program?

Stewart.

Jan 06 2009

D Programming

C/C++ Programming

Other

digitalmars.D.learn - invalid utf-8 sequence