www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - invalid utf-8 sequence

reply james <Jamesg4 gmail.com> writes:
im writing an indexer, but im having a problem because on some file, when i
read gives this error   

Error 4: invalid UTF-8 sequence

is there a way to fix it.
Jan 06 2009
next sibling parent reply "Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:
On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.
You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
Jan 06 2009
parent reply james <Jamesg4 gmail.com> writes:
Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.
You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
is there any library or function that can automatically convert these unknown html charset into UTF-8
Jan 06 2009
next sibling parent reply "Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:
On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4 gmail.com> wrote:
 Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.
You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
is there any library or function that can automatically convert these unknown html charset into UTF-8
Not that I know of, for D anyway.
Jan 06 2009
parent reply james <Jamesg4 gmail.com> writes:
Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4 gmail.com> wrote:
 Jarrett Billingsley Wrote:

 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.
You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
is there any library or function that can automatically convert these unknown html charset into UTF-8
Not that I know of, for D anyway.
i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own.
Jan 06 2009
parent "Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:
On Tue, Jan 6, 2009 at 10:34 PM, james <Jamesg4 gmail.com> wrote:
 Not that I know of, for D anyway.
i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own.
It wouldn't help you anyway. UnicodeFile reads.. uh, Unicode files. Your file is _not_ Unicode.
Jan 06 2009
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
james wrote:
 Jarrett Billingsley Wrote:
 
 On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error

 Error 4: invalid UTF-8 sequence

 is there a way to fix it.
You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
is there any library or function that can automatically convert these unknown html charset into UTF-8
You mean that tries to work out what character set a file is in and then translates it? (What is the current state of the art of character set detection heuristics?) Stewart.
Jan 07 2009
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
james wrote:
 im writing an indexer, but im having a problem because on some file, when i
read gives this error   
 
 Error 4: invalid UTF-8 sequence
 
 is there a way to fix it.
Probably, but since you've decided not to post your code, nobody can tell you for sure what that way is. Moreover, what is giving this error - the compiler, or your compiled program? Stewart.
Jan 06 2009