digitalmars.D.learn - invalid utf-8 sequence
- james (3/3) Jan 06 2009 im writing an indexer, but im having a problem because on some file, whe...
- Jarrett Billingsley (8/11) Jan 06 2009 You're probably reading a file that's encoded in some non-Unicode
- james (2/17) Jan 06 2009 is there any library or function that can automatically convert these un...
- Jarrett Billingsley (2/19) Jan 06 2009 Not that I know of, for D anyway.
- james (2/24) Jan 06 2009 i just found out about a function 'UnicodeFile' in tango, but im using D...
- Jarrett Billingsley (3/5) Jan 06 2009 It wouldn't help you anyway. UnicodeFile reads.. uh, Unicode files.
- Stewart Gordon (6/24) Jan 07 2009 You mean that tries to work out what character set a file is in and then...
- Stewart Gordon (6/11) Jan 06 2009 Probably, but since you've decided not to post your code, nobody can
im writing an indexer, but im having a problem because on some file, when i read gives this error Error 4: invalid UTF-8 sequence is there a way to fix it.
Jan 06 2009
On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:im writing an indexer, but im having a problem because on some file, when i read gives this error Error 4: invalid UTF-8 sequence is there a way to fix it.You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
Jan 06 2009
Jarrett Billingsley Wrote:On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:is there any library or function that can automatically convert these unknown html charset into UTF-8im writing an indexer, but im having a problem because on some file, when i read gives this error Error 4: invalid UTF-8 sequence is there a way to fix it.You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
Jan 06 2009
On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4 gmail.com> wrote:Jarrett Billingsley Wrote:Not that I know of, for D anyway.On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:is there any library or function that can automatically convert these unknown html charset into UTF-8im writing an indexer, but im having a problem because on some file, when i read gives this error Error 4: invalid UTF-8 sequence is there a way to fix it.You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
Jan 06 2009
Jarrett Billingsley Wrote:On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4 gmail.com> wrote:i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own.Jarrett Billingsley Wrote:Not that I know of, for D anyway.On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:is there any library or function that can automatically convert these unknown html charset into UTF-8im writing an indexer, but im having a problem because on some file, when i read gives this error Error 4: invalid UTF-8 sequence is there a way to fix it.You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
Jan 06 2009
On Tue, Jan 6, 2009 at 10:34 PM, james <Jamesg4 gmail.com> wrote:It wouldn't help you anyway. UnicodeFile reads.. uh, Unicode files. Your file is _not_ Unicode.Not that I know of, for D anyway.i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own.
Jan 06 2009
james wrote:Jarrett Billingsley Wrote:You mean that tries to work out what character set a file is in and then translates it? (What is the current state of the art of character set detection heuristics?) Stewart.On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 gmail.com> wrote:is there any library or function that can automatically convert these unknown html charset into UTF-8im writing an indexer, but im having a problem because on some file, when i read gives this error Error 4: invalid UTF-8 sequence is there a way to fix it.You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
Jan 07 2009
james wrote:im writing an indexer, but im having a problem because on some file, when i read gives this error Error 4: invalid UTF-8 sequence is there a way to fix it.Probably, but since you've decided not to post your code, nobody can tell you for sure what that way is. Moreover, what is giving this error - the compiler, or your compiled program? Stewart.
Jan 06 2009