www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Need to do some "dirty" UTF-8 handling

reply "Nick Sabalausky" <a a.a> writes:
Sometimes I need to bring data into a string, and need to be able to treat 
it as an actual "string", but don't actually care if the entire thing is 
technically valid UTF-8 or not, don't care if invalid bytes don't get 
preserved right, and can't have any utf exceptions being thrown regardless 
of the input. Yea, I know that's sloppy, but sometimes that's good enough 
and proper handling may be far more trouble than what's needed. (For 
example: Processing HTML from arbitrary URLs. It's pretty much guaranteed 
you'll come across stuff that's wrong or even has the encoding type 
improperly set. But it's usually more important for the process to succeed 
than for it to be perfectly accurate.)

Far as I can tell, this seems to currently be impossible with Phobos (unless 
you're *extremely* meticulous about watching what your entire codebase does 
with the data), which is a major pain when such a need arises.

Anyone have a good workaround? For instance, maybe a function that'll take 
in a byte array and convert *all* invalid UTF-8 sequences to a user-selected 
valid character?
Jun 25 2011
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <a a.a> wrote:

 Anyone have a good workaround? For instance, maybe a function that'll  
 take
 in a byte array and convert *all* invalid UTF-8 sequences to a  
 user-selected
 valid character?
I tend to do this a lot, for various reasons. By my experience, a great part of string-handling functions in Phobos will work just fine with strings containing invalid UTF-8 - you can generally use your intuition about whether a function will need to look at individual characters inside the string. Note, though, that there's currently a bug in D2/Phobos (6064) which causes std.array.join (and possibly other functions) to treat strings as not something that can be joined by concatenation, and do a character-by-character copy (which is both needlessly inefficient and will choke on invalid UTF-8). When I really need to pass arbitrary data through string-handling functions, I use these functions: /// convert any data to valid UTF-8, so D's string functions can properly work on it string rawToUTF8(string s) { dstring d; foreach (char c; s) d ~= c; return toUTF8(d); } string UTF8ToRaw(string r) { string s; foreach (dchar c; r) { assert(c < '\u0100'); s ~= c; } return s; } ( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 ) Of course, it would be nice if it'd be possible to only convert INVALID UTF-8 sequences. According to Wikipedia, the invalid Unicode code points U+DC80..U+DCFF are often used for encoding invalid byte sequences. I'd guess that a proper implementation will need to guarantee that a roundtrip will always return the same data as the input, so it'd have to "escape" the invalid code points used for escaping as well. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Jun 25 2011
parent reply "Nick Sabalausky" <a a.a> writes:
"Vladimir Panteleev" <vladimir thecybershadow.net> wrote in message 
news:op.vxmuvzqbtuzx1w cybershadow.mshome.net...
 string s;
 foreach (dchar c; r)
That doesn't throw on an invalid sequence?
Jun 25 2011
parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sat, 25 Jun 2011 23:17:37 +0300, Nick Sabalausky <a a.a> wrote:

 "Vladimir Panteleev" <vladimir thecybershadow.net> wrote in message
 news:op.vxmuvzqbtuzx1w cybershadow.mshome.net...
 string s;
 foreach (dchar c; r)
That doesn't throw on an invalid sequence?
You use rawToUTF8 to convert an arbitrary array of chars to valid UTF-8. You use UTF8ToRaw to convert the output of rawToUTF8 back to the original string. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Jun 26 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-25 02:00, Nick Sabalausky wrote:
 Sometimes I need to bring data into a string, and need to be able to treat
 it as an actual "string", but don't actually care if the entire thing is
 technically valid UTF-8 or not, don't care if invalid bytes don't get
 preserved right, and can't have any utf exceptions being thrown regardless
 of the input. Yea, I know that's sloppy, but sometimes that's good enough
 and proper handling may be far more trouble than what's needed. (For
 example: Processing HTML from arbitrary URLs. It's pretty much guaranteed
 you'll come across stuff that's wrong or even has the encoding type
 improperly set. But it's usually more important for the process to succeed
 than for it to be perfectly accurate.)
 
 Far as I can tell, this seems to currently be impossible with Phobos
 (unless you're *extremely* meticulous about watching what your entire
 codebase does with the data), which is a major pain when such a need
 arises.
 
 Anyone have a good workaround? For instance, maybe a function that'll take
 in a byte array and convert *all* invalid UTF-8 sequences to a
 user-selected valid character?
Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually treats it as a string instead of an array of bytes _must_ treat it as UTF-8 since it has to decode to determine what the characters are. So, I don't think that there's really any way around that. A string must be valid UTF-8. But if you really don't care about the string's contents, then you can just cast it to an array of ubyte and plenty of functions will work with it - nothing terribly string specific of course, but I don't see how you could possibly expect to do much string-specific with invalid data anyway. - Jonathan M Davis
Jun 25 2011
parent "Nick Sabalausky" <a a.a> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.1214.1309008317.14074.digitalmars-d-learn puremagic.com...
 On 2011-06-25 02:00, Nick Sabalausky wrote:
 Sometimes I need to bring data into a string, and need to be able to 
 treat
 it as an actual "string", but don't actually care if the entire thing is
 technically valid UTF-8 or not, don't care if invalid bytes don't get
 preserved right, and can't have any utf exceptions being thrown 
 regardless
 of the input. Yea, I know that's sloppy, but sometimes that's good enough
 and proper handling may be far more trouble than what's needed. (For
 example: Processing HTML from arbitrary URLs. It's pretty much guaranteed
 you'll come across stuff that's wrong or even has the encoding type
 improperly set. But it's usually more important for the process to 
 succeed
 than for it to be perfectly accurate.)

 Far as I can tell, this seems to currently be impossible with Phobos
 (unless you're *extremely* meticulous about watching what your entire
 codebase does with the data), which is a major pain when such a need
 arises.

 Anyone have a good workaround? For instance, maybe a function that'll 
 take
 in a byte array and convert *all* invalid UTF-8 sequences to a
 user-selected valid character?
Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually treats it as a string instead of an array of bytes _must_ treat it as UTF-8 since it has to decode to determine what the characters are. So, I don't think that there's really any way around that. A string must be valid UTF-8. But if you really don't care about the string's contents, then you can just cast it to an array of ubyte and plenty of functions will work with it - nothing terribly string specific of course, but I don't see how you could possibly expect to do much string-specific with invalid data anyway.
Using immutable(ubyte)[] just causes an enormous amount of type-related problems, largely involving the need to throw around a bunch of casts absolutely everywhere, including every single time any of the byte arrays needs to come in contact with an actual string (for instance, a string literal, for comparing,searching or anything else). It might be the "correct" thing, but in many cases (anything that doesn't need to be perfect, or can't realistically be perfect) it's far more trouble than it's actually worth. Like I said, "For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?" In such a case, *there would be no invalid data* in the actual string.
Jun 25 2011
prev sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I've had a similar requirement some time ago. I've had to copy and
modify the phobos function std.utf.decode for a custom text editor
because the function throws when it finds an invalid code point. This
is way too slow for my needs. I'm actually displaying invalid code
points with special marks (just like Scintilla), so I need decoding to
work as fast as possible.

The new function simply replaces throwing exceptions with flagging a boolean.
Jun 25 2011
parent reply "Nick Sabalausky" <a a.a> writes:
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a 
 boolean.
I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
Jun 25 2011
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>  wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.
I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :) -- Dmitry Olshansky
Jun 25 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-25 15:17, Dmitry Olshansky wrote:
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>  wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.
 
 The new function simply replaces throwing exceptions with flagging a
 boolean.
I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
std.encoding to the rescue? It looks like a well established module that was forgotten for some reason.
It's also likely going away. It was an experiment of sorts which Andrei considers a failure. We need something to replace it, but as I understand it, it doesn't solve all of the problems that it's supposed to, and those it does solve, it doesn't necessarily solve in the best way. So, an improved replacement is going to need to be devised, but I wouldn't expect std.encoding to stick around in the long run. - Jonathan M Davis
Jun 25 2011
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message 
news:iu5n32$2vjd$1 digitalmars.com...
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>  wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.
I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)
Ahh, I didn't even notice that module. Even if it's imperfect and goes away, it looks like it'll at least get the job done for me. And the encoding conversions should even give me an easy way to save at least some of the invalid chars (which wasn't really a requirement of mine, but it'll still be nice).
Jun 25 2011
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 26.06.2011 3:25, Nick Sabalausky wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:iu5n32$2vjd$1 digitalmars.com...
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>   wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.
I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)
Ahh, I didn't even notice that module.
Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!)
 Even if it's imperfect and goes away, it looks like it'll at least get the
 job done for me. And the encoding conversions should even give me an easy
 way to save at least some of the invalid chars (which wasn't really a
 requirement of mine, but it'll still be nice).
Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;) -- Dmitry Olshansky
Jun 25 2011
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message 
news:iu5tan$ets$1 digitalmars.com...
 On 26.06.2011 3:25, Nick Sabalausky wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:iu5n32$2vjd$1 digitalmars.com...
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>   wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.
I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)
Ahh, I didn't even notice that module.
Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!)
 Even if it's imperfect and goes away, it looks like it'll at least get 
 the
 job done for me. And the encoding conversions should even give me an easy
 way to save at least some of the invalid chars (which wasn't really a
 requirement of mine, but it'll still be nice).
Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;)
Yea, and even when it does go, I can just copy it and include it manually (although it'll probably need some work once typedef goes away). This seems to get the job done well enough for me, and even manages to save some of the intended chars: // With std.utf and std.encoding imported: string src = ...; bool valid=true; try validate(src); catch(UtfException e) valid=false; if(!valid) { auto tmpStr = sanitize( cast(Windows1252String) src ); transcode(tmpStr, src); }
Jun 25 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-25 17:04, Dmitry Olshansky wrote:
 On 26.06.2011 3:25, Nick Sabalausky wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:iu5n32$2vjd$1 digitalmars.com...
 
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>   wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.
 
 The new function simply replaces throwing exceptions with flagging a
 boolean.
I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)
Ahh, I didn't even notice that module.
Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!)
 Even if it's imperfect and goes away, it looks like it'll at least get
 the job done for me. And the encoding conversions should even give me an
 easy way to save at least some of the invalid chars (which wasn't really
 a requirement of mine, but it'll still be nice).
Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;)
Oh, it'll probably be around for a while. It'll take time before a replacement is devised. After, std.stream is still around, isn't it? And there's actually supposedly a plan regarding its replacement's implementation. There's no such thing with regards to std.encoding. I just thought that I should point out that it's likely to be replaced at some point (hopefully with something much better). - Jonathan M Davis
Jun 25 2011