digitalmars.D.learn - Need to do some "dirty" UTF-8 handling
- Nick Sabalausky (16/16) Jun 25 2011 Sometimes I need to bring data into a string, and need to be able to tre...
- Vladimir Panteleev (41/46) Jun 25 2011 I tend to do this a lot, for various reasons. By my experience, a great ...
- Nick Sabalausky (3/5) Jun 25 2011 That doesn't throw on an invalid sequence?
- Vladimir Panteleev (7/13) Jun 26 2011 You use rawToUTF8 to convert an arbitrary array of chars to valid UTF-8....
- Jonathan M Davis (10/29) Jun 25 2011 Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually ...
- Nick Sabalausky (14/52) Jun 25 2011 Using immutable(ubyte)[] just causes an enormous amount of type-related
- Andrej Mitrovic (7/7) Jun 25 2011 I've had a similar requirement some time ago. I've had to copy and
- Nick Sabalausky (14/22) Jun 25 2011 I think I may end up doing something like that :/
- Dmitry Olshansky (6/29) Jun 25 2011 std.encoding to the rescue?
- Jonathan M Davis (8/45) Jun 25 2011 It's also likely going away. It was an experiment of sorts which Andrei
- Nick Sabalausky (7/40) Jun 25 2011 Ahh, I didn't even notice that module.
- Dmitry Olshansky (9/51) Jun 25 2011 Same here, It's just a couple of days(!) ago I somehow managed to find
- Nick Sabalausky (18/70) Jun 25 2011 Yea, and even when it does go, I can just copy it and include it manuall...
- Jonathan M Davis (8/67) Jun 25 2011 Oh, it'll probably be around for a while. It'll take time before a repla...
Sometimes I need to bring data into a string, and need to be able to treat it as an actual "string", but don't actually care if the entire thing is technically valid UTF-8 or not, don't care if invalid bytes don't get preserved right, and can't have any utf exceptions being thrown regardless of the input. Yea, I know that's sloppy, but sometimes that's good enough and proper handling may be far more trouble than what's needed. (For example: Processing HTML from arbitrary URLs. It's pretty much guaranteed you'll come across stuff that's wrong or even has the encoding type improperly set. But it's usually more important for the process to succeed than for it to be perfectly accurate.) Far as I can tell, this seems to currently be impossible with Phobos (unless you're *extremely* meticulous about watching what your entire codebase does with the data), which is a major pain when such a need arises. Anyone have a good workaround? For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?
Jun 25 2011
On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <a a.a> wrote:Anyone have a good workaround? For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?I tend to do this a lot, for various reasons. By my experience, a great part of string-handling functions in Phobos will work just fine with strings containing invalid UTF-8 - you can generally use your intuition about whether a function will need to look at individual characters inside the string. Note, though, that there's currently a bug in D2/Phobos (6064) which causes std.array.join (and possibly other functions) to treat strings as not something that can be joined by concatenation, and do a character-by-character copy (which is both needlessly inefficient and will choke on invalid UTF-8). When I really need to pass arbitrary data through string-handling functions, I use these functions: /// convert any data to valid UTF-8, so D's string functions can properly work on it string rawToUTF8(string s) { dstring d; foreach (char c; s) d ~= c; return toUTF8(d); } string UTF8ToRaw(string r) { string s; foreach (dchar c; r) { assert(c < '\u0100'); s ~= c; } return s; } ( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 ) Of course, it would be nice if it'd be possible to only convert INVALID UTF-8 sequences. According to Wikipedia, the invalid Unicode code points U+DC80..U+DCFF are often used for encoding invalid byte sequences. I'd guess that a proper implementation will need to guarantee that a roundtrip will always return the same data as the input, so it'd have to "escape" the invalid code points used for escaping as well. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Jun 25 2011
"Vladimir Panteleev" <vladimir thecybershadow.net> wrote in message news:op.vxmuvzqbtuzx1w cybershadow.mshome.net...string s; foreach (dchar c; r)That doesn't throw on an invalid sequence?
Jun 25 2011
On Sat, 25 Jun 2011 23:17:37 +0300, Nick Sabalausky <a a.a> wrote:"Vladimir Panteleev" <vladimir thecybershadow.net> wrote in message news:op.vxmuvzqbtuzx1w cybershadow.mshome.net...You use rawToUTF8 to convert an arbitrary array of chars to valid UTF-8. You use UTF8ToRaw to convert the output of rawToUTF8 back to the original string. -- Best regards, Vladimir mailto:vladimir thecybershadow.netstring s; foreach (dchar c; r)That doesn't throw on an invalid sequence?
Jun 26 2011
On 2011-06-25 02:00, Nick Sabalausky wrote:Sometimes I need to bring data into a string, and need to be able to treat it as an actual "string", but don't actually care if the entire thing is technically valid UTF-8 or not, don't care if invalid bytes don't get preserved right, and can't have any utf exceptions being thrown regardless of the input. Yea, I know that's sloppy, but sometimes that's good enough and proper handling may be far more trouble than what's needed. (For example: Processing HTML from arbitrary URLs. It's pretty much guaranteed you'll come across stuff that's wrong or even has the encoding type improperly set. But it's usually more important for the process to succeed than for it to be perfectly accurate.) Far as I can tell, this seems to currently be impossible with Phobos (unless you're *extremely* meticulous about watching what your entire codebase does with the data), which is a major pain when such a need arises. Anyone have a good workaround? For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually treats it as a string instead of an array of bytes _must_ treat it as UTF-8 since it has to decode to determine what the characters are. So, I don't think that there's really any way around that. A string must be valid UTF-8. But if you really don't care about the string's contents, then you can just cast it to an array of ubyte and plenty of functions will work with it - nothing terribly string specific of course, but I don't see how you could possibly expect to do much string-specific with invalid data anyway. - Jonathan M Davis
Jun 25 2011
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message news:mailman.1214.1309008317.14074.digitalmars-d-learn puremagic.com...On 2011-06-25 02:00, Nick Sabalausky wrote:Using immutable(ubyte)[] just causes an enormous amount of type-related problems, largely involving the need to throw around a bunch of casts absolutely everywhere, including every single time any of the byte arrays needs to come in contact with an actual string (for instance, a string literal, for comparing,searching or anything else). It might be the "correct" thing, but in many cases (anything that doesn't need to be perfect, or can't realistically be perfect) it's far more trouble than it's actually worth. Like I said, "For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?" In such a case, *there would be no invalid data* in the actual string.Sometimes I need to bring data into a string, and need to be able to treat it as an actual "string", but don't actually care if the entire thing is technically valid UTF-8 or not, don't care if invalid bytes don't get preserved right, and can't have any utf exceptions being thrown regardless of the input. Yea, I know that's sloppy, but sometimes that's good enough and proper handling may be far more trouble than what's needed. (For example: Processing HTML from arbitrary URLs. It's pretty much guaranteed you'll come across stuff that's wrong or even has the encoding type improperly set. But it's usually more important for the process to succeed than for it to be perfectly accurate.) Far as I can tell, this seems to currently be impossible with Phobos (unless you're *extremely* meticulous about watching what your entire codebase does with the data), which is a major pain when such a need arises. Anyone have a good workaround? For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually treats it as a string instead of an array of bytes _must_ treat it as UTF-8 since it has to decode to determine what the characters are. So, I don't think that there's really any way around that. A string must be valid UTF-8. But if you really don't care about the string's contents, then you can just cast it to an array of ubyte and plenty of functions will work with it - nothing terribly string specific of course, but I don't see how you could possibly expect to do much string-specific with invalid data anyway.
Jun 25 2011
I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.
Jun 25 2011
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
Jun 25 2011
On 26.06.2011 1:49, Nick Sabalausky wrote:"Andrej Mitrovic"<andrej.mitrovich gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :) -- Dmitry OlshanskyI've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
Jun 25 2011
On 2011-06-25 15:17, Dmitry Olshansky wrote:On 26.06.2011 1:49, Nick Sabalausky wrote:It's also likely going away. It was an experiment of sorts which Andrei considers a failure. We need something to replace it, but as I understand it, it doesn't solve all of the problems that it's supposed to, and those it does solve, it doesn't necessarily solve in the best way. So, an improved replacement is going to need to be devised, but I wouldn't expect std.encoding to stick around in the long run. - Jonathan M Davis"Andrej Mitrovic"<andrej.mitrovich gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...std.encoding to the rescue? It looks like a well established module that was forgotten for some reason.I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
Jun 25 2011
"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message news:iu5n32$2vjd$1 digitalmars.com...On 26.06.2011 1:49, Nick Sabalausky wrote:Ahh, I didn't even notice that module. Even if it's imperfect and goes away, it looks like it'll at least get the job done for me. And the encoding conversions should even give me an easy way to save at least some of the invalid chars (which wasn't really a requirement of mine, but it'll still be nice)."Andrej Mitrovic"<andrej.mitrovich gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.
Jun 25 2011
On 26.06.2011 3:25, Nick Sabalausky wrote:"Dmitry Olshansky"<dmitry.olsh gmail.com> wrote in message news:iu5n32$2vjd$1 digitalmars.com...Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!)On 26.06.2011 1:49, Nick Sabalausky wrote:Ahh, I didn't even notice that module."Andrej Mitrovic"<andrej.mitrovich gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.Even if it's imperfect and goes away, it looks like it'll at least get the job done for me. And the encoding conversions should even give me an easy way to save at least some of the invalid chars (which wasn't really a requirement of mine, but it'll still be nice).Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;) -- Dmitry Olshansky
Jun 25 2011
"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message news:iu5tan$ets$1 digitalmars.com...On 26.06.2011 3:25, Nick Sabalausky wrote:Yea, and even when it does go, I can just copy it and include it manually (although it'll probably need some work once typedef goes away). This seems to get the job done well enough for me, and even manages to save some of the intended chars: // With std.utf and std.encoding imported: string src = ...; bool valid=true; try validate(src); catch(UtfException e) valid=false; if(!valid) { auto tmpStr = sanitize( cast(Windows1252String) src ); transcode(tmpStr, src); }"Dmitry Olshansky"<dmitry.olsh gmail.com> wrote in message news:iu5n32$2vjd$1 digitalmars.com...Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!)On 26.06.2011 1:49, Nick Sabalausky wrote:Ahh, I didn't even notice that module."Andrej Mitrovic"<andrej.mitrovich gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.Even if it's imperfect and goes away, it looks like it'll at least get the job done for me. And the encoding conversions should even give me an easy way to save at least some of the invalid chars (which wasn't really a requirement of mine, but it'll still be nice).Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;)
Jun 25 2011
On 2011-06-25 17:04, Dmitry Olshansky wrote:On 26.06.2011 3:25, Nick Sabalausky wrote:Oh, it'll probably be around for a while. It'll take time before a replacement is devised. After, std.stream is still around, isn't it? And there's actually supposedly a plan regarding its replacement's implementation. There's no such thing with regards to std.encoding. I just thought that I should point out that it's likely to be replaced at some point (hopefully with something much better). - Jonathan M Davis"Dmitry Olshansky"<dmitry.olsh gmail.com> wrote in message news:iu5n32$2vjd$1 digitalmars.com...Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!)On 26.06.2011 1:49, Nick Sabalausky wrote:Ahh, I didn't even notice that module."Andrej Mitrovic"<andrej.mitrovich gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...std.encoding to the rescue? It looks like a well established module that was forgotten for some reason. And here I'm wondering what a function named sanitize could do :)I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. The new function simply replaces throwing exceptions with flagging a boolean.I think I may end up doing something like that :/ I was hoping to be able to do something vaguely sensible like this: string newStr; foreach(dchar dc; str) { if(isValidDchar(dc)) newStr ~= dc; else newStr ~= 'X'; } str = newStr; But that just blows up in my face.Even if it's imperfect and goes away, it looks like it'll at least get the job done for me. And the encoding conversions should even give me an easy way to save at least some of the invalid chars (which wasn't really a requirement of mine, but it'll still be nice).Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;)
Jun 25 2011