digitalmars.D - UTF-8 to dchar conversion
- Arcane Jill (90/90) Jul 28 2004 For Sean...
-
Arcane Jill
(7/7)
Jul 28 2004
In article
, Arcane Jill says... -
Arcane Jill
(7/7)
Jul 28 2004
In article
, Arcane Jill says... - Arcane Jill (86/86) Jul 28 2004 Aaargh!
- parabolis (17/112) Jul 28 2004 This function does not verify any non-first byte in a UTF-8
- Arcane Jill (14/22) Jul 28 2004 Well spotted. Okay, so replace
- Sean Kelly (4/4) Jul 28 2004 The routines themselves were left unaltered from the original UTF functi...
- Walter (7/7) Jul 28 2004 One aspect to consider when writing fast conversion code is the frequenc...
- Arcane Jill (72/79) Jul 28 2004 Good point.
- Walter (8/88) Jul 29 2004 Does your version also reject UTF-8 sequences that produce the correct
- Arcane Jill (20/22) Jul 29 2004 Theoretically, yes. Two-byte sequences starting with 0xC0 and 0xD0 are c...
- Arcane Jill (3/5) Jul 29 2004 should read
- Walter (4/5) Jul 29 2004 It would be nice to have a comprehensive set of test data for these thin...
For Sean... I noticed your std.utf update on the bugs forum. Using delegates is obviously sensible, but I noticed the routine looked a tad on the slow side. Here's a faster algorithm - it doesn't use delegates, but I'm sure you could do some mixing and matching to get the best of both. Here's my fast converter: (and no nasty gotos either!) Jill
Jul 28 2004
In article <ce91ga$jnj$1 digitaldaemon.com>, Arcane Jill says... Ah, bugger! should read: That'll teach me to post code without testing it first! Jill
Jul 28 2004
In article <ce91t7$jrt$1 digitaldaemon.com>, Arcane Jill says... And should read (Aren't you glad I'm not writing real code myself just now. Just think how many bugs it would end up with! Still - the /principle/ is sound.)
Jul 28 2004
Aaargh! Found even more bugs. Fixed them. Let's just start again. HERE's the fast UTF-8 routine... (If there are any more bugs after this, someone else can find them).
Jul 28 2004
This function does not verify any non-first byte in a UTF-8 sequence actually starts with 10xxxxxx... So it accepts 0xC1,0xBF (correct) and 0xC1,0xFF (incorrect) You also probably wanted isValidDchar(c) instead of isValidDchar(s) and s = s[len..s.length]; instead of p = p[len..p.length]; (I also noticed you used uint exclusively... :P) Out of curiosity why did you define the LENGTH and the START_CALC arrays? Arcane Jill wrote:For Sean... I noticed your std.utf update on the bugs forum. Using delegates is obviously sensible, but I noticed the routine looked a tad on the slow side. Here's a faster algorithm - it doesn't use delegates, but I'm sure you could do some mixing and matching to get the best of both. Here's my fast converter: (and no nasty gotos either!) Jill
Jul 28 2004
In article <ce9483$kq0$1 digitaldaemon.com>, parabolis says...This function does not verify any non-first byte in a UTF-8 sequence actually starts with 10xxxxxx... So it accepts 0xC1,0xBF (correct) and 0xC1,0xFF (incorrect)Well spotted. Okay, so replace with Thanks very much for pointing that out. I appreciate it.You also probably wantedYeah, there were some typos in the original post. I fixed them in the repost.Out of curiosity why did you define the LENGTH and the START_CALC arrays?Because they're the fast lookup tables. Jill
Jul 28 2004
The routines themselves were left unaltered from the original UTF functions. I'll play with your suggestions and see if I can get it all working though. If the code can be made faster then that's fine with me :) Sean
Jul 28 2004
One aspect to consider when writing fast conversion code is the frequency of various characters. Characters do not have a flat random distribution. I'd wager that the overwhelming majority of them will be ASCII. Thus, a fast converter would first just test for ASCII, and save the more complex processing for non-ASCII. Your routine does numerous unnecessary operations on ASCII chars, so while it may be faster if the data is random, it would be slower on text data.
Jul 28 2004
In article <ce98eo$n71$1 digitaldaemon.com>, Walter says...One aspect to consider when writing fast conversion code is the frequency of various characters. Characters do not have a flat random distribution. I'd wager that the overwhelming majority of them will be ASCII. Thus, a fast converter would first just test for ASCII, and save the more complex processing for non-ASCII. Your routine does numerous unnecessary operations on ASCII chars, so while it may be faster if the data is random, it would be slower on text data.Good point. Here's a new version then, which tests for ASCII first. (It also makes the lookup tables half the size!) Jill
Jul 28 2004
Does your version also reject UTF-8 sequences that produce the correct value, but are not the shortest possible sequence? "Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cea792$14f4$1 digitaldaemon.com...In article <ce98eo$n71$1 digitaldaemon.com>, Walter says...ofOne aspect to consider when writing fast conversion code is the frequencyI'dvarious characters. Characters do not have a flat random distribution.operationswager that the overwhelming majority of them will be ASCII. Thus, a fast converter would first just test for ASCII, and save the more complex processing for non-ASCII. Your routine does numerous unnecessarybeon ASCII chars, so while it may be faster if the data is random, it wouldslower on text data.Good point. Here's a new version then, which tests for ASCII first. (It also makes the lookup tables half the size!) Jill
Jul 29 2004
In article <cebj7l$1mro$1 digitaldaemon.com>, Walter says...Does your version also reject UTF-8 sequences that produce the correct value, but are not the shortest possible sequence?Theoretically, yes. Two-byte sequences starting with 0xC0 and 0xD0 are caught by the relevant zero entries in the LENGTH table (at offsets 0x40 and 0x41); Overlong three and four byte sequences are ruled out by the test: and overlong five or more byte sequences (indeed, /all/ five or more byte sequences) are ruled out, again, by zeroes in the LENGTH table (at offset 0x78 to 0x7F). I have to confess, though, I have not tested this. I wrote it and posted it without testing it, which is bad form, I know, but it's the first D I've written since the funeral and I'm just getting back into practice. I figured you wouldn't want to use it as-is anyway, because you'll want all that delegate stuff with get() and put() instead of just assuming everyone wants a string. That said, I can't /see/ any bugs in it, and it's quite short so there are not many places for them to hide. (So, if you use this, or a variant of it, keep the unit tests in). If you want UTF conversion to /really/ zip along, you could consider dropping to assembler. Just a thought. Jill
Jul 29 2004
In article <cebljb$1nu9$1 digitaldaemon.com>, Arcane Jill says... Textual typo correction:(at offsets 0x40 and 0x41);should read(at offsets 0x40 and 0x50);
Jul 29 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cebljb$1nu9$1 digitaldaemon.com...I have to confess, though, I have not tested this.It would be nice to have a comprehensive set of test data for these things. Are there any on the UTF sites you look at?
Jul 29 2004