www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF-8 to dchar conversion

reply Arcane Jill <Arcane_member pathlink.com> writes:
For Sean...

I noticed your std.utf update on the bugs forum. Using delegates is obviously
sensible, but I noticed the routine looked a tad on the slow side. Here's a
faster algorithm - it doesn't use delegates, but I'm sure you could do some
mixing and matching to get the best of both. Here's my fast converter:





















































































(and no nasty gotos either!)
Jill
Jul 28 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce91ga$jnj$1 digitaldaemon.com>, Arcane Jill says...

Ah, bugger!



should read:



That'll teach me to post code without testing it first!
Jill
Jul 28 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce91t7$jrt$1 digitaldaemon.com>, Arcane Jill says...

And



should read



(Aren't you glad I'm not writing real code myself just now. Just think how many
bugs it would end up with! Still - the /principle/ is sound.)
Jul 28 2004
prev sibling next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
Aaargh!

Found even more bugs. Fixed them. Let's just start again. HERE's the fast UTF-8
routine... (If there are any more bugs after this, someone else can find them).





















































































Jul 28 2004
prev sibling next sibling parent reply parabolis <parabolis softhome.net> writes:
This function does not verify any non-first byte in a UTF-8 
sequence actually starts with 10xxxxxx... So it accepts
   0xC1,0xBF  (correct)
   and
   0xC1,0xFF (incorrect)

You also probably wanted
     isValidDchar(c)
         instead of
     isValidDchar(s)
         and
     s = s[len..s.length];
         instead of
     p = p[len..p.length];

(I also noticed you used uint exclusively... :P)

Out of curiosity why did you define the LENGTH and the 
START_CALC arrays?

Arcane Jill wrote:
 For Sean...
 
 I noticed your std.utf update on the bugs forum. Using delegates is obviously
 sensible, but I noticed the routine looked a tad on the slow side. Here's a
 faster algorithm - it doesn't use delegates, but I'm sure you could do some
 mixing and matching to get the best of both. Here's my fast converter:
 



















































































 
 (and no nasty gotos either!)
 Jill
 
 
Jul 28 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce9483$kq0$1 digitaldaemon.com>, parabolis says...
This function does not verify any non-first byte in a UTF-8 
sequence actually starts with 10xxxxxx... So it accepts
   0xC1,0xBF  (correct)
   and
   0xC1,0xFF (incorrect)
Well spotted. Okay, so replace with Thanks very much for pointing that out. I appreciate it.
You also probably wanted
Yeah, there were some typos in the original post. I fixed them in the repost.
Out of curiosity why did you define the LENGTH and the 
START_CALC arrays?
Because they're the fast lookup tables. Jill
Jul 28 2004
prev sibling next sibling parent Sean Kelly <sean f4.ca> writes:
The routines themselves were left unaltered from the original UTF functions.
I'll play with your suggestions and see if I can get it all working though.  If
the code can be made faster then that's fine with me :)


Sean
Jul 28 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
One aspect to consider when writing fast conversion code is the frequency of
various characters. Characters do not have a flat random distribution. I'd
wager that the overwhelming majority of them will be ASCII. Thus, a fast
converter would first just test for ASCII, and save the more complex
processing for non-ASCII. Your routine does numerous unnecessary operations
on ASCII chars, so while it may be faster if the data is random, it would be
slower on text data.
Jul 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce98eo$n71$1 digitaldaemon.com>, Walter says...
One aspect to consider when writing fast conversion code is the frequency of
various characters. Characters do not have a flat random distribution. I'd
wager that the overwhelming majority of them will be ASCII. Thus, a fast
converter would first just test for ASCII, and save the more complex
processing for non-ASCII. Your routine does numerous unnecessary operations
on ASCII chars, so while it may be faster if the data is random, it would be
slower on text data.
Good point. Here's a new version then, which tests for ASCII first. (It also makes the lookup tables half the size!) Jill
Jul 28 2004
parent reply "Walter" <newshound digitalmars.com> writes:
Does your version also reject UTF-8 sequences that produce the correct
value, but are not the shortest possible sequence?

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cea792$14f4$1 digitaldaemon.com...
 In article <ce98eo$n71$1 digitaldaemon.com>, Walter says...
One aspect to consider when writing fast conversion code is the frequency
of
various characters. Characters do not have a flat random distribution.
I'd
wager that the overwhelming majority of them will be ASCII. Thus, a fast
converter would first just test for ASCII, and save the more complex
processing for non-ASCII. Your routine does numerous unnecessary
operations
on ASCII chars, so while it may be faster if the data is random, it would
be
slower on text data.
Good point. Here's a new version then, which tests for ASCII first. (It also makes the lookup tables half the size!) Jill
Jul 29 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cebj7l$1mro$1 digitaldaemon.com>, Walter says...
Does your version also reject UTF-8 sequences that produce the correct
value, but are not the shortest possible sequence?
Theoretically, yes. Two-byte sequences starting with 0xC0 and 0xD0 are caught by the relevant zero entries in the LENGTH table (at offsets 0x40 and 0x41); Overlong three and four byte sequences are ruled out by the test: and overlong five or more byte sequences (indeed, /all/ five or more byte sequences) are ruled out, again, by zeroes in the LENGTH table (at offset 0x78 to 0x7F). I have to confess, though, I have not tested this. I wrote it and posted it without testing it, which is bad form, I know, but it's the first D I've written since the funeral and I'm just getting back into practice. I figured you wouldn't want to use it as-is anyway, because you'll want all that delegate stuff with get() and put() instead of just assuming everyone wants a string. That said, I can't /see/ any bugs in it, and it's quite short so there are not many places for them to hide. (So, if you use this, or a variant of it, keep the unit tests in). If you want UTF conversion to /really/ zip along, you could consider dropping to assembler. Just a thought. Jill
Jul 29 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cebljb$1nu9$1 digitaldaemon.com>, Arcane Jill says...

Textual typo correction:

(at offsets 0x40 and 0x41);
should read
(at offsets 0x40 and 0x50);
Jul 29 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cebljb$1nu9$1 digitaldaemon.com...
 I have to confess, though, I have not tested this.
It would be nice to have a comprehensive set of test data for these things. Are there any on the UTF sites you look at?
Jul 29 2004