digitalmars.D - UTF-8 to dchar conversion

Arcane Jill (90/90) Jul 28 2004 For Sean...

Arcane Jill (7/7) Jul 28 2004 In article , Arcane Jill says...

Arcane Jill (7/7) Jul 28 2004 In article , Arcane Jill says...

Arcane Jill (86/86) Jul 28 2004 Aaargh!
parabolis (17/112) Jul 28 2004 This function does not verify any non-first byte in a UTF-8

Arcane Jill (14/22) Jul 28 2004 Well spotted. Okay, so replace

Sean Kelly (4/4) Jul 28 2004 The routines themselves were left unaltered from the original UTF functi...
Walter (7/7) Jul 28 2004 One aspect to consider when writing fast conversion code is the frequenc...

Arcane Jill (72/79) Jul 28 2004 Good point.

Walter (8/88) Jul 29 2004 Does your version also reject UTF-8 sequences that produce the correct

Arcane Jill (20/22) Jul 29 2004 Theoretically, yes. Two-byte sequences starting with 0xC0 and 0xD0 are c...

Arcane Jill (3/5) Jul 29 2004 should read
Walter (4/5) Jul 29 2004 It would be nice to have a comprehensive set of test data for these thin...

Arcane Jill <Arcane_member pathlink.com> writes:

For Sean...

I noticed your std.utf update on the bugs forum. Using delegates is obviously
sensible, but I noticed the routine looked a tad on the slow side. Here's a
faster algorithm - it doesn't use delegates, but I'm sure you could do some
mixing and matching to get the best of both. Here's my fast converter:





















































































(and no nasty gotos either!)
Jill

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce91ga$jnj$1 digitaldaemon.com>, Arcane Jill says...

Ah, bugger!



should read:



That'll teach me to post code without testing it first!
Jill

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce91t7$jrt$1 digitaldaemon.com>, Arcane Jill says...

And



should read



(Aren't you glad I'm not writing real code myself just now. Just think how many
bugs it would end up with! Still - the /principle/ is sound.)

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

Aaargh!

Found even more bugs. Fixed them. Let's just start again. HERE's the fast UTF-8
routine... (If there are any more bugs after this, someone else can find them).

Jul 28 2004

parabolis <parabolis softhome.net> writes:

This function does not verify any non-first byte in a UTF-8 
sequence actually starts with 10xxxxxx... So it accepts
   0xC1,0xBF  (correct)
   and
   0xC1,0xFF (incorrect)

You also probably wanted
     isValidDchar(c)
         instead of
     isValidDchar(s)
         and
     s = s[len..s.length];
         instead of
     p = p[len..p.length];

(I also noticed you used uint exclusively... :P)

Out of curiosity why did you define the LENGTH and the 
START_CALC arrays?

Arcane Jill wrote:
 For Sean...
 
 I noticed your std.utf update on the bugs forum. Using delegates is obviously
 sensible, but I noticed the routine looked a tad on the slow side. Here's a
 faster algorithm - it doesn't use delegates, but I'm sure you could do some
 mixing and matching to get the best of both. Here's my fast converter:
 



















































































 
 (and no nasty gotos either!)
 Jill

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce9483$kq0$1 digitaldaemon.com>, parabolis says...
This function does not verify any non-first byte in a UTF-8 
sequence actually starts with 10xxxxxx... So it accepts
   0xC1,0xBF  (correct)
   and
   0xC1,0xFF (incorrect)

Well spotted. Okay, so replace




with







Thanks very much for pointing that out. I appreciate it.


You also probably wanted

Yeah, there were some typos in the original post. I fixed them in the repost.

Out of curiosity why did you define the LENGTH and the 
START_CALC arrays?

Because they're the fast lookup tables.
Jill

Jul 28 2004

Sean Kelly <sean f4.ca> writes:

The routines themselves were left unaltered from the original UTF functions.
I'll play with your suggestions and see if I can get it all working though.  If
the code can be made faster then that's fine with me :)


Sean

Jul 28 2004

"Walter" <newshound digitalmars.com> writes:

One aspect to consider when writing fast conversion code is the frequency of
various characters. Characters do not have a flat random distribution. I'd
wager that the overwhelming majority of them will be ASCII. Thus, a fast
converter would first just test for ASCII, and save the more complex
processing for non-ASCII. Your routine does numerous unnecessary operations
on ASCII chars, so while it may be faster if the data is random, it would be
slower on text data.

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce98eo$n71$1 digitaldaemon.com>, Walter says...
One aspect to consider when writing fast conversion code is the frequency of
various characters. Characters do not have a flat random distribution. I'd
wager that the overwhelming majority of them will be ASCII. Thus, a fast
converter would first just test for ASCII, and save the more complex
processing for non-ASCII. Your routine does numerous unnecessary operations
on ASCII chars, so while it may be faster if the data is random, it would be
slower on text data.

Good point.

Here's a new version then, which tests for ASCII first. (It also makes the
lookup tables half the size!)





































































Jill

Jul 28 2004

"Walter" <newshound digitalmars.com> writes:

Does your version also reject UTF-8 sequences that produce the correct
value, but are not the shortest possible sequence?

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cea792$14f4$1 digitaldaemon.com...
 In article <ce98eo$n71$1 digitaldaemon.com>, Walter says...
One aspect to consider when writing fast conversion code is the frequency


of
various characters. Characters do not have a flat random distribution.


I'd
wager that the overwhelming majority of them will be ASCII. Thus, a fast
converter would first just test for ASCII, and save the more complex
processing for non-ASCII. Your routine does numerous unnecessary


operations
on ASCII chars, so while it may be faster if the data is random, it would


be
slower on text data.

 Good point.

 Here's a new version then, which tests for ASCII first. (It also makes the
 lookup tables half the size!)





































































 Jill

Jul 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cebj7l$1mro$1 digitaldaemon.com>, Walter says...
Does your version also reject UTF-8 sequences that produce the correct
value, but are not the shortest possible sequence?

Theoretically, yes. Two-byte sequences starting with 0xC0 and 0xD0 are caught by
the relevant zero entries in the LENGTH table (at offsets 0x40 and 0x41);
Overlong three and four byte sequences are ruled out by the test:




and overlong five or more byte sequences (indeed, /all/ five or more byte
sequences) are ruled out, again, by zeroes in the LENGTH table (at offset 0x78
to 0x7F).

I have to confess, though, I have not tested this. I wrote it and posted it
without testing it, which is bad form, I know, but it's the first D I've written
since the funeral and I'm just getting back into practice. I figured you
wouldn't want to use it as-is anyway, because you'll want all that delegate
stuff with get() and put() instead of just assuming everyone wants a string.
That said, I can't /see/ any bugs in it, and it's quite short so there are not
many places for them to hide. (So, if you use this, or a variant of it, keep the
unit tests in).

If you want UTF conversion to /really/ zip along, you could consider dropping to
assembler. Just a thought.

Jill

Jul 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cebljb$1nu9$1 digitaldaemon.com>, Arcane Jill says...

Textual typo correction:

(at offsets 0x40 and 0x41);

should read

(at offsets 0x40 and 0x50);

Jul 29 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cebljb$1nu9$1 digitaldaemon.com...
 I have to confess, though, I have not tested this.

It would be nice to have a comprehensive set of test data for these things.
Are there any on the UTF sites you look at?

Jul 29 2004

D Programming

C/C++ Programming

Other

digitalmars.D - UTF-8 to dchar conversion