www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why do you decode ? (Seriously)

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
Intrigued by a familiar topic in std.lexer. I've split it out.
It's not as easy question as it seems.

Before you start the usual "because codepoint has semantic meaning, 
codeunit is just bytes ya-da, ya-da" let me explain you something.

Codepoint is indeed a complete piece of symbolic information represented 
as a number in[0, 0x10FFFF] range.
A few such pieces  make up user-precived character, not that many people 
bother with this as the "few" is awfully often equals 1.
So far nothing new.

My point is - people decode UTF-8 to dchar only to be able to:
a) compare it directly with compiler's built-in '<someunicodechar>'
b) call one of isAlpha, isSpace, ... that take dchar

In other words:
	Decoding should be required only when one wants to store it in a new 
form. Otherwise if used for direct consumption it's pointless extra work.

Now take a look at this snippet:

char[] input = ...;
size_t idx = ...;
size_t len = stride(input, idx);
uint u8word = *cast(uint*)(input.ptr+idx);
//u8word contains full UTF-8 sequence
u8word &= (1<<(8*len)) -1; //mask out extra bytes
//now u8word is a complete UTF-8 sequence in one uint


Barring its hacky nature, I claim that the number obtained is in no way 
worse then distilled codepoint. It is a number that maps 1:1 any 
codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.

So why do we use dchar and not UTF-8 word, as it's as good as dchar and 
faster to obtain? The reasons as above are:
a) compiler doesn't generate UTF-8 words in any built-in way (and thus 
no special type)
b) there is not functions that will do isAlpha on this beast.

Because of the above currently requires doing some manual work instead 
of compiler magic

Reminding that I'm (no big wonder) doing the "Improve Unicode support 
for D"  GSOC project, I'll think I can easily help with point b. To that 
end the solution is flexible enough to do the same with UTF-16 word (not 
that it's relevant).

Now just throw in a template:

tempalte utf8Word(dchar ch)
{
	enum utf8Word = genUtf8(ch);
}

//sketch
uint genUtf8(dchar ch)
{
     if (c <= 0x7F)
         return ch;
     if (c <= 0x7FF)
         return 0xC0 | (c >> 6) | ((0x80 | (c & 0x3F))<<8);
     if (c <= 0xFFFF)
     {
         assert(!(0xD800 <= c && c <= 0xDFFF));
         return 0xE0 | (c >> 12) | (0x80 | (((c >> 6) & 0x3F))<<8)
		| ((0x80 | (c & 0x3F))<<16);
     }
     if (c <= 0x10FFFF)
     {
	uint r = 0x80 | (c & 0x3F); //going backwards ;)
	r <<= 8;
	r |= 0x80 | ((c >> 6) & 0x3F);
	r <<= 8;
	r |= 0x80 | ((c >> 12) & 0x3F);
	r <<= 8;
         r |= 0xF0 | (c >> 18);
         return r;
     }
}

And zup-puff! Stuff like the following works:

switch(u8word)
{
case utf8Word!'Ы':
	...
}

And the only thing lacking is a special type so that you can't mistake 
it with just some arbitrary number.

-- 
Dmitry Olshansky
Aug 02 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 12:47 PM, Dmitry Olshansky wrote:
 char[] input = ...;
 size_t idx = ...;
 size_t len = stride(input, idx);
 uint u8word = *cast(uint*)(input.ptr+idx);
 //u8word contains full UTF-8 sequence
 u8word &= (1<<(8*len)) -1; //mask out extra bytes
 //now u8word is a complete UTF-8 sequence in one uint


 Barring its hacky nature, I claim that the number obtained is in no way
 worse then distilled codepoint. It is a number that maps 1:1 any
 codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.
I like a lot this idea of an "minimally decoded" character that's isomorphic with UTF-32 but much cheaper to extract. (We could use ulong if they add 5- and 6-byte characters). I wonder if people came up with this and gave it a name. If not, I'd say we call such a number an "olsh". Andrei
Aug 02 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 11:42 AM, Andrei Alexandrescu wrote:
 I like a lot this idea of an "minimally decoded" character that's isomorphic
 with UTF-32 but much cheaper to extract. (We could use ulong if they add 5- and
 6-byte characters). I wonder if people came up with this and gave it a name. If
 not, I'd say we call such a number an "olsh".
Yeah, it's too bad the inventors of UTF8 didn't think of this.
Aug 02 2012
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02-Aug-12 22:42, Andrei Alexandrescu wrote:
 On 8/2/12 12:47 PM, Dmitry Olshansky wrote:
 char[] input = ...;
 size_t idx = ...;
 size_t len = stride(input, idx);
 uint u8word = *cast(uint*)(input.ptr+idx);
 //u8word contains full UTF-8 sequence
 u8word &= (1<<(8*len)) -1; //mask out extra bytes
 //now u8word is a complete UTF-8 sequence in one uint


 Barring its hacky nature, I claim that the number obtained is in no way
 worse then distilled codepoint. It is a number that maps 1:1 any
 codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.
I like a lot this idea of an "minimally decoded" character that's isomorphic with UTF-32 but much cheaper to extract. (We could use ulong if they add 5- and 6-byte characters).
The good news is that there *used to be* 5 and 6-bytes. Now there is only up to 4. That's probably why such technique was not deployed widely yet. I don't think such a decision is easy to roll back.
I wonder if people came up with
 this and gave it a name. If not, I'd say we call such a number an "olsh".
Cool, thought it'd better be olsh8 so that we can use olsh16 for UTF16 :) -- Dmitry Olshansky
Aug 02 2012
prev sibling parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 08/02/12 18:47, Dmitry Olshansky wrote:
 char[] input = ...;
 size_t idx = ...;
 size_t len = stride(input, idx);
 uint u8word = *cast(uint*)(input.ptr+idx);
 So why do we use dchar and not UTF-8 word, as it's as good as dchar and faster
to obtain? 
Iff unaligned accesses happen to be legal on the platform _and_ iff doing them is faster than the (not that complex) decoding. artur
Aug 02 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Aug-12 00:40, Artur Skawina wrote:
 On 08/02/12 18:47, Dmitry Olshansky wrote:
 char[] input = ...;
 size_t idx = ...;
 size_t len = stride(input, idx);
 uint u8word = *cast(uint*)(input.ptr+idx);
 So why do we use dchar and not UTF-8 word, as it's as good as dchar and faster
to obtain?
Iff unaligned accesses happen to be legal on the platform _and_ iff doing them is faster than the (not that complex) decoding.
You read memory either way, suppose you read it byte by byte vs "1 or 2 words (if unaligned)" at once. And take a look at std.utf, I'd say it is rather involved. In any case there is a minimum of: mask out upper contol bits, shift to proper position or with result register [repeat per byte] return result Of course, I'm biased by x86 but it is my understanding that unaligned support is more or less understood to be a good feature. Arm v6+ seems to have it. And I suspect there is a way to recode the above to be more word-aligned friendly (e.g. via adding explicit leftover word). -- Dmitry Olshansky
Aug 02 2012