digitalmars.D - Why do you decode ? (Seriously)
- Dmitry Olshansky (77/77) Aug 02 2012 Intrigued by a familiar topic in std.lexer. I've split it out.
- Andrei Alexandrescu (6/16) Aug 02 2012 I like a lot this idea of an "minimally decoded" character that's
- Walter Bright (2/6) Aug 02 2012 Yeah, it's too bad the inventors of UTF8 didn't think of this.
- Dmitry Olshansky (7/25) Aug 02 2012 The good news is that there *used to be* 5 and 6-bytes. Now there is
- Artur Skawina (4/9) Aug 02 2012 Iff unaligned accesses happen to be legal on the platform _and_ iff doin...
- Dmitry Olshansky (14/22) Aug 02 2012 You read memory either way, suppose you read it byte by byte vs "1 or 2
Intrigued by a familiar topic in std.lexer. I've split it out. It's not as easy question as it seems. Before you start the usual "because codepoint has semantic meaning, codeunit is just bytes ya-da, ya-da" let me explain you something. Codepoint is indeed a complete piece of symbolic information represented as a number in[0, 0x10FFFF] range. A few such pieces make up user-precived character, not that many people bother with this as the "few" is awfully often equals 1. So far nothing new. My point is - people decode UTF-8 to dchar only to be able to: a) compare it directly with compiler's built-in '<someunicodechar>' b) call one of isAlpha, isSpace, ... that take dchar In other words: Decoding should be required only when one wants to store it in a new form. Otherwise if used for direct consumption it's pointless extra work. Now take a look at this snippet: char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx); //u8word contains full UTF-8 sequence u8word &= (1<<(8*len)) -1; //mask out extra bytes //now u8word is a complete UTF-8 sequence in one uint Barring its hacky nature, I claim that the number obtained is in no way worse then distilled codepoint. It is a number that maps 1:1 any codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word. So why do we use dchar and not UTF-8 word, as it's as good as dchar and faster to obtain? The reasons as above are: a) compiler doesn't generate UTF-8 words in any built-in way (and thus no special type) b) there is not functions that will do isAlpha on this beast. Because of the above currently requires doing some manual work instead of compiler magic Reminding that I'm (no big wonder) doing the "Improve Unicode support for D" GSOC project, I'll think I can easily help with point b. To that end the solution is flexible enough to do the same with UTF-16 word (not that it's relevant). Now just throw in a template: tempalte utf8Word(dchar ch) { enum utf8Word = genUtf8(ch); } //sketch uint genUtf8(dchar ch) { if (c <= 0x7F) return ch; if (c <= 0x7FF) return 0xC0 | (c >> 6) | ((0x80 | (c & 0x3F))<<8); if (c <= 0xFFFF) { assert(!(0xD800 <= c && c <= 0xDFFF)); return 0xE0 | (c >> 12) | (0x80 | (((c >> 6) & 0x3F))<<8) | ((0x80 | (c & 0x3F))<<16); } if (c <= 0x10FFFF) { uint r = 0x80 | (c & 0x3F); //going backwards ;) r <<= 8; r |= 0x80 | ((c >> 6) & 0x3F); r <<= 8; r |= 0x80 | ((c >> 12) & 0x3F); r <<= 8; r |= 0xF0 | (c >> 18); return r; } } And zup-puff! Stuff like the following works: switch(u8word) { case utf8Word!'Ы': ... } And the only thing lacking is a special type so that you can't mistake it with just some arbitrary number. -- Dmitry Olshansky
Aug 02 2012
On 8/2/12 12:47 PM, Dmitry Olshansky wrote:char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx); //u8word contains full UTF-8 sequence u8word &= (1<<(8*len)) -1; //mask out extra bytes //now u8word is a complete UTF-8 sequence in one uint Barring its hacky nature, I claim that the number obtained is in no way worse then distilled codepoint. It is a number that maps 1:1 any codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.I like a lot this idea of an "minimally decoded" character that's isomorphic with UTF-32 but much cheaper to extract. (We could use ulong if they add 5- and 6-byte characters). I wonder if people came up with this and gave it a name. If not, I'd say we call such a number an "olsh". Andrei
Aug 02 2012
On 8/2/2012 11:42 AM, Andrei Alexandrescu wrote:I like a lot this idea of an "minimally decoded" character that's isomorphic with UTF-32 but much cheaper to extract. (We could use ulong if they add 5- and 6-byte characters). I wonder if people came up with this and gave it a name. If not, I'd say we call such a number an "olsh".Yeah, it's too bad the inventors of UTF8 didn't think of this.
Aug 02 2012
On 02-Aug-12 22:42, Andrei Alexandrescu wrote:On 8/2/12 12:47 PM, Dmitry Olshansky wrote:The good news is that there *used to be* 5 and 6-bytes. Now there is only up to 4. That's probably why such technique was not deployed widely yet. I don't think such a decision is easy to roll back.char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx); //u8word contains full UTF-8 sequence u8word &= (1<<(8*len)) -1; //mask out extra bytes //now u8word is a complete UTF-8 sequence in one uint Barring its hacky nature, I claim that the number obtained is in no way worse then distilled codepoint. It is a number that maps 1:1 any codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.I like a lot this idea of an "minimally decoded" character that's isomorphic with UTF-32 but much cheaper to extract. (We could use ulong if they add 5- and 6-byte characters).I wonder if people came up with this and gave it a name. If not, I'd say we call such a number an "olsh".Cool, thought it'd better be olsh8 so that we can use olsh16 for UTF16 :) -- Dmitry Olshansky
Aug 02 2012
On 08/02/12 18:47, Dmitry Olshansky wrote:char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx);So why do we use dchar and not UTF-8 word, as it's as good as dchar and faster to obtain?Iff unaligned accesses happen to be legal on the platform _and_ iff doing them is faster than the (not that complex) decoding. artur
Aug 02 2012
On 03-Aug-12 00:40, Artur Skawina wrote:On 08/02/12 18:47, Dmitry Olshansky wrote:You read memory either way, suppose you read it byte by byte vs "1 or 2 words (if unaligned)" at once. And take a look at std.utf, I'd say it is rather involved. In any case there is a minimum of: mask out upper contol bits, shift to proper position or with result register [repeat per byte] return result Of course, I'm biased by x86 but it is my understanding that unaligned support is more or less understood to be a good feature. Arm v6+ seems to have it. And I suspect there is a way to recode the above to be more word-aligned friendly (e.g. via adding explicit leftover word). -- Dmitry Olshanskychar[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx);So why do we use dchar and not UTF-8 word, as it's as good as dchar and faster to obtain?Iff unaligned accesses happen to be legal on the platform _and_ iff doing them is faster than the (not that complex) decoding.
Aug 02 2012