digitalmars.D - UTF8 and unary encoding
- Andrei Alexandrescu (12/12) Sep 12 2016 While looking at https://en.wikipedia.org/wiki/Unary_coding I found that...
- Jonathan M Davis via Digitalmars-d (13/24) Sep 12 2016 Aren't we already doing that with stride? It reads the number of bytes i...
- Andrei Alexandrescu (3/7) Sep 12 2016 Oh, ok. I'd either forgotten or the code has been improved since I last
While looking at https://en.wikipedia.org/wiki/Unary_coding I found that UTF8 uses unary encoding for the length of multibyte sequences. Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals that indeed "The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence. When reading from a stream, a reader can process all fully received sequences without first having to wait for either the leading byte of a next sequence or an end-of-stream indication." We don't use that explicitly; instead, we load each byte of multi-sequences. Who'd be interested in looking whether Phobos' primitives can be faster with multibyte-rich text? Andrei
Sep 12 2016
On Monday, September 12, 2016 07:37:05 Andrei Alexandrescu via Digitalmars-d wrote:While looking at https://en.wikipedia.org/wiki/Unary_coding I found that UTF8 uses unary encoding for the length of multibyte sequences. Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals that indeed "The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence. When reading from a stream, a reader can process all fully received sequences without first having to wait for either the leading byte of a next sequence or an end-of-stream indication." We don't use that explicitly; instead, we load each byte of multi-sequences. Who'd be interested in looking whether Phobos' primitives can be faster with multibyte-rich text?Aren't we already doing that with stride? It reads the number of bytes in a code point from the first code unit and then if we're dealing with a random access range of char or an array of char, then we skip that many code units without reading them. The fact that we auto-decode in many cases does mean that all of the bytes are read in a number of cases where they wouldn't need to be if we were dealing with ranges of char, but in the cases where we aren't auto-decoding, we should already be taking advantage of this in general via stride (though obviously, there could be specific places where the code is not skipping bytes like it should). Or am I misunderstanding what you're talking about doing here? - Jonathan M Davis
Sep 12 2016
On 9/12/16 11:59 AM, Jonathan M Davis via Digitalmars-d wrote:Aren't we already doing that with stride? It reads the number of bytes in a code point from the first code unit and then if we're dealing with a random access range of char or an array of char, then we skip that many code units without reading them.Oh, ok. I'd either forgotten or the code has been improved since I last looked at it. Thanks! -- Andrei
Sep 12 2016