www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to detect start of Unicode symbol and count amount of graphemes

reply "Uranuz" <neuranuz gmail.com> writes:
I have struct StringStream that I use to go through and parse 
input string. String could be of string, wstring or dstring type. 
I implement function popChar that reads codeUnit from Stream. I 
want to have *debug* mode of parser (via CT switch), where I 
could get information about lineIndex, codeUnitIndex, 
graphemeIndex. So I don't want to use *front* primitive because 
it autodecodes everywhere, but I want to get info abot index of 
*user perceived character* in debug mode (so decoding is needed 
here).

Question is how to detect that I go from one Unicode grapheme to 
another when iterating on string, wstring, dstring by code unit? 
Is it simple or is it attempt to reimplement a big piece of 
existing std library code?

As a result I should just increment internal graphemeIndex.

There short version of implementation that I want follows

struct StringStream(String)
{
    String str;
    size_t index;
    size_t graphemeIndex;

    auto popChar()
    {
       index++;
       if( ??? ) //How to detect new grapheme?
       {
          graphemeIndex++;
       }
       return str[index];
    }

}

Sorry for very simple question. I just have a mess in my head 
about Unicode and D strings
Oct 05 2014
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote:
 I have struct StringStream that I use to go through and parse 
 input string. String could be of string, wstring or dstring 
 type. I implement function popChar that reads codeUnit from 
 Stream. I want to have *debug* mode of parser (via CT switch), 
 where I could get information about lineIndex, codeUnitIndex, 
 graphemeIndex. So I don't want to use *front* primitive because 
 it autodecodes everywhere, but I want to get info abot index of 
 *user perceived character* in debug mode (so decoding is needed 
 here).

 Question is how to detect that I go from one Unicode grapheme 
 to another when iterating on string, wstring, dstring by code 
 unit? Is it simple or is it attempt to reimplement a big piece 
 of existing std library code?
You can use std.uni.byGrapheme to iterate by graphemes: AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though your first grapheme might be off.
Oct 05 2014
parent reply "Uranuz" <neuranuz gmail.com> writes:
 You can use std.uni.byGrapheme to iterate by graphemes:


 AFAIK, graphemes are not "self synchronizing", but codepoints 
 are. You can pop code units until you reach the beginning of a 
 new codepoint. From there, you can iterate by graphemes, though 
 your first grapheme might be off.
Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?
Oct 05 2014
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2014-10-05 14:09, Uranuz wrote:

 Maybe there is some idea how to just detect first code unit of grapheme
 without overhead for using Grapheme struct? I just tried to check if ch
 < 128 (for UTF-8). But this dont work. How to check if byte is
 continuation of code for single code point or if new sequence started?
Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point. [1] http://en.wikipedia.org/wiki/UTF-8#Description -- /Jacob Carlborg
Oct 05 2014
parent reply "Uranuz" <neuranuz gmail.com> writes:
 Have a look here [1]. For example, if you have a byte that is 
 between U+0080 and U+07FF you know that you need two bytes to 
 get that whole code point.

 [1] http://en.wikipedia.org/wiki/UTF-8#Description
Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works) ( str[index] & 0b10000000 ) == 0 || ( str[index] & 0b11100000 ) == 0b11000000 || ( str[index] & 0b11110000 ) == 0b11100000 || ( str[index] & 0b11111000 ) == 0b11110000 If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule? For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?
Oct 06 2014
next sibling parent ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Mon, 06 Oct 2014 17:28:43 +0000
Uranuz via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 If it is true it means that first byte of sequence found and I=20
 can count them. Am I right that it equals to number of graphemes,=20
 or are there some exceptions from this rule?
alot. take for example RIGHT-TO-LEFT MARK, which is not a grapheme at all. and not a "composite" for that matter. ah, those joys of unicode!
Oct 06 2014
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Mon, Oct 06, 2014 at 05:28:43PM +0000, Uranuz via Digitalmars-d-learn wrote:
Have a look here [1]. For example, if you have a byte that is between
U+0080 and U+07FF you know that you need two bytes to get that whole
code point.

[1] http://en.wikipedia.org/wiki/UTF-8#Description
Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works) ( str[index] & 0b10000000 ) == 0 || ( str[index] & 0b11100000 ) == 0b11000000 || ( str[index] & 0b11110000 ) == 0b11100000 || ( str[index] & 0b11111000 ) == 0b11110000 If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule? For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?
This looks wrong to me. Are you sure this finds *all* possible graphemes? Keep in mind that combining diacritic sequences are treated as a single grapheme; for example the sequence 'A' U+0301 U+0302 U+0303. There are several different codepoint ranges that have the combining diacritic property, and they are definitely more complicated than what you have here. Furthermore, there are more complicated things like the Devanagari sequences (e.g., KA + VIRAMA + TA + VOWEL SIGN U), that your code certainly doesn't look like it would handle correctly. As somebody else has said, it's generally a bad idea to work with Unicode byte sequences yourself, because Unicode is complicated, and many apparently-simple concepts actually require a lot of care to get it right. T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall
Oct 06 2014
parent reply Jacob Carlborg <doob me.com> writes:
On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:

 This looks wrong to me. Are you sure this finds *all* possible
 graphemes?
No, the data I gave was to detect a complete code unit. Graphemes are something else, I think Uranuz is mixing up the Unicode terms. -- /Jacob Carlborg
Oct 06 2014
parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Tue, Oct 07, 2014 at 08:28:49AM +0200, Jacob Carlborg via
Digitalmars-d-learn wrote:
 On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:
 
This looks wrong to me. Are you sure this finds *all* possible
graphemes?
No, the data I gave was to detect a complete code unit. Graphemes are something else, I think Uranuz is mixing up the Unicode terms.
[...] Ahhh, OK, then it makes sense. T -- People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like. Otherwise the programs they write will be pretty weird. -- D. Knuth
Oct 07 2014
prev sibling parent "anonymous" <anonymous example.com> writes:
On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:
 ( str[index] & 0b10000000 ) == 0 ||
 ( str[index] & 0b11100000 ) == 0b11000000 ||
 ( str[index] & 0b11110000 ) == 0b11100000 ||
 ( str[index] & 0b11111000 ) == 0b11110000

 If it is true it means that first byte of sequence found and I 
 can count them. Am I right that it equals to number of 
 graphemes, or are there some exceptions from this rule?

 For UTF-32 number of codeUnits is just equal to number of 
 graphemes. And what about UTF-16? Is it possible to detect 
 first codeUnit of encoding sequence?
I think your idea of graphemes is off. A grapheme is made up of one or more code points. This is the same for all UTF encodings. A code point is made of one or more code units. UTF8: between 1 and 4 I think, UTF16: 1 or 2, UTF32: always 1. A code unit is made up of a fixed number of bytes. UTF8: 1, UTF16: 2, UTF32: 4. So, the number of UTF8 bytes in a sequence has no relation to graphemes. The number of leading ones in a UTF8 start byte is equal to the total number of bytes in that sequence. I.e. when you see a 0b1110_0000 byte, the following two bytes should be continuation bytes (0b10xx_xxxx), and the three of them together encode a *code point*. And in UTF32, the number of code units is equal to the number of *code points*, not graphemes.
Oct 06 2014
prev sibling parent "Kagamin" <spam here.lot> writes:
On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote:
 Maybe there is some idea how to just detect first code unit of 
 grapheme without overhead for using Grapheme struct? I just 
 tried to check if ch < 128 (for UTF-8). But this dont work. How 
 to check if byte is continuation of code for single code point 
 or if new sequence started?
Are you trying to split strings? If you want to optimize usage of graphemes, try to check if 10 code units contain ascii symbol; when that fails, fall back to graphemes.
Oct 06 2014
prev sibling parent "Nicolas F." <ddev fratti.ch> writes:
Unicode is hard to deal with properly as how you deal with it is
very context dependant.

One grapheme is a visible character and consists of one or more
codepoints. One codepoint is one mapping of a byte sequence to a
meaning, and consists of one or more bytes.

This you do not want to deal with yourself, as knowing which
codepoints form graphemes is hard. Thankfully, std.uni exists.
Specifically, look at decodeGrapheme: it pops one grapheme from
an input range and returns it.

Never write code that deals with unicode on a bytelevel. It will
always be wrong.
Oct 06 2014