digitalmars.D.learn - How to detect start of Unicode symbol and count amount of graphemes

Uranuz (32/32) Oct 05 2014 I have struct StringStream that I use to go through and parse

monarch_dodra (7/20) Oct 05 2014 You can use std.uni.byGrapheme to iterate by graphemes:

Uranuz (5/11) Oct 05 2014 Maybe there is some idea how to just detect first code unit of

Jacob Carlborg (7/11) Oct 05 2014 Have a look here [1]. For example, if you have a byte that is between

Uranuz (13/17) Oct 06 2014 Thanks. I solved it myself already for UTF-8 encoding. There

ketmar via Digitalmars-d-learn (5/8) Oct 06 2014 On Mon, 06 Oct 2014 17:28:43 +0000
H. S. Teoh via Digitalmars-d-learn (17/40) Oct 06 2014 This looks wrong to me. Are you sure this finds *all* possible

Jacob Carlborg (5/7) Oct 06 2014 No, the data I gave was to detect a complete code unit. Graphemes are

H. S. Teoh via Digitalmars-d-learn (6/13) Oct 07 2014 [...]

anonymous (16/26) Oct 06 2014 I think your idea of graphemes is off.

Kagamin (4/9) Oct 06 2014 Are you trying to split strings? If you want to optimize usage of

Nicolas F. (11/11) Oct 06 2014 Unicode is hard to deal with properly as how you deal with it is

"Uranuz" <neuranuz gmail.com> writes:

I have struct StringStream that I use to go through and parse 
input string. String could be of string, wstring or dstring type. 
I implement function popChar that reads codeUnit from Stream. I 
want to have *debug* mode of parser (via CT switch), where I 
could get information about lineIndex, codeUnitIndex, 
graphemeIndex. So I don't want to use *front* primitive because 
it autodecodes everywhere, but I want to get info abot index of 
*user perceived character* in debug mode (so decoding is needed 
here).

Question is how to detect that I go from one Unicode grapheme to 
another when iterating on string, wstring, dstring by code unit? 
Is it simple or is it attempt to reimplement a big piece of 
existing std library code?

As a result I should just increment internal graphemeIndex.

There short version of implementation that I want follows

struct StringStream(String)
{
    String str;
    size_t index;
    size_t graphemeIndex;

    auto popChar()
    {
       index++;
       if( ??? ) //How to detect new grapheme?
       {
          graphemeIndex++;
       }
       return str[index];
    }

}

Sorry for very simple question. I just have a mess in my head 
about Unicode and D strings

Oct 05 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote:
 I have struct StringStream that I use to go through and parse 
 input string. String could be of string, wstring or dstring 
 type. I implement function popChar that reads codeUnit from 
 Stream. I want to have *debug* mode of parser (via CT switch), 
 where I could get information about lineIndex, codeUnitIndex, 
 graphemeIndex. So I don't want to use *front* primitive because 
 it autodecodes everywhere, but I want to get info abot index of 
 *user perceived character* in debug mode (so decoding is needed 
 here).

 Question is how to detect that I go from one Unicode grapheme 
 to another when iterating on string, wstring, dstring by code 
 unit? Is it simple or is it attempt to reimplement a big piece 
 of existing std library code?

You can use std.uni.byGrapheme to iterate by graphemes:


AFAIK, graphemes are not "self synchronizing", but codepoints 
are. You can pop code units until you reach the beginning of a 
new codepoint. From there, you can iterate by graphemes, though 
your first grapheme might be off.

Oct 05 2014

"Uranuz" <neuranuz gmail.com> writes:

 You can use std.uni.byGrapheme to iterate by graphemes:


 AFAIK, graphemes are not "self synchronizing", but codepoints 
 are. You can pop code units until you reach the beginning of a 
 new codepoint. From there, you can iterate by graphemes, though 
 your first grapheme might be off.

Maybe there is some idea how to just detect first code unit of 
grapheme without overhead for using Grapheme struct? I just tried 
to check if ch < 128 (for UTF-8). But this dont work. How to 
check if byte is continuation of code for single code point or if 
new sequence started?

Oct 05 2014

Jacob Carlborg <doob me.com> writes:

On 2014-10-05 14:09, Uranuz wrote:

 Maybe there is some idea how to just detect first code unit of grapheme
 without overhead for using Grapheme struct? I just tried to check if ch
 < 128 (for UTF-8). But this dont work. How to check if byte is
 continuation of code for single code point or if new sequence started?

Have a look here [1]. For example, if you have a byte that is between 
U+0080 and U+07FF you know that you need two bytes to get that whole 
code point.

[1] http://en.wikipedia.org/wiki/UTF-8#Description

-- 
/Jacob Carlborg

Oct 05 2014

"Uranuz" <neuranuz gmail.com> writes:

 Have a look here [1]. For example, if you have a byte that is 
 between U+0080 and U+07FF you know that you need two bytes to 
 get that whole code point.

 [1] http://en.wikipedia.org/wiki/UTF-8#Description

Thanks. I solved it myself already for UTF-8 encoding. There 
choosed approach with using bitbask. Maybe it is not best with 
eficiency but it works)

( str[index] & 0b10000000 ) == 0 ||
( str[index] & 0b11100000 ) == 0b11000000 ||
( str[index] & 0b11110000 ) == 0b11100000 ||
( str[index] & 0b11111000 ) == 0b11110000

If it is true it means that first byte of sequence found and I 
can count them. Am I right that it equals to number of graphemes, 
or are there some exceptions from this rule?

For UTF-32 number of codeUnits is just equal to number of 
graphemes. And what about UTF-16? Is it possible to detect first 
codeUnit of encoding sequence?

Oct 06 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 06 Oct 2014 17:28:43 +0000
Uranuz via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 If it is true it means that first byte of sequence found and I=20
 can count them. Am I right that it equals to number of graphemes,=20
 or are there some exceptions from this rule?

alot. take for example RIGHT-TO-LEFT MARK, which is not a grapheme at
all. and not a "composite" for that matter. ah, those joys of unicode!

Oct 06 2014

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Mon, Oct 06, 2014 at 05:28:43PM +0000, Uranuz via Digitalmars-d-learn wrote:
Have a look here [1]. For example, if you have a byte that is between
U+0080 and U+07FF you know that you need two bytes to get that whole
code point.

[1] http://en.wikipedia.org/wiki/UTF-8#Description

 
 Thanks. I solved it myself already for UTF-8 encoding. There choosed
 approach with using bitbask. Maybe it is not best with eficiency but
 it works)
 
 ( str[index] & 0b10000000 ) == 0 ||
 ( str[index] & 0b11100000 ) == 0b11000000 ||
 ( str[index] & 0b11110000 ) == 0b11100000 ||
 ( str[index] & 0b11111000 ) == 0b11110000
 
 If it is true it means that first byte of sequence found and I can
 count them. Am I right that it equals to number of graphemes, or are
 there some exceptions from this rule?
 
 For UTF-32 number of codeUnits is just equal to number of graphemes.
 And what about UTF-16? Is it possible to detect first codeUnit of
 encoding sequence?

This looks wrong to me. Are you sure this finds *all* possible
graphemes? Keep in mind that combining diacritic sequences are treated
as a single grapheme; for example the sequence 'A' U+0301 U+0302 U+0303.
There are several different codepoint ranges that have the combining
diacritic property, and they are definitely more complicated than what
you have here.

Furthermore, there are more complicated things like the Devanagari
sequences (e.g., KA + VIRAMA + TA + VOWEL SIGN U), that your code
certainly doesn't look like it would handle correctly.

As somebody else has said, it's generally a bad idea to work with
Unicode byte sequences yourself, because Unicode is complicated, and
many apparently-simple concepts actually require a lot of care to get it
right.


T

-- 
It won't be covered in the book. The source code has to be useful for
something, after all. -- Larry Wall

Oct 06 2014

Jacob Carlborg <doob me.com> writes:

On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:

 This looks wrong to me. Are you sure this finds *all* possible
 graphemes?

No, the data I gave was to detect a complete code unit. Graphemes are 
something else, I think Uranuz is mixing up the Unicode terms.

-- 
/Jacob Carlborg

Oct 06 2014

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, Oct 07, 2014 at 08:28:49AM +0200, Jacob Carlborg via
Digitalmars-d-learn wrote:
 On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:
 
This looks wrong to me. Are you sure this finds *all* possible
graphemes?

 
 No, the data I gave was to detect a complete code unit. Graphemes are
 something else, I think Uranuz is mixing up the Unicode terms.

[...]

Ahhh, OK, then it makes sense.


T

-- 
People who are more than casually interested in computers should have at least
some idea of what the underlying hardware is like. Otherwise the programs they
write will be pretty weird. -- D. Knuth

Oct 07 2014

"anonymous" <anonymous example.com> writes:

On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:
 ( str[index] & 0b10000000 ) == 0 ||
 ( str[index] & 0b11100000 ) == 0b11000000 ||
 ( str[index] & 0b11110000 ) == 0b11100000 ||
 ( str[index] & 0b11111000 ) == 0b11110000

 If it is true it means that first byte of sequence found and I 
 can count them. Am I right that it equals to number of 
 graphemes, or are there some exceptions from this rule?

 For UTF-32 number of codeUnits is just equal to number of 
 graphemes. And what about UTF-16? Is it possible to detect 
 first codeUnit of encoding sequence?

I think your idea of graphemes is off.

A grapheme is made up of one or more code points. This is the
same for all UTF encodings.
A code point is made of one or more code units. UTF8: between 1
and 4 I think, UTF16: 1 or 2, UTF32: always 1.
A code unit is made up of a fixed number of bytes. UTF8: 1,
UTF16: 2, UTF32: 4.

So, the number of UTF8 bytes in a sequence has no relation to
graphemes. The number of leading ones in a UTF8 start byte is
equal to the total number of bytes in that sequence. I.e. when
you see a 0b1110_0000 byte, the following two bytes should be
continuation bytes (0b10xx_xxxx), and the three of them together
encode a *code point*.

And in UTF32, the number of code units is equal to the number of
*code points*, not graphemes.

Oct 06 2014

"Kagamin" <spam here.lot> writes:

On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote:
 Maybe there is some idea how to just detect first code unit of 
 grapheme without overhead for using Grapheme struct? I just 
 tried to check if ch < 128 (for UTF-8). But this dont work. How 
 to check if byte is continuation of code for single code point 
 or if new sequence started?

Are you trying to split strings? If you want to optimize usage of 
graphemes, try to check if 10 code units contain ascii symbol; 
when that fails, fall back to graphemes.

Oct 06 2014

"Nicolas F." <ddev fratti.ch> writes:

Unicode is hard to deal with properly as how you deal with it is
very context dependant.

One grapheme is a visible character and consists of one or more
codepoints. One codepoint is one mapping of a byte sequence to a
meaning, and consists of one or more bytes.

This you do not want to deal with yourself, as knowing which
codepoints form graphemes is hard. Thankfully, std.uni exists.
Specifically, look at decodeGrapheme: it pops one grapheme from
an input range and returns it.

Never write code that deals with unicode on a bytelevel. It will
always be wrong.

Oct 06 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How to detect start of Unicode symbol and count amount of graphemes