digitalmars.D.learn - Accessing non-binary Unicode properties with std.uni

=?UTF-8?B?Q2hsb8Op?= Kekoa (9/10) Sep 28 2020 The documentation of std.uni [1] says that the unicode struct

Dukc (4/14) Sep 29 2020 It seems the East Asian width is Unicode standard 13.0, while

H. S. Teoh (409/426) Sep 29 2020 [...]

=?UTF-8?B?Q2hsb8Op?= Kekoa (2/7) Sep 29 2020 Thank you. Analyzing the data file seems simple enough. :)

H. S. Teoh (8/16) Sep 29 2020 If you're daring, you can try parsing it at compile-time... but in this

=?UTF-8?B?Q2hsb8Op?= Kekoa <rightfold+ck gmail.com> writes:

The documentation of std.uni [1] says that the unicode struct 
provides sets for several binary properties. I am looking for a 
way to query non-binary properties of a character. Is that 
possible with std.uni or do I need to use a third-party library?

I am specifically interested in the East_Asian_Width property [2] 
(which has six allowed values). Trying to access 
std.uni.unicode.East_Asian_Width results in the error message:

 No unicode set by name East_Asian_Width was found.

[1]: https://dlang.org/library/std/uni.html
[2]: https://www.unicode.org/reports/tr11/tr11-38.html

Sep 28 2020

Dukc <ajieskola gmail.com> writes:

On Monday, 28 September 2020 at 18:23:43 UTC, Chloé Kekoa wrote:
 The documentation of std.uni [1] says that the unicode struct 
 provides sets for several binary properties. I am looking for a 
 way to query non-binary properties of a character. Is that 
 possible with std.uni or do I need to use a third-party library?

 I am specifically interested in the East_Asian_Width property 
 [2] (which has six allowed values). Trying to access 
 std.uni.unicode.East_Asian_Width results in the error message:

 No unicode set by name East_Asian_Width was found.

 [1]: https://dlang.org/library/std/uni.html
 [2]: https://www.unicode.org/reports/tr11/tr11-38.html

It seems the East Asian width is Unicode standard 13.0, while 
Phobos implements 6.2. So seems like ca case for a third-party 
library :(.

Sep 29 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Sep 29, 2020 at 04:22:18PM +0000, Dukc via Digitalmars-d-learn wrote:
 On Monday, 28 September 2020 at 18:23:43 UTC, Chlo� Kekoa wrote:
 The documentation of std.uni [1] says that the unicode struct
 provides sets for several binary properties. I am looking for a way
 to query non-binary properties of a character. Is that possible with
 std.uni or do I need to use a third-party library?
 
 I am specifically interested in the East_Asian_Width property [2]
 (which has six allowed values). Trying to access
 std.uni.unicode.East_Asian_Width results in the error message:
 
 No unicode set by name East_Asian_Width was found.

 
 [1]: https://dlang.org/library/std/uni.html
 [2]: https://www.unicode.org/reports/tr11/tr11-38.html

 
 It seems the East Asian width is Unicode standard 13.0, while Phobos
 implements 6.2. So seems like ca case for a third-party library :(.

[...]

OTOH, the relevant Unicode data file that contains East_Asian_Width data
(EastAsianWidth.txt) is relatively straightforward to parse.  In one of
my projects, I wrote a little helper program to parse this file and
generate a function that tells me if a given dchar is wide or narrow.

Here's the generated function (just copy-n-paste this into your code, no
need for yet another external library dependency):

	bool isWide(dchar ch)  safe pure nothrow  nogc
	{
	    if (ch < 63744)
	    {
		if (ch < 12880)
		{
		    if (ch < 11904)
		    {
			if (ch < 4352) return false;
			if (ch < 4448) return true;
			if (ch == 9001 || ch == 9002) return true;
			return false;
		    }
		    else if (ch < 12351) return true;
		    else
		    {
			if (ch < 12353) return false;
			if (ch < 12872) return true;
			return false;
		    }
		}
		else if (ch < 19904) return true;
		else
		{
		    if (ch < 43360)
		    {
			if (ch < 19968) return false;
			if (ch < 42183) return true;
			return false;
		    }
		    else if (ch < 43389) return true;
		    else
		    {
			if (ch < 44032) return false;
			if (ch < 55204) return true;
			return false;
		    }
		}
	    }
	    else if (ch < 64256) return true;
	    else
	    {
		if (ch < 65504)
		{
		    if (ch < 65072)
		    {
			if (ch < 65040) return false;
			if (ch < 65050) return true;
			return false;
		    }
		    else if (ch < 65132) return true;
		    else
		    {
			if (ch < 65281) return false;
			if (ch < 65377) return true;
			return false;
		    }
		}
		else if (ch < 65511) return true;
		else
		{
		    if (ch < 127488)
		    {
			if (ch == 110592 || ch == 110593) return true;
			return false;
		    }
		    else if (ch < 127570) return true;
		    else
		    {
			if (ch < 131072) return false;
			if (ch < 262142) return true;
			return false;
		    }
		}
	    }
	}

Here's the utility that generated this code:

	/**
	 * Simple program to parse EastAsianWidth.txt to extract some useful info.
	 */
	
	import std.algorithm;
	import std.conv;
	import std.range;
	import std.regex;
	import std.stdio;
	
	struct CodeRange
	{
	    dchar start, end;
	
	    bool overlaps(CodeRange cr)
	    {
	        return ((start >= cr.start && start < cr.end) ||
	                (end >= cr.start && end < cr.end));
	    }
	
	    unittest
	    {
	        assert(CodeRange(1,11).overlaps(CodeRange(11,12)));
	        assert(!CodeRange(1,10).overlaps(CodeRange(11,12)));
	    }
	
	    void merge(CodeRange cr)
	    {
	        start = min(start, cr.start);
	        end = max(end, cr.end);
	    }
	
	    unittest
	    {
	        auto cr = CodeRange(10,20);
	        cr.merge(CodeRange(20,30));
	        assert(cr == CodeRange(10,30));
	    }
	
	    void toString(scope void delegate(const(char)[]) sink)
	    {
	        import std.format : formattedWrite;
	        sink.formattedWrite("%04X", start);
	        if (end > start+1)
	            sink.formattedWrite("..%04X", end-1);
	    }
	}
	
	struct Entry
	{
	    CodeRange range;
	    string width;
	
	    void toString(scope void delegate(const(char)[]) sink)
	    {
	        import std.format : formattedWrite;
	        sink.formattedWrite("%s;%s", range, width);
	    }
	}
	
	/**
	 * Returns: An input range of Entry objects.
	 */
	auto parse(R)(R input)
	    if (isInputRange!R && is(ElementType!R : const(char)[]))
	{
	    // For our purposes, we don't need to distinguish between explicit/implicit
	    // narrowness, and ambiguous cases can just default to narrow. So we map
	    // the original width to its equivalent using the following equivalence
	    // table.
	    string[string] equivs = [
	        "Na" : "N",
	        "N"  : "N",
	        "H"  : "N",
	        "A"  : "N",
	        "W"  : "W",
	        "F"  : "W"
	    ];
	
	    auto reEmpty = regex(`^\s*$`);
	    auto reSingle = regex(`^([0-9A-F]+);(N|A|H|W|F|Na)\b`);
	    auto reRange = regex(`^([0-9A-F]+)\.\.([0-9A-F]+);(N|A|H|W|F|Na)\b`);
	
	    struct Result
	    {
	        R     range;
	        Entry front;
	        bool  empty;
	
	        this(R _range)
	        {
	            range = _range;
	            next(); // get things started
	        }
	
	        void next()
	        {
	            while (!range.empty)
	            {
	                auto line = range.front;
	
	                if (auto m = line.match(reSingle))
	                {
	                    auto width = equivs[m.captures[2]];
	                    dchar ch = cast(dchar) m.captures[1].to!int(16);
	                    front = Entry(CodeRange(ch, ch+1), width);
	                    empty = false;
	                    return;
	                }
	                else if (auto m = line.match(reRange))
	                {
	                    auto width = equivs[m.captures[3]];
	                    dchar start = cast(dchar) m.captures[1].to!int(16);
	                    dchar end = cast(dchar) m.captures[2].to!int(16) + 1;
	                    front = Entry(CodeRange(start, end), width);
	                    empty = false;
	                    return;
	                }

	                {
	                    import std.string : format;
	                    throw new Exception("Couldn't parse line:\n%s"
	                                        .format(line));
	                }
	
	                range.popFront();
	            }
	            empty = true;
	        }
	
	        void popFront()
	        {
	            range.popFront();
	            next();
	        }
	    }
	    static assert(isInputRange!Result);
	
	    return Result(input);
	}
	
	void outputByWidthType(R)(R input)
	    if (isInputRange!R && is(ElementType!R : const(char)[]))
	{
	    CodeRange[][string] widths;
	    string lastWidth;
	
	    void addRange(Entry entry)
	    {
	        auto range = entry.range;
	        auto width = entry.width;
	        auto ranges = width in widths;
	        if (ranges && ranges.length > 0 && width == lastWidth)
	        {
	            (*ranges)[$-1].merge(range);
	        }
	        else
	            widths[width] ~= range;
	
	        lastWidth = width;
	    }
	
	    foreach (entry; input.parse())
	    {
	         addRange(entry);
	    }
	
	    foreach (width; widths.byKey())
	    {

	        foreach (range; widths[width])
	        {
	            writefln("%s;%s", range, width);
	        }
	        writeln();
	    }
	}
	
	/**
	 * Returns: An input range of Entry objects.
	 */
	auto mergeConsecutive(R)(R input)
	    if (isInputRange!R && is(ElementType!R : Entry))
	{
	    struct Result
	    {
	        R     range;
	        bool  empty;
	        Entry front;
	        Entry current;
	
	        this(R _range)
	        {
	            range = _range;
	            next();
	        }
	
	        void next()
	        {
	            while (!range.empty)
	            {
	                auto e = range.front;
	                if (current.width != e.width)
	                {
	                    if (current.width != "")
	                    {
	                        empty = false;
	                        front = current;
	
	                        current = e;
	                        range.popFront();
	
	                        //writefln("Yielding: %s", front);
	                        return;
	                    }
	                    current = e;
	                }
	                else
	                {
	                    //writefln("Merging: %s with %s", current, e);
	                    current.range.merge(e.range);
	                }
	
	                range.popFront();
	            }
	
	            if (current.width != "")
	            {
	                empty = false;
	                front = current;
	            }
	            else
	                empty = true;
	        }
	
	        void popFront()
	        {
	            if (range.empty)
	                empty = true; // on last element
	            else
	                next();
	        }
	    }
	
	    return Result(input);
	}
	
	void outputByCodePoint(R)(R input)
	    if (isInputRange!R && is(ElementType!R : const(char)[]))
	{
	    writefln("%(%s\n%)", input.parse().mergeConsecutive());
	}
	
	void tally(R)(R input)
	    if (isInputRange!R && is(ElementType!R : const(char)[]))
	{
	    int totalW, totalN;
	
	    foreach (e; input.parse().mergeConsecutive())
	    {
	        if (e.width=="W")
	            totalW += (e.range.end - e.range.start);
	        else if (e.width=="N")
	            totalN += (e.range.end - e.range.start);
	        else
	            assert(0);
	    }
	    writefln("Tally: W=%d N=%d\n", totalW, totalN);
	}
	
	void genRecogCode(R)(R input)
	    if (isInputRange!R && is(ElementType!R : const(char)[]))
	{
	    import std.uni;
	
	    CodepointSet wideChars;
	    foreach (e; input.parse().mergeConsecutive())
	    {
	        if (e.width=="W")
	            wideChars.add(e.range.start, e.range.end);
	    }
	
	    writeln(wideChars.toSourceCode("isWide"));
	}
	
	int main(string[] args)
	{
	    if (args.length < 2)
	    {
	        assert(args.length > 0);
	        stderr.writefln("Usage: %s (bywidth|bypoint|tally|gencode)", args[0]);
	        return 1;
	    }
	
	    auto input = File("ext/EastAsianWidth.txt", "r").byLine();
	
	    auto cmd = args[1];
	    switch (cmd)
	    {
	        case "bywidth":
	            outputByWidthType(input);
	            break;
	
	        case "bypoint":
	            outputByCodePoint(input);
	            break;
	
	        case "tally":
	            tally(input);
	            break;
	
	        case "gencode":
	            genRecogCode(input);
	            break;
	
	        default:
	            stderr.writefln("Unknown command: %s", cmd);
	            return 1;
	    }
	    return 0;
	}


T

-- 
People tell me that I'm skeptical, but I don't believe them.

Sep 29 2020

=?UTF-8?B?Q2hsb8Op?= Kekoa <rightfold+ck gmail.com> writes:

On Tuesday, 29 September 2020 at 17:04:51 UTC, H. S. Teoh wrote:
 OTOH, the relevant Unicode data file that contains 
 East_Asian_Width data (EastAsianWidth.txt) is relatively 
 straightforward to parse.  In one of my projects, I wrote a 
 little helper program to parse this file and generate a 
 function that tells me if a given dchar is wide or narrow.

Thank you. Analyzing the data file seems simple enough. :)

Sep 29 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Sep 29, 2020 at 06:14:45PM +0000, Chlo� Kekoa via Digitalmars-d-learn
wrote:
 On Tuesday, 29 September 2020 at 17:04:51 UTC, H. S. Teoh wrote:
 OTOH, the relevant Unicode data file that contains East_Asian_Width
 data (EastAsianWidth.txt) is relatively straightforward to parse.
 In one of my projects, I wrote a little helper program to parse this
 file and generate a function that tells me if a given dchar is wide
 or narrow.

 
 Thank you. Analyzing the data file seems simple enough. :)

If you're daring, you can try parsing it at compile-time... but in this
case, it's kinda pointless, since the data file doesn't change, so
statically generating the desired code as a separate step seems a more
logical thing to do.


T

-- 
Curiosity kills the cat. Moral: don't be the cat.

Sep 29 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Accessing non-binary Unicode properties with std.uni