www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [GSOC] New unicode module beta, with Grapheme support!

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
Well, officially the final bell has rung, marking the end of GSOC.

Meaning it's about time to show the project to the community.
This time around I sadly have some unresolved issues. Part of these are 
my fault, others are well known bugs in phobos/compiler.

Still there is a lot of cool stuff in there that I'd love to tell about:

  - all functions isXXX and toUpper/toLower of the old std.uni interface 
suddenly became faster and/or smarter

  - icmp function that does proper case insensitive string comparison 
and  matches e.g. german ß (Sulzbacher form) as equal to 'ss' (full 
casefolding rules)

  - performance maniacs can use faster/simpler one: sicmp that maps only 
1:1 codepoints (simple casefolding rules)

  - extended grapheme cluster support: decode operation (decodeGrapheme) 
& slightly simpler a-la std.utf.stride to only get the length in 
codeunits (graphemeStride)

- normalization currently only NFD & NFKD, have some issues see below 
(and I still need to triple check the correctness) NFC & NFKC are coming 
soon

- decompositon (and composition is coming): either Canonical or 
Compatibility  also yields Grapheme with decomposed codepoint

And the last but not least, library users get access to all the power 
toys used to construct the above algorithms:
     1) codepoint sets with full & fast set ops
     2) highly customizable multi-stage lookup table (aka Trie) with 
easy helpers to construct optimal multi-level dchar-->bool tables
     3) a ton of predefined Unicode sets: see general property, block or 
script

Caveats:
     - the NFC & NFKC normalization are in the works, I'll try to get it 
sometime later this week.

     - more then that normalization depends on patched Phobos and still 
often fails due to the bug 
http://d.puremagic.com/issues/show_bug.cgi?id=4584.

Patched Phobos is here: 
https://github.com/blackwhale/phobos/tree/stable-sort

     - no 64bit currently. Somehow I managed to broke my _fresh_ 64bit 
installation of dmd (it fails both on Phobos unit tests & anything in my 
project), thus x64 lacks a bulk of generated tables and is unsupported 
right now. Any help is appreciated.

Grab sources + tests, benchmarks, tools and sample data from:
https://github.com/blackwhale/gsoc-bench-2012/zipball/beta

And the sketchy DDoc:
http://blackwhale.github.com/phobos/std_uni.html

The first step to usage is "import uni;" vs "import std.uni;" and adding 
uni.d to your command line.

Note: icmp may conflict with its brain dead twin from std.algorithm (or 
was that std.string?) use the usual tricks to disambiguate as necessary.

I'd enjoy some feedback as way back in 2010 I recall a lot of 
Unicode-aware people longing for grapheme support. A short list of Ali 
Çehreli, Fawzi Mohamed and Michel Fortin comes to mind maybe others will 
chime in.

P.S. Consider it as "ready for comments" as opposed to "ready for review".

P.P.S. Volunteers who'd like to test x64 are welcome to run
  rdmd gen_uni.d
and report back (maybe it's my local setup problem).


-- 
Olshansky Dmitry
Aug 22 2012
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 23, 2012 at 01:31:55AM +0400, Dmitry Olshansky wrote:
 Well, officially the final bell has rung, marking the end of GSOC.
 
 Meaning it's about time to show the project to the community.
 This time around I sadly have some unresolved issues. Part of these
 are my fault, others are well known bugs in phobos/compiler.
 
 Still there is a lot of cool stuff in there that I'd love to tell about:
[...snip list of awesome stuff...] This is awesome!! Finally, the beginnings of better Unicode support. I'm busy with other stuff currently, so I'll have to give more detailed comments later, but from a quick glance over your list, I just have to say +1, well done! T -- Doubt is a self-fulfilling prophecy.
Aug 22 2012
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Dmitry Olshansky:

We still don't know how much long-term success D will have, and 
not everyone needs advanced Unicode management, but I think this 
module will help the D success :-)

Are the CPU SIMD instructions going to help the performance of 
some code in this module?


     - more then that normalization depends on patched Phobos 
 and still often fails due to the bug 
 http://d.puremagic.com/issues/show_bug.cgi?id=4584.
I have written a comment there. And I have other things to say, but they are a little OT in this thread, so I'll open another thread soon. Bye and good work, bearophile
Aug 22 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 23-Aug-12 02:01, bearophile wrote:
 Dmitry Olshansky:

 We still don't know how much long-term success D will have, and not
 everyone needs advanced Unicode management, but I think this module will
 help the D success :-)

 Are the CPU SIMD instructions going to help the performance of some code
 in this module?
Not sure if there is something that does: auto a[] = [1, 3, 5, 8, 10, 20 ...]; //array of [a,b) intervals int b = 9; assert(b in a); //conceptual 'in', checks if a is in one of intervals
     - more then that normalization depends on patched Phobos and still
 often fails due to the bug
 http://d.puremagic.com/issues/show_bug.cgi?id=4584.
I have written a comment there. And I have other things to say, but they are a little OT in this thread, so I'll open another thread soon. Bye and good work,
Thanks. -- Olshansky Dmitry
Aug 22 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-22 23:31, Dmitry Olshansky wrote:
 Well, officially the final bell has rung, marking the end of GSOC.
Cool.
 P.P.S. Volunteers who'd like to test x64 are welcome to run
   rdmd gen_uni.d
 and report back (maybe it's my local setup problem).
On Mac OS X, using DMD 2.060 64bit, the assert at line 568 is triggered. The last part of the output is: 2FA1D ---> 2A600 2fa1d -~-> 2a600 core.exception.AssertError gen_uni(568): Assertion failure ---------------- 5 gen_uni 0x00000001000a06ea _d_assertm + 38 6 gen_uni 0x00000001000013f7 void gen_uni.__assert(int) + 23 7 gen_uni 0x000000010000377c void gen_uni.writeTries().int __foreachbody8149(ref dchar, ref ushort) + 124 8 gen_uni 0x000000010009f406 _aaApply2 + 106 9 gen_uni 0x0000000100003497 void gen_uni.writeTries() + 687 10 gen_uni 0x00000001000013b7 _Dmain + 1131 11 gen_uni 0x00000001000a108a extern (C) int rt.dmain2.main(int, char**).void runMain() + 34 12 gen_uni 0x00000001000a0a41 extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate()) + 45 13 gen_uni 0x00000001000a10d4 extern (C) int rt.dmain2.main(int, char**).void runAll() + 56 14 gen_uni 0x00000001000a0a41 extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate()) + 45 15 gen_uni 0x00000001000a09cb main + 235 16 gen_uni 0x0000000100000f44 start + 52 17 ??? 0x0000000000000001 0x0 + 1 ---------------- I also piped it to a file which resulted in this: http://pastebin.com/xjg68CdG -- /Jacob Carlborg
Aug 22 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 23-Aug-12 10:34, Jacob Carlborg wrote:
 On 2012-08-22 23:31, Dmitry Olshansky wrote:
 Well, officially the final bell has rung, marking the end of GSOC.
Cool.
 P.P.S. Volunteers who'd like to test x64 are welcome to run
   rdmd gen_uni.d
 and report back (maybe it's my local setup problem).
On Mac OS X, using DMD 2.060 64bit, the assert at line 568 is triggered. The last part of the output is:
Great... at least the error is the same :) So the only Q remains is why Phobos tests fail for me but doh. Guess I'll have to go the old painful way of debugging.
 2FA1D ---> 2A600
 2fa1d -~->  2a600
 core.exception.AssertError gen_uni(568): Assertion failure
 ----------------
 5   gen_uni                             0x00000001000a06ea _d_assertm + 38
 6   gen_uni                             0x00000001000013f7 void
 gen_uni.__assert(int) + 23
 7   gen_uni                             0x000000010000377c void
 gen_uni.writeTries().int __foreachbody8149(ref dchar, ref ushort) + 124
 8   gen_uni                             0x000000010009f406 _aaApply2 + 106
 9   gen_uni                             0x0000000100003497 void
 gen_uni.writeTries() + 687
 10  gen_uni                             0x00000001000013b7 _Dmain + 1131
 11  gen_uni                             0x00000001000a108a extern (C)
 int rt.dmain2.main(int, char**).void runMain() + 34
 12  gen_uni                             0x00000001000a0a41 extern (C)
 int rt.dmain2.main(int, char**).void tryExec(scope void delegate()) + 45
 13  gen_uni                             0x00000001000a10d4 extern (C)
 int rt.dmain2.main(int, char**).void runAll() + 56
 14  gen_uni                             0x00000001000a0a41 extern (C)
 int rt.dmain2.main(int, char**).void tryExec(scope void delegate()) + 45
 15  gen_uni                             0x00000001000a09cb main + 235
 16  gen_uni                             0x0000000100000f44 start + 52
 17  ???                                 0x0000000000000001 0x0 + 1
 ----------------

 I also piped it to a file which resulted in this:

 http://pastebin.com/xjg68CdG
-- Olshansky Dmitry
Aug 23 2012
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 23-Aug-12 10:34, Jacob Carlborg wrote:
 On 2012-08-22 23:31, Dmitry Olshansky wrote:
 Well, officially the final bell has rung, marking the end of GSOC.
Cool.
 P.P.S. Volunteers who'd like to test x64 are welcome to run
   rdmd gen_uni.d
 and report back (maybe it's my local setup problem).
On Mac OS X, using DMD 2.060 64bit, the assert at line 568 is triggered. The last part of the output is: 2FA1D ---> 2A600 2fa1d -~-> 2a600 core.exception.AssertError gen_uni(568): Assertion failure
Found one pesky bug ... not in my code though ^) http://d.puremagic.com/issues/show_bug.cgi?id=8583 And it appears to be the only problem with x64. The beta zipball now should have proper x64 version too. https://github.com/blackwhale/gsoc-bench-2012/zipball/beta -- Olshansky Dmitry
Aug 24 2012