digitalmars.D - Proposed Phobos equivalent of wcswidth()
- H. S. Teoh (148/148) Jan 13 2018 This past week, while reviewing Phobos PR #6008, I started experimenting
- Jack Stouffer (3/6) Jan 15 2018 std.utf.displayWidth
- Simen =?UTF-8?B?S2rDpnLDpXM=?= (4/5) Jan 15 2018 +1
- H. S. Teoh (6/10) Jan 15 2018 [...]
- Jack Stouffer (8/17) Jan 15 2018 The way I understand it is that std.uni is (supposed to be) for
- H. S. Teoh (14/32) Jan 15 2018 Are you sure? I thought std.utf was specifically dealing with UTF-*
- Jonathan M Davis (6/30) Jan 15 2018 Your understanding of the division more or less matches mine, though I'm...
- WhatMeForget (3/9) Jan 15 2018 std.utf.bikeshed
- Kagamin (2/2) Jan 15 2018 columnWidth as it only makes sense for column-oriented text
- Dominikus Dittes Scherkl (7/9) Jan 15 2018 I think displayWidth is better, because "width" is directly
with an optimized D equivalent of wcswidth(). For more details, see: https://issues.dlang.org/show_bug.cgi?id=7054 https://issues.dlang.org/show_bug.cgi?id=17810 as well as the discussion on: https://github.com/dlang/phobos/pull/6008 Anyway, the TL;DR summary is this: given a format() spec like "%20s", in order to insert the correct number of spaces to pad the string to 20 characters (or rather, 20 spaces in the output), we need to compute the displayed length of the string in monospace font. Unfortunately, given the complexities of Unicode, this is far from trivial: - In C, the C library doesn't even pretend to know Unicode, so the padding is just based on the number of bytes the string occupies. Obviously, for anything non-ASCII the output will be wrong (misaligned). - In D, in the original naïve implementation, we try to be a little smarter by counting the number of dchars. Unfortunately, this is also wrong, because of combining diacritics like U+0301 which modify the preceding character and do not advance the cursor. instead. However, this is *still* wrong, because of the existence of zero-width characters (don't you just love Unicode?!), and also because of "wide" or "full-width" East Asian block characters as specified by Unicode TR11 (and scarily enough, the new Emoji blocks are included in this "wide" category), which on a text console generally occupies 2 positions per grapheme rather than 1. Eventually, the solution boils down to implementing the equivalent of Posix wcswidth(). But a naïve implementation of this is extremely inefficient, because segmenting a Unicode string by grapheme and *then* computing its width is non-trivial. So inefficient that it's just too slow to use in format(), especially if most strings you'd pass to format() are ASCII-only or mostly ASCII. Thankfully, std.uni provides (some of) the tools to optimize this. The basic idea is this: we don't actually care to segment graphemes; all we want to do is to know, given some string s, how many display positions it will occupy, so that we can insert the right number of spaces. The actual grapheme segmentation and typesetting is the terminal's job, and none of format()'s business. So we can cut some corners while still producing the right results. Basically my current solution consists of: - Parsing EastAsianWidth.txt published by the Unicode Consortium to precompute a table of wide/full-width characters (W and F) -- this is not done at runtime or compile-time, but as a separate step to generate the source code of the table, since otherwise it's either too slow at runtime or would slow down Phobos compilation too much, plus it depends on an external file which is not practical; - Combining this table with Unicode category Grapheme_extend, plus a bunch of hand-coded zero-width characters to produce a mapping of every dchar to 0, 1, or 2. All characters that extend a grapheme, like a combining diacritic, maps to 0. All characters designated as Wide or Full-width (excluding grapheme extenders) map to 2. Everything else maps to 1. - Compiling this table into a 3-level Trie (std.uni.Trie) for O(1) runtime lookup per dchar. - Computing the display width, then, is just a matter of iterating over dchars in the string and summing the values looked up in the trie. Of course, no matter how optimized a width lookup is, it's still pretty slow for an ASCII-only string, which is 90% of the use cases of format(). So to improve this common case, the additional optimization is to scan the string for ASCII-only bytes, and just incrementing the width since we know ASCII characters are always 1 column wide. Only when we encounter a non-ASCII byte that we bother with UTF-8 decoding and the table lookup. Here's my current implementation: https://github.com/quickfur/strwidth Here's my current benchmark results: - walkLength is literally passing the string to std.range.walkLength, which is basically counting the number of code points in the string. As mentioned before, this does not produce the correct width. - byGraphemeWalk is the next step up, to count the number of graphemes using std.uni.byGrapheme. Unfortunately, this is still not fully correct. - graphemeStrideWalk is a slight optimization of byGraphemeWalk, by not actually decoding the grapheme, but just computing the stride. It also has the virtue of being usable in CTFE. Performance-wise, it's not that much different from byGraphemeWalk. - width0 is the first "correct" string width computation, but with a naïve, slow implementation. It serves as a baseline to compare the next implementations. - width1 is the trie-optimized version of width0. It shows significant improvement over width0, but is still very slow for ASCII strings compared with walkLength. - width2 includes the optimization of skipping ASCII-only parts of the string, bypassing decoding and trie lookup. - width3 is a variant of width2 that also skips trie lookup for characters below U+300, the first block that contains widths not equal to 1 (combining diacritics). The benchmark results don't show significant performance differences with width2, however. [walkLength] (100000 iterations): ASCII strings of 32 bytes: 24 ms, 483 μs, and 4 hnsecs Unicode strings of 32 bytes: 16 ms, 829 μs, and 2 hnsecs ASCII strings of 128 bytes: 94 ms, 538 μs, and 9 hnsecs Unicode strings of 128 bytes: 59 ms, 395 μs, and 6 hnsecs ASCII strings of 1024 bytes: 672 ms, 73 μs, and 7 hnsecs Unicode strings of 1024 bytes: 376 ms, 634 μs, and 7 hnsecs [byGraphemeWalk] (100000 iterations): ASCII strings of 32 bytes: 953 ms, 353 μs, and 5 hnsecs Unicode strings of 32 bytes: 353 ms, 934 μs, and 2 hnsecs ASCII strings of 128 bytes: 3 secs, 862 ms, 747 μs, and 7 hnsecs Unicode strings of 128 bytes: 1 sec, 302 ms, 128 μs, and 6 hnsecs ASCII strings of 1024 bytes: 30 secs, 993 ms, 404 μs, and 1 hnsec Unicode strings of 1024 bytes: 9 secs, 900 ms, 579 μs, and 1 hnsec [graphemeStrideWalk] (100000 iterations): ASCII strings of 32 bytes: 816 ms, 190 μs, and 9 hnsecs Unicode strings of 32 bytes: 307 ms, 141 μs, and 9 hnsecs ASCII strings of 128 bytes: 3 secs, 317 ms, 661 μs, and 7 hnsecs Unicode strings of 128 bytes: 1 sec, 138 ms, 172 μs, and 7 hnsecs ASCII strings of 1024 bytes: 26 secs, 903 ms, and 898 μs Unicode strings of 1024 bytes: 8 secs, 804 ms, 48 μs, and 1 hnsec [width0] (100000 iterations): ASCII strings of 32 bytes: 1 sec, 62 ms, 163 μs, and 5 hnsecs Unicode strings of 32 bytes: 397 ms and 517 μs ASCII strings of 128 bytes: 4 secs, 228 ms, 269 μs, and 2 hnsecs Unicode strings of 128 bytes: 1 sec, 448 ms, 926 μs, and 5 hnsecs ASCII strings of 1024 bytes: 33 secs, 797 ms, 690 μs, and 5 hnsecs Unicode strings of 1024 bytes: 11 secs, 206 ms, 915 μs, and 8 hnsecs [width1] (100000 iterations): ASCII strings of 32 bytes: 173 ms, 509 μs, and 8 hnsecs Unicode strings of 32 bytes: 80 ms, 730 μs, and 9 hnsecs ASCII strings of 128 bytes: 692 ms and 117 μs Unicode strings of 128 bytes: 296 ms, 89 μs, and 1 hnsec ASCII strings of 1024 bytes: 5 secs, 486 ms, 328 μs, and 3 hnsecs Unicode strings of 1024 bytes: 2 secs, 208 ms, 846 μs, and 9 hnsecs [width2] (100000 iterations): ASCII strings of 32 bytes: 12 ms, 791 μs, and 9 hnsecs Unicode strings of 32 bytes: 69 ms, 51 μs, and 5 hnsecs ASCII strings of 128 bytes: 50 ms, 643 μs, and 3 hnsecs Unicode strings of 128 bytes: 283 ms, 572 μs, and 3 hnsecs ASCII strings of 1024 bytes: 319 ms, 527 μs, and 4 hnsecs Unicode strings of 1024 bytes: 2 secs, 200 ms, 473 μs, and 3 hnsecs [width3] (100000 iterations): ASCII strings of 32 bytes: 12 ms, 927 μs, and 5 hnsecs Unicode strings of 32 bytes: 67 ms, 952 μs, and 9 hnsecs ASCII strings of 128 bytes: 50 ms, 322 μs, and 7 hnsecs Unicode strings of 128 bytes: 283 ms, 628 μs, and 6 hnsecs ASCII strings of 1024 bytes: 331 ms, 921 μs, and 9 hnsecs Unicode strings of 1024 bytes: 2 secs, 239 ms, 415 μs, and 5 hnsecs Given these results, it appears that width2 is probably the best way to go. And now the obligatory bikeshed: what should the Phobos equivalent of wcswidth be called? The current plan is just width(), but the name is too simplistic and too likely to collide with user-defined identifiers. Any suggestions? Bring on the rainbow bikeshed! :-P T -- Without geometry, life would be pointless. -- VS
Jan 13 2018
On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote:...Thanks for taking the time to do this.And now the obligatory bikeshed: what should the Phobos equivalent of wcswidth be called?std.utf.displayWidth
Jan 15 2018
On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:std.utf.displayWidth+1 -- Simen
Jan 15 2018
On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjærås via Digitalmars-d wrote:On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:[...] Why std.utf rather than std.uni, though? T -- ASCII stupid question, getty stupid ANSI.std.utf.displayWidth+1
Jan 15 2018
On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjærås via Digitalmars-d wrote:The way I understand it is that std.uni is (supposed to be) for functions on individual unicode units (be they code units/points or graphemes) and std.utf is for functions which handle operating on unicode strings. Obviously there are exceptions. I think "they" put graphemeStride in std.uni because Grapheme was defined there and it seemed reasonable at the time. But, generally I think utf stuff should go into std.utf.On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:[...] Why std.utf rather than std.uni, though?std.utf.displayWidth+1
Jan 15 2018
On Mon, Jan 15, 2018 at 06:20:16PM +0000, Jack Stouffer via Digitalmars-d wrote:On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:Are you sure? I thought std.utf was specifically dealing with UTF-* encodings, i.e., code units and conversions to/from code points, and std.uni was supposed to be for implementing Unicode algorithms and Unicode compliance in general, i.e., stuff that works at the code point level.On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjærås via Digitalmars-d wrote:The way I understand it is that std.uni is (supposed to be) for functions on individual unicode units (be they code units/points or graphemes) and std.utf is for functions which handle operating on unicode strings.On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:[...] Why std.utf rather than std.uni, though?std.utf.displayWidth+1Obviously there are exceptions. I think "they" put graphemeStride in std.uni because Grapheme was defined there and it seemed reasonable at the time. But, generally I think utf stuff should go into std.utf.But displayWidth isn't really directly related to UTF (i.e., the encoding of Unicode code points). It seems to me to be more to do with processing Unicode in general, though, granted, the optimizations I implemented are kinda in a grey zone between dealing with Unicode proper (i.e., with code points) vs. working with code units. T -- Klein bottle for rent ... inquire within. -- Stephen Mulraney
Jan 15 2018
On Monday, January 15, 2018 10:37:14 H. S. Teoh via Digitalmars-d wrote:On Mon, Jan 15, 2018 at 06:20:16PM +0000, Jack Stouffer via Digitalmars-dwrote:Your understanding of the division more or less matches mine, though I'm not sure that the line is entirely clearcut. I would definitely think that std.uni was the more appropriate place for such a function. - Jonathan M DavisOn Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:Are you sure? I thought std.utf was specifically dealing with UTF-* encodings, i.e., code units and conversions to/from code points, and std.uni was supposed to be for implementing Unicode algorithms and Unicode compliance in general, i.e., stuff that works at the code point level.On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjærås via Digitalmars-d wrote:The way I understand it is that std.uni is (supposed to be) for functions on individual unicode units (be they code units/points or graphemes) and std.utf is for functions which handle operating on unicode strings.On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:[...] Why std.utf rather than std.uni, though?std.utf.displayWidth+1
Jan 15 2018
On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote:std.utf.bikeshed Never heard that phrase before. Nice one :)...Thanks for taking the time to do this.And now the obligatory bikeshed: what should the Phobos equivalent of wcswidth be called?std.utf.displayWidth
Jan 15 2018
columnWidth as it only makes sense for column-oriented text display.
Jan 15 2018
On Monday, 15 January 2018 at 15:08:24 UTC, Kagamin wrote:columnWidth as it only makes sense for column-oriented text display.I think displayWidth is better, because "width" is directly linked to hozizontal direction (else it would be called hight), and setting text in colums would still take additional steps to be set correct. Also "display" indicates that it has nothing to do with the string length, which is good to avoid confusion.
Jan 15 2018