digitalmars.D - Reducing the cost of autodecoding
- Andrei Alexandrescu (43/43) Oct 12 2016 So we've had a good run with making popFront smaller. In ASCII
- Stefan Koch (5/48) Oct 12 2016 This will only work really efficiently with some state on the
- Andrei Alexandrescu (4/5) Oct 12 2016 Remember the ASCII part is the bothersome one. There's only two
- safety0ff (12/15) Oct 12 2016 My measurements:
- Stefan Koch (2/18) Oct 12 2016 where did you apply the branch hints ?
- safety0ff (2/3) Oct 12 2016 Code: http://pastebin.com/CFCpUftW
- Andrei Alexandrescu (11/26) Oct 12 2016 Wait, so going through the bytes made almost no difference? Or did you
- Stefan Koch (4/12) Oct 12 2016 I was about to suggest the same.
- Stefan Koch (3/21) Oct 12 2016 We should probably introduce a new module for stuff like this.
- Andrei Alexandrescu (5/7) Oct 12 2016 Yah, shouldn't go in object.d as it's fairly niche. On the other hand
- Stefan Koch (6/14) Oct 12 2016 maybe core.intrinsics ?
- Johan Engelen (31/39) Oct 13 2016 There could be some kind of "expect" theme, or a
- Jacob Carlborg (9/13) Oct 13 2016 I think it should be a new module. I think core.intrinsics, as Stefan
- safety0ff (4/6) Oct 12 2016 It made little difference: LDC compiled into AVX2 vectorized
- safety0ff (6/8) Oct 12 2016 Measurements without -mcpu=native:
- Andrei Alexandrescu (5/14) Oct 12 2016 So we should be able to reduce overhead by means of proper code
- Stefan Koch (7/23) Oct 12 2016 AVX for ascii ?
- Andrei Alexandrescu (3/25) Oct 12 2016 Oh ok, so it's that checksum in particular that got optimized. Bad
- safety0ff (4/6) Oct 13 2016 Also, I suspect a benchmark with a larger loop body might not
- Stefan Koch (40/43) Oct 14 2016 I disagree in longer loops code compactness is as important as in
- Stefan Koch (41/85) Oct 14 2016 Disregard all that code.
- Patrick Schluter (38/128) Oct 15 2016 Looks very verbose to me. I had found in the BSD codebase a very
- Patrick Schluter (5/5) Oct 15 2016 Oooops, I should not post after drinking 2 glasses of
- Andrei Alexandrescu (3/5) Oct 15 2016 https://ldc.acomirei.ru
- Uplink_Coder (27/27) Oct 15 2016 It can also be written like this producing smaller code.
- Patrick Schluter (4/31) Oct 15 2016 Just a question. Do encoding errors not have to be detected or is
- safety0ff (4/6) Oct 15 2016 AFAIK they have to be detected, otherwise it would be a
- Patrick Schluter (9/9) Oct 15 2016 At least with that lookup table below, you can detect isolated
- Patrick Schluter (5/14) Oct 15 2016 192 and 193 can never appear in a UTF-8 text, they are overlongs
- Uplink_Coder (4/13) Oct 15 2016 If you use 0 instead of 1 the length check will suffice for
- Stefan Koch (43/59) Oct 15 2016 __gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2,
- Patrick Schluter (11/73) Oct 15 2016 What does "debug if" do ? Because when I replace it with a simple
- Patrick Schluter (46/46) Oct 16 2016 Here my version. It's probably not the shortest (100 ligns of
- Uplink_Coder (5/52) Oct 16 2016 This looks quite slow.
- NoBrainer (5/8) Oct 16 2016 A: What's 2 + 2?
- Patrick Schluter (35/40) Oct 16 2016 I know but it has to be correct before being fast.
- Patrick Schluter (5/46) Oct 16 2016 Of course, this line is wrong, should shift by 18 not 16 :
- Patrick Schluter (47/49) Oct 16 2016 Ok now the looped version which doesn't need the lookup table.
- ketmar (4/4) Oct 16 2016 On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:
- safety0ff (7/16) Oct 14 2016 You must have misunderstood:
- Ilya Yaroshenko (3/16) Oct 12 2016 For fair test line should be feet into L1 cache. --Ilya
- Ilya Yaroshenko (3/23) Oct 12 2016 EDITL For fair test the line should be in the L1 cache. --Ilya
- Johan Engelen (9/11) Oct 12 2016 "-O3 -enable-inlining" is synonymous with "-O3" :-)
- Dmitry Olshansky (43/50) Oct 25 2016 I'm sooo late to the party. Still a neat trick with UTF lookup table is
- Steven Schveighoffer (13/27) Oct 25 2016 Half of the entries in your table are 0 (every char that starts with
- Stefam Koch (3/39) Oct 25 2016 The whole point of a LUT to begin with is to reduce instructions.
- Steven Schveighoffer (4/9) Oct 25 2016 Yes, I know. But as I understand it, using 64-bit math on 32-bit systems...
- safety0ff (37/38) Oct 25 2016 Some food for thought:
- safety0ff (4/6) Oct 25 2016 Unfortunately it also changes the API of popFront to throw on
So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: ===== import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join; ulong checksum; if (args.length == 1) { while (line.length) { version(autodecode) { checksum += line.myFront; line.myPopFront; } else { checksum += line[0]; line = line[1 .. $]; } } version(autodecode) writeln("autodecode ", checksum); else writeln("bytes ", checksum); } else writeln("overhead"); } ===== On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding. Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap. Andrei
Oct 12 2016
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: ===== import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join; ulong checksum; if (args.length == 1) { while (line.length) { version(autodecode) { checksum += line.myFront; line.myPopFront; } else { checksum += line[0]; line = line[1 .. $]; } } version(autodecode) writeln("autodecode ", checksum); else writeln("bytes ", checksum); } else writeln("overhead"); } ===== On my machine, with "ldc2 -release -O3 -enable-inlining" I get something like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with autodecoding. Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap. AndreiThis will only work really efficiently with some state on the stack. If we are to support Unicode.
Oct 12 2016
On 10/12/2016 12:03 PM, Stefan Koch wrote:This will only work really efficiently with some state on the stack.Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
Oct 12 2016
On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- AndreiMy measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes 0.385s current autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
Oct 12 2016
On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote:On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:where did you apply the branch hints ?Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- AndreiMy measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes 0.385s current autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
Oct 12 2016
On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote:where did you apply the branch hints ?Code: http://pastebin.com/CFCpUftW
Oct 12 2016
On Wednesday, 12 October 2016 at 20:24:54 UTC, safety0ff wrote:Code: http://pastebin.com/CFCpUftWLine 25 doesn't look trusted: reads past the end of an empty string.
Oct 13 2016
On Thursday, 13 October 2016 at 14:51:50 UTC, Kagamin wrote:On Wednesday, 12 October 2016 at 20:24:54 UTC, safety0ff wrote:Length is checked in the loop that calls this function. In phobos length is only checked with an assertion,Code: http://pastebin.com/CFCpUftWLine 25 doesn't look trusted: reads past the end of an empty string.
Oct 13 2016
On 10/12/2016 04:02 PM, safety0ff wrote:On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already?Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- AndreiMy measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes 0.385scurrent autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? Andrei
Oct 12 2016
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote:I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? AndreiI was about to suggest the same. I can prepare a PR.
Oct 12 2016
On Wednesday, 12 October 2016 at 23:59:15 UTC, Stefan Koch wrote:On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote:We should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things.I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? AndreiI was about to suggest the same. I can prepare a PR.
Oct 12 2016
On 10/12/2016 08:11 PM, Stefan Koch wrote:We should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things.Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
Oct 12 2016
On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu wrote:On 10/12/2016 08:11 PM, Stefan Koch wrote:maybe core.intrinsics ? or code.codelayout ? We can control the layout at object-file level. We should be able to expose some of that functionality.We should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things.Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
Oct 12 2016
On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu wrote:On 10/12/2016 08:11 PM, Stefan Koch wrote:There could be some kind of "expect" theme, or a microoptimization theme. Functions that have no observable effects and that provide hints for optimization (possibly compiler-dependent implementations of those functions). Besides providing the expected value of an expression, other "expect"/microopt functionality is checking explicitly for function pointer values (to inline a likely function), that I wrote about in [1]: ``` /// Calls `fptr(args)`, optimize for the case when fptr points to Likely(). pragma(inline, true) auto is_likely(alias Likely, Fptr, Args...)(Fptr fptr, Args args) { return (fptr == &Likely) ? Likely(args) : fptr(args); } // ... void function() fptr = get_function_ptr(); fptr.is_likely!likely_function(); ``` A similar function can be made for "expecting" a class type for virtual calls [1]. Other microopt thingies that come to mind are: - cache prefetching - function attributes for hot/cold functions cheers, Johan [1] https://johanengelen.github.io/ldc/2016/04/13/PGO-in-LDC-virtual-calls.htmlWe should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things.Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
Oct 13 2016
On 2016-10-13 03:26, Andrei Alexandrescu wrote:Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- AndreiI think it should be a new module. I think core.intrinsics, as Stefan suggested, sounds like a good idea. I don't think having a module with only two functions is a problem, assuming we expect more of these functions. We already have that case with core.attribute [1], which only have _one_ attribute defined. [1] https://github.com/dlang/druntime/blob/master/src/core/attribute.d#L54 -- /Jacob Carlborg
Oct 13 2016
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei Alexandrescu wrote:Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already?It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)
Oct 12 2016
On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
Oct 12 2016
On 10/12/2016 08:41 PM, safety0ff wrote:On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- AndreiIt made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
Oct 12 2016
On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote:On 10/12/2016 08:41 PM, safety0ff wrote:AVX for ascii ? What are you referring to ? Most text processing is terribly incompatible with simd. sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread.On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- AndreiIt made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
Oct 12 2016
On 10/12/2016 09:35 PM, Stefan Koch wrote:On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote:Oh ok, so it's that checksum in particular that got optimized. Bad benchmark! Bad! -- AndreiOn 10/12/2016 08:41 PM, safety0ff wrote:AVX for ascii ? What are you referring to ? Most text processing is terribly incompatible with simd. sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread.On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- AndreiIt made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
Oct 12 2016
On Thursday, 13 October 2016 at 01:36:44 UTC, Andrei Alexandrescu wrote:Oh ok, so it's that checksum in particular that got optimized. Bad benchmark! Bad! -- AndreiAlso, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
Oct 13 2016
On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:I disagree in longer loops code compactness is as important as in small ones. This is about the smallest inline version of decode I could come up with : __gshared static immutable ubyte[] charWidthTab = [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 ]; dchar myFront(ref char[] str) pure nothrow { dchar c = cast(dchar) str[0]; if ((c & 128)) { if (c & 64) final switch(charWidthTab[c - 192]) { case 2 : c |= ((str[1] & 0x80) >> 5); break; case 3 : c |= ((str[1] & 0x80) >> 4); c |= ((str[2] & 0x80) >> 10); break; case 4 : c |= ((str[1] & 0x80) >> 3); c |= ((str[2] & 0x80) >> 9); c |= ((str[3] & 0x80) >> 15); break; case 5,6,1 : goto Linvalid; } else Linvalid : c = dchar.init; } return c; }Bad benchmark! Bad! -- AndreiAlso, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
Oct 14 2016
On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:Disregard all that code. It is horribly wrong! This is more correct : (Tough for some reason it does not pass the unittests) dchar myFront(ref char[] str) pure { dchar c = cast(dchar) str.ptr[0]; if (c & 128) { if (c & 64) { auto l = charWidthTab.ptr[c - 192]; if (str.length < l) goto Linvalid; final switch (l) { case 2: c = ((c & ~(64 | 128)) << 6); c |= (str.ptr[1] & ~0x80); break; case 3: c = ((c & ~(32 | 64 | 128)) << 12); c |= ((str.ptr[1] & ~0x80) << 6); c |= ((str.ptr[2] & ~0x80)); break; case 4: c = ((c & ~(16 | 32 | 64 | 128)) << 18); c |= ((str.ptr[1] & ~0x80) << 12); c |= ((str.ptr[2] & ~0x80) << 6); c |= ((str.ptr[3] & ~0x80)); break; case 5, 6, 1: goto Linvalid; } } else Linvalid : throw new Exception("yadayada"); } return c; }I disagree in longer loops code compactness is as important as in small ones. This is about the smallest inline version of decode I could come up with : __gshared static immutable ubyte[] charWidthTab = [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 ]; dchar myFront(ref char[] str) pure nothrow { dchar c = cast(dchar) str[0]; if ((c & 128)) { if (c & 64) final switch(charWidthTab[c - 192]) { case 2 : c |= ((str[1] & 0x80) >> 5); break; case 3 : c |= ((str[1] & 0x80) >> 4); c |= ((str[2] & 0x80) >> 10); break; case 4 : c |= ((str[1] & 0x80) >> 3); c |= ((str[2] & 0x80) >> 9); c |= ((str[3] & 0x80) >> 15); break; case 5,6,1 : goto Linvalid; } else Linvalid : c = dchar.init; } return c; }Bad benchmark! Bad! -- AndreiAlso, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
Oct 14 2016
On Saturday, 15 October 2016 at 00:50:08 UTC, Stefan Koch wrote:On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:Looks very verbose to me. I had found in the BSD codebase a very clever utf-8 conversion function in C, maybe it can be used here. Sorry if I do not participate on the testing as I don't have a proper compilation environment here at home. Here the routine I use at work (it's in C), put that here for inspiration. DEFINE_INLINE uint_t xctomb(char *r, wchar_t wc) { uint_t u8l = utf8len(wc); switch(u8l) { /* Note: code falls through cases! */ case 4: r[3] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0x10000; case 3: r[2] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0x800; case 2: r[1] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0xc0; case 1: r[0] = wc; } return u8l; } utf8len being DEFINE_INLINE uint_t utf8len(wchar_t wc) { if(wc < 0x80) return 1; else if(wc < 0x800) return 2; else if(wc < 0x10000) return 3; else return 4; } The code generated on SPARC with gcc 3.4.6 was really good. On x86_64 with gcc 5.1 was also not bad. I have not tried a lot of alternatives as UTF-8 coding is not a bottle neck on our project. There's also no check for length 5 and 6 as they are not possible on our system, but for here it has to be added. (the DEFINE_INLINE macro is either extern inline or inline depending on some macro magic that is not of importance here).On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:Disregard all that code. It is horribly wrong! This is more correct : (Tough for some reason it does not pass the unittests) dchar myFront(ref char[] str) pure { dchar c = cast(dchar) str.ptr[0]; if (c & 128) { if (c & 64) { auto l = charWidthTab.ptr[c - 192]; if (str.length < l) goto Linvalid; final switch (l) { case 2: c = ((c & ~(64 | 128)) << 6); c |= (str.ptr[1] & ~0x80); break; case 3: c = ((c & ~(32 | 64 | 128)) << 12); c |= ((str.ptr[1] & ~0x80) << 6); c |= ((str.ptr[2] & ~0x80)); break; case 4: c = ((c & ~(16 | 32 | 64 | 128)) << 18); c |= ((str.ptr[1] & ~0x80) << 12); c |= ((str.ptr[2] & ~0x80) << 6); c |= ((str.ptr[3] & ~0x80)); break; case 5, 6, 1: goto Linvalid; } } else Linvalid : throw new Exception("yadayada"); } return c; }I disagree in longer loops code compactness is as important as in small ones. This is about the smallest inline version of decode I could come up with : __gshared static immutable ubyte[] charWidthTab = [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 ]; dchar myFront(ref char[] str) pure nothrow { dchar c = cast(dchar) str[0]; if ((c & 128)) { if (c & 64) final switch(charWidthTab[c - 192]) { case 2 : c |= ((str[1] & 0x80) >> 5); break; case 3 : c |= ((str[1] & 0x80) >> 4); c |= ((str[2] & 0x80) >> 10); break; case 4 : c |= ((str[1] & 0x80) >> 3); c |= ((str[2] & 0x80) >> 9); c |= ((str[3] & 0x80) >> 15); break; case 5,6,1 : goto Linvalid; } else Linvalid : c = dchar.init; } return c; }Bad benchmark! Bad! -- AndreiAlso, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
Oct 15 2016
Oooops, I should not post after drinking 2 glasses of Châteauneuf-du-pape. That function does exactly the contrary of what popFront does. This one is conversion from dchar to multibyte not multibyte to dchar as you did. Sorry for the inconvenience.
Oct 15 2016
On 10/15/2016 12:42 PM, Patrick Schluter wrote:Sorry if I do not participate on the testing as I don't have a proper compilation environment here at home.https://ldc.acomirei.ru Andrei
Oct 15 2016
It can also be written like this producing smaller code. But it the cost of slower decoding. dchar myFront(ref char[] str) pure { dchar c = cast(dchar) str.ptr[0]; if (c & 128) { if (c & 64) { int idx = 0; int l = charWidthTab.ptr[c - 192]; if (str.length < l) goto Linvalid; c = 0; l--; while(l) { l--; c |= str.ptr[idx++]; c <<= 6; } c |= str.ptr[idx]; } else Linvalid : throw new Exception("yadayada"); } return c; }
Oct 15 2016
On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:It can also be written like this producing smaller code. But it the cost of slower decoding. dchar myFront(ref char[] str) pure { dchar c = cast(dchar) str.ptr[0]; if (c & 128) { if (c & 64) { int idx = 0; int l = charWidthTab.ptr[c - 192]; if (str.length < l) goto Linvalid; c = 0; l--; while(l) { l--; c |= str.ptr[idx++]; c <<= 6; } c |= str.ptr[idx]; } else Linvalid : throw new Exception("yadayada"); } return c; }Just a question. Do encoding errors not have to be detected or is validity of the string guaranteed? Wrong continuation bytes or overlong encodings are not detected by this routine.
Oct 15 2016
On Saturday, 15 October 2016 at 19:00:12 UTC, Patrick Schluter wrote:Just a question. Do encoding errors not have to be detected or is validity of the string guaranteed?AFAIK they have to be detected, otherwise it would be a regression.
Oct 15 2016
At least with that lookup table below, you can detect isolated continuation bytes (192 and 193) and invalid codes (above 244). __gshared static immutable ubyte[] charWidthTab = [ 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]; length 5 and 6 need not to be tested specifically for your goto.
Oct 15 2016
On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter wrote:At least with that lookup table below, you can detect isolated continuation bytes (192 and 193) and invalid codes (above 244).192 and 193 can never appear in a UTF-8 text, they are overlongs not continuation bytes. Continuation are characters between 128 and 191 and thos are not allowed, so should be checked.__gshared static immutable ubyte[] charWidthTab = [ 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]; length 5 and 6 need not to be tested specifically for your goto.
Oct 15 2016
On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter wrote:At least with that lookup table below, you can detect isolated continuation bytes (192 and 193) and invalid codes (above 244). __gshared static immutable ubyte[] charWidthTab = [ 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]; length 5 and 6 need not to be tested specifically for your goto.If you use 0 instead of 1 the length check will suffice for throwing on invalid.
Oct 15 2016
On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder wrote:On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter wrote:__gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0]; dchar myFront2(ref char[] str) pure { auto c1 = str.ptr[0]; if (c1 & 128) { if (c1 & 64) { int idx = 0; int l = charWidthTab.ptr[c1 - 192]; if (str.length < l) goto Linvalid; dchar c = 0; l--; while (l) { l--; immutable cc = str.ptr[idx++]; debug if (cc & 64) goto Linvalid; c |= cc; c <<= 6; } c |= str.ptr[idx]; return c; } Linvalid: throw new Exception("yadayada"); } else { return c1; } } This code proofs to be the fastest so far. On UTF and non-UTF text. It's also fairly small.At least with that lookup table below, you can detect isolated continuation bytes (192 and 193) and invalid codes (above 244). __gshared static immutable ubyte[] charWidthTab = [ 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]; length 5 and 6 need not to be tested specifically for your goto.If you use 0 instead of 1 the length check will suffice for throwing on invalid.
Oct 15 2016
On Saturday, 15 October 2016 at 21:21:22 UTC, Stefan Koch wrote:On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder wrote:What does "debug if" do ? Because when I replace it with a simple "if" the code generated by "LDC 1.1.0-beta2 -release -O3 -boundscheck=off" is 15 lines shorter. That was my first question but I think I see the issue now. By removing the debug, the condition is compiled in and the compiler short circuits the whole loop and goes to the Linvalid. The error is that cc is loaded the first time with the same value as c1 because idx=0 and it is post incremented. It should be preincremented and c would have to be initialised with the data bits of c1 not 0.On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter wrote:__gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0]; dchar myFront2(ref char[] str) pure { auto c1 = str.ptr[0]; if (c1 & 128) { if (c1 & 64) { int idx = 0; int l = charWidthTab.ptr[c1 - 192]; if (str.length < l) goto Linvalid; dchar c = 0; l--; while (l) { l--; immutable cc = str.ptr[idx++]; debug if (cc & 64) goto Linvalid; c |= cc; c <<= 6; } c |= str.ptr[idx]; return c; } Linvalid: throw new Exception("yadayada"); } else { return c1; } } This code proofs to be the fastest so far. On UTF and non-UTF text. It's also fairly small.At least with that lookup table below, you can detect isolated continuation bytes (192 and 193) and invalid codes (above 244). __gshared static immutable ubyte[] charWidthTab = [ 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]; length 5 and 6 need not to be tested specifically for your goto.If you use 0 instead of 1 the length check will suffice for throwing on invalid.
Oct 15 2016
Here my version. It's probably not the shortest (100 ligns of assembly with LDC) but it is correct and has following properties: - Performance proportional to the encoding length - Detects Invalid byte sequences - Detects Overlong encodings - Detects Invalid code points I put the exception to be comparable to other routines but Unicode specifies that it is preferable to not abort on encoding errors (to avoid denial of service attacks). dchar myFront2(ref char[] str) { dchar c0 = str.ptr[0]; if(c0 < 0x80) { return c0; } else if(str.length > 1) { dchar c1 = str.ptr[1]; if(c0 < 0xE0 && (c1 & 0xC0) == 0x80) { c1 = ((c0 & 0x1F) << 6)|(c1 & 0x3F); if(c1 < 0x80) goto Linvalid; return c1; } else if(str.length > 2) { dchar c2 = str.ptr[2]; if(c0 < 0xF0 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 0x80) { c2 = ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); if(c2 < 0x800) goto Linvalid; return c2; } else if(str.length > 3) { dchar c3 = str.ptr[3]; if(c0 < 0xF5 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 0x80 && (c3 & 0xC0) == 0x80) { c3 = ((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F); if(c3 < 0x10000 || c3 > 0x10ffff) goto Linvalid; return c3; } } } } Linvalid: throw new Exception("yadayada"); //assert(myFront2(['\xC2','\xA2'])==0xA3); }
Oct 16 2016
On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter wrote:Here my version. It's probably not the shortest (100 ligns of assembly with LDC) but it is correct and has following properties: - Performance proportional to the encoding length - Detects Invalid byte sequences - Detects Overlong encodings - Detects Invalid code points I put the exception to be comparable to other routines but Unicode specifies that it is preferable to not abort on encoding errors (to avoid denial of service attacks). dchar myFront2(ref char[] str) { dchar c0 = str.ptr[0]; if(c0 < 0x80) { return c0; } else if(str.length > 1) { dchar c1 = str.ptr[1]; if(c0 < 0xE0 && (c1 & 0xC0) == 0x80) { c1 = ((c0 & 0x1F) << 6)|(c1 & 0x3F); if(c1 < 0x80) goto Linvalid; return c1; } else if(str.length > 2) { dchar c2 = str.ptr[2]; if(c0 < 0xF0 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 0x80) { c2 = ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); if(c2 < 0x800) goto Linvalid; return c2; } else if(str.length > 3) { dchar c3 = str.ptr[3]; if(c0 < 0xF5 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 0x80 && (c3 & 0xC0) == 0x80) { c3 = ((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F); if(c3 < 0x10000 || c3 > 0x10ffff) goto Linvalid; return c3; } } } } Linvalid: throw new Exception("yadayada"); //assert(myFront2(['\xC2','\xA2'])==0xA3); }This looks quite slow. We already have a correct version in utf.decodeImpl. The goal here was to find a small and fast alternative.
Oct 16 2016
On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:This looks quite slow. We already have a correct version in utf.decodeImpl. The goal here was to find a small and fast alternative.A: What's 2 + 2? B: 3 A: That's wrong. B: But incredibly fast.
Oct 16 2016
On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter wrote:This looks quite slow. We already have a correct version in utf.decodeImpl. The goal here was to find a small and fast alternative.I know but it has to be correct before being fast. The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check. dchar myFront3(ref char[] str) { dchar c0 = str.ptr[0]; if(c0 < 0x80) { return c0; } else if(str.length > 1) { dchar c1 = str.ptr[1]; if(c0 < 0xE0) { return ((c0 & 0x1F) << 6)|(c1 & 0x3F); } else if(str.length > 2) { dchar c2 = str.ptr[2]; if(c0 < 0xF0) { return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); } else if(str.length > 3) { dchar c3 = str.ptr[3]; if(c0 < 0xF5) { return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F); } } } } Linvalid: throw new Exception("yadayada"); } Next step will be to loop for length 2,3,4, with or without your table.
Oct 16 2016
On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter wrote:On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:Of course, this line is wrong, should shift by 18 not 16 : return((c0 & 0x07) << 18)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F);On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter wrote:This looks quite slow. We already have a correct version in utf.decodeImpl. The goal here was to find a small and fast alternative.I know but it has to be correct before being fast. The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check. dchar myFront3(ref char[] str) { dchar c0 = str.ptr[0]; if(c0 < 0x80) { return c0; } else if(str.length > 1) { dchar c1 = str.ptr[1]; if(c0 < 0xE0) { return ((c0 & 0x1F) << 6)|(c1 & 0x3F); } else if(str.length > 2) { dchar c2 = str.ptr[2]; if(c0 < 0xF0) { return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); } else if(str.length > 3) { dchar c3 = str.ptr[3]; if(c0 < 0xF5) { return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F);} } } } Linvalid: throw new Exception("yadayada"); } Next step will be to loop for length 2,3,4, with or without your table.
Oct 16 2016
On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter wrote:Next step will be to loop for length 2,3,4, with or without your table.Ok now the looped version which doesn't need the lookup table. This one assembles in 72 lines of assembly (25 lines only for the exception code). dchar myFront5(ref char[] str) { byte c0 = str[0]; if(c0 >= 0) { return c0; } else { if(((c0!=-64) & (c0!=-63)) && c0 <= -12 ) { auto l = str.length; int idx = 1; dchar mask0 = 0; dchar c1=c0&0x3f; int lim = -64; while(l--) { if(c0<lim) { if(c1 >0x10FFFF) break; return c1; } lim >>= 1 ; c1 = ((c1<<6) | (str.ptr[idx++]&0x3F)) & ~mask0; mask0 = 1 << (idx*6 + 7-idx); } } } Linvalid: throw new Exception("yadayada"); } Explanations of the tricks used: 1. the first character is read as signed byte with sign extension, this allows to compare it to the lim variable which is used to do mainly what the lookup table was doing. 2. there 2 loop escapes, if l, the variable holding the length of the string, is 0 which means that the string is too short or when the lim is bigger than the 1st character (see 1.) 3. Using the same 0x3f mask to extract the data bits from all bytes in the loop has the drawback of adding spurious bits coming from the 1st byte for 3 and 4 long sequences. The mask0 variable is used to "shoot" that bit in the next loop. 4. 2 simple tests allow to eliminate a lot of invalid cases (overlongs on 3 and 4 sequences are not tested yet though but I think there's a simple way of doing it but I'm too tired now to exlore it).
Oct 16 2016
On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote: have you seen this[1] link? it is almost what you're doing, but with some more nice properties. [1] http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Oct 16 2016
On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:You must have misunderstood: My thought was simply that with a larger loop body, LLVM might not make such dramatic rearrangement of the basic blocks. Take your straw man elsewhere :-/I disagree in longer loops code compactness is as important as in small ones.Bad benchmark! Bad! -- AndreiAlso, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.This is more correct : (Tough for some reason it does not pass the unittests)You're only validating the first byte, current code validates all of them.
Oct 14 2016
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: ===== import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join;For fair test line should be feet into L1 cache. --Ilya
Oct 12 2016
On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko wrote:On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:EDITL For fair test the line should be in the L1 cache. --IlyaSo we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high. Now it's time to look at the end-to-end cost of autodecoding. I wrote this simple microbenchmark: ===== import std.range; alias myPopFront = std.range.popFront; alias myFront = std.range.front; void main(string[] args) { import std.algorithm, std.array, std.stdio; char[] line = "0123456789".dup.repeat(50_000_000).join;For fair test line should be feet into L1 cache. --Ilya
Oct 12 2016
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:On my machine, with "ldc2 -release -O3 -enable-inlining""-O3 -enable-inlining" is synonymous with "-O3" :-) With LDC 1.1.0-beta3, you can try with "-enable-cross-module-inlining". It won't cross-module inline everything (notably: nested functions), but it's a start. If you see large performance differences, let me know. cheers, Johan
Oct 12 2016
On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:So we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.[snip]Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry. // construct in-word lookup table ulong bitTable() { import std.utf; ulong table; for(size_t i = 128; i<256; i+= 8) { char[1] c = [ cast(char)i ]; try{ ulong s = std.utf.stride(c[0..1], 0); table |= s<<(4*((i-128)>>3)); } catch(Exception){ // keep zeros } } return table; } uint myStride()(char c) { enum table = bitTable; return (table >> ((c & 0x7f) >> 3)) & 0xF; } void myPopFront()(ref char[] s) { char c = s[0]; if(c < 0x80) s = s[1..$]; else { uint step = myStride(c); if(!step) throw new Exception("blah"); s = s[step..$]; } } Accordingly myFront could be modified to use the same technique. Can't check perf with LDC at the moment sadly.Andrei
Oct 25 2016
On 10/25/16 12:57 PM, Dmitry Olshansky wrote:On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits). So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result): if(c < 0x80) s = s[1 .. $]; else if(c < 0xc0 || c > 0xf7) throw new Exception("blah"); else s = s[1 + myStride(c) .. $]; Should behave better on 32-bit system. You could also store 3-bits to avoid extra addition. -SteveSo we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.[snip]Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry.
Oct 25 2016
On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote:On 10/25/16 12:57 PM, Dmitry Olshansky wrote:The whole point of a LUT to begin with is to reduce instructions.On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits). So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result): if(c < 0x80) s = s[1 .. $]; else if(c < 0xc0 || c > 0xf7) throw new Exception("blah"); else s = s[1 + myStride(c) .. $]; Should behave better on 32-bit system. You could also store 3-bits to avoid extra addition. -SteveSo we've had a good run with making popFront smaller. In ASCII microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. $]. Smaller functions make sure that the impact on instruction cache in larger applications is not high.[snip]Your mission, should you choose to accept it, is to define a combination front/popFront that reduces the gap.I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry.
Oct 25 2016
On 10/25/16 3:32 PM, Stefam Koch wrote:On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote:Yes, I know. But as I understand it, using 64-bit math on 32-bit systems is expensive also. Maybe not in this case... -SteveShould behave better on 32-bit system. You could also store 3-bits to avoid extra addition.The whole point of a LUT to begin with is to reduce instructions.
Oct 25 2016
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:Now it's time to look at the end-to-end cost of autodecoding.Some food for thought: - front necessarily needs to compute the number of bytes to advance. - We can't change the API to share data between front and popFront, however we can create a situation where a pure function gets duplicate calls removed by the compiler. Since we require that the ascii test gets inlined into the caller of front/popFront to improve ascii performance, we have a situation similar to this: alias Result = Tuple!("codepoint",dchar,"advance",int); auto decode(const char[] str) pure { pragma(inline,true); if (str[0] < 0x80) return Result(str[0],1); else return decodeNonAscii(str); } dchar front(const char[] str) pure { pragma(inline,true); return str.decode.codepoint; } void popFront(ref const(char)[] str) { pragma(inline,true); return str[str.decode.advance..$]; } When used in front/popFront pairs, the duplicated decode calls get merged and we don't do any duplicate work (unlike the current situation.) Unfortunately, it's not possible to achieve the best code generation due to missed optimizations by the compilers (I haven't tried GDC.) LDC subforum. Once we have this is possible only the decodeNonAscii, perhaps using the DFA method linked by ketmar. P.S. I am aware that this pessimises popFront for code which only counts codepoints without inspecting them.
Oct 25 2016
On Tuesday, 25 October 2016 at 21:46:30 UTC, safety0ff wrote:P.S. I am aware that this pessimises popFront for code which only counts codepoints without inspecting them.Unfortunately it also changes the API of popFront to throw on invalid characters. So the example would need to be reworked.
Oct 25 2016