digitalmars.D - Can you shrink it further?
- Andrei Alexandrescu (4/4) Oct 09 2016 Had a bit of a good time with popFront for UTF, got to this:
- Stefan Koch (3/8) Oct 09 2016 With -Oz both functions look almost the same on ldc and gdc
- Andrei Alexandrescu (11/18) Oct 09 2016 The big ldc link I posted:
- safety0ff (14/15) Oct 09 2016 Less clever seemed to work for me:
- Andrei Alexandrescu (3/17) Oct 09 2016 Oh, forgot to mention: the initial/short path should only check for
- Stefan Koch (19/21) Oct 10 2016 void popFront1(ref char[] s) @trusted pure nothrow {
- Stefan Koch (18/23) Oct 10 2016 Since in this case stability of min is concern, you can shave of
- Stefan Koch (2/4) Oct 10 2016 In this case the stability of min is +NO+ concern.
- Stefan Koch (36/36) Oct 10 2016 This version has 24 instructions but these have a smaller
- Andrei Alexandrescu (31/67) Oct 10 2016 That looks good. I'm just worried about the jump forward - ideally the
- Stefan Koch (34/43) Oct 10 2016 If you want to skip a byte it's easy to do as well.
- Lurker (13/59) Oct 10 2016 Pardon me asking, but why all these gotos instead of else ifs:
- Stefan Koch (5/17) Oct 10 2016 No it does not.
- Andrei Alexandrescu (3/47) Oct 10 2016 Affirmative. That's identical to the code in "[ ... ]" :o). Generated
- Stefan Koch (9/66) Oct 10 2016 It was not identical.
- Matthias Bentrup (10/62) Oct 11 2016 A branch-free version:
- Stefan Koch (6/15) Oct 11 2016 You still need to special case c < 128
- Stefan Koch (4/22) Oct 11 2016 Also the code produces conditional set instructions which have a
- Temtaime (8/32) Oct 11 2016 void popFront1(ref char[] s) @trusted pure nothrow
- Stefan Koch (3/10) Oct 11 2016 Yours runs with 790 us best time.
- Stefan Koch (5/17) Oct 11 2016 CORRECTION this is not bsr's fault.
- Temtaime (9/28) Oct 11 2016 Sorry this was also a type in the code.
- Stefan Koch (2/10) Oct 11 2016 162 us
- Ethan Watson (20/34) Oct 11 2016 The branching, it hurts my eyes!
- Andrei Alexandrescu (42/49) Oct 11 2016 Thanks. This does a lot of work on the frequent path c < 0x80:
- Andrei Alexandrescu (15/17) Oct 11 2016 What inputs did you test it on?
- Stefan Koch (10/28) Oct 11 2016 https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/bln...
- Stefan Koch (21/24) Oct 11 2016 On my arguably a bit dated laptop:
- Andrei Alexandrescu (16/24) Oct 11 2016 Interesting. 0x80 should be special-cased and you forgot to check the
- Stefan Koch (4/26) Oct 11 2016 It's much slower.
- Matthias Bentrup (25/52) Oct 11 2016 This is the result I'd like to get, but I can't find a way to
- Stefan Koch (4/27) Oct 11 2016 This takes 180us.
- Andrei Alexandrescu (3/24) Oct 11 2016 Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge....
- Matthias Bentrup (38/64) Oct 12 2016 Here are three branch-less variants that use the sign instead of
- Stefan Koch (4/43) Oct 12 2016 All three are slower than baseline, for my test-case.
- Matthias Bentrup (2/7) Oct 12 2016 The blns.txt file mentioned upthread.
- Stefan Koch (41/50) Oct 12 2016 And what were your timings ?
- Stefan Koch (5/5) Oct 12 2016 I just confirmed that branching version is faster then
- Andrei Alexandrescu (22/26) Oct 12 2016 Nice. I like that the table is NOT looked up on the ASCII path, so it
- Stefan Koch (5/36) Oct 12 2016 btw.
- Stefan Koch (11/18) Oct 12 2016 When I write code so i can keep the
- Andrei Alexandrescu (4/20) Oct 12 2016 Thanks! I'd say make sure there is exactly 0% loss on performance
- Stefan Koch (8/12) Oct 12 2016 I measured again.
- Andrei Alexandrescu (2/12) Oct 12 2016 No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei
- Stefan Koch (4/6) Oct 12 2016 Alright then
- Andrei Alexandrescu (4/6) Oct 12 2016 I'd say: (a) test for speed of ASCII-only text; (b) make it small.
- Andrei Alexandrescu (3/13) Oct 12 2016 This is really small although it does two jumps on the frequent path.
- Andrei Alexandrescu (5/36) Oct 11 2016 Looked at this, still seems to generate a jump forward with ldc. Also,
- David Nadlinger (5/6) Oct 11 2016 ldc.intrinsics.llvm_expect might help to influence basic block
- Stefan Koch (7/13) Oct 11 2016 I forgot that the fall-trough did no longer end in Lend;
- Andrei Alexandrescu (3/5) Oct 11 2016 http://stoke.stanford.edu you mean? That would be cool. Keep us posted!
- Stefan Koch (5/10) Oct 11 2016 Yep I mean that one.
- safety0ff (29/29) Oct 12 2016 My current favorites:
- safety0ff (2/3) Oct 12 2016 Didn't see the LUT implementation, nvm!
- Andrei Alexandrescu (3/6) Oct 12 2016 Yah, that's pretty clever. Better yet, I suspect we can reuse the
- Stefan Koch (7/14) Oct 13 2016 The first results from stoke are in.
- Stefan Koch (54/70) Oct 13 2016 Also I doubt that it is correct :(
Had a bit of a good time with popFront for UTF, got to this: https://github.com/dlang/phobos/pull/4848. I suspect there are cleverer things that can be done. Can anyone make it tighter? Take a look at the disassembly referred to in the pull request. Thanks! -- Andrei
Oct 09 2016
On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:Had a bit of a good time with popFront for UTF, got to this: https://github.com/dlang/phobos/pull/4848. I suspect there are cleverer things that can be done. Can anyone make it tighter? Take a look at the disassembly referred to in the pull request. Thanks! -- AndreiWith -Oz both functions look almost the same on ldc and gdc
Oct 09 2016
On 10/09/2016 07:05 PM, Stefan Koch wrote:On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:The big ldc link I posted: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 uses -O3, not -Oz. Changing the flag to -Oz seems to cause no change in the generated code. I see different code generated for popFront1 (proposed) and popFront2 (candidate), e.g. popFront1 has 27 asm lines whereas popFront2 has 34. (BTW: is there an easy way to see the size of the generated functions?) Under what conditions did you notice almost identical code? Thanks, AndreiHad a bit of a good time with popFront for UTF, got to this: https://github.com/dlang/phobos/pull/4848. I suspect there are cleverer things that can be done. Can anyone make it tighter? Take a look at the disassembly referred to in the pull request. Thanks! -- AndreiWith -Oz both functions look almost the same on ldc and gdc
Oct 09 2016
On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:I suspect there are cleverer things that can be done.Less clever seemed to work for me: void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80 || c >= 0xFE) { s = s.ptr[1 .. s.length]; } else { import core.bitop; size_t i = 7u - bsr(~c); // N.B. changed uint to size_t import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } }
Oct 09 2016
On 10/09/2016 10:55 PM, safety0ff wrote:On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:Oh, forgot to mention: the initial/short path should only check for ASCII, i.e. c < 0x80. -- AndreiI suspect there are cleverer things that can be done.Less clever seemed to work for me: void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80 || c >= 0xFE) { s = s.ptr[1 .. s.length]; } else { import core.bitop; size_t i = 7u - bsr(~c); // N.B. changed uint to size_t import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } }
Oct 09 2016
On Monday, 10 October 2016 at 03:55:17 UTC, Andrei Alexandrescu wrote:Oh, forgot to mention: the initial/short path should only check for ASCII, i.e. c < 0x80. -- Andreivoid popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; size_t char_length = 1; if (!(c & 0x80)) { goto Lend; } else { import core.bitop; uint i = 7u - bsr(~c | 1u); import std.algorithm; if (i > 6u) goto Lend; char_length = min(i, s.length); } Lend : s = s.ptr[char_length .. s.length]; } This one removes one unnecessary ret. It will also probably be better for the branch-predictor.
Oct 10 2016
On Monday, 10 October 2016 at 15:17:05 UTC, Stefan Koch wrote:On Monday, 10 October 2016 at 03:55:17 UTC, Andrei Alexandrescu wrote:Since in this case stability of min is concern, you can shave of another 2 instructions by writing the comparison by hand void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; size_t char_length = 1; if (!(c & 0x80)) { goto Lend; } else { import core.bitop; uint i = 7u - bsr(~c | 1u); import std.algorithm; if (i > 6u) goto Lend; char_length = i < s.length ? i : s.length; } Lend : s = s.ptr[char_length .. s.length]; }Oh, forgot to mention: the initial/short path should only check for ASCII, i.e. c < 0x80. -- Andrei
Oct 10 2016
On Monday, 10 October 2016 at 15:37:09 UTC, Stefan Koch wrote:Since in this case stability of min is concern, you can shave of another 2 instructions by writing the comparison by handIn this case the stability of min is +NO+ concern.
Oct 10 2016
This version has 24 instructions but these have a smaller encoding then and are generally inexpensive With inline asm and conditional moves it would be possible to reduce this even further to ~20 instructions. void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; size_t char_length = 1; if (c < 127) { goto Lend; } else { if ((c & 0b1100_0000) == 0b1000_0000) { // This is invalid as a first char goto Lerror; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } //These characters are also no longer valid Lerror : assert(0); } Lend : s = s.ptr[char_length .. s.length]; }
Oct 10 2016
That looks good. I'm just worried about the jump forward - ideally the case c < 127 would simply entail a quick return. I tried a fix, but it didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just skip one byte. Also, are we right to not worry about 5- and 6-byte sequences? The docs keep on threatening with it, and then immediately mention those are not valid. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; } goto Lend; } } Andrei On 10/10/16 9:39 PM, Stefan Koch wrote:This version has 24 instructions but these have a smaller encoding then and are generally inexpensive With inline asm and conditional moves it would be possible to reduce this even further to ~20 instructions. void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; size_t char_length = 1; if (c < 127) { goto Lend; } else { if ((c & 0b1100_0000) == 0b1000_0000) { // This is invalid as a first char goto Lerror; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } //These characters are also no longer valid Lerror : assert(0); } Lend : s = s.ptr[char_length .. s.length]; }
Oct 10 2016
On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu wrote:That looks good. I'm just worried about the jump forward - ideally the case c < 127 would simply entail a quick return. I tried a fix, but it didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just skip one byte. Also, are we right to not worry about 5- and 6-byte sequences? The docs keep on threatening with it, and then immediately mention those are not valid. [ ... ] AndreiIf you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Oct 10 2016
On Tuesday, 11 October 2016 at 03:00:45 UTC, Stefan Koch wrote:On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu wrote:Pardon me asking, but why all these gotos instead of else ifs: if (c < 192) { char_length = 2; } else if (c < 240) { char_length = 3; } else if (...) { } Does it have any effect on generated code (I don't think it should)?That looks good. I'm just worried about the jump forward - ideally the case c < 127 would simply entail a quick return. I tried a fix, but it didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just skip one byte. Also, are we right to not worry about 5- and 6-byte sequences? The docs keep on threatening with it, and then immediately mention those are not valid. [ ... ] AndreiIf you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Oct 10 2016
On Tuesday, 11 October 2016 at 03:18:24 UTC, Lurker wrote:Pardon me asking, but why all these gotos instead of else ifs: if (c < 192) { char_length = 2; } else if (c < 240) { char_length = 3; } else if (...) { } Does it have any effect on generated code (I don't think it should)?No it does not. I wrote the gotos because that is how I am used to thinking about code. If else should work fine here.
Oct 10 2016
On 10/10/16 11:00 PM, Stefan Koch wrote:On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu wrote:Affirmative. That's identical to the code in "[ ... ]" :o). Generated code still does a jmp forward though. -- AndreiThat looks good. I'm just worried about the jump forward - ideally the case c < 127 would simply entail a quick return. I tried a fix, but it didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just skip one byte. Also, are we right to not worry about 5- and 6-byte sequences? The docs keep on threatening with it, and then immediately mention those are not valid. [ ... ] AndreiIf you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Oct 10 2016
On Tuesday, 11 October 2016 at 03:58:59 UTC, Andrei Alexandrescu wrote:On 10/10/16 11:00 PM, Stefan Koch wrote:It was not identical. ((c & b01100_0000) == 0b1000_0000)) Can be true in all of the 3 following cases. If we do not do a jmp to return here, we cannot guarantee that we will not skip over the next valid char. Thereby corrupting already corrupt strings even more. For best performance we need to leave the gotos in there.On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu wrote:Affirmative. That's identical to the code in "[ ... ]" :o). Generated code still does a jmp forward though. -- AndreiThat looks good. I'm just worried about the jump forward - ideally the case c < 127 would simply entail a quick return. I tried a fix, but it didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just skip one byte. Also, are we right to not worry about 5- and 6-byte sequences? The docs keep on threatening with it, and then immediately mention those are not valid. [ ... ] AndreiIf you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Oct 10 2016
On Tuesday, 11 October 2016 at 04:05:47 UTC, Stefan Koch wrote:On Tuesday, 11 October 2016 at 03:58:59 UTC, Andrei Alexandrescu wrote:A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; } Theoretically the char_length could be computed with three sub and addc instructions, but no compiler is smart enough to detect that.On 10/10/16 11:00 PM, Stefan Koch wrote:It was not identical. ((c & b01100_0000) == 0b1000_0000)) Can be true in all of the 3 following cases. If we do not do a jmp to return here, we cannot guarantee that we will not skip over the next valid char. Thereby corrupting already corrupt strings even more. For best performance we need to leave the gotos in there.On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu wrote:Affirmative. That's identical to the code in "[ ... ]" :o). Generated code still does a jmp forward though. -- Andrei[...]If you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Oct 11 2016
On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup wrote:A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; } Theoretically the char_length could be computed with three sub and addc instructions, but no compiler is smart enough to detect that.You still need to special case c < 128 as well as the follow chars. also smaller c's are more common the bigger ones making the branching version faster on average.
Oct 11 2016
On Tuesday, 11 October 2016 at 08:03:40 UTC, Stefan Koch wrote:On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup wrote:Also the code produces conditional set instructions which have a higher latency. And worse throughput.A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; } Theoretically the char_length could be computed with three sub and addc instructions, but no compiler is smart enough to detect that.You still need to special case c < 128 as well as the follow chars. also smaller c's are more common the bigger ones making the branching version faster on average.
Oct 11 2016
On Tuesday, 11 October 2016 at 08:17:52 UTC, Stefan Koch wrote:On Tuesday, 11 October 2016 at 08:03:40 UTC, Stefan Koch wrote:void popFront1(ref char[] s) trusted pure nothrow { import core.bitop, std.algorithm; auto v = bsr(~s[0] | 1); s = s[clamp(v, 1, v > 6 ? 1 : $)..$]; } Seems to be less if i'm not wrong.On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup wrote:Also the code produces conditional set instructions which have a higher latency. And worse throughput.A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; } Theoretically the char_length could be computed with three sub and addc instructions, but no compiler is smart enough to detect that.You still need to special case c < 128 as well as the follow chars. also smaller c's are more common the bigger ones making the branching version faster on average.
Oct 11 2016
On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:void popFront1(ref char[] s) trusted pure nothrow { import core.bitop, std.algorithm; auto v = bsr(~s[0] | 1); s = s[clamp(v, 1, v > 6 ? 1 : $)..$]; } Seems to be less if i'm not wrong.Yours runs with 790 us best time. bsr is a real timetaker :)
Oct 11 2016
On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:CORRECTION this is not bsr's fault. It's most likely clamp. I am compiling with dmd and dmd is not as good in optimizing when templates are in the mix.void popFront1(ref char[] s) trusted pure nothrow { import core.bitop, std.algorithm; auto v = bsr(~s[0] | 1); s = s[clamp(v, 1, v > 6 ? 1 : $)..$]; } Seems to be less if i'm not wrong.Yours runs with 790 us best time. bsr is a real timetaker :)
Oct 11 2016
On Tuesday, 11 October 2016 at 09:13:10 UTC, Stefan Koch wrote:On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:Sorry this was also a type in the code. void popFront7(ref char[] s) trusted pure nothrow { import core.bitop; auto v = 7 - bsr(~s[0] | 1); s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$]; } Please check this.On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:CORRECTION this is not bsr's fault. It's most likely clamp. I am compiling with dmd and dmd is not as good in optimizing when templates are in the mix.void popFront1(ref char[] s) trusted pure nothrow { import core.bitop, std.algorithm; auto v = bsr(~s[0] | 1); s = s[clamp(v, 1, v > 6 ? 1 : $)..$]; } Seems to be less if i'm not wrong.Yours runs with 790 us best time. bsr is a real timetaker :)
Oct 11 2016
On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:Sorry this was also a type in the code. void popFront7(ref char[] s) trusted pure nothrow { import core.bitop; auto v = 7 - bsr(~s[0] | 1); s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$]; } Please check this.162 us
Oct 11 2016
On Tuesday, 11 October 2016 at 10:01:41 UTC, Stefan Koch wrote:On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:The branching, it hurts my eyes! Something like the following should give correct (assuming I haven't written bad logic) branchless results with architecture-optimised max calls. Note that the minus/plus 1 operation on the third line will ensure with the sign multiplication that values of 7 will map to 1, whereas for all other values it's an extra operation. But the advantage is that you're not sticking three branches in close proximity to each other, so you will never get a branch predictor fail. (Of note, any performance test for these functions should test with data designed to fail the branching code I quoted, keeping in mind that desktop Intel processors have a four-state branch predictor. I've not performance tested it myself, but this will certainly run faster on the AMD Jaguar processors than a version with branching checks.) int v = 7 - bsr( ~s[0] | 1 ); int sign = ( (v - 7) >> 31 ); v = ( v - 1 ) * sign + 1; str = str[ min( v, s.length ) .. $ ];Sorry this was also a type in the code. void popFront7(ref char[] s) trusted pure nothrow { import core.bitop; auto v = 7 - bsr(~s[0] | 1); s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$]; } Please check this.162 us
Oct 11 2016
On 10/11/2016 05:45 AM, Temtaime wrote:void popFront7(ref char[] s) trusted pure nothrow { import core.bitop; auto v = 7 - bsr(~s[0] | 1); s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$]; } Please check this.Thanks. This does a lot of work on the frequent path c < 0x80: pure nothrow trusted void example.popFront7(ref char[]): movq 8(%rdi), %rax movzbl (%rax), %ecx xorq $254, %rcx orq $1, %rcx bsrq %rcx, %rcx notl %ecx addl $8, %ecx cmpl $6, %ecx jg .LBB0_2 testl %ecx, %ecx je .LBB0_2 movslq %ecx, %rdx movq (%rdi), %rcx cmpq %rcx, %rdx cmovaq %rcx, %rdx jmp .LBB0_3 .LBB0_2: movq (%rdi), %rcx movl $1, %edx .LBB0_3: addq %rdx, %rax subq %rdx, %rcx movq %rcx, (%rdi) movq %rax, 8(%rdi) retq So I changed it to: void popFront7(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { import core.bitop; uint v = 7 - bsr(~c | (c > 0xfd) << 6u); s = s.ptr[v > s.length ? s.length : v .. s.length]; } } That's about as large as the baseline. Andrei
Oct 11 2016
On 10/11/2016 04:57 AM, Stefan Koch wrote:Yours runs with 790 us best time. bsr is a real timetaker :)What inputs did you test it on? Here's what I think would be a good set of requirements: * The ASCII case should be short and fast: a comparison and a branch, followed by return. This would improve a very common case and address the main issue with autodecoding. * For the multibyte case, the main requirement is the code must be small. This is because it gets inlined all over the place and seldom used. * For the multibyte case, the fewer bytes in the encoding the less work. This is because more frequent multi-byte characters have generally lower codes. Currently front() - the other time spender in autodecoding - issues a function call on the multibyte case. That makes the code of front() itself small, at the cost of more expensive multibyte handling. Andrei
Oct 11 2016
On Tuesday, 11 October 2016 at 14:16:54 UTC, Andrei Alexandrescu wrote:On 10/11/2016 04:57 AM, Stefan Koch wrote:https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txtYours runs with 790 us best time. bsr is a real timetaker :)What inputs did you test it on?Here's what I think would be a good set of requirements: * The ASCII case should be short and fast: a comparison and a branch, followed by return. This would improve a very common case and address the main issue with autodecoding.Already done* For the multibyte case, the main requirement is the code must be small. This is because it gets inlined all over the place and seldom used. * For the multibyte case, the fewer bytes in the encoding the less work. This is because more frequent multi-byte characters have generally lower codes.That is why I had the branches, generally only the first one is takenCurrently front() - the other time spender in autodecoding - issues a function call on the multibyte case. That makes the code of front() itself small, at the cost of more expensive multibyte handling.I think at some point we have to cache the length of the last decoded char, Otherwise we are throwing work away. However that will only work within a RangeWrapper-Struct
Oct 11 2016
On Tuesday, 11 October 2016 at 08:17:52 UTC, Stefan Koch wrote:Also the code produces conditional set instructions which have a higher latency. And worse throughput.On my arguably a bit dated laptop: popFront3 performs 109 us best time and popFront4 performs with 265 us best time Testcode : void main() { import std.datetime : StopWatch; import std.stdio; foreach(_;0 .. 255) { char[] test1 = (import("blns.txt")).dup; StopWatch sw; sw.start; while(test1.length) popFront(test1); sw.stop; writeln("pf1 took ", sw.peek.usecs, "us"); sw.reset(); } } blns.txt is taken from https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt
Oct 11 2016
On 10/11/2016 03:30 AM, Matthias Bentrup wrote:A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; } Theoretically the char_length could be computed with three sub and addc instructions, but no compiler is smart enough to detect that.Interesting. 0x80 should be special-cased and you forgot to check the bounds. So: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[l <= s.length ? l : s.length .. s.length]; } } This generated 27 instructions, i.e. same as the baseline. See: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 But it doesn't seem to check for all errors. Andrei
Oct 11 2016
On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu wrote:On 10/11/2016 03:30 AM, Matthias Bentrup wrote:It's much slower. Because of the flag-storing instructions.A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; }void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[l <= s.length ? l : s.length .. s.length]; } } This generated 27 instructions, i.e. same as the baseline. See: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 But it doesn't seem to check for all errors. Andrei
Oct 11 2016
On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu wrote:On 10/11/2016 03:30 AM, Matthias Bentrup wrote:This is the result I'd like to get, but I can't find a way to write it without inline assembly :( void popFrontAsmIntel(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s[1 .. $]; } else { uint l = void; asm pure nothrow nogc { mov EAX, 1; mov BL, 0xf8-1; sub BL, c; cmp BL, 0xf8-0xc0; adc EAX, 0; cmp BL, 0xf8-0xe0; adc EAX, 0; cmp BL, 0xf8-0xf0; adc EAX, 0; mov l, EAX; } s = s[l <= $ ? l : $ .. $]; } }A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; } Theoretically the char_length could be computed with three sub and addc instructions, but no compiler is smart enough to detect that.Interesting. 0x80 should be special-cased and you forgot to check the bounds. So: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[l <= s.length ? l : s.length .. s.length]; } } This generated 27 instructions, i.e. same as the baseline. See: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 But it doesn't seem to check for all errors. Andrei
Oct 11 2016
On Tuesday, 11 October 2016 at 14:49:28 UTC, Matthias Bentrup wrote:This is the result I'd like to get, but I can't find a way to write it without inline assembly :( void popFrontAsmIntel(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s[1 .. $]; } else { uint l = void; asm pure nothrow nogc { mov EAX, 1; mov BL, 0xf8-1; sub BL, c; cmp BL, 0xf8-0xc0; adc EAX, 0; cmp BL, 0xf8-0xe0; adc EAX, 0; cmp BL, 0xf8-0xf0; adc EAX, 0; mov l, EAX; } s = s[l <= $ ? l : $ .. $]; } }This takes 180us. Baseline takes 124us.
Oct 11 2016
On 10/11/2016 10:49 AM, Matthias Bentrup wrote:void popFrontAsmIntel(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s[1 .. $]; } else { uint l = void; asm pure nothrow nogc { mov EAX, 1; mov BL, 0xf8-1; sub BL, c; cmp BL, 0xf8-0xc0; adc EAX, 0; cmp BL, 0xf8-0xe0; adc EAX, 0; cmp BL, 0xf8-0xf0; adc EAX, 0; mov l, EAX; } s = s[l <= $ ? l : $ .. $]; } }Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. -- Andrei
Oct 11 2016
On Tuesday, 11 October 2016 at 15:01:47 UTC, Andrei Alexandrescu wrote:On 10/11/2016 10:49 AM, Matthias Bentrup wrote:Here are three branch-less variants that use the sign instead of the carry bit. The last one is the fastest on my machine, although it mixes the rare error case and the common 1-byte case into one branch. void popFront1(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) { s = s[1 .. $]; } else if (c < -8) { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } else { s = s[1 .. $]; } } void popFront1a(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) {Three s = s[1 .. $]; } else { uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16void popFrontAsmIntel(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s[1 .. $]; } else { uint l = void; asm pure nothrow nogc { mov EAX, 1; mov BL, 0xf8-1; sub BL, c; cmp BL, 0xf8-0xc0; adc EAX, 0; cmp BL, 0xf8-0xe0; adc EAX, 0; cmp BL, 0xf8-0xf0; adc EAX, 0; mov l, EAX; } s = s[l <= $ ? l : $ .. $]; } }Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. -- Andreiimport std.algorithm; s = s[min(i, $) .. $]; } } void popFront1b(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= -8) { s = s[1 .. $]; } else { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } }31)) & (c + 8 >> 31));
Oct 12 2016
On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote:Here are three branch-less variants that use the sign instead of the carry bit. The last one is the fastest on my machine, although it mixes the rare error case and the common 1-byte case into one branch. void popFront1(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) { s = s[1 .. $]; } else if (c < -8) { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } else { s = s[1 .. $]; } } void popFront1a(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) {Three s = s[1 .. $]; } else { uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31)) & (c + 8 >> 31)); import std.algorithm; s = s[min(i, $) .. $]; } } void popFront1b(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= -8) { s = s[1 .. $]; } else { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } }All three are slower than baseline, for my test-case. What did you test it against.
Oct 12 2016
On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote:On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote:The blns.txt file mentioned upthread.[...]All three are slower than baseline, for my test-case. What did you test it against.
Oct 12 2016
On Wednesday, 12 October 2016 at 10:15:17 UTC, Matthias Bentrup wrote:On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote:And what were your timings ? BTW. the code you posted would not be a proper replacement for utf8 popFront since our UTF8-popFront does handle depreacted sequences correctly. making it equivalent to this code : void pfnew(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; } else { if (c < 192+32) { char_length = 2; } else if (c < 192+32+16) { char_length = 3; } else if (c < 192+32+16+8) { char_length = 4; } else if (c < 192+32+16+8+4) { char_length = 5; } else if (c < 192+32+16+8+4+2) { char_length = 6; } char_length = char_length > s.length ? s.length : char_length; goto Lend; } }On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote:The blns.txt file mentioned upthread.[...]All three are slower than baseline, for my test-case. What did you test it against.
Oct 12 2016
I just confirmed that branching version is faster then table-lookup. please test it our for yourself http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj The table-lookup does produce the smallest code though.
Oct 12 2016
On 10/12/2016 06:56 AM, Stefan Koch wrote:I just confirmed that branching version is faster then table-lookup. please test it our for yourself http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj The table-lookup does produce the smallest code though.Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern: size_t char_length = 1; immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; return ; } as opposed to: immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[1 .. s.length]; return ; } In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
Oct 12 2016
On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:On 10/12/2016 06:56 AM, Stefan Koch wrote:btw. We could get rid of of a sub if we changed slices from holding a pointer and a length to holding two pointers.I just confirmed that branching version is faster then table-lookup. please test it our for yourself http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj The table-lookup does produce the smallest code though.Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern: size_t char_length = 1; immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; return ; } as opposed to: immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[1 .. s.length]; return ; } In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
Oct 12 2016
On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:When I write code so i can keep the str = str[1 .. str.length] It leads to lager code overall and decreases performance. both in table lookup and in the multi-branch code. update: for ldc the table-lookup code is faster 27% then the current popFront. on dmd the table-lookup is 2% faster then the current popFront. for ldc the branching code is 23% faster then the current popFront for dmd the branching code is 20% faster then the current popFrontIn the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
Oct 12 2016
On 10/12/2016 09:39 AM, Stefan Koch wrote:On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- AndreiOn Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:When I write code so i can keep the str = str[1 .. str.length] It leads to lager code overall and decreases performance. both in table lookup and in the multi-branch code. update: for ldc the table-lookup code is faster 27% then the current popFront. on dmd the table-lookup is 2% faster then the current popFront. for ldc the branching code is 23% faster then the current popFront for dmd the branching code is 20% faster then the current popFrontIn the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
Oct 12 2016
On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote:On 10/12/2016 09:39 AM, Stefan Koch wrote:Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- AndreiI measured again. The table version has a DECREASES the performance for dmd by 1%. I think we should keep performance for dmd in mind. I could add the table version in a version (LDC) block. I cannot currently measure on gdc. I'd appreciate more perf data.
Oct 12 2016
On 10/12/2016 10:39 AM, Stefan Koch wrote:On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote:No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- AndreiOn 10/12/2016 09:39 AM, Stefan Koch wrote:Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- AndreiI measured again. The table version has a DECREASES the performance for dmd by 1%. I think we should keep performance for dmd in mind. I could add the table version in a version (LDC) block.
Oct 12 2016
On Wednesday, 12 October 2016 at 14:46:32 UTC, Andrei Alexandrescu wrote:No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- AndreiAlright then PR: https://github.com/dlang/phobos/pull/4849
Oct 12 2016
On 10/12/2016 05:23 AM, Stefan Koch wrote:All three are slower than baseline, for my test-case. What did you test it against.I'd say: (a) test for speed of ASCII-only text; (b) make it small. That's all we need. Nobody worries about 10-20% in multibyte-heavy text. -- Andrei
Oct 12 2016
On 10/12/2016 04:56 AM, Matthias Bentrup wrote:void popFront1b(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= -8) { s = s[1 .. $]; } else { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } }This is really small although it does two jumps on the frequent path. Perhaps an improvement over the current implementation. Nice work! -- Andrei
Oct 12 2016
On 10/10/2016 11:00 PM, Stefan Koch wrote:void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }Looked at this, still seems to generate a jump forward with ldc. Also, why do you leave a fallthrough path? Progress needs to be made on all paths, otherwise we have infinite loops. Andrei
Oct 11 2016
On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu wrote:Looked at this, still seems to generate a jump forward with ldc.ldc.intrinsics.llvm_expect might help to influence basic block layout. — David
Oct 11 2016
On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu wrote:On 10/10/2016 11:00 PM, Stefan Koch wrote:I forgot that the fall-trough did no longer end in Lend; That forward jump to Lend is a very common and therefore predicted branch. I will now run this problem through STOKE. Let's see what it comes up with :)[...]Looked at this, still seems to generate a jump forward with ldc. Also, why do you leave a fallthrough path? Progress needs to be made on all paths, otherwise we have infinite loops.
Oct 11 2016
On 10/11/16 11:15 AM, Stefan Koch wrote:I will now run this problem through STOKE. Let's see what it comes up with :)http://stoke.stanford.edu you mean? That would be cool. Keep us posted! -- Andrei
Oct 11 2016
On Tuesday, 11 October 2016 at 16:13:45 UTC, Andrei Alexandrescu wrote:On 10/11/16 11:15 AM, Stefan Koch wrote:Yep I mean that one. It will take a while to work out the right cost-functions. I'll do a PR as soon as this bears fruit.I will now run this problem through STOKE. Let's see what it comes up with :)http://stoke.stanford.edu you mean? That would be cool. Keep us posted! -- Andrei
Oct 11 2016
My current favorites: void popFront(ref char[] s) trusted pure nothrow { immutable byte c = s[0]; if (c >= -2) { s = s.ptr[1 .. s.length]; } else { import core.bitop; size_t i = 7u - bsr(~c); import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } } I also experimented with explicit speculation: void popFront(ref char[] s) trusted pure nothrow { immutable byte c = s[0]; s = s.ptr[1 .. s.length]; if (c < -2) { import core.bitop; size_t i = 6u - bsr(~c); import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } } LDC and GDC both compile these to 23 instructions. DMD does worse than with my other code. You can influence GDC's block layout with __builtin_expect. I notice that many other snippets posted use uint instead of size_t in the multi-byte branch. This generates extra instructions for me.
Oct 12 2016
On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:[Snip]Didn't see the LUT implementation, nvm!
Oct 12 2016
On 10/12/2016 01:05 PM, safety0ff wrote:On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei[Snip]Didn't see the LUT implementation, nvm!
Oct 12 2016
On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei Alexandrescu wrote:On 10/12/2016 01:05 PM, safety0ff wrote:The first results from stoke are in. It turns out stoke likes to produce garbage :( It's smallest result so far has around 100 instructions. However it might get better if I give it a few more hours to explore.On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei[Snip]Didn't see the LUT implementation, nvm!
Oct 13 2016
On Friday, 14 October 2016 at 04:21:28 UTC, Stefan Koch wrote:On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei Alexandrescu wrote:Also I doubt that it is correct :( testb $0x8, 0x200aa9(%rip) movl $0x6, %eax prefetchnta 0x200a9d(%rip) je .L_400650 mulb -0x4(%rsp) movb $0xfa, -0x5(%rsp) vmovd (%rax), %xmm6 pmovzxbd -0x5(%rsp), %xmm11 1 psrad $0xf9, %xmm6 movl $0xef, %esp pextrd $0xfe, %xmm6, (%rax) .L_4005b0: vrsqrtps 0x200a69(%rip), %ymm13 vzeroall incl %edi cmpb %ah, %dl cmpq %rdi, %rdi jbe .L_400640 ja .L_4005f0 pcmpeqq -0x4(%rsp), %xmm10 sbbb %ah, 0x200a4d(%rip) jmpq .L_400643 .L_4005f0: ja .L_40060c jmpq .L_400643 .L_40060c: ja .L_400628 minsd 0x200a3c(%rip), %xmm10 jmpq .L_400643 .L_400628: vmovsldup %ymm3, %ymm3 vrcpps %ymm12, %ymm7 vrsqrtps -0x4(%rsp), %xmm0 fldl2t vmovmskpd %xmm8, %r10 vrcpps %xmm6, %xmm13 rcrw $0xf7, %ax jbe .L_400643 sbbq $0x40, %rax xorb $0xfe, 0x200a0d(%rip) adcw $0xf0, %r10w .L_400640: vmaskmovpd %xmm4, %xmm10, 0x2009ff(%rip) pabsb %xmm12, %xmm15 .L_400643: jne .L_4005b0 .L_400650: retq I am not quite sure what this does. But I am certain it has nothing to do with UTF-8 decoding :) Oh btw using an end pointer instead of a length reduces the table version to 12 instructions.On 10/12/2016 01:05 PM, safety0ff wrote:The first results from stoke are in. It turns out stoke likes to produce garbage :( It's smallest result so far has around 100 instructions. However it might get better if I give it a few more hours to explore.On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei[Snip]Didn't see the LUT implementation, nvm!
Oct 13 2016