digitalmars.D - Can you shrink it further?

Andrei Alexandrescu (4/4) Oct 09 2016 Had a bit of a good time with popFront for UTF, got to this:

Stefan Koch (3/8) Oct 09 2016 With -Oz both functions look almost the same on ldc and gdc

Andrei Alexandrescu (11/18) Oct 09 2016 The big ldc link I posted:

safety0ff (14/15) Oct 09 2016 Less clever seemed to work for me:

Andrei Alexandrescu (3/17) Oct 09 2016 Oh, forgot to mention: the initial/short path should only check for

Stefan Koch (19/21) Oct 10 2016 void popFront1(ref char[] s) @trusted pure nothrow {

Stefan Koch (18/23) Oct 10 2016 Since in this case stability of min is concern, you can shave of

Stefan Koch (2/4) Oct 10 2016 In this case the stability of min is +NO+ concern.

Stefan Koch (36/36) Oct 10 2016 This version has 24 instructions but these have a smaller

Andrei Alexandrescu (31/67) Oct 10 2016 That looks good. I'm just worried about the jump forward - ideally the

Stefan Koch (34/43) Oct 10 2016 If you want to skip a byte it's easy to do as well.

Lurker (13/59) Oct 10 2016 Pardon me asking, but why all these gotos instead of else ifs:

Stefan Koch (5/17) Oct 10 2016 No it does not.

Andrei Alexandrescu (3/47) Oct 10 2016 Affirmative. That's identical to the code in "[ ... ]" :o). Generated

Stefan Koch (9/66) Oct 10 2016 It was not identical.

Matthias Bentrup (10/62) Oct 11 2016 A branch-free version:

Stefan Koch (6/15) Oct 11 2016 You still need to special case c < 128

Stefan Koch (4/22) Oct 11 2016 Also the code produces conditional set instructions which have a

Temtaime (8/32) Oct 11 2016 void popFront1(ref char[] s) @trusted pure nothrow

Stefan Koch (3/10) Oct 11 2016 Yours runs with 790 us best time.

Stefan Koch (5/17) Oct 11 2016 CORRECTION this is not bsr's fault.

Temtaime (9/28) Oct 11 2016 Sorry this was also a type in the code.

Stefan Koch (2/10) Oct 11 2016 162 us

Ethan Watson (20/34) Oct 11 2016 The branching, it hurts my eyes!

Andrei Alexandrescu (42/49) Oct 11 2016 Thanks. This does a lot of work on the frequent path c < 0x80:

Andrei Alexandrescu (15/17) Oct 11 2016 What inputs did you test it on?

Stefan Koch (10/28) Oct 11 2016 https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/bln...

Stefan Koch (21/24) Oct 11 2016 On my arguably a bit dated laptop:

Andrei Alexandrescu (16/24) Oct 11 2016 Interesting. 0x80 should be special-cased and you forgot to check the

Stefan Koch (4/26) Oct 11 2016 It's much slower.
Matthias Bentrup (25/52) Oct 11 2016 This is the result I'd like to get, but I can't find a way to

Stefan Koch (4/27) Oct 11 2016 This takes 180us.
Andrei Alexandrescu (3/24) Oct 11 2016 Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge....

Matthias Bentrup (38/64) Oct 12 2016 Here are three branch-less variants that use the sign instead of

Stefan Koch (4/43) Oct 12 2016 All three are slower than baseline, for my test-case.

Matthias Bentrup (2/7) Oct 12 2016 The blns.txt file mentioned upthread.

Stefan Koch (41/50) Oct 12 2016 And what were your timings ?

Stefan Koch (5/5) Oct 12 2016 I just confirmed that branching version is faster then

Andrei Alexandrescu (22/26) Oct 12 2016 Nice. I like that the table is NOT looked up on the ASCII path, so it

Stefan Koch (5/36) Oct 12 2016 btw.

Stefan Koch (11/18) Oct 12 2016 When I write code so i can keep the

Andrei Alexandrescu (4/20) Oct 12 2016 Thanks! I'd say make sure there is exactly 0% loss on performance

Stefan Koch (8/12) Oct 12 2016 I measured again.

Andrei Alexandrescu (2/12) Oct 12 2016 No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei

Stefan Koch (4/6) Oct 12 2016 Alright then

Andrei Alexandrescu (4/6) Oct 12 2016 I'd say: (a) test for speed of ASCII-only text; (b) make it small.

Andrei Alexandrescu (3/13) Oct 12 2016 This is really small although it does two jumps on the frequent path.

Andrei Alexandrescu (5/36) Oct 11 2016 Looked at this, still seems to generate a jump forward with ldc. Also,

David Nadlinger (5/6) Oct 11 2016 ldc.intrinsics.llvm_expect might help to influence basic block
Stefan Koch (7/13) Oct 11 2016 I forgot that the fall-trough did no longer end in Lend;

Andrei Alexandrescu (3/5) Oct 11 2016 http://stoke.stanford.edu you mean? That would be cool. Keep us posted!

Stefan Koch (5/10) Oct 11 2016 Yep I mean that one.

safety0ff (29/29) Oct 12 2016 My current favorites:

safety0ff (2/3) Oct 12 2016 Didn't see the LUT implementation, nvm!

Andrei Alexandrescu (3/6) Oct 12 2016 Yah, that's pretty clever. Better yet, I suspect we can reuse the

Stefan Koch (7/14) Oct 13 2016 The first results from stoke are in.

Stefan Koch (54/70) Oct 13 2016 Also I doubt that it is correct :(

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Had a bit of a good time with popFront for UTF, got to this: 
https://github.com/dlang/phobos/pull/4848. I suspect there are cleverer 
things that can be done. Can anyone make it tighter? Take a look at the 
disassembly referred to in the pull request. Thanks! -- Andrei

Oct 09 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu 
wrote:
 Had a bit of a good time with popFront for UTF, got to this: 
 https://github.com/dlang/phobos/pull/4848. I suspect there are 
 cleverer things that can be done. Can anyone make it tighter? 
 Take a look at the disassembly referred to in the pull request. 
 Thanks! -- Andrei

With -Oz both functions look almost the same on ldc and gdc

Oct 09 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/09/2016 07:05 PM, Stefan Koch wrote:
On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:
Had a bit of a good time with popFront for UTF, got to this:
https://github.com/dlang/phobos/pull/4848. I suspect there are
cleverer things that can be done. Can anyone make it tighter? Take a
look at the disassembly referred to in the pull request. Thanks! --
Andrei

With -Oz both functions look almost the same on ldc and gdc

The big ldc link I posted:

http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3

uses -O3, not -Oz. Changing the flag to -Oz seems to cause no change in
the generated code. I see different code generated for popFront1
(proposed) and popFront2 (candidate), e.g. popFront1 has 27 asm lines
whereas popFront2 has 34. (BTW: is there an easy way to see the size of
the generated functions?)

Under what conditions did you notice almost identical code?

Thanks,

Andrei

Oct 09 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu 
wrote:
 I suspect there are cleverer things that can be done.

Less clever seemed to work for me:

void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80 || c >= 0xFE) {
     s = s.ptr[1 .. s.length];
   } else {
     import core.bitop;
     size_t i = 7u - bsr(~c); // N.B. changed uint to size_t
     import std.algorithm;
     s = s.ptr[min(i, s.length) .. s.length];
   }
}

Oct 09 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/09/2016 10:55 PM, safety0ff wrote:
 On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:
 I suspect there are cleverer things that can be done.

 Less clever seemed to work for me:

 void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80 || c >= 0xFE) {
     s = s.ptr[1 .. s.length];
   } else {
     import core.bitop;
     size_t i = 7u - bsr(~c); // N.B. changed uint to size_t
     import std.algorithm;
     s = s.ptr[min(i, s.length) .. s.length];
   }
 }

Oh, forgot to mention: the initial/short path should only check for 
ASCII, i.e. c < 0x80. -- Andrei

Oct 09 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 10 October 2016 at 03:55:17 UTC, Andrei Alexandrescu 
wrote:
 Oh, forgot to mention: the initial/short path should only check 
 for ASCII, i.e. c < 0x80. -- Andrei

void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   size_t char_length = 1;
   if (!(c & 0x80)) {
     goto Lend;
   } else {
     import core.bitop;
     uint i = 7u - bsr(~c | 1u);
     import std.algorithm;
     if (i > 6u) goto Lend;
     char_length = min(i, s.length);
   }
   Lend :
   s = s.ptr[char_length .. s.length];
}


This one removes one unnecessary ret.
It will also probably be better for the branch-predictor.

Oct 10 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 10 October 2016 at 15:17:05 UTC, Stefan Koch wrote:
 On Monday, 10 October 2016 at 03:55:17 UTC, Andrei Alexandrescu 
 wrote:
 Oh, forgot to mention: the initial/short path should only 
 check for ASCII, i.e. c < 0x80. -- Andrei


Since in this case stability of min is concern, you can shave of 
another 2 instructions by writing the comparison by hand

void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   size_t char_length = 1;
   if (!(c & 0x80)) {
     goto Lend;
   } else {
     import core.bitop;
     uint i = 7u - bsr(~c | 1u);
     import std.algorithm;
     if (i > 6u) goto Lend;
     char_length = i < s.length ? i : s.length;
   }
   Lend :
   s = s.ptr[char_length .. s.length];
}

Oct 10 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 10 October 2016 at 15:37:09 UTC, Stefan Koch wrote:
 Since in this case stability of min is concern, you can shave 
 of another 2 instructions by writing the comparison by hand

In this case the stability of min is +NO+ concern.

Oct 10 2016

Stefan Koch <uplink.coder googlemail.com> writes:

This version has 24 instructions but these have a smaller 
encoding then and are generally inexpensive
With inline asm and conditional moves it would be possible to 
reduce this even further
to ~20 instructions.

void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   size_t char_length = 1;
   if (c < 127)
   {
     goto Lend;
   } else {
     if ((c & 0b1100_0000) == 0b1000_0000)
     {
       // This is invalid as a first char
       goto Lerror;
     }
     if (c < 192)
     {
       char_length = 2;
       goto Lend;
     }
     if (c < 240) {
       char_length = 3;
       goto Lend;
     }
     if (c < 248) {
        char_length = 4;
       goto Lend;
     }

     //These characters are also no longer valid
     Lerror : assert(0);
   }
   Lend :
   s = s.ptr[char_length .. s.length];
}

Oct 10 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

That looks good. I'm just worried about the jump forward - ideally the 
case c < 127 would simply entail a quick return. I tried a fix, but it 
didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just 
skip one byte. Also, are we right to not worry about 5- and 6-byte 
sequences? The docs keep on threatening with it, and then immediately 
mention those are not valid.

void popFront3(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1;
   if (c < 127)
   {
   Lend :
     s = s.ptr[char_length .. s.length];
   } else {
     if (c < 192)
     {
       char_length = 2;
       goto Lend;
     }
     if (c < 240) {
       char_length = 3;
       goto Lend;
     }
     if (c < 248) {
        char_length = 4;
     }
     goto Lend;
   }
}


Andrei

On 10/10/16 9:39 PM, Stefan Koch wrote:
 This version has 24 instructions but these have a smaller encoding then
 and are generally inexpensive
 With inline asm and conditional moves it would be possible to reduce
 this even further
 to ~20 instructions.

 void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   size_t char_length = 1;
   if (c < 127)
   {
     goto Lend;
   } else {
     if ((c & 0b1100_0000) == 0b1000_0000)
     {
       // This is invalid as a first char
       goto Lerror;
     }
     if (c < 192)
     {
       char_length = 2;
       goto Lend;
     }
     if (c < 240) {
       char_length = 3;
       goto Lend;
     }
     if (c < 248) {
        char_length = 4;
       goto Lend;
     }

     //These characters are also no longer valid
     Lerror : assert(0);
   }
   Lend :
   s = s.ptr[char_length .. s.length];
 }

Oct 10 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu 
wrote:
 That looks good. I'm just worried about the jump forward - 
 ideally the case c < 127 would simply entail a quick return. I 
 tried a fix, but it didn't do what I wanted in ldc. We 
 shouldn't assert(0) if wrong - just skip one byte. Also, are we 
 right to not worry about 5- and 6-byte sequences? The docs keep 
 on threatening with it, and then immediately mention those are 
 not valid.

 [ ... ]

 Andrei

If you want to skip a byte it's easy to do as well.

void popFront3(ref char[] s)  trusted pure nothrow {
    immutable c = s[0];
    uint char_length = 1;
    if (c < 127)
    {
    Lend :
      s = s.ptr[char_length .. s.length];
    } else {
      if ((c & b01100_0000) == 0b1000_0000)
      {
        //just skip one in case this is not the beginning of a 
code-point char
        goto Lend;
      }
      if (c < 192)
      {
        char_length = 2;
        goto Lend;
      }
      if (c < 240)
      {
        char_length = 3;
        goto Lend;
      }
      if (c < 248)
      {
        char_length = 4;
        goto Lend;
      }
    }
  }

Oct 10 2016

Lurker <lurker gmail.com> writes:

On Tuesday, 11 October 2016 at 03:00:45 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei 
 Alexandrescu wrote:
 That looks good. I'm just worried about the jump forward - 
 ideally the case c < 127 would simply entail a quick return. I 
 tried a fix, but it didn't do what I wanted in ldc. We 
 shouldn't assert(0) if wrong - just skip one byte. Also, are 
 we right to not worry about 5- and 6-byte sequences? The docs 
 keep on threatening with it, and then immediately mention 
 those are not valid.

 [ ... ]

 Andrei

 If you want to skip a byte it's easy to do as well.

 void popFront3(ref char[] s)  trusted pure nothrow {
    immutable c = s[0];
    uint char_length = 1;
    if (c < 127)
    {
    Lend :
      s = s.ptr[char_length .. s.length];
    } else {
      if ((c & b01100_0000) == 0b1000_0000)
      {
        //just skip one in case this is not the beginning of a 
 code-point char
        goto Lend;
      }
      if (c < 192)
      {
        char_length = 2;
        goto Lend;
      }
      if (c < 240)
      {
        char_length = 3;
        goto Lend;
      }
      if (c < 248)
      {
        char_length = 4;
        goto Lend;
      }
    }
  }

Pardon me asking, but why all these gotos instead of else ifs:

   if (c < 192)
   {
     char_length = 2;
   }
   else if (c < 240)
   {
     char_length = 3;
   } else if (...) {
   }

Does it have any effect on generated code (I don't think it 
should)?

Oct 10 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 03:18:24 UTC, Lurker wrote:
 Pardon me asking, but why all these gotos instead of else ifs:

   if (c < 192)
   {
     char_length = 2;
   }
   else if (c < 240)
   {
     char_length = 3;
   } else if (...) {
   }

 Does it have any effect on generated code (I don't think it 
 should)?

No it does not.
I wrote the gotos because that is how I am used to thinking about 
code.
If else should work fine here.

Oct 10 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/10/16 11:00 PM, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu wrote:
 That looks good. I'm just worried about the jump forward - ideally the
 case c < 127 would simply entail a quick return. I tried a fix, but it
 didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just
 skip one byte. Also, are we right to not worry about 5- and 6-byte
 sequences? The docs keep on threatening with it, and then immediately
 mention those are not valid.

 [ ... ]

 Andrei

 If you want to skip a byte it's easy to do as well.

 void popFront3(ref char[] s)  trusted pure nothrow {
    immutable c = s[0];
    uint char_length = 1;
    if (c < 127)
    {
    Lend :
      s = s.ptr[char_length .. s.length];
    } else {
      if ((c & b01100_0000) == 0b1000_0000)
      {
        //just skip one in case this is not the beginning of a code-point
 char
        goto Lend;
      }
      if (c < 192)
      {
        char_length = 2;
        goto Lend;
      }
      if (c < 240)
      {
        char_length = 3;
        goto Lend;
      }
      if (c < 248)
      {
        char_length = 4;
        goto Lend;
      }
    }
  }

Affirmative. That's identical to the code in "[ ... ]" :o). Generated 
code still does a jmp forward though. -- Andrei

Oct 10 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 03:58:59 UTC, Andrei Alexandrescu 
wrote:
 On 10/10/16 11:00 PM, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei 
 Alexandrescu wrote:
 That looks good. I'm just worried about the jump forward - 
 ideally the
 case c < 127 would simply entail a quick return. I tried a 
 fix, but it
 didn't do what I wanted in ldc. We shouldn't assert(0) if 
 wrong - just
 skip one byte. Also, are we right to not worry about 5- and 
 6-byte
 sequences? The docs keep on threatening with it, and then 
 immediately
 mention those are not valid.

 [ ... ]

 Andrei

 If you want to skip a byte it's easy to do as well.

 void popFront3(ref char[] s)  trusted pure nothrow {
    immutable c = s[0];
    uint char_length = 1;
    if (c < 127)
    {
    Lend :
      s = s.ptr[char_length .. s.length];
    } else {
      if ((c & b01100_0000) == 0b1000_0000)
      {
        //just skip one in case this is not the beginning of a 
 code-point
 char
        goto Lend;
      }
      if (c < 192)
      {
        char_length = 2;
        goto Lend;
      }
      if (c < 240)
      {
        char_length = 3;
        goto Lend;
      }
      if (c < 248)
      {
        char_length = 4;
        goto Lend;
      }
    }
  }

 Affirmative. That's identical to the code in "[ ... ]" :o). 
 Generated code still does a jmp forward though. -- Andrei

It was not identical.
((c & b01100_0000) == 0b1000_0000))
Can be true in all of the 3 following cases.
If we do not do a jmp to return here, we cannot guarantee that we 
will not skip over the next valid char.
Thereby corrupting already corrupt strings even more.

For best performance we need to leave the gotos in there.

Oct 10 2016

Matthias Bentrup <matthias.bentrup googlemail.com> writes:

On Tuesday, 11 October 2016 at 04:05:47 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 03:58:59 UTC, Andrei 
 Alexandrescu wrote:
 On 10/10/16 11:00 PM, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei 
 Alexandrescu wrote:
[...]

 If you want to skip a byte it's easy to do as well.

 void popFront3(ref char[] s)  trusted pure nothrow {
    immutable c = s[0];
    uint char_length = 1;
    if (c < 127)
    {
    Lend :
      s = s.ptr[char_length .. s.length];
    } else {
      if ((c & b01100_0000) == 0b1000_0000)
      {
        //just skip one in case this is not the beginning of a 
 code-point
 char
        goto Lend;
      }
      if (c < 192)
      {
        char_length = 2;
        goto Lend;
      }
      if (c < 240)
      {
        char_length = 3;
        goto Lend;
      }
      if (c < 248)
      {
        char_length = 4;
        goto Lend;
      }
    }
  }

 Affirmative. That's identical to the code in "[ ... ]" :o). 
 Generated code still does a jmp forward though. -- Andrei

 It was not identical.
 ((c & b01100_0000) == 0b1000_0000))
 Can be true in all of the 3 following cases.
 If we do not do a jmp to return here, we cannot guarantee that 
 we will not skip over the next valid char.
 Thereby corrupting already corrupt strings even more.

 For best performance we need to leave the gotos in there.

A branch-free version:

void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
}

Theoretically the char_length could be computed with three sub 
and addc instructions, but no compiler is smart enough to detect 
that.

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup 
wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub 
 and addc instructions, but no compiler is smart enough to 
 detect that.

You still need to special case c < 128
as well as the follow chars.

also smaller c's are more common the bigger ones making the 
branching version faster on average.

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 08:03:40 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup 
 wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub 
 and addc instructions, but no compiler is smart enough to 
 detect that.

 You still need to special case c < 128
 as well as the follow chars.

 also smaller c's are more common the bigger ones making the 
 branching version faster on average.

Also the code produces conditional set instructions which have a 
higher latency.
And worse throughput.

Oct 11 2016

Temtaime <temtaime gmail.com> writes:

On Tuesday, 11 October 2016 at 08:17:52 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:03:40 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup 
 wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three 
 sub and addc instructions, but no compiler is smart enough to 
 detect that.

 You still need to special case c < 128
 as well as the follow chars.

 also smaller c's are more common the bigger ones making the 
 branching version faster on average.

 Also the code produces conditional set instructions which have 
 a higher latency.
 And worse throughput.

void popFront1(ref char[] s)  trusted pure nothrow
{
   import core.bitop, std.algorithm;
   auto v = bsr(~s[0] | 1);
   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
}

Seems to be less if i'm not wrong.

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
 void popFront1(ref char[] s)  trusted pure nothrow
 {
   import core.bitop, std.algorithm;
   auto v = bsr(~s[0] | 1);
   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
 }

 Seems to be less if i'm not wrong.


Yours runs with 790 us best time.
bsr is a real timetaker :)

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
 void popFront1(ref char[] s)  trusted pure nothrow
 {
   import core.bitop, std.algorithm;
   auto v = bsr(~s[0] | 1);
   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
 }

 Seems to be less if i'm not wrong.


 Yours runs with 790 us best time.
 bsr is a real timetaker :)

CORRECTION this is not bsr's fault.
It's most likely clamp.
I am compiling with dmd and dmd is not as good in optimizing when 
templates are in the mix.

Oct 11 2016

Temtaime <temtaime gmail.com> writes:

On Tuesday, 11 October 2016 at 09:13:10 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
 void popFront1(ref char[] s)  trusted pure nothrow
 {
   import core.bitop, std.algorithm;
   auto v = bsr(~s[0] | 1);
   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
 }

 Seems to be less if i'm not wrong.


 Yours runs with 790 us best time.
 bsr is a real timetaker :)

 CORRECTION this is not bsr's fault.
 It's most likely clamp.
 I am compiling with dmd and dmd is not as good in optimizing 
 when templates are in the mix.

Sorry this was also a type in the code.

void popFront7(ref char[] s)  trusted pure nothrow
{
   import core.bitop;
   auto v = 7 - bsr(~s[0] | 1);
   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
}

Please check this.

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:
 Sorry this was also a type in the code.

 void popFront7(ref char[] s)  trusted pure nothrow
 {
   import core.bitop;
   auto v = 7 - bsr(~s[0] | 1);
   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
 }

 Please check this.

162 us

Oct 11 2016

Ethan Watson <gooberman gmail.com> writes:

On Tuesday, 11 October 2016 at 10:01:41 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:
 Sorry this was also a type in the code.

 void popFront7(ref char[] s)  trusted pure nothrow
 {
   import core.bitop;
   auto v = 7 - bsr(~s[0] | 1);
   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 
 1)..$];
 }

 Please check this.

 162 us

The branching, it hurts my eyes!

Something like the following should give correct (assuming I 
haven't written bad logic) branchless results with 
architecture-optimised max calls. Note that the minus/plus 1 
operation on the third line will ensure with the sign 
multiplication that values of 7 will map to 1, whereas for all 
other values it's an extra operation. But the advantage is that 
you're not sticking three branches in close proximity to each 
other, so you will never get a branch predictor fail. (Of note, 
any performance test for these functions should test with data 
designed to fail the branching code I quoted, keeping in mind 
that desktop Intel processors have a four-state branch predictor. 
I've not performance tested it myself, but this will certainly 
run faster on the AMD Jaguar processors than a version with 
branching checks.)

int v = 7 - bsr( ~s[0] | 1 );
int sign = ( (v - 7) >> 31 );
v = ( v - 1 ) * sign + 1;
str = str[ min( v, s.length ) .. $ ];

Oct 11 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/11/2016 05:45 AM, Temtaime wrote:
 void popFront7(ref char[] s)  trusted pure nothrow
 {
   import core.bitop;
   auto v = 7 - bsr(~s[0] | 1);
   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
 }

 Please check this.

Thanks. This does a lot of work on the frequent path c < 0x80:

pure nothrow  trusted void example.popFront7(ref char[]):
         movq    8(%rdi), %rax
         movzbl  (%rax), %ecx
         xorq    $254, %rcx
         orq     $1, %rcx
         bsrq    %rcx, %rcx
         notl    %ecx
         addl    $8, %ecx
         cmpl    $6, %ecx
         jg      .LBB0_2
         testl   %ecx, %ecx
         je      .LBB0_2
         movslq  %ecx, %rdx
         movq    (%rdi), %rcx
         cmpq    %rcx, %rdx
         cmovaq  %rcx, %rdx
         jmp     .LBB0_3
.LBB0_2:
         movq    (%rdi), %rcx
         movl    $1, %edx
.LBB0_3:
         addq    %rdx, %rax
         subq    %rdx, %rcx
         movq    %rcx, (%rdi)
         movq    %rax, 8(%rdi)
         retq

So I changed it to:

void popFront7(ref char[] s)  trusted pure nothrow
{
   immutable c = s[0];
   if (c < 0x80) {
     s = s.ptr[1 .. s.length];
   } else {
     import core.bitop;
     uint v = 7 - bsr(~c | (c > 0xfd) << 6u);
     s = s.ptr[v > s.length ? s.length : v .. s.length];
   }
}

That's about as large as the baseline.


Andrei

Oct 11 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/11/2016 04:57 AM, Stefan Koch wrote:
 Yours runs with 790 us best time.
 bsr is a real timetaker :)

What inputs did you test it on?

Here's what I think would be a good set of requirements:

* The ASCII case should be short and fast: a comparison and a branch, 
followed by return. This would improve a very common case and address 
the main issue with autodecoding.

* For the multibyte case, the main requirement is the code must be 
small. This is because it gets inlined all over the place and seldom used.

* For the multibyte case, the fewer bytes in the encoding the less work. 
This is because more frequent multi-byte characters have generally lower 
codes.

Currently front() - the other time spender in autodecoding - issues a 
function call on the multibyte case. That makes the code of front() 
itself small, at the cost of more expensive multibyte handling.


Andrei

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 14:16:54 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 04:57 AM, Stefan Koch wrote:
 Yours runs with 790 us best time.
 bsr is a real timetaker :)

 What inputs did you test it on?

https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt

 Here's what I think would be a good set of requirements:

 * The ASCII case should be short and fast: a comparison and a 
 branch, followed by return. This would improve a very common 
 case and address the main issue with autodecoding.

Already done
 * For the multibyte case, the main requirement is the code must 
 be small. This is because it gets inlined all over the place 
 and seldom used.

 * For the multibyte case, the fewer bytes in the encoding the 
 less work. This is because more frequent multi-byte characters 
 have generally lower codes.

That is why I had the branches, generally only the first one is 
taken
 Currently front() - the other time spender in autodecoding - 
 issues a function call on the multibyte case. That makes the 
 code of front() itself small, at the cost of more expensive 
 multibyte handling.

I think at some point we have to cache the length of the last 
decoded char,
Otherwise we are throwing work away.

However that will only work within a RangeWrapper-Struct

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 08:17:52 UTC, Stefan Koch wrote:
 Also the code produces conditional set instructions which have 
 a higher latency.
 And worse throughput.

On my arguably a bit dated laptop:
popFront3 performs 109 us best time
and popFront4 performs with 265 us best time

Testcode :
void main()
{
import std.datetime : StopWatch;
import std.stdio;
foreach(_;0 .. 255) {
char[] test1 = (import("blns.txt")).dup;
StopWatch sw;
sw.start;
while(test1.length) popFront(test1);
sw.stop;
writeln("pf1 took ", sw.peek.usecs, "us");
sw.reset();
}
}

blns.txt is taken from
https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt

Oct 11 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub and addc
 instructions, but no compiler is smart enough to detect that.

Interesting. 0x80 should be special-cased and you forgot to check the 
bounds. So:

void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s.ptr[1 .. s.length];
   } else {
     uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248);
     s = s.ptr[l <= s.length ? l : s.length .. s.length];
   }
}

This generated 27 instructions, i.e. same as the baseline. See:

http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3

But it doesn't seem to check for all errors.


Andrei

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s.ptr[1 .. s.length];
   } else {
     uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248);
     s = s.ptr[l <= s.length ? l : s.length .. s.length];
   }
 }

 This generated 27 instructions, i.e. same as the baseline. See:

 http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3

 But it doesn't seem to check for all errors.


 Andrei

It's much slower.
Because of the flag-storing instructions.

Oct 11 2016

Matthias Bentrup <matthias.bentrup googlemail.com> writes:

On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub 
 and addc
 instructions, but no compiler is smart enough to detect that.

 Interesting. 0x80 should be special-cased and you forgot to 
 check the bounds. So:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s.ptr[1 .. s.length];
   } else {
     uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248);
     s = s.ptr[l <= s.length ? l : s.length .. s.length];
   }
 }

 This generated 27 instructions, i.e. same as the baseline. See:

 http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3

 But it doesn't seem to check for all errors.


 Andrei

This is the result I'd like to get, but I can't find a way to 
write it without inline assembly :(

void popFrontAsmIntel(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s[1 .. $];
   } else {
     uint l = void;
     asm pure nothrow  nogc {
       mov EAX, 1;
       mov BL, 0xf8-1;
       sub BL, c;
       cmp BL, 0xf8-0xc0;
       adc EAX, 0;
       cmp BL, 0xf8-0xe0;
       adc EAX, 0;
       cmp BL, 0xf8-0xf0;
       adc EAX, 0;
       mov l, EAX;
     }
     s = s[l <= $ ? l : $ .. $];
   }
}

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 14:49:28 UTC, Matthias Bentrup 
wrote:
 This is the result I'd like to get, but I can't find a way to 
 write it without inline assembly :(

 void popFrontAsmIntel(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s[1 .. $];
   } else {
     uint l = void;
     asm pure nothrow  nogc {
       mov EAX, 1;
       mov BL, 0xf8-1;
       sub BL, c;
       cmp BL, 0xf8-0xc0;
       adc EAX, 0;
       cmp BL, 0xf8-0xe0;
       adc EAX, 0;
       cmp BL, 0xf8-0xf0;
       adc EAX, 0;
       mov l, EAX;
     }
     s = s[l <= $ ? l : $ .. $];
   }
 }

This takes 180us.
Baseline takes 124us.

Oct 11 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/11/2016 10:49 AM, Matthias Bentrup wrote:
 void popFrontAsmIntel(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s[1 .. $];
   } else {
     uint l = void;
     asm pure nothrow  nogc {
       mov EAX, 1;
       mov BL, 0xf8-1;
       sub BL, c;
       cmp BL, 0xf8-0xc0;
       adc EAX, 0;
       cmp BL, 0xf8-0xe0;
       adc EAX, 0;
       cmp BL, 0xf8-0xf0;
       adc EAX, 0;
       mov l, EAX;
     }
     s = s[l <= $ ? l : $ .. $];
   }
 }

Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. 
-- Andrei

Oct 11 2016

Matthias Bentrup <matthias.bentrup googlemail.com> writes:

On Tuesday, 11 October 2016 at 15:01:47 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 10:49 AM, Matthias Bentrup wrote:
 void popFrontAsmIntel(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s[1 .. $];
   } else {
     uint l = void;
     asm pure nothrow  nogc {
       mov EAX, 1;
       mov BL, 0xf8-1;
       sub BL, c;
       cmp BL, 0xf8-0xc0;
       adc EAX, 0;
       cmp BL, 0xf8-0xe0;
       adc EAX, 0;
       cmp BL, 0xf8-0xf0;
       adc EAX, 0;
       mov l, EAX;
     }
     s = s[l <= $ ? l : $ .. $];
   }
 }

 Did you take a look at the codegen on http://ldc.acomirei.ru? 
 It's huge. -- Andrei

Here are three branch-less variants that use the sign instead of 
the carry bit.

The last one is the fastest on my machine, although it mixes the 
rare error case and the common 1-byte case into one branch.

void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= 0) {
     s = s[1 .. $];
   } else if (c < -8) {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   } else {
     s = s[1 .. $];
   }
}

void popFront1a(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= 0) {Three
     s = s[1 .. $];
   } else {
     uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 
 31)) & (c + 8 >> 31));


     import std.algorithm;
     s = s[min(i, $) .. $];
   }
}

void popFront1b(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= -8) {
     s = s[1 .. $];
   } else {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   }
}

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup 
wrote:
 Here are three branch-less variants that use the sign instead 
 of the carry bit.

 The last one is the fastest on my machine, although it mixes 
 the rare error case and the common 1-byte case into one branch.

 void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= 0) {
     s = s[1 .. $];
   } else if (c < -8) {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 
 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   } else {
     s = s[1 .. $];
   }
 }

 void popFront1a(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= 0) {Three
     s = s[1 .. $];
   } else {
     uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 
 16 >> 31)) & (c + 8 >> 31));
     import std.algorithm;
     s = s[min(i, $) .. $];
   }
 }

 void popFront1b(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= -8) {
     s = s[1 .. $];
   } else {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 
 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   }
 }

All three are slower than baseline, for my test-case.
What did you test it against.

Oct 12 2016

Matthias Bentrup <matthias.bentrup googlemail.com> writes:

On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup 
 wrote:
 [...]

 All three are slower than baseline, for my test-case.
 What did you test it against.

The blns.txt file mentioned upthread.

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 10:15:17 UTC, Matthias Bentrup 
wrote:
 On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch 
 wrote:
 On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias 
 Bentrup wrote:
 [...]

 All three are slower than baseline, for my test-case.
 What did you test it against.

 The blns.txt file mentioned upthread.

And what were your timings ?

BTW. the code you posted would not be a proper replacement for 
utf8 popFront since our UTF8-popFront does handle depreacted 
sequences correctly.

making it equivalent to this code :
void pfnew(ref char[] s)  trusted pure nothrow
{
     immutable c = s[0];
     uint char_length = 1;
     if (c < 192)
     {
Lend :
         s = s.ptr[char_length .. s.length];
     } else {
         if (c < 192+32)
         {
             char_length = 2;
         }
         else if (c < 192+32+16)
         {
             char_length = 3;
         }
         else if (c < 192+32+16+8)
         {
             char_length = 4;
         }
         else if (c < 192+32+16+8+4)
         {
             char_length = 5;
         }
         else if (c < 192+32+16+8+4+2)
         {
             char_length = 6;
         }

         char_length = char_length > s.length ? s.length : 
char_length;
         goto Lend;
     }
}

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

I just confirmed that branching version is faster then 
table-lookup.

please test it our for yourself
http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

The table-lookup does produce the smallest code though.

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 06:56 AM, Stefan Koch wrote:
 I just confirmed that branching version is faster then table-lookup.

 please test it our for yourself
 http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

 The table-lookup does produce the smallest code though.

Nice. I like that the table is NOT looked up on the ASCII path, so it 
stays cold in ASCII text. However, there's a problem with this pattern:

     size_t char_length = 1;
     immutable c = s[0];
     if (c < 192)
     {
         Lend :
         s = s.ptr[char_length .. s.length];
         return ;
     }

as opposed to:

     immutable c = s[0];
     if (c < 192)
     {
         Lend :
         s = s.ptr[1 .. s.length];
         return ;
     }

In the second case, the compiler generates an inc for bumping the 
pointer and a dec for decreasing the length (small instructions). If the 
variable char_length is used, add/sub must be used (larger). -- Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei 
Alexandrescu wrote:
 On 10/12/2016 06:56 AM, Stefan Koch wrote:
 I just confirmed that branching version is faster then 
 table-lookup.

 please test it our for yourself
 http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

 The table-lookup does produce the smallest code though.

 Nice. I like that the table is NOT looked up on the ASCII path, 
 so it stays cold in ASCII text. However, there's a problem with 
 this pattern:

     size_t char_length = 1;
     immutable c = s[0];
     if (c < 192)
     {
         Lend :
         s = s.ptr[char_length .. s.length];
         return ;
     }

 as opposed to:

     immutable c = s[0];
     if (c < 192)
     {
         Lend :
         s = s.ptr[1 .. s.length];
         return ;
     }

 In the second case, the compiler generates an inc for bumping 
 the pointer and a dec for decreasing the length (small 
 instructions). If the variable char_length is used, add/sub 
 must be used (larger). -- Andrei

btw.
We could get rid of of a sub if we changed slices from holding a 
pointer and a length to holding two pointers.

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei 
 Alexandrescu wrote:
 In the second case, the compiler generates an inc for bumping 
 the pointer and a dec for decreasing the length (small 
 instructions). If the variable char_length is used, add/sub 
 must be used (larger). -- Andrei


When I write code so i can keep the
str = str[1 .. str.length]
It leads to lager code overall and decreases performance.
both in table lookup and in the multi-branch code.

update:
for ldc the table-lookup code is faster 27% then the current 
popFront.
on dmd the table-lookup is 2% faster then the current popFront.

for ldc the branching code is 23% faster then the current popFront
for dmd the branching code is 20% faster then the current popFront

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 09:39 AM, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:
 In the second case, the compiler generates an inc for bumping the
 pointer and a dec for decreasing the length (small instructions). If
 the variable char_length is used, add/sub must be used (larger). --
 Andrei


 When I write code so i can keep the
 str = str[1 .. str.length]
 It leads to lager code overall and decreases performance.
 both in table lookup and in the multi-branch code.

 update:
 for ldc the table-lookup code is faster 27% then the current popFront.
 on dmd the table-lookup is 2% faster then the current popFront.

 for ldc the branching code is 23% faster then the current popFront
 for dmd the branching code is 20% faster then the current popFront

Thanks! I'd say make sure there is exactly 0% loss on performance 
compared to the popFront in the ASCII case, and if so make a PR with the 
table version. -- Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei 
Alexandrescu wrote:
 On 10/12/2016 09:39 AM, Stefan Koch wrote:

 Thanks! I'd say make sure there is exactly 0% loss on 
 performance compared to the popFront in the ASCII case, and if 
 so make a PR with the table version. -- Andrei

I measured again.
The table version has a DECREASES the performance for dmd by 1%.
I think we should keep performance for dmd in mind.
I could add the table version in a version (LDC) block.

I cannot currently measure on gdc.
I'd appreciate more perf data.

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 10:39 AM, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote:
 On 10/12/2016 09:39 AM, Stefan Koch wrote:

 Thanks! I'd say make sure there is exactly 0% loss on performance
 compared to the popFront in the ASCII case, and if so make a PR with
 the table version. -- Andrei

 I measured again.
 The table version has a DECREASES the performance for dmd by 1%.
 I think we should keep performance for dmd in mind.
 I could add the table version in a version (LDC) block.

No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 14:46:32 UTC, Andrei 
Alexandrescu wrote:
 No need. 1% for dmd is negligible. 25% would raise an eyebrow. 
 -- Andrei

Alright then
PR: https://github.com/dlang/phobos/pull/4849

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 05:23 AM, Stefan Koch wrote:
 All three are slower than baseline, for my test-case.
 What did you test it against.

I'd say: (a) test for speed of ASCII-only text; (b) make it small. 
That's all we need. Nobody worries about 10-20% in multibyte-heavy text. 
-- Andrei

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 04:56 AM, Matthias Bentrup wrote:
 void popFront1b(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= -8) {
     s = s[1 .. $];
   } else {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   }
 }

This is really small although it does two jumps on the frequent path. 
Perhaps an improvement over the current implementation. Nice work! -- Andrei

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/10/2016 11:00 PM, Stefan Koch wrote:
 void popFront3(ref char[] s)  trusted pure nothrow {
    immutable c = s[0];
    uint char_length = 1;
    if (c < 127)
    {
    Lend :
      s = s.ptr[char_length .. s.length];
    } else {
      if ((c & b01100_0000) == 0b1000_0000)
      {
        //just skip one in case this is not the beginning of a code-point
 char
        goto Lend;
      }
      if (c < 192)
      {
        char_length = 2;
        goto Lend;
      }
      if (c < 240)
      {
        char_length = 3;
        goto Lend;
      }
      if (c < 248)
      {
        char_length = 4;
        goto Lend;
      }
    }
  }

Looked at this, still seems to generate a jump forward with ldc. Also, 
why do you leave a fallthrough path? Progress needs to be made on all 
paths, otherwise we have infinite loops.


Andrei

Oct 11 2016

David Nadlinger <code klickverbot.at> writes:

On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu 
wrote:
 Looked at this, still seems to generate a jump forward with ldc.

ldc.intrinsics.llvm_expect might help to influence basic block 
layout.

  — David

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu 
wrote:
 On 10/10/2016 11:00 PM, Stefan Koch wrote:
 [...]

 Looked at this, still seems to generate a jump forward with 
 ldc. Also, why do you leave a fallthrough path? Progress needs 
 to be made on all paths, otherwise we have infinite loops.

I forgot that the fall-trough did no longer end in Lend;
That forward jump to Lend is a very common and therefore 
predicted branch.


I will now run this problem through STOKE.
Let's see what it comes up with :)

Oct 11 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/11/16 11:15 AM, Stefan Koch wrote:
 I will now run this problem through STOKE.
 Let's see what it comes up with :)

http://stoke.stanford.edu you mean? That would be cool. Keep us posted! 
-- Andrei

Oct 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 11 October 2016 at 16:13:45 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/16 11:15 AM, Stefan Koch wrote:
 I will now run this problem through STOKE.
 Let's see what it comes up with :)

 http://stoke.stanford.edu you mean? That would be cool. Keep us 
 posted! -- Andrei

Yep I mean that one.
It will take a while to work out the right cost-functions.
I'll do a PR as soon as this bears fruit.

Oct 11 2016

safety0ff <safety0ff.dev gmail.com> writes:

My current favorites:

void popFront(ref char[] s)  trusted pure nothrow {
   immutable byte c = s[0];
   if (c >= -2) {
     s = s.ptr[1 .. s.length];
   } else {
     import core.bitop;
     size_t i = 7u - bsr(~c);
     import std.algorithm;
     s = s.ptr[min(i, s.length) .. s.length];
   }
}

I also experimented with explicit speculation:

void popFront(ref char[] s)  trusted pure nothrow {
   immutable byte c = s[0];
   s = s.ptr[1 .. s.length];
   if (c < -2) {
     import core.bitop;
     size_t i = 6u - bsr(~c);
     import std.algorithm;
     s = s.ptr[min(i, s.length) .. s.length];
   }
}


LDC and GDC both compile these to 23 instructions.
DMD does worse than with my other code.

You can influence GDC's block layout with __builtin_expect.

I notice that many other snippets posted use uint instead of 
size_t in the multi-byte branch. This generates extra 
instructions for me.

Oct 12 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
 [Snip]

Didn't see the LUT implementation, nvm!

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 01:05 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
 [Snip]

 Didn't see the LUT implementation, nvm!

Yah, that's pretty clever. Better yet, I suspect we can reuse the 
look-up table for front() as well. -- Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei 
Alexandrescu wrote:
 On 10/12/2016 01:05 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
 [Snip]

 Didn't see the LUT implementation, nvm!

 Yah, that's pretty clever. Better yet, I suspect we can reuse 
 the look-up table for front() as well. -- Andrei

The first results from stoke are in.
It turns out stoke likes to produce garbage :(
It's smallest result so far has around 100 instructions.
However it might get better if I give it a few more hours to 
explore.

Oct 13 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 14 October 2016 at 04:21:28 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei 
 Alexandrescu wrote:
 On 10/12/2016 01:05 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff 
 wrote:
 [Snip]

 Didn't see the LUT implementation, nvm!

 Yah, that's pretty clever. Better yet, I suspect we can reuse 
 the look-up table for front() as well. -- Andrei

 The first results from stoke are in.
 It turns out stoke likes to produce garbage :(
 It's smallest result so far has around 100 instructions.
 However it might get better if I give it a few more hours to 
 explore.

Also I doubt that it is correct :(

testb $0x8, 0x200aa9(%rip)
movl $0x6, %eax
prefetchnta 0x200a9d(%rip)
je .L_400650
mulb -0x4(%rsp)
movb $0xfa, -0x5(%rsp)
vmovd (%rax), %xmm6
pmovzxbd -0x5(%rsp), %xmm11      1
psrad $0xf9, %xmm6
movl $0xef, %esp
pextrd $0xfe, %xmm6, (%rax)
.L_4005b0:
vrsqrtps 0x200a69(%rip), %ymm13
vzeroall
incl %edi
cmpb %ah, %dl
cmpq %rdi, %rdi
jbe .L_400640
ja .L_4005f0
pcmpeqq -0x4(%rsp), %xmm10
sbbb %ah, 0x200a4d(%rip)
jmpq .L_400643
.L_4005f0:
ja .L_40060c
jmpq .L_400643
.L_40060c:
ja .L_400628
minsd 0x200a3c(%rip), %xmm10
jmpq .L_400643
.L_400628:
vmovsldup %ymm3, %ymm3
vrcpps %ymm12, %ymm7
vrsqrtps -0x4(%rsp), %xmm0
fldl2t
vmovmskpd %xmm8, %r10
vrcpps %xmm6, %xmm13
rcrw $0xf7, %ax
jbe .L_400643
sbbq $0x40, %rax
xorb $0xfe, 0x200a0d(%rip)
adcw $0xf0, %r10w
.L_400640:
vmaskmovpd %xmm4, %xmm10, 0x2009ff(%rip)
pabsb %xmm12, %xmm15
.L_400643:
jne .L_4005b0
.L_400650:
retq

I am not quite sure what this does.
But I am certain it has nothing to do with UTF-8 decoding :)

Oh btw using an end pointer instead of a length reduces the table 
version to 12 instructions.

Oct 13 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Can you shrink it further?