digitalmars.D - Table lookups - this is pretty definitive

Walter Bright (67/67) Apr 01 2014 Try this benchmark comparing various classification schemes:

w0rp (1/2) Apr 01 2014 Yeah, table lookup is the way to go for ASCII, it seems.

w0rp (2/5) Apr 01 2014 After I remember to use -inline -release -O

Denis Shelomovskij (20/87) Apr 01 2014 Some regular benchmark notes:

Denis Shelomovskij (59/154) Apr 01 2014 Few more words about `dmd`:

Walter Bright (3/5) Apr 01 2014 This is why I recommend, when going all out in optimizing, that at some ...

Walter Bright (16/17) Apr 01 2014 For me, rewriting main() as:

John Colvin (19/87) Apr 01 2014 It's not really possible to tell anything from that benchmark,

Walter Bright (7/17) Apr 01 2014 This is always an issue with benchmarking in order to find the best way ...

Peter Alexander (6/9) Apr 01 2014 That's to be expected as warp will call it often and it will stay

Walter Bright (2/3) Apr 01 2014 No data yet. Stay tuned.

JR (14/22) Apr 01 2014 Table lookups are awesome.

Artur Skawina (12/25) Apr 01 2014 You mention AAs, so you may already realize this, but there is no

bearophile (6/15) Apr 01 2014 A recent improvement in foreach loops allows to remove the cast

Andrej Mitrovic (6/14) Apr 01 2014 Also (untested):

Walter Bright (2/6) Apr 01 2014 Yup, that works. Purty nice!
bearophile (10/15) Apr 01 2014 Eventually this should work, and avoid the nasty cast:

Brad Anderson (3/8) Apr 02 2014 I can't decide if this part is brilliant or terrible. Clever,

bearophile (7/13) Apr 02 2014 That iota enhancement has nothing terrible: the "[]" syntax is

Brad Anderson (2/16) Apr 02 2014 I was more referring to ubyte.max.iota :)

Walter Bright (3/7) Apr 01 2014 Incorporated your idea:

bearophile (16/17) Apr 01 2014 A bit of statistics. I have counted about 260 "cast(" in the

Tove (8/10) Apr 01 2014 Original program... 3 4 1
Dmitry Olshansky (5/33) Apr 01 2014 Would strongly suggest to use 2 kinds of data - randomized and some

Walter Bright (2/4) Apr 01 2014 See Vladimir's post.

Dmitry Olshansky (5/10) Apr 02 2014 Cool. All the more confidence in using multi-stage tables for Unicode

Ivan Kazmenko (26/28) Apr 01 2014 10_000 iterations:

Ivan Kazmenko (16/20) Apr 01 2014 The "it" being augmented is the benchmark, the "it" winning being

Vladimir Panteleev (12/13) Apr 01 2014 I modified it a bit to use the Phobos source code as the test

Walter Bright (3/5) Apr 01 2014 Nice. I especially like the use of dirEntries to read all the text. That...

Marco Leise (9/9) Apr 01 2014 Do I see it right that it really comes down to one use of

Daniel N (3/10) Apr 02 2014 I was thinking the same, sure would be nice if computed goto:s

Dmitry Olshansky (6/17) Apr 02 2014 _Explicit_ tail call (aka switch-call) would have been cleaner and more

bearophile (6/10) Apr 02 2014 It looks like an additive change. Can you show a possible syntax

Dmitry Olshansky (23/29) Apr 02 2014 Easily. In grammar, a new kind of statement:

Walter Bright (3/21) Apr 02 2014 Doesn't seem terribly different than the "force inline" feature discussi...

Dmitry Olshansky (10/41) Apr 02 2014 If we can alter semantics of

Walter Bright (6/10) Apr 02 2014 I don't see why not. Note that we couldn't do this for extern(C) functio...

bearophile (8/13) Apr 02 2014 So what does it happen if I (by mistake or by ignorance) try to

Walter Bright (3/14) Apr 02 2014 That's right.

bearophile (7/11) Apr 02 2014 Then this feature needs a specific and explicit syntax, or it has

Walter Bright (3/5) Apr 02 2014 That's like saying inlining has no point because it doesn't have a parti...

bearophile (5/7) Apr 03 2014 Hopefully Dmitry Olshansky can explain why you are probably wrong
Artur Skawina (19/24) Apr 03 2014 Weak inlining *hints* have no point - C learned this the hard way, with ...

Dmitry Olshansky (24/43) Apr 03 2014 From this it seems to me that once again you've walled yourself out of
Marco Leise (13/27) Apr 05 2014 How would `return foo(args)' be used in the C preprocessor example?

Dmitry Olshansky (9/20) Apr 03 2014 I'd gladly author an enhancement once I'm certain of what exactly to

monarch_dodra (20/28) Apr 02 2014 I'd like to point out this is quite a complicated function to

Daniel N (4/14) Apr 02 2014 Considering SSE4.2 was released already back in 2008 and was

Dmitry Olshansky (9/23) Apr 02 2014 Ideally we'd need:

Walter Bright (4/7) Apr 02 2014 memchr is already optimized by SIMD, and std.algorithm.find uses memchr....

Dmitry Olshansky (8/18) Apr 02 2014 This is all good and well, but...

monarch_dodra (12/25) Apr 02 2014 I found myself needing "wmemchr" (and optionally "dmemchr") for

Andrei Alexandrescu (4/7) Apr 03 2014 Yes. Probably checking stuff with SIMD and a few other approaches would

monarch_dodra (5/14) Apr 03 2014 I filed it here:

Dmitry Olshansky (5/8) Apr 03 2014 Something I wanted in D since about 2011. Should have put an enhancement...

Walter Bright (4/7) Apr 02 2014 Surely a better approach would be to do a statistical analysis of charac...

Iain Buclaw (34/35) Apr 07 2014 GDC turns switches into table lookups, and speeds are comparable on my

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (2/4) Apr 07 2014 Haha, that's funny. But why don't you put the data in a separate

Walter Bright (3/6) Apr 07 2014 True, it's pretty easy to defeat that optimization.

Martin Nowak (3/4) Apr 07 2014 Counter example:

monarch_dodra (7/11) Apr 07 2014 It's all a matter of how complicated the "non-table"

Walter Bright (2/3) Apr 07 2014 Yup. Memory lookups can be expensive. Once it's in the cache, it's cheap...

Alix Pexton (100/100) Apr 17 2014 I added a lookup scheme of my own, its not as fast as Walters (in fact

ixid (5/106) Apr 17 2014 I feel like there must be a way of making a fast bit look up but

monarch_dodra (7/11) Apr 17 2014 http://dlang.org/phobos/core_bitop.html#.bt

Alix Pexton (4/15) Apr 18 2014 I didn't think to look in core.bitop for a faster way to check bits,

Alix Pexton (29/29) Apr 18 2014 I tested this against the others...

Alix Pexton (3/3) Apr 18 2014 PS

Walter Bright <newshound2 digitalmars.com> writes:

Try this benchmark comparing various classification schemes:
---------------------------------
import core.stdc.stdlib;
import core.stdc.string;

import std.algorithm;
import std.array;
import std.ascii;
import std.datetime;
import std.range;
import std.stdio;
import std.traits;

bool isIdentifierChar0(ubyte c)
{
     return isAlphaNum(c) || c == '_' || c == '$';
}

bool isIdentifierChar1(ubyte c)
{
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
}

immutable bool[256] tab2;
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
}

bool isIdentifierChar2(ubyte c)
{
     return tab2[c];
}

/*********************************************/

int f0()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
}

int f1()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
}

int f2()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
}

void main()
{
     auto r = benchmark!(f0, f1, f2)(10_000);
     writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs);
}

Apr 01 2014

"w0rp" <devw0rp gmail.com> writes:

Milliseconds 22 14 10

Yeah, table lookup is the way to go for ASCII, it seems.

Apr 01 2014

"w0rp" <devw0rp gmail.com> writes:

On Tuesday, 1 April 2014 at 18:41:03 UTC, w0rp wrote:
Milliseconds 22 14 10

 Yeah, table lookup is the way to go for ASCII, it seems.

 Milliseconds 13 8 2

After I remember to use -inline -release -O

Apr 01 2014

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

01.04.2014 22:35, Walter Bright пишет:
 Try this benchmark comparing various classification schemes:
 ---------------------------------
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
      return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
      return ((c >= '0' || c == '$') &&
              (c <= '9' || c >= 'A')  &&
              (c <= 'Z' || c >= 'a' || c == '_') &&
              (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
      for (size_t u = 0; u < 0x100; ++u)
      {
          tab2[u] = isIdentifierChar0(cast(ubyte)u);
      }
 }

 bool isIdentifierChar2(ubyte c)
 {
      return tab2[c];
 }

 /*********************************************/

 int f0()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar0(cast(ubyte)u);
      }
      return x;
 }

 int f1()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar1(cast(ubyte)u);
      }
      return x;
 }

 int f2()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar2(cast(ubyte)u);
      }
      return x;
 }

 void main()
 {
      auto r = benchmark!(f0, f1, f2)(10_000);
      writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs);
 }

Some regular benchmark notes:

1. The first one passed to `benchmark` is always slower (e.g. pass `f2` 
and see).

2. Unexpected program behaviour changes:

Let's use `benchmark!(f1, f1, f1)(1_000_000)`:
Milliseconds 928 889 888

Then copy `isAlphaNum` in file (benchmark still shows same results) and 
change `dchar c` to `ubyte c`, result changes:
Milliseconds 849 815 827
The difference is sufficient but `isAlphaNum` not called in benchmark.

Also `f0` is faster than `f1`, benchmark!(f0, f0, f0)(1_000_000):
Milliseconds 731 694 702

And `f2` wins, it's the only obvious thing, benchmark!(f2, f2, 
f2)(1_000_000):
Milliseconds 227 184 184


Compiler used: dmd -O -inline -release

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Apr 01 2014

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

01.04.2014 23:22, Denis Shelomovskij пишет:
 01.04.2014 22:35, Walter Bright пишет:
 Try this benchmark comparing various classification schemes:
 ---------------------------------
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
      return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
      return ((c >= '0' || c == '$') &&
              (c <= '9' || c >= 'A')  &&
              (c <= 'Z' || c >= 'a' || c == '_') &&
              (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
      for (size_t u = 0; u < 0x100; ++u)
      {
          tab2[u] = isIdentifierChar0(cast(ubyte)u);
      }
 }

 bool isIdentifierChar2(ubyte c)
 {
      return tab2[c];
 }

 /*********************************************/

 int f0()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar0(cast(ubyte)u);
      }
      return x;
 }

 int f1()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar1(cast(ubyte)u);
      }
      return x;
 }

 int f2()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar2(cast(ubyte)u);
      }
      return x;
 }

 void main()
 {
      auto r = benchmark!(f0, f1, f2)(10_000);
      writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs,
 r[2].msecs);
 }

 Some regular benchmark notes:

 1. The first one passed to `benchmark` is always slower (e.g. pass `f2`
 and see).

 2. Unexpected program behaviour changes:

 Let's use `benchmark!(f1, f1, f1)(1_000_000)`:
 Milliseconds 928 889 888

 Then copy `isAlphaNum` in file (benchmark still shows same results) and
 change `dchar c` to `ubyte c`, result changes:
 Milliseconds 849 815 827
 The difference is sufficient but `isAlphaNum` not called in benchmark.

 Also `f0` is faster than `f1`, benchmark!(f0, f0, f0)(1_000_000):
 Milliseconds 731 694 702

 And `f2` wins, it's the only obvious thing, benchmark!(f2, f2,
 f2)(1_000_000):
 Milliseconds 227 184 184


 Compiler used: dmd -O -inline -release

Few more words about `dmd`:

"Hey, silly, you still use `== x` to compare with `x`? Use table 
lookups! And never cast `size_t` to `ubyte`! See:
---
import std.datetime;
import std.stdio;


immutable bool[256] tab2;

static this()
{
     foreach(size_t u; 0 .. 0x100)
         tab2[u] = u == '_';
}

/*********************************************/

int f0()
{
     int x;
     foreach(uint u; 0 .. 0x100)
         x += u == '_';
     return x;
}

int f2()
{
     int x;
     foreach(uint u; 0 .. 0x100)
         x += tab2[cast(ubyte)u];
     return x;
}

int f2plus()
{
     int x;
     foreach(size_t u; 0 .. 0x100)
         x += tab2[u];
     return x;
}

void main()
{
     auto r = benchmark!(f0, f0, f0)(1_000_000);
     writefln("Milliseconds %s %s %s (f0)", r[0].msecs, r[1].msecs, 
r[2].msecs);

     r = benchmark!(f2, f2, f2)(1_000_000);
     writefln("Milliseconds %s %s %s (f2)", r[0].msecs, r[1].msecs, 
r[2].msecs);

     r = benchmark!(f2plus, f2plus, f2plus)(1_000_000);
     writefln("Milliseconds %s %s %s (f2plus)", r[0].msecs, r[1].msecs, 
r[2].msecs);
}
---
Milliseconds 323 274 281 (f0)
Milliseconds 185 185 185 (f2)
Milliseconds 105 97 105 (f2plus)
---


P.S.
I don't want to say anything with these results. I just have no idea 
what happens here and I don't want to investigate dmd's way of 
optimizing things now.


-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Apr 01 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 12:38 PM, Denis Shelomovskij wrote:
 I don't want to say anything with these results. I just have no idea what
 happens here and I don't want to investigate dmd's way of optimizing things
now.

This is why I recommend, when going all out in optimizing, that at some point 
you've gotta look at the assembler output of (insert your favorite compiler).

Apr 01 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 12:22 PM, Denis Shelomovskij wrote:
 Compiler used: dmd -O -inline -release

For me, rewriting main() as:

void main()
{
   {
     auto r = benchmark!(f0, f1, f2)(100_000);
     writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs);
   }
   {
     auto r = benchmark!(f0, f1, f2)(100_000);
     writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs);
   }
}

Gives:

Milliseconds 139 99 15
Milliseconds 122 93 13

Apr 01 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:
 ---------------------------------
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
     return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

 bool isIdentifierChar2(ubyte c)
 {
     return tab2[c];
 }

 /*********************************************/

 int f0()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
 }

 int f1()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
 }

 int f2()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
 }

 void main()
 {
     auto r = benchmark!(f0, f1, f2)(10_000);
     writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, 
 r[2].msecs);
 }

It's not really possible to tell anything from that benchmark, 
especially with fancy modern optimisers and branch predictors.

1) ldc and gdc both optimise away some/all of the tests 
respectively.
2) once isIdentifierCharX is inlined, the compiler can determine 
what the results will be before-hand.
3) Even without 1) and 2), the results are unrealistically 
(very!) correlated, leading to a branch-prediction bonanza. I'm 
sure you know how significant that can be on modern CPUs. It is 
also very friendly to pre-fetching of the lookup table*

Perhaps have the same benchmark, but working on realistic data 
from a file.


*In my experience, the cost of using lookup tables often only 
appears in real code, where cache pressure and predictability 
becomes worse. This is especially true of lazy, range-based code. 
If several layers of a range use different lookup tables you can 
quickly end up ruining cache performance compared to an eager 
approach.

Apr 01 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 12:23 PM, John Colvin wrote:
 It's not really possible to tell anything from that benchmark, especially with
 fancy modern optimisers and branch predictors.

I disagree, read on.

 1) ldc and gdc both optimise away some/all of the tests respectively.
 2) once isIdentifierCharX is inlined, the compiler can determine what the
 results will be before-hand.

This is always an issue with benchmarking in order to find the best way to
write 
an algorithm. I just adjust it so the particular compiler used can't throw it
away.


 3) Even without 1) and 2), the results are unrealistically (very!) correlated,
 leading to a branch-prediction bonanza. I'm sure you know how significant that
 can be on modern CPUs. It is also very friendly to pre-fetching of the lookup
 table*

This is exactly why I wanted to test it.


 Perhaps have the same benchmark, but working on realistic data from a file.

Switching to it in Warp produced a very significant speedup on real files.

https://github.com/facebook/warp/pull/5

Apr 01 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:

 Switching to it in Warp produced a very significant speedup on 
 real files.

 https://github.com/facebook/warp/pull/5

That's to be expected as warp will call it often and it will stay 
in the cache. For a program that calls it less often the other 
methods may be faster.

Basically, it depends on the use case.

Glad to see it's helped in warp. Any idea how it is comparing to 
clang with this improvement?

Apr 01 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 2:30 PM, Peter Alexander wrote:
 Any idea how it is comparing to clang with this improvement?

No data yet. Stay tuned.

Apr 01 2014

"JR" <zorael gmail.com> writes:

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

Table lookups are awesome.

While tab2 there lends itself to being populated in a static 
this() it would really lower the bar if we could define static 
immutable AA literals at compile-time. Or at least CTFE-fill 
them. Having to rely on static this() for such is unfortunate and 
hacky.

Relevant is 
http://forum.dlang.org/thread/oxnwtojtdymopgvanbwz forum.dlang.org 
in which I get a speed increase of >3.5x by doing string-to-enum 
translation/mapping using someEnum[string] AA lookups, instead of 
using someString.to!someEnum.

(Given CtAAs, std.conv.to could then generate such for various 
translations where cheap -- assuming we still like AAs?)

Apr 01 2014

Artur Skawina <art.08.09 gmail.com> writes:

On 04/01/14 21:35, JR wrote:
 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

 
 Table lookups are awesome.
 
 While tab2 there lends itself to being populated in a static this() it would
really lower the bar if we could define static immutable AA literals at
compile-time. Or at least CTFE-fill them. Having to rely on static this() for
such is unfortunate and hacky.

You mention AAs, so you may already realize this, but there is no
need for a mod ctor in the above example:

   immutable bool[256] tab2 = {
      bool t[256];
      for (size_t u = 0; u < 0x100; ++u)
          t[u] = isIdentifierChar0(cast(ubyte)u);
      return t;
   }();

"Static immutable AAs" can be built at CT likewise, it's just that
their type will be different from "std" RT AAs.

artur

Apr 01 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Artur Skawina:

 You mention AAs, so you may already realize this, but there is 
 no
 need for a mod ctor in the above example:

    immutable bool[256] tab2 = {
       bool t[256];
       for (size_t u = 0; u < 0x100; ++u)
           t[u] = isIdentifierChar0(cast(ubyte)u);
       return t;
    }();

A recent improvement in foreach loops allows to remove the cast 
from your code, but I have found a new compiler bug:

https://d.puremagic.com/issues/show_bug.cgi?id=12504

Bye,
bearophile

Apr 01 2014

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 4/1/14, Artur Skawina <art.08.09 gmail.com> wrote:
 You mention AAs, so you may already realize this, but there is no
 need for a mod ctor in the above example:

    immutable bool[256] tab2 = {
       bool t[256];
       for (size_t u = 0; u < 0x100; ++u)
           t[u] = isIdentifierChar0(cast(ubyte)u);
       return t;
    }();

Also (untested):

immutable bool[256] tab2 =
    iota(0, 0x100)
    .map!(i => isIdentifierChar0(cast(ubyte)i))
    .array;

Apr 01 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 1:57 PM, Andrej Mitrovic wrote:
 immutable bool[256] tab2 =
      iota(0, 0x100)
      .map!(i => isIdentifierChar0(cast(ubyte)i))
      .array;

Yup, that works. Purty nice!

Apr 01 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Andrej Mitrovic:

 Also (untested):

 immutable bool[256] tab2 =
     iota(0, 0x100)
     .map!(i => isIdentifierChar0(cast(ubyte)i))
     .array;

Eventually this should work, and avoid the nasty cast:

immutable bool[256] tab3 =
     ubyte
     .max
     .iota!"[]"
     .map!isIdentifierChar0
     .array;


Bye,
bearophile

Apr 01 2014

"Brad Anderson" <eco gnuk.net> writes:

On Tuesday, 1 April 2014 at 22:32:28 UTC, bearophile wrote:
 Eventually this should work, and avoid the nasty cast:

 immutable bool[256] tab3 =
     ubyte
     .max
     .iota!"[]"

I can't decide if this part is brilliant or terrible. Clever, 
either way.

Apr 02 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Brad Anderson:

 immutable bool[256] tab3 =
    ubyte
    .max
    .iota!"[]"

 I can't decide if this part is brilliant or terrible. Clever, 
 either way.

That iota enhancement has nothing terrible: the "[]" syntax is 
already used in Phobos for std.range.uniform, and it's quite 
needed, otherwise it can't generate the right bound of a type 
(like the last ubyte, last byte, last char, last int, etc).

Bye,
bearophile

Apr 02 2014

"Brad Anderson" <eco gnuk.net> writes:

On Wednesday, 2 April 2014 at 21:58:33 UTC, bearophile wrote:
 Brad Anderson:

 immutable bool[256] tab3 =
   ubyte
   .max
   .iota!"[]"

 I can't decide if this part is brilliant or terrible. Clever, 
 either way.

 That iota enhancement has nothing terrible: the "[]" syntax is 
 already used in Phobos for std.range.uniform, and it's quite 
 needed, otherwise it can't generate the right bound of a type 
 (like the last ubyte, last byte, last char, last int, etc).

 Bye,
 bearophile

I was more referring to ubyte.max.iota :)

Apr 02 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 1:57 PM, Andrej Mitrovic wrote:
 immutable bool[256] tab2 =
      iota(0, 0x100)
      .map!(i => isIdentifierChar0(cast(ubyte)i))
      .array;

Incorporated your idea:

https://github.com/facebook/warp/pull/5/files

Apr 01 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 Incorporated your idea:

A bit of statistics. I have counted about 260 "cast(" in the 
whole d files (total is 7752 CLOC lines of D code, plus 1008 
blank lines and 905 lines of comments. I have not counted the 
lines of unittests).

The files with the higher number of casts are:

macros.d, with 86 casts
constexpr.d, with 39 casts
directive.d, with 30 casts
lexer.d, with 28 casts
stringlit.d, with 18 casts
id.d, with 16 casts
context.d, with 11 casts
skip.d, with 9 casts

Bye,
bearophile

Apr 01 2014

"Tove" <tove fransson.se> writes:

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:
 ---------------------------------

Original program... 3 4 1
10x as many iterations... 36 47 19

I think this benchmark is flawed...

1) Apparently there are rounding issues... assuming linear 
scaling, 1.9 ms yields ~1ms? Probably better to use usecs...

2) Table lookups will nearly always win in isolated 1 function 
tests... but not necessarily in real apps.

Apr 01 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

01-Apr-2014 22:35, Walter Bright пишет:
 Try this benchmark comparing various classification schemes:

 int f0()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar0(cast(ubyte)u);
      }
      return x;
 }

 int f1()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar1(cast(ubyte)u);
      }
      return x;
 }

 int f2()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar2(cast(ubyte)u);
      }
      return x;
 }

Would strongly suggest to use 2 kinds of data - randomized and some 
text. And, of course, it must be loaded at run-time.

-- 
Dmitry Olshansky

Apr 01 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 12:46 PM, Dmitry Olshansky wrote:
 Would strongly suggest to use 2 kinds of data - randomized and some text. And,
 of course, it must be loaded at run-time.

See Vladimir's post.

Apr 01 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

02-Apr-2014 02:09, Walter Bright пишет:
 On 4/1/2014 12:46 PM, Dmitry Olshansky wrote:
 Would strongly suggest to use 2 kinds of data - randomized and some
 text. And,
 of course, it must be loaded at run-time.

 See Vladimir's post.

Cool. All the more confidence in using multi-stage tables for Unicode 
stuff ;)

-- 
Dmitry Olshansky

Apr 02 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:
 ---------------------------------

10_000 iterations:
Milliseconds 7 10 2
Milliseconds 5 5 1
Milliseconds 8 10 2

100_000 iterations:
Milliseconds 78 93 21
Milliseconds 78 94 21
Milliseconds 78 92 21

1_000_000 iterations:
Milliseconds 584 614 146
Milliseconds 541 612 147
Milliseconds 502 609 146
Milliseconds 529 613 147
Milliseconds 539 606 147

This is DMD 32-bit 2.065.0 on an Intel Core-i7.
The options are -O -release -inline -noboundscheck.

A bit of comments:

1. The benchmark seems rather artificial to me.  The natural 
order of input is fixed, which is nice for both the branch 
predictor (long streaks of each branch) and the array access 
(sequential).

2. The latter function wins, no wonder though, since there's no 
branching.  (I tried to augment it by actually using the cache in 
between, but it still wins against my attempts so far.)

Ivan Kazmenko.

Apr 01 2014

"Ivan Kazmenko" <gassa mail.ru> writes:

On Tuesday, 1 April 2014 at 21:34:29 UTC, Ivan Kazmenko wrote:
 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:

 2. The latter function wins, no wonder though, since there's no 
 branching.  (I tried to augment it by actually using the cache 
 in between, but it still wins against my attempts so far.)

The "it" being augmented is the benchmark, the "it" winning being 
the function isIdentifierChar2.

This modification of the benchmark pretends to use a 
stack-allocated array alongside calling the funcions of interest:

http://dpaste.dzfl.pl/1ed247445308

The timings are:
Milliseconds 623 590 229
Milliseconds 629 592 231
Milliseconds 619 596 233
Milliseconds 524 592 283

Funnily, on DPaste, the picture changes:
Milliseconds 1864 1349 1303

Perhaps there is no "-O -release -inline -noboundscheck" tuning 
there.

Ivan Kazmenko.

Apr 01 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

I modified it a bit to use the Phobos source code as the test 
data:
http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d

My results (DMD git HEAD, -O -inline -release):

x86: 41070 29642 5169
x64: 42975 29759 5822

I've been using table lookups in my library for a while, e.g. for 
ASCII case conversions:

https://github.com/CyberShadow/ae/blob/master/utils/text.d#L323-L341

or escaping JSON strings:

https://github.com/CyberShadow/ae/blob/master/utils/json.d#L150-L188

Apr 01 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/1/2014 3:00 PM, Vladimir Panteleev wrote:
 I modified it a bit to use the Phobos source code as the test data:
 http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d

Nice. I especially like the use of dirEntries to read all the text. That should 
go into the phobos documentation!

Apr 01 2014

Marco Leise <Marco.Leise gmx.de> writes:

Do I see it right that it really comes down to one use of
isIdentifierStart(c) in one huge switch-case? Since "warp"
seems to be interested only in ASCII properties since it is
processing C code, could the switch-case be optimized into a
256 entries large jump table?

If a table lookup (char->bool) is faster than multiple
comparisons, why shouldn't this apply to char->address as well?

-- 
Marco

Apr 01 2014

"Daniel N" <ufo orbiting.us> writes:

On Wednesday, 2 April 2014 at 03:47:58 UTC, Marco Leise wrote:
 Do I see it right that it really comes down to one use of
 isIdentifierStart(c) in one huge switch-case? Since "warp"
 seems to be interested only in ASCII properties since it is
 processing C code, could the switch-case be optimized into a
 256 entries large jump table?

 If a table lookup (char->bool) is faster than multiple
 comparisons, why shouldn't this apply to char->address as well?

I was thinking the same, sure would be nice if computed goto:s 
were standardized. (as suggested by Bearophile many times)

Apr 02 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

02-Apr-2014 12:27, Daniel N пишет:
 On Wednesday, 2 April 2014 at 03:47:58 UTC, Marco Leise wrote:
 Do I see it right that it really comes down to one use of
 isIdentifierStart(c) in one huge switch-case? Since "warp"
 seems to be interested only in ASCII properties since it is
 processing C code, could the switch-case be optimized into a
 256 entries large jump table?

 If a table lookup (char->bool) is faster than multiple
 comparisons, why shouldn't this apply to char->address as well?

 I was thinking the same, sure would be nice if computed goto:s were
 standardized. (as suggested by Bearophile many times)

_Explicit_ tail call (aka switch-call) would have been cleaner and more 
modular. It covers all relevant cases of computed-goto such as 
threaded-code interpreters and also recursion-heavy functional code.

-- 
Dmitry Olshansky

Apr 02 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Dmitry Olshansky:

 _Explicit_ tail call (aka switch-call) would have been cleaner 
 and more modular. It covers all relevant cases of computed-goto 
 such as threaded-code interpreters and also recursion-heavy 
 functional code.

It looks like an additive change. Can you show a possible syntax 
and semantics for D? If it's so good it could also become a DIP 
later.

Bye,
bearophile

Apr 02 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

02-Apr-2014 15:26, bearophile пишет:
 Dmitry Olshansky:

 _Explicit_ tail call (aka switch-call) would have been cleaner and
 more modular. It covers all relevant cases of computed-goto such as
 threaded-code interpreters and also recursion-heavy functional code.

 It looks like an additive change. Can you show a possible syntax and
 semantics for D? If it's so good it could also become a DIP later.

Easily. In grammar, a new kind of statement:

SwitchCallStatment:
	goto CallExpression;

Semantics:
Same as that of a function call expression except:
a) Can Switch-Call only to a function with the same return type.
b) Anything below this statement is unreachable, i.e. treat it as return.
c) Stack frame of the current function is overwritten with that of 
switched-to function. In other words after a switch-call stack looks as 
if the switched-to function was called instead in the first place.

The last point is what modern tail-call optimizations do when one has 2 
functions that tail-recurse to each other. They overwrite stackframes on 
such tail calls.

However this only happens with certain optimization switch which IMO 
makes the thing totally unreliable. In fact I even observed that I could 
implement threaded-code interpreter by relying on tail-call optimization 
with LDC, but only when compiling with all optimizations enabled. 
Unoptimized builds simply blowup stack. This "semantics are tied to 
optimization" is a situation that I hate about C/C++ and would hope to 
address in D.


-- 
Dmitry Olshansky

Apr 02 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:
 SwitchCallStatment:
      goto CallExpression;

 Semantics:
 Same as that of a function call expression except:
 a) Can Switch-Call only to a function with the same return type.
 b) Anything below this statement is unreachable, i.e. treat it as return.
 c) Stack frame of the current function is overwritten with that of switched-to
 function. In other words after a switch-call stack looks as if the switched-to
 function was called instead in the first place.

 The last point is what modern tail-call optimizations do when one has 2
 functions that tail-recurse to each other. They overwrite stackframes on such
 tail calls.

How is this different from simply recognizing "return foo(arg);" as being a
goto?


 However this only happens with certain optimization switch which IMO makes the
 thing totally unreliable. In fact I even observed that I could implement
 threaded-code interpreter by relying on tail-call optimization with LDC, but
 only when compiling with all optimizations enabled. Unoptimized builds simply
 blowup stack. This "semantics are tied to optimization" is a situation that I
 hate about C/C++ and would hope to address in D.

Doesn't seem terribly different than the "force inline" feature discussion.

Apr 02 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Apr-2014 00:39, Walter Bright пишет:
 On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:
 SwitchCallStatment:
      goto CallExpression;

 Semantics:
 Same as that of a function call expression except:
 a) Can Switch-Call only to a function with the same return type.
 b) Anything below this statement is unreachable, i.e. treat it as return.
 c) Stack frame of the current function is overwritten with that of
 switched-to
 function. In other words after a switch-call stack looks as if the
 switched-to
 function was called instead in the first place.

 The last point is what modern tail-call optimizations do when one has 2
 functions that tail-recurse to each other. They overwrite stackframes
 on such
 tail calls.

 How is this different from simply recognizing "return foo(arg);" as
 being a goto?

If we can alter semantics of
return foo(arg);
to _always_ be a goto, guarantee a jump and reuse of the stack,  then 
I'm all for it.

 However this only happens with certain optimization switch which IMO
 makes the
 thing totally unreliable. In fact I even observed that I could implement
 threaded-code interpreter by relying on tail-call optimization with
 LDC, but
 only when compiling with all optimizations enabled. Unoptimized builds
 simply
 blowup stack. This "semantics are tied to optimization" is a situation
 that I
 hate about C/C++ and would hope to address in D.

 Doesn't seem terribly different than the "force inline" feature discussion.

Yes, it shows the same problem: semantics and optimization are often 
mixed up. In this case however a program would segfault if the relevant 
optimization was not performed, it's not just be many times slower.

-- 
Dmitry Olshansky

Apr 02 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/2/2014 1:54 PM, Dmitry Olshansky wrote:
 If we can alter semantics of
 return foo(arg);
 to _always_ be a goto, guarantee a jump and reuse of the stack,  then I'm all
 for it.

I don't see why not. Note that we couldn't do this for extern(C) functions, or 
variadics, or caller functions with parameters that need destruction, or 
parameters that may refer to any locals. That last constraint might be a doozy, 
however.

Why not submit an enhancement request to bugzilla?

Apr 02 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 I don't see why not. Note that we couldn't do this for 
 extern(C) functions, or variadics, or caller functions with 
 parameters that need destruction, or parameters that may refer 
 to any locals. That last constraint might be a doozy, however.

 Why not submit an enhancement request to bugzilla?

So what does it happen if I (by mistake or by ignorance) try to 
use this on a function that doesn't satisfy one of those 
requirements? D can't give an error in that case because it's a 
big breaking change. So is D in that case just silently not 
performing the tail call as the programmer meant?

Bye,
bearophile

Apr 02 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/2/2014 3:06 PM, bearophile wrote:
 Walter Bright:

 I don't see why not. Note that we couldn't do this for extern(C) functions, or
 variadics, or caller functions with parameters that need destruction, or
 parameters that may refer to any locals. That last constraint might be a
 doozy, however.

 Why not submit an enhancement request to bugzilla?

 So what does it happen if I (by mistake or by ignorance) try to use this on a
 function that doesn't satisfy one of those requirements?

Then it won't tail call it.

 D can't give an error in that case because it's a big breaking change. So is D
in that case just
 silently not performing the tail call as the programmer meant?

That's right.

Apr 02 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 D can't give an error in that case because it's a big breaking 
 change. So is D in that case just
 silently not performing the tail call as the programmer meant?

 That's right.

Then this feature needs a specific and explicit syntax, or it has 
_no_ point at all.
(Note: I am currently not asking for this feature in D. I have 
just shown one of its minimal requirements).

Bye,
bearophile

Apr 02 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/2/2014 6:21 PM, bearophile wrote:
 Then this feature needs a specific and explicit syntax, or it has _no_ point at
 all.

That's like saying inlining has no point because it doesn't have a particular 
syntax.

Apr 02 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 That's like saying inlining has no point because it doesn't 
 have a particular syntax.

Hopefully Dmitry Olshansky can explain why you are probably wrong 
regarding the proposed feature.

Bye,
bearophile

Apr 03 2014

Artur Skawina <art.08.09 gmail.com> writes:

On 04/03/14 04:40, Walter Bright wrote:
 On 4/2/2014 6:21 PM, bearophile wrote:
 Then this feature needs a specific and explicit syntax, or it has _no_ point at
 all.

 
 That's like saying inlining has no point because it doesn't have a particular
syntax.

Weak inlining *hints* have no point - C learned this the hard way, with "inline"
being useless today, and everybody having to use "always_inline/forceinline".
The issue isn't about optimizations, which any decent compiler can deal with;
it's about the programmer being able to rely on avoiding certain undesirable
effects, like excessive stack consumption, unnecessary clobbering of the few
precious cpu registers available etc.

What Bearophile is saying is that an optional tail call optimization being
mandated by a spec would be completely useless -- TCO can be legally done
by the compiler anyway; documenting that it could happen does not help the
programmer in any way - (s)he still has to assume that the optimization might
not happen. A syntax like "goto return f();" which mandates TCO would make the
situation different.

[Defining computed gotos would be a better idea, as the tail-call
transformation 
 would significantly raise the requirements for a compliant D compiler. OTOH
 computed-gotos would lead to more readable D code, are easier to implement,
 and can be easily be made safe in D (by making the type of label-addresses
 unique per-function and avoiding function-cloning)]

artur

Apr 03 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Apr-2014 05:05, Walter Bright пишет:
 On 4/2/2014 3:06 PM, bearophile wrote:
 Walter Bright:

 I don't see why not. Note that we couldn't do this for extern(C)
 functions, or
 variadics, or caller functions with parameters that need destruction, or
 parameters that may refer to any locals. That last constraint might be a
 doozy, however.

 Why not submit an enhancement request to bugzilla?

 So what does it happen if I (by mistake or by ignorance) try to use
 this on a
 function that doesn't satisfy one of those requirements?

 Then it won't tail call it.

 D can't give an error in that case because it's a big breaking change.
 So is D in that case just
 silently not performing the tail call as the programmer meant?

 That's right.

 From this it seems to me that once again you've walled yourself out of 
this question and going to frame it as an optimization.

Tail-call has 2 sides to it:

a) An optimization for functions that return with a call expression. I'd 
leave this case an optimization because it also has an undesirable 
effect on call stack, hindering debugging.

b) An explicit instruction to switch execution to another function. This 
doesn't have any unexpected effects, and failing to make a tail-call 
will change the semantics of code. This is critical for certain kind of 
code, it simply doesn't work if tail-call is not a jump.

I agree that nothing prevents doing optimization of
return foo(args);
in case a) if optimization is enabled, and this is something GCC/LLVM 
already do.

But more importantly there should be an explicit tool for the case b), 
with syntax:
goto foo(args);
that will fail to compile if tail-call is not possible.

Because the alternative of double-checking on every compiler that 
"return foo(args)" in all the right places is optimized the right way is 
fragile, and would turn out to be an _ongoing_ waste of time.

-- 
Dmitry Olshansky

Apr 03 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 02 Apr 2014 18:05:32 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/2/2014 3:06 PM, bearophile wrote:
 Walter Bright:

 I don't see why not. Note that we couldn't do this for extern(C) functions, or
 variadics, or caller functions with parameters that need destruction, or
 parameters that may refer to any locals. That last constraint might be a
 doozy, however.

 Why not submit an enhancement request to bugzilla?

 So what does it happen if I (by mistake or by ignorance) try to use this on a
 function that doesn't satisfy one of those requirements?

 
 Then it won't tail call it.

How would `return foo(args)' be used in the C preprocessor example?
Would it be implemented like with a jump table?:

char c = ...;
return caseTable[c](args);

What are the benefits over "goto"? I figure with "goto" the
semantics are clearer and you can easily access the variables
on the stack. Does the TCO version enable any more use-cases?
(Note: I consider "far less convoluted code" a valid
use-case. :) )

-- 
Marco

Apr 05 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Apr-2014 01:51, Walter Bright пишет:
 On 4/2/2014 1:54 PM, Dmitry Olshansky wrote:
 If we can alter semantics of
 return foo(arg);
 to _always_ be a goto, guarantee a jump and reuse of the stack,  then
 I'm all
 for it.

 I don't see why not. Note that we couldn't do this for extern(C)
 functions, or variadics, or caller functions with parameters that need
 destruction, or parameters that may refer to any locals. That last
 constraint might be a doozy, however.

 Why not submit an enhancement request to bugzilla?

I'd gladly author an enhancement once I'm certain of what exactly to 
state in it.

Always optimizing return foo(args) to a jump may have impact on 
debugging experience (curious call-stacks) and cost people hours of 
wasted time chasing shadows.

Adding an explicit form for it seems to me far more appealing and honest.

-- 
Dmitry Olshansky

Apr 03 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

 bool isIdentifierChar1(ubyte c)
 {
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
 }

I'd like to point out this is quite a complicated function to 
begin with, so it doesn't generalize to all isXXX is ascii, for 
which the tests would be fairly simpler.

In any case, (on my win32 machine) I can go from 810msecs to 
500msecs using this function instead:

bool isIdentifierChar1(ubyte c)
{
     return c <= 'z' && (
             'a' <= c ||
             ('0' <= c && (c <= '9' || c == '_' || ('A' <= c && c 
<= 'Z'))) ||
             c == '$');
}

That said, I'm abusing the fact that 50% of your bench is for 
chars over 0x80. If I loop only on actual ASCII you can find in 
text, (0x20 - 0X80), then those numbers "only" go from "320" => 
"300". Only slightly better, but still a win.

*BUT*, if your functions were to accept any arbitrary codepoint, 
it would absolutely murder.

Apr 02 2014

"Daniel N" <ufo orbiting.us> writes:

 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

 bool isIdentifierChar1(ubyte c)
 {
    return ((c >= '0' || c == '$') &&
            (c <= '9' || c >= 'A')  &&
            (c <= 'Z' || c >= 'a' || c == '_') &&
            (c <= 'z'));
 }


Considering SSE4.2 was released already back in 2008 and was 
designed to accelerate this very use-case...

Maybe it's more beneficial to add a few new code-paths... SSE4.2, 
NEON, etc.

Apr 02 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

02-Apr-2014 20:52, Daniel N пишет:
 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

 bool isIdentifierChar1(ubyte c)
 {
    return ((c >= '0' || c == '$') &&
            (c <= '9' || c >= 'A')  &&
            (c <= 'Z' || c >= 'a' || c == '_') &&
            (c <= 'z'));
 }


 Considering SSE4.2 was released already back in 2008 and was designed to
 accelerate this very use-case...

 Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON,
 etc.

Ideally we'd need:
a) A collection of meaningful (high-level compared to std.simd) 
primitives that can be accelerated by SIMD, in Phobos. Such as 
searching, bit-counting etc.
b) A way to automatically use whatever is best and supported by the 
hardware. GCC got a zero-overhead stuff for that since around 4.6 (?).

-- 
Dmitry Olshansky

Apr 02 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:
 Ideally we'd need:
 a) A collection of meaningful (high-level compared to std.simd) primitives that
 can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc.

memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In 
general, enormous effort has been poured into optimizing C standard library 
functions. If an algorithm can be implemented using such functions, we ought to.

Apr 02 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Apr-2014 00:46, Walter Bright пишет:
 On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:
 Ideally we'd need:
 a) A collection of meaningful (high-level compared to std.simd)
 primitives that
 can be accelerated by SIMD, in Phobos. Such as searching, bit-counting
 etc.

 memchr is already optimized by SIMD, and std.algorithm.find uses memchr.
 In general, enormous effort has been poured into optimizing C standard
 library functions. If an algorithm can be implemented using such
 functions, we ought to.

This is all good and well, but...
memchr is working on the level of a single byte, leaving nothing for 2, 
4, 8 byte-wide needles. Not type-safe. No fallback for CTFE.

Obviously C standard is lacking many things, hence the endless row of 
extensions.

-- 
Dmitry Olshansky

Apr 02 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 2 April 2014 at 21:18:19 UTC, Dmitry Olshansky 
wrote:
 03-Apr-2014 00:46, Walter Bright пишет:
 On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:
 memchr is already optimized by SIMD, and std.algorithm.find 
 uses memchr.
 In general, enormous effort has been poured into optimizing C 
 standard
 library functions. If an algorithm can be implemented using 
 such
 functions, we ought to.

 This is all good and well, but...
 memchr is working on the level of a single byte, leaving 
 nothing for 2, 4, 8 byte-wide needles. Not type-safe. No 
 fallback for CTFE.

I found myself needing "wmemchr" (and optionally "dmemchr") for 
speeding up find for wstrings and dstrings.

I realized "wmemchr" exists, but it is platform defined: ushort 
(w) on windows, and uint (d) on linux.

So it's hard to use, and definitely not the "W" version I'd 
expect in D code.

If I were to request the actual "memchr"/"wmemchr"/"dmemchr" 
functions in some "core.???" module, is this something we'd want, 
and would somebody know how to provide an efficient 
implementation?

Apr 02 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 4/2/14, 11:19 PM, monarch_dodra wrote:
 If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions
 in some "core.???" module, is this something we'd want, and would
 somebody know how to provide an efficient implementation?

Yes. Probably checking stuff with SIMD and a few other approaches would 
be great.

Andrei

Apr 03 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 3 April 2014 at 19:39:57 UTC, Andrei Alexandrescu 
wrote:
 On 4/2/14, 11:19 PM, monarch_dodra wrote:
 If I were to request the actual "memchr"/"wmemchr"/"dmemchr" 
 functions
 in some "core.???" module, is this something we'd want, and 
 would
 somebody know how to provide an efficient implementation?

 Yes. Probably checking stuff with SIMD and a few other 
 approaches would be great.

 Andrei

I filed it here:
https://d.puremagic.com/issues/show_bug.cgi?id=12515

Now, we just need someone to actually do it :)

Apr 03 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Apr-2014 10:19, monarch_dodra пишет:

 If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions
 in some "core.???" module, is this something we'd want, and would
 somebody know how to provide an efficient implementation?

Something I wanted in D since about 2011. Should have put an enhancement 
request or tipped bearophile ;)

-- 
Dmitry Olshansky

Apr 03 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/2/2014 4:49 AM, monarch_dodra wrote:
 That said, I'm abusing the fact that 50% of your bench is for chars over 0x80.
 If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those
 numbers "only" go from "320" => "300". Only slightly better, but still a win.

Surely a better approach would be to do a statistical analysis of character 
frequency in a representative corpus, and tune the compares that way. But the 
table lookup offered such a dramatic improvement, I didn't bother.

Apr 02 2014

Iain Buclaw <ibuclaw gdcproject.org> writes:

On 1 April 2014 19:35, Walter Bright <newshound2 digitalmars.com> wrote:
 Try this benchmark comparing various classification schemes:

GDC turns switches into table lookups, and speeds are comparable on my
system, though your little hack is appears faster in release builds.

Tests are with 1 million runs.

---
Non-release run:
Milliseconds 1797 1642 1501 1463

Non-release run, no bounds checking:
Milliseconds 1730 1611 1483 1512

Release run:
Milliseconds 1753 1633 1471 1528

Optimised run:
Milliseconds 0 0 0 0
---

It is interesting that switches seem to take longer when bounds
checking is turned off, and even longer when in release mode.  However
this is not very conclusive as I can't test with optimisations turned
on.  Unfortunately for your small program, gcc is too smart and
optimises everything away.

---
bool isIdentifierChar3(ubyte c)
{
    switch(c)
    {
        case '0': .. case '9':
        case 'A': .. case 'Z':
        case 'a': .. case 'z':
        case '$':
        case '_':
            return true;

        default:
            return false;
    }
}

Apr 07 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

 on.  Unfortunately for your small program, gcc is too smart and
 optimises everything away.

Haha, that's funny. But why don't you put the data in a separate 
compilation unit?

Apr 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2014 2:41 AM, "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 on.  Unfortunately for your small program, gcc is too smart and
 optimises everything away.

 Haha, that's funny. But why don't you put the data in a separate compilation
unit?

True, it's pretty easy to defeat that optimization.

Apr 07 2014

Martin Nowak <code dawg.eu> writes:

On 04/01/2014 08:35 PM, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

Counter example:
https://github.com/D-Programming-Language/phobos/commit/8599d6b4acde0eff886876a56f54ecd8e6b528e9

Apr 07 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 7 April 2014 at 11:35:37 UTC, Martin Nowak wrote:
 On 04/01/2014 08:35 PM, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

 Counter example:
 https://github.com/D-Programming-Language/phobos/commit/8599d6b4acde0eff886876a56f54ecd8e6b528e9

It's all a matter of how complicated the "non-table" 
implementation is I guess. table-lookup has a "steep entry cost", 
but is constant, regardless of complexity.

Table lookups were "recently" removed from std.ascii too. The 
implementation walter is benching "against" is particularly 
complicated here.

Apr 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2014 7:11 AM, monarch_dodra wrote:
 It's all a matter of how complicated the "non-table" implementation is I guess.

Yup. Memory lookups can be expensive. Once it's in the cache, it's cheap.

Apr 07 2014

Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:

I added a lookup scheme of my own, its not as fast as Walters (in fact 
its the slowest without -inline - release -O) but it uses 1 bit per 
entry in the table instead of a whole byte so you can have lots and lots 
of different tables. I'm even reasonably sure that it works correctly!


===============================
import core.stdc.stdlib;
import core.stdc.string;

import std.algorithm;
import std.array;
import std.ascii;
import std.datetime;
import std.range;
import std.stdio;
import std.traits;

bool isIdentifierChar0(ubyte c)
{
     return isAlphaNum(c) || c == '_' || c == '$';
}

bool isIdentifierChar1(ubyte c)
{
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
}

immutable bool[256] tab2;
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
}

bool isIdentifierChar2(ubyte c)
{
     return tab2[c];
}

immutable ulong[4] tab3;
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         if (isIdentifierChar0(cast(ubyte)u))
         {
             auto sub = u >>> 6;
             auto b = u & 0x3f;
             auto mask = 0x01L << b;
             tab3[sub] |= mask;
         }
     }
}

bool isIdentifierChar3(ubyte c)
{
	auto sub = c >>> 6;
	c &= 0x3f;
	auto mask = 0x01L << c;
	return (tab3[sub] & mask) > 0;
}

int f0()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
}

int f1()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
}

int f2()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
}

int f3()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar3(cast(ubyte)u);
     }
     return x;
}

void main()
{
     auto r = benchmark!(f0, f1, f2, f3)(10_000);
     writefln("Milliseconds %s %s %s %s", r[0].msecs, r[1].msecs, 
r[2].msecs, r[3].msecs);
}

Apr 17 2014

"ixid" <nuaccount gmail.com> writes:

On Thursday, 17 April 2014 at 16:27:26 UTC, Alix Pexton wrote:
 I added a lookup scheme of my own, its not as fast as Walters 
 (in fact its the slowest without -inline - release -O) but it 
 uses 1 bit per entry in the table instead of a whole byte so 
 you can have lots and lots of different tables. I'm even 
 reasonably sure that it works correctly!


 ===============================
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
     return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

 bool isIdentifierChar2(ubyte c)
 {
     return tab2[c];
 }

 immutable ulong[4] tab3;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         if (isIdentifierChar0(cast(ubyte)u))
         {
             auto sub = u >>> 6;
             auto b = u & 0x3f;
             auto mask = 0x01L << b;
             tab3[sub] |= mask;
         }
     }
 }

 bool isIdentifierChar3(ubyte c)
 {
 	auto sub = c >>> 6;
 	c &= 0x3f;
 	auto mask = 0x01L << c;
 	return (tab3[sub] & mask) > 0;
 }

 int f0()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
 }

 int f1()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
 }

 int f2()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
 }

 int f3()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar3(cast(ubyte)u);
     }
     return x;
 }

 void main()
 {
     auto r = benchmark!(f0, f1, f2, f3)(10_000);
     writefln("Milliseconds %s %s %s %s", r[0].msecs, 
 r[1].msecs, r[2].msecs, r[3].msecs);
 }

I feel like there must be a way of making a fast bit look up but 
my version is only moderate in speed. You can get all the bits 
you need on two 64 bit registers or one SSE register. I haven't 
tried bt, does that work with a 64 bit register?

Apr 17 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 17 April 2014 at 18:07:24 UTC, ixid wrote:
 I feel like there must be a way of making a fast bit look up 
 but my version is only moderate in speed. You can get all the 
 bits you need on two 64 bit registers or one SSE register. I 
 haven't tried bt, does that work with a 64 bit register?



?

Note it can be applied to the table in general, rather than the 
byte themselves. EG:

ubyte[256] buf;
auto b = bt(buf.ptr, 428);

Apr 17 2014

Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:

On 17/04/2014 8:41 PM, monarch_dodra wrote:
 On Thursday, 17 April 2014 at 18:07:24 UTC, ixid wrote:
 I feel like there must be a way of making a fast bit look up but my
 version is only moderate in speed. You can get all the bits you need
 on two 64 bit registers or one SSE register. I haven't tried bt, does
 that work with a 64 bit register?



 ?

 Note it can be applied to the table in general, rather than the byte
 themselves. EG:

 ubyte[256] buf;
 auto b = bt(buf.ptr, 428);

I didn't think to look in core.bitop for a faster way to check bits, 
I'll check if it closes the gap to Walter's version.

A...

Apr 18 2014

Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:

I tested this against the others...
(with "-inline -release -O" of course)

===

uint[8] tab4; // bitop function only work on uints
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         if (isIdentifierChar0(cast(ubyte)u))
         {
             bts(tab4.ptr, u);
         }
     }
}

bool isIdentifierChar4(ubyte c)
{
	return bt(tab4.ptr, c) != 0;
}

===

it takes about 3 times as long (about the same as the original 
isIdentifierChar1, that used lots of &&s and ||s).

So 2 shifts, 2 ands and a compare to zero, beats core.bitop.

for clarity, the output of my benchmark for all 5 version was

Milliseconds 21 15 2 5 15

the 2 (which on some runs was a 3) is Walter's table of bools, the 5 is 
my table of ulongs and the 15 on the end is core.bitop.

Short of delving into the asm, which I'm trying to avoid, any other 
suggestions for speed-ups?

A...

Apr 18 2014

Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:

PS

I discovered that core.bitop.bts() won't accept a pointer to immutable 
even when inside a static this. Not sure if that counts as a bug or not ><

Apr 18 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Table lookups - this is pretty definitive