digitalmars.D - Table lookups - this is pretty definitive
- Walter Bright (67/67) Apr 01 2014 Try this benchmark comparing various classification schemes:
- w0rp (1/2) Apr 01 2014 Yeah, table lookup is the way to go for ASCII, it seems.
- w0rp (2/5) Apr 01 2014 After I remember to use -inline -release -O
- Denis Shelomovskij (20/87) Apr 01 2014 Some regular benchmark notes:
- Denis Shelomovskij (59/154) Apr 01 2014 Few more words about `dmd`:
- Walter Bright (3/5) Apr 01 2014 This is why I recommend, when going all out in optimizing, that at some ...
- Walter Bright (16/17) Apr 01 2014 For me, rewriting main() as:
- John Colvin (19/87) Apr 01 2014 It's not really possible to tell anything from that benchmark,
- Walter Bright (7/17) Apr 01 2014 This is always an issue with benchmarking in order to find the best way ...
- Peter Alexander (6/9) Apr 01 2014 That's to be expected as warp will call it often and it will stay
- Walter Bright (2/3) Apr 01 2014 No data yet. Stay tuned.
- JR (14/22) Apr 01 2014 Table lookups are awesome.
- Artur Skawina (12/25) Apr 01 2014 You mention AAs, so you may already realize this, but there is no
- bearophile (6/15) Apr 01 2014 A recent improvement in foreach loops allows to remove the cast
- Andrej Mitrovic (6/14) Apr 01 2014 Also (untested):
- Walter Bright (2/6) Apr 01 2014 Yup, that works. Purty nice!
- bearophile (10/15) Apr 01 2014 Eventually this should work, and avoid the nasty cast:
- Brad Anderson (3/8) Apr 02 2014 I can't decide if this part is brilliant or terrible. Clever,
- bearophile (7/13) Apr 02 2014 That iota enhancement has nothing terrible: the "[]" syntax is
- Brad Anderson (2/16) Apr 02 2014 I was more referring to ubyte.max.iota :)
- Walter Bright (3/7) Apr 01 2014 Incorporated your idea:
- bearophile (16/17) Apr 01 2014 A bit of statistics. I have counted about 260 "cast(" in the
- Tove (8/10) Apr 01 2014 Original program... 3 4 1
- Dmitry Olshansky (5/33) Apr 01 2014 Would strongly suggest to use 2 kinds of data - randomized and some
- Walter Bright (2/4) Apr 01 2014 See Vladimir's post.
- Dmitry Olshansky (5/10) Apr 02 2014 Cool. All the more confidence in using multi-stage tables for Unicode
- Ivan Kazmenko (26/28) Apr 01 2014 10_000 iterations:
- Ivan Kazmenko (16/20) Apr 01 2014 The "it" being augmented is the benchmark, the "it" winning being
- Vladimir Panteleev (12/13) Apr 01 2014 I modified it a bit to use the Phobos source code as the test
- Walter Bright (3/5) Apr 01 2014 Nice. I especially like the use of dirEntries to read all the text. That...
- Marco Leise (9/9) Apr 01 2014 Do I see it right that it really comes down to one use of
- Daniel N (3/10) Apr 02 2014 I was thinking the same, sure would be nice if computed goto:s
- Dmitry Olshansky (6/17) Apr 02 2014 _Explicit_ tail call (aka switch-call) would have been cleaner and more
- bearophile (6/10) Apr 02 2014 It looks like an additive change. Can you show a possible syntax
- Dmitry Olshansky (23/29) Apr 02 2014 Easily. In grammar, a new kind of statement:
- Walter Bright (3/21) Apr 02 2014 Doesn't seem terribly different than the "force inline" feature discussi...
- Dmitry Olshansky (10/41) Apr 02 2014 If we can alter semantics of
- Walter Bright (6/10) Apr 02 2014 I don't see why not. Note that we couldn't do this for extern(C) functio...
- bearophile (8/13) Apr 02 2014 So what does it happen if I (by mistake or by ignorance) try to
- Walter Bright (3/14) Apr 02 2014 That's right.
- bearophile (7/11) Apr 02 2014 Then this feature needs a specific and explicit syntax, or it has
- Walter Bright (3/5) Apr 02 2014 That's like saying inlining has no point because it doesn't have a parti...
- bearophile (5/7) Apr 03 2014 Hopefully Dmitry Olshansky can explain why you are probably wrong
- Artur Skawina (19/24) Apr 03 2014 Weak inlining *hints* have no point - C learned this the hard way, with ...
- Dmitry Olshansky (24/43) Apr 03 2014 From this it seems to me that once again you've walled yourself out of
- Marco Leise (13/27) Apr 05 2014 How would `return foo(args)' be used in the C preprocessor example?
- Dmitry Olshansky (9/20) Apr 03 2014 I'd gladly author an enhancement once I'm certain of what exactly to
- monarch_dodra (20/28) Apr 02 2014 I'd like to point out this is quite a complicated function to
- Daniel N (4/14) Apr 02 2014 Considering SSE4.2 was released already back in 2008 and was
- Dmitry Olshansky (9/23) Apr 02 2014 Ideally we'd need:
- Walter Bright (4/7) Apr 02 2014 memchr is already optimized by SIMD, and std.algorithm.find uses memchr....
- Dmitry Olshansky (8/18) Apr 02 2014 This is all good and well, but...
- monarch_dodra (12/25) Apr 02 2014 I found myself needing "wmemchr" (and optionally "dmemchr") for
- Andrei Alexandrescu (4/7) Apr 03 2014 Yes. Probably checking stuff with SIMD and a few other approaches would
- monarch_dodra (5/14) Apr 03 2014 I filed it here:
- Dmitry Olshansky (5/8) Apr 03 2014 Something I wanted in D since about 2011. Should have put an enhancement...
- Walter Bright (4/7) Apr 02 2014 Surely a better approach would be to do a statistical analysis of charac...
- Iain Buclaw (34/35) Apr 07 2014 GDC turns switches into table lookups, and speeds are comparable on my
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (2/4) Apr 07 2014 Haha, that's funny. But why don't you put the data in a separate
- Walter Bright (3/6) Apr 07 2014 True, it's pretty easy to defeat that optimization.
- Martin Nowak (3/4) Apr 07 2014 Counter example:
- monarch_dodra (7/11) Apr 07 2014 It's all a matter of how complicated the "non-table"
- Walter Bright (2/3) Apr 07 2014 Yup. Memory lookups can be expensive. Once it's in the cache, it's cheap...
- Alix Pexton (100/100) Apr 17 2014 I added a lookup scheme of my own, its not as fast as Walters (in fact
- ixid (5/106) Apr 17 2014 I feel like there must be a way of making a fast bit look up but
- monarch_dodra (7/11) Apr 17 2014 http://dlang.org/phobos/core_bitop.html#.bt
- Alix Pexton (4/15) Apr 18 2014 I didn't think to look in core.bitop for a faster way to check bits,
- Alix Pexton (29/29) Apr 18 2014 I tested this against the others...
- Alix Pexton (3/3) Apr 18 2014 PS
Try this benchmark comparing various classification schemes: --------------------------------- import core.stdc.stdlib; import core.stdc.string; import std.algorithm; import std.array; import std.ascii; import std.datetime; import std.range; import std.stdio; import std.traits; bool isIdentifierChar0(ubyte c) { return isAlphaNum(c) || c == '_' || c == '$'; } bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); } immutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } } bool isIdentifierChar2(ubyte c) { return tab2[c]; } /*********************************************/ int f0() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar0(cast(ubyte)u); } return x; } int f1() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar1(cast(ubyte)u); } return x; } int f2() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar2(cast(ubyte)u); } return x; } void main() { auto r = benchmark!(f0, f1, f2)(10_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); }
Apr 01 2014
Milliseconds 22 14 10Yeah, table lookup is the way to go for ASCII, it seems.
Apr 01 2014
On Tuesday, 1 April 2014 at 18:41:03 UTC, w0rp wrote:Milliseconds 22 14 10Yeah, table lookup is the way to go for ASCII, it seems.Milliseconds 13 8 2After I remember to use -inline -release -O
Apr 01 2014
01.04.2014 22:35, Walter Bright пишет:Try this benchmark comparing various classification schemes: --------------------------------- import core.stdc.stdlib; import core.stdc.string; import std.algorithm; import std.array; import std.ascii; import std.datetime; import std.range; import std.stdio; import std.traits; bool isIdentifierChar0(ubyte c) { return isAlphaNum(c) || c == '_' || c == '$'; } bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); } immutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } } bool isIdentifierChar2(ubyte c) { return tab2[c]; } /*********************************************/ int f0() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar0(cast(ubyte)u); } return x; } int f1() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar1(cast(ubyte)u); } return x; } int f2() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar2(cast(ubyte)u); } return x; } void main() { auto r = benchmark!(f0, f1, f2)(10_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); }Some regular benchmark notes: 1. The first one passed to `benchmark` is always slower (e.g. pass `f2` and see). 2. Unexpected program behaviour changes: Let's use `benchmark!(f1, f1, f1)(1_000_000)`: Milliseconds 928 889 888 Then copy `isAlphaNum` in file (benchmark still shows same results) and change `dchar c` to `ubyte c`, result changes: Milliseconds 849 815 827 The difference is sufficient but `isAlphaNum` not called in benchmark. Also `f0` is faster than `f1`, benchmark!(f0, f0, f0)(1_000_000): Milliseconds 731 694 702 And `f2` wins, it's the only obvious thing, benchmark!(f2, f2, f2)(1_000_000): Milliseconds 227 184 184 Compiler used: dmd -O -inline -release -- Денис В. Шеломовский Denis V. Shelomovskij
Apr 01 2014
01.04.2014 23:22, Denis Shelomovskij пишет:01.04.2014 22:35, Walter Bright пишет:Few more words about `dmd`: "Hey, silly, you still use `== x` to compare with `x`? Use table lookups! And never cast `size_t` to `ubyte`! See: --- import std.datetime; import std.stdio; immutable bool[256] tab2; static this() { foreach(size_t u; 0 .. 0x100) tab2[u] = u == '_'; } /*********************************************/ int f0() { int x; foreach(uint u; 0 .. 0x100) x += u == '_'; return x; } int f2() { int x; foreach(uint u; 0 .. 0x100) x += tab2[cast(ubyte)u]; return x; } int f2plus() { int x; foreach(size_t u; 0 .. 0x100) x += tab2[u]; return x; } void main() { auto r = benchmark!(f0, f0, f0)(1_000_000); writefln("Milliseconds %s %s %s (f0)", r[0].msecs, r[1].msecs, r[2].msecs); r = benchmark!(f2, f2, f2)(1_000_000); writefln("Milliseconds %s %s %s (f2)", r[0].msecs, r[1].msecs, r[2].msecs); r = benchmark!(f2plus, f2plus, f2plus)(1_000_000); writefln("Milliseconds %s %s %s (f2plus)", r[0].msecs, r[1].msecs, r[2].msecs); } --- Milliseconds 323 274 281 (f0) Milliseconds 185 185 185 (f2) Milliseconds 105 97 105 (f2plus) --- P.S. I don't want to say anything with these results. I just have no idea what happens here and I don't want to investigate dmd's way of optimizing things now. -- Денис В. Шеломовский Denis V. ShelomovskijTry this benchmark comparing various classification schemes: --------------------------------- import core.stdc.stdlib; import core.stdc.string; import std.algorithm; import std.array; import std.ascii; import std.datetime; import std.range; import std.stdio; import std.traits; bool isIdentifierChar0(ubyte c) { return isAlphaNum(c) || c == '_' || c == '$'; } bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); } immutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } } bool isIdentifierChar2(ubyte c) { return tab2[c]; } /*********************************************/ int f0() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar0(cast(ubyte)u); } return x; } int f1() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar1(cast(ubyte)u); } return x; } int f2() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar2(cast(ubyte)u); } return x; } void main() { auto r = benchmark!(f0, f1, f2)(10_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); }Some regular benchmark notes: 1. The first one passed to `benchmark` is always slower (e.g. pass `f2` and see). 2. Unexpected program behaviour changes: Let's use `benchmark!(f1, f1, f1)(1_000_000)`: Milliseconds 928 889 888 Then copy `isAlphaNum` in file (benchmark still shows same results) and change `dchar c` to `ubyte c`, result changes: Milliseconds 849 815 827 The difference is sufficient but `isAlphaNum` not called in benchmark. Also `f0` is faster than `f1`, benchmark!(f0, f0, f0)(1_000_000): Milliseconds 731 694 702 And `f2` wins, it's the only obvious thing, benchmark!(f2, f2, f2)(1_000_000): Milliseconds 227 184 184 Compiler used: dmd -O -inline -release
Apr 01 2014
On 4/1/2014 12:38 PM, Denis Shelomovskij wrote:I don't want to say anything with these results. I just have no idea what happens here and I don't want to investigate dmd's way of optimizing things now.This is why I recommend, when going all out in optimizing, that at some point you've gotta look at the assembler output of (insert your favorite compiler).
Apr 01 2014
On 4/1/2014 12:22 PM, Denis Shelomovskij wrote:Compiler used: dmd -O -inline -releaseFor me, rewriting main() as: void main() { { auto r = benchmark!(f0, f1, f2)(100_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); } { auto r = benchmark!(f0, f1, f2)(100_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); } } Gives: Milliseconds 139 99 15 Milliseconds 122 93 13
Apr 01 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:Try this benchmark comparing various classification schemes: --------------------------------- import core.stdc.stdlib; import core.stdc.string; import std.algorithm; import std.array; import std.ascii; import std.datetime; import std.range; import std.stdio; import std.traits; bool isIdentifierChar0(ubyte c) { return isAlphaNum(c) || c == '_' || c == '$'; } bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); } immutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } } bool isIdentifierChar2(ubyte c) { return tab2[c]; } /*********************************************/ int f0() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar0(cast(ubyte)u); } return x; } int f1() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar1(cast(ubyte)u); } return x; } int f2() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar2(cast(ubyte)u); } return x; } void main() { auto r = benchmark!(f0, f1, f2)(10_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); }It's not really possible to tell anything from that benchmark, especially with fancy modern optimisers and branch predictors. 1) ldc and gdc both optimise away some/all of the tests respectively. 2) once isIdentifierCharX is inlined, the compiler can determine what the results will be before-hand. 3) Even without 1) and 2), the results are unrealistically (very!) correlated, leading to a branch-prediction bonanza. I'm sure you know how significant that can be on modern CPUs. It is also very friendly to pre-fetching of the lookup table* Perhaps have the same benchmark, but working on realistic data from a file. *In my experience, the cost of using lookup tables often only appears in real code, where cache pressure and predictability becomes worse. This is especially true of lazy, range-based code. If several layers of a range use different lookup tables you can quickly end up ruining cache performance compared to an eager approach.
Apr 01 2014
On 4/1/2014 12:23 PM, John Colvin wrote:It's not really possible to tell anything from that benchmark, especially with fancy modern optimisers and branch predictors.I disagree, read on.1) ldc and gdc both optimise away some/all of the tests respectively. 2) once isIdentifierCharX is inlined, the compiler can determine what the results will be before-hand.This is always an issue with benchmarking in order to find the best way to write an algorithm. I just adjust it so the particular compiler used can't throw it away.3) Even without 1) and 2), the results are unrealistically (very!) correlated, leading to a branch-prediction bonanza. I'm sure you know how significant that can be on modern CPUs. It is also very friendly to pre-fetching of the lookup table*This is exactly why I wanted to test it.Perhaps have the same benchmark, but working on realistic data from a file.Switching to it in Warp produced a very significant speedup on real files. https://github.com/facebook/warp/pull/5
Apr 01 2014
Switching to it in Warp produced a very significant speedup on real files. https://github.com/facebook/warp/pull/5That's to be expected as warp will call it often and it will stay in the cache. For a program that calls it less often the other methods may be faster. Basically, it depends on the use case. Glad to see it's helped in warp. Any idea how it is comparing to clang with this improvement?
Apr 01 2014
On 4/1/2014 2:30 PM, Peter Alexander wrote:Any idea how it is comparing to clang with this improvement?No data yet. Stay tuned.
Apr 01 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:immutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } }Table lookups are awesome. While tab2 there lends itself to being populated in a static this() it would really lower the bar if we could define static immutable AA literals at compile-time. Or at least CTFE-fill them. Having to rely on static this() for such is unfortunate and hacky. Relevant is http://forum.dlang.org/thread/oxnwtojtdymopgvanbwz forum.dlang.org in which I get a speed increase of >3.5x by doing string-to-enum translation/mapping using someEnum[string] AA lookups, instead of using someString.to!someEnum. (Given CtAAs, std.conv.to could then generate such for various translations where cheap -- assuming we still like AAs?)
Apr 01 2014
On 04/01/14 21:35, JR wrote:On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example: immutable bool[256] tab2 = { bool t[256]; for (size_t u = 0; u < 0x100; ++u) t[u] = isIdentifierChar0(cast(ubyte)u); return t; }(); "Static immutable AAs" can be built at CT likewise, it's just that their type will be different from "std" RT AAs. arturimmutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } }Table lookups are awesome. While tab2 there lends itself to being populated in a static this() it would really lower the bar if we could define static immutable AA literals at compile-time. Or at least CTFE-fill them. Having to rely on static this() for such is unfortunate and hacky.
Apr 01 2014
Artur Skawina:You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example: immutable bool[256] tab2 = { bool t[256]; for (size_t u = 0; u < 0x100; ++u) t[u] = isIdentifierChar0(cast(ubyte)u); return t; }();A recent improvement in foreach loops allows to remove the cast from your code, but I have found a new compiler bug: https://d.puremagic.com/issues/show_bug.cgi?id=12504 Bye, bearophile
Apr 01 2014
On 4/1/14, Artur Skawina <art.08.09 gmail.com> wrote:You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example: immutable bool[256] tab2 = { bool t[256]; for (size_t u = 0; u < 0x100; ++u) t[u] = isIdentifierChar0(cast(ubyte)u); return t; }();Also (untested): immutable bool[256] tab2 = iota(0, 0x100) .map!(i => isIdentifierChar0(cast(ubyte)i)) .array;
Apr 01 2014
On 4/1/2014 1:57 PM, Andrej Mitrovic wrote:immutable bool[256] tab2 = iota(0, 0x100) .map!(i => isIdentifierChar0(cast(ubyte)i)) .array;Yup, that works. Purty nice!
Apr 01 2014
Andrej Mitrovic:Also (untested): immutable bool[256] tab2 = iota(0, 0x100) .map!(i => isIdentifierChar0(cast(ubyte)i)) .array;Eventually this should work, and avoid the nasty cast: immutable bool[256] tab3 = ubyte .max .iota!"[]" .map!isIdentifierChar0 .array; Bye, bearophile
Apr 01 2014
On Tuesday, 1 April 2014 at 22:32:28 UTC, bearophile wrote:Eventually this should work, and avoid the nasty cast: immutable bool[256] tab3 = ubyte .max .iota!"[]"I can't decide if this part is brilliant or terrible. Clever, either way.
Apr 02 2014
Brad Anderson:That iota enhancement has nothing terrible: the "[]" syntax is already used in Phobos for std.range.uniform, and it's quite needed, otherwise it can't generate the right bound of a type (like the last ubyte, last byte, last char, last int, etc). Bye, bearophileimmutable bool[256] tab3 = ubyte .max .iota!"[]"I can't decide if this part is brilliant or terrible. Clever, either way.
Apr 02 2014
On Wednesday, 2 April 2014 at 21:58:33 UTC, bearophile wrote:Brad Anderson:I was more referring to ubyte.max.iota :)That iota enhancement has nothing terrible: the "[]" syntax is already used in Phobos for std.range.uniform, and it's quite needed, otherwise it can't generate the right bound of a type (like the last ubyte, last byte, last char, last int, etc). Bye, bearophileimmutable bool[256] tab3 = ubyte .max .iota!"[]"I can't decide if this part is brilliant or terrible. Clever, either way.
Apr 02 2014
On 4/1/2014 1:57 PM, Andrej Mitrovic wrote:immutable bool[256] tab2 = iota(0, 0x100) .map!(i => isIdentifierChar0(cast(ubyte)i)) .array;Incorporated your idea: https://github.com/facebook/warp/pull/5/files
Apr 01 2014
Walter Bright:Incorporated your idea:A bit of statistics. I have counted about 260 "cast(" in the whole d files (total is 7752 CLOC lines of D code, plus 1008 blank lines and 905 lines of comments. I have not counted the lines of unittests). The files with the higher number of casts are: macros.d, with 86 casts constexpr.d, with 39 casts directive.d, with 30 casts lexer.d, with 28 casts stringlit.d, with 18 casts id.d, with 16 casts context.d, with 11 casts skip.d, with 9 casts Bye, bearophile
Apr 01 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:Try this benchmark comparing various classification schemes: ---------------------------------Original program... 3 4 1 10x as many iterations... 36 47 19 I think this benchmark is flawed... 1) Apparently there are rounding issues... assuming linear scaling, 1.9 ms yields ~1ms? Probably better to use usecs... 2) Table lookups will nearly always win in isolated 1 function tests... but not necessarily in real apps.
Apr 01 2014
01-Apr-2014 22:35, Walter Bright пишет:Try this benchmark comparing various classification schemes:int f0() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar0(cast(ubyte)u); } return x; } int f1() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar1(cast(ubyte)u); } return x; } int f2() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar2(cast(ubyte)u); } return x; }Would strongly suggest to use 2 kinds of data - randomized and some text. And, of course, it must be loaded at run-time. -- Dmitry Olshansky
Apr 01 2014
On 4/1/2014 12:46 PM, Dmitry Olshansky wrote:Would strongly suggest to use 2 kinds of data - randomized and some text. And, of course, it must be loaded at run-time.See Vladimir's post.
Apr 01 2014
02-Apr-2014 02:09, Walter Bright пишет:On 4/1/2014 12:46 PM, Dmitry Olshansky wrote:Cool. All the more confidence in using multi-stage tables for Unicode stuff ;) -- Dmitry OlshanskyWould strongly suggest to use 2 kinds of data - randomized and some text. And, of course, it must be loaded at run-time.See Vladimir's post.
Apr 02 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:Try this benchmark comparing various classification schemes: ---------------------------------10_000 iterations: Milliseconds 7 10 2 Milliseconds 5 5 1 Milliseconds 8 10 2 100_000 iterations: Milliseconds 78 93 21 Milliseconds 78 94 21 Milliseconds 78 92 21 1_000_000 iterations: Milliseconds 584 614 146 Milliseconds 541 612 147 Milliseconds 502 609 146 Milliseconds 529 613 147 Milliseconds 539 606 147 This is DMD 32-bit 2.065.0 on an Intel Core-i7. The options are -O -release -inline -noboundscheck. A bit of comments: 1. The benchmark seems rather artificial to me. The natural order of input is fixed, which is nice for both the branch predictor (long streaks of each branch) and the array access (sequential). 2. The latter function wins, no wonder though, since there's no branching. (I tried to augment it by actually using the cache in between, but it still wins against my attempts so far.) Ivan Kazmenko.
Apr 01 2014
On Tuesday, 1 April 2014 at 21:34:29 UTC, Ivan Kazmenko wrote:On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:2. The latter function wins, no wonder though, since there's no branching. (I tried to augment it by actually using the cache in between, but it still wins against my attempts so far.)The "it" being augmented is the benchmark, the "it" winning being the function isIdentifierChar2. This modification of the benchmark pretends to use a stack-allocated array alongside calling the funcions of interest: http://dpaste.dzfl.pl/1ed247445308 The timings are: Milliseconds 623 590 229 Milliseconds 629 592 231 Milliseconds 619 596 233 Milliseconds 524 592 283 Funnily, on DPaste, the picture changes: Milliseconds 1864 1349 1303 Perhaps there is no "-O -release -inline -noboundscheck" tuning there. Ivan Kazmenko.
Apr 01 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:Try this benchmark comparing various classification schemes:I modified it a bit to use the Phobos source code as the test data: http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d My results (DMD git HEAD, -O -inline -release): x86: 41070 29642 5169 x64: 42975 29759 5822 I've been using table lookups in my library for a while, e.g. for ASCII case conversions: https://github.com/CyberShadow/ae/blob/master/utils/text.d#L323-L341 or escaping JSON strings: https://github.com/CyberShadow/ae/blob/master/utils/json.d#L150-L188
Apr 01 2014
On 4/1/2014 3:00 PM, Vladimir Panteleev wrote:I modified it a bit to use the Phobos source code as the test data: http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.dNice. I especially like the use of dirEntries to read all the text. That should go into the phobos documentation!
Apr 01 2014
Do I see it right that it really comes down to one use of isIdentifierStart(c) in one huge switch-case? Since "warp" seems to be interested only in ASCII properties since it is processing C code, could the switch-case be optimized into a 256 entries large jump table? If a table lookup (char->bool) is faster than multiple comparisons, why shouldn't this apply to char->address as well? -- Marco
Apr 01 2014
On Wednesday, 2 April 2014 at 03:47:58 UTC, Marco Leise wrote:Do I see it right that it really comes down to one use of isIdentifierStart(c) in one huge switch-case? Since "warp" seems to be interested only in ASCII properties since it is processing C code, could the switch-case be optimized into a 256 entries large jump table? If a table lookup (char->bool) is faster than multiple comparisons, why shouldn't this apply to char->address as well?I was thinking the same, sure would be nice if computed goto:s were standardized. (as suggested by Bearophile many times)
Apr 02 2014
02-Apr-2014 12:27, Daniel N пишет:On Wednesday, 2 April 2014 at 03:47:58 UTC, Marco Leise wrote:_Explicit_ tail call (aka switch-call) would have been cleaner and more modular. It covers all relevant cases of computed-goto such as threaded-code interpreters and also recursion-heavy functional code. -- Dmitry OlshanskyDo I see it right that it really comes down to one use of isIdentifierStart(c) in one huge switch-case? Since "warp" seems to be interested only in ASCII properties since it is processing C code, could the switch-case be optimized into a 256 entries large jump table? If a table lookup (char->bool) is faster than multiple comparisons, why shouldn't this apply to char->address as well?I was thinking the same, sure would be nice if computed goto:s were standardized. (as suggested by Bearophile many times)
Apr 02 2014
Dmitry Olshansky:_Explicit_ tail call (aka switch-call) would have been cleaner and more modular. It covers all relevant cases of computed-goto such as threaded-code interpreters and also recursion-heavy functional code.It looks like an additive change. Can you show a possible syntax and semantics for D? If it's so good it could also become a DIP later. Bye, bearophile
Apr 02 2014
02-Apr-2014 15:26, bearophile пишет:Dmitry Olshansky:Easily. In grammar, a new kind of statement: SwitchCallStatment: goto CallExpression; Semantics: Same as that of a function call expression except: a) Can Switch-Call only to a function with the same return type. b) Anything below this statement is unreachable, i.e. treat it as return. c) Stack frame of the current function is overwritten with that of switched-to function. In other words after a switch-call stack looks as if the switched-to function was called instead in the first place. The last point is what modern tail-call optimizations do when one has 2 functions that tail-recurse to each other. They overwrite stackframes on such tail calls. However this only happens with certain optimization switch which IMO makes the thing totally unreliable. In fact I even observed that I could implement threaded-code interpreter by relying on tail-call optimization with LDC, but only when compiling with all optimizations enabled. Unoptimized builds simply blowup stack. This "semantics are tied to optimization" is a situation that I hate about C/C++ and would hope to address in D. -- Dmitry Olshansky_Explicit_ tail call (aka switch-call) would have been cleaner and more modular. It covers all relevant cases of computed-goto such as threaded-code interpreters and also recursion-heavy functional code.It looks like an additive change. Can you show a possible syntax and semantics for D? If it's so good it could also become a DIP later.
Apr 02 2014
On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:SwitchCallStatment: goto CallExpression; Semantics: Same as that of a function call expression except: a) Can Switch-Call only to a function with the same return type. b) Anything below this statement is unreachable, i.e. treat it as return. c) Stack frame of the current function is overwritten with that of switched-to function. In other words after a switch-call stack looks as if the switched-to function was called instead in the first place. The last point is what modern tail-call optimizations do when one has 2 functions that tail-recurse to each other. They overwrite stackframes on such tail calls.How is this different from simply recognizing "return foo(arg);" as being a goto?However this only happens with certain optimization switch which IMO makes the thing totally unreliable. In fact I even observed that I could implement threaded-code interpreter by relying on tail-call optimization with LDC, but only when compiling with all optimizations enabled. Unoptimized builds simply blowup stack. This "semantics are tied to optimization" is a situation that I hate about C/C++ and would hope to address in D.Doesn't seem terribly different than the "force inline" feature discussion.
Apr 02 2014
03-Apr-2014 00:39, Walter Bright пишет:On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:If we can alter semantics of return foo(arg); to _always_ be a goto, guarantee a jump and reuse of the stack, then I'm all for it.SwitchCallStatment: goto CallExpression; Semantics: Same as that of a function call expression except: a) Can Switch-Call only to a function with the same return type. b) Anything below this statement is unreachable, i.e. treat it as return. c) Stack frame of the current function is overwritten with that of switched-to function. In other words after a switch-call stack looks as if the switched-to function was called instead in the first place. The last point is what modern tail-call optimizations do when one has 2 functions that tail-recurse to each other. They overwrite stackframes on such tail calls.How is this different from simply recognizing "return foo(arg);" as being a goto?Yes, it shows the same problem: semantics and optimization are often mixed up. In this case however a program would segfault if the relevant optimization was not performed, it's not just be many times slower. -- Dmitry OlshanskyHowever this only happens with certain optimization switch which IMO makes the thing totally unreliable. In fact I even observed that I could implement threaded-code interpreter by relying on tail-call optimization with LDC, but only when compiling with all optimizations enabled. Unoptimized builds simply blowup stack. This "semantics are tied to optimization" is a situation that I hate about C/C++ and would hope to address in D.Doesn't seem terribly different than the "force inline" feature discussion.
Apr 02 2014
On 4/2/2014 1:54 PM, Dmitry Olshansky wrote:If we can alter semantics of return foo(arg); to _always_ be a goto, guarantee a jump and reuse of the stack, then I'm all for it.I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?
Apr 02 2014
Walter Bright:I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements? D can't give an error in that case because it's a big breaking change. So is D in that case just silently not performing the tail call as the programmer meant? Bye, bearophile
Apr 02 2014
On 4/2/2014 3:06 PM, bearophile wrote:Walter Bright:Then it won't tail call it.I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements?D can't give an error in that case because it's a big breaking change. So is D in that case just silently not performing the tail call as the programmer meant?That's right.
Apr 02 2014
Walter Bright:Then this feature needs a specific and explicit syntax, or it has _no_ point at all. (Note: I am currently not asking for this feature in D. I have just shown one of its minimal requirements). Bye, bearophileD can't give an error in that case because it's a big breaking change. So is D in that case just silently not performing the tail call as the programmer meant?That's right.
Apr 02 2014
On 4/2/2014 6:21 PM, bearophile wrote:Then this feature needs a specific and explicit syntax, or it has _no_ point at all.That's like saying inlining has no point because it doesn't have a particular syntax.
Apr 02 2014
Walter Bright:That's like saying inlining has no point because it doesn't have a particular syntax.Hopefully Dmitry Olshansky can explain why you are probably wrong regarding the proposed feature. Bye, bearophile
Apr 03 2014
On 04/03/14 04:40, Walter Bright wrote:On 4/2/2014 6:21 PM, bearophile wrote:Weak inlining *hints* have no point - C learned this the hard way, with "inline" being useless today, and everybody having to use "always_inline/forceinline". The issue isn't about optimizations, which any decent compiler can deal with; it's about the programmer being able to rely on avoiding certain undesirable effects, like excessive stack consumption, unnecessary clobbering of the few precious cpu registers available etc. What Bearophile is saying is that an optional tail call optimization being mandated by a spec would be completely useless -- TCO can be legally done by the compiler anyway; documenting that it could happen does not help the programmer in any way - (s)he still has to assume that the optimization might not happen. A syntax like "goto return f();" which mandates TCO would make the situation different. [Defining computed gotos would be a better idea, as the tail-call transformation would significantly raise the requirements for a compliant D compiler. OTOH computed-gotos would lead to more readable D code, are easier to implement, and can be easily be made safe in D (by making the type of label-addresses unique per-function and avoiding function-cloning)] arturThen this feature needs a specific and explicit syntax, or it has _no_ point at all.That's like saying inlining has no point because it doesn't have a particular syntax.
Apr 03 2014
03-Apr-2014 05:05, Walter Bright пишет:On 4/2/2014 3:06 PM, bearophile wrote:From this it seems to me that once again you've walled yourself out of this question and going to frame it as an optimization. Tail-call has 2 sides to it: a) An optimization for functions that return with a call expression. I'd leave this case an optimization because it also has an undesirable effect on call stack, hindering debugging. b) An explicit instruction to switch execution to another function. This doesn't have any unexpected effects, and failing to make a tail-call will change the semantics of code. This is critical for certain kind of code, it simply doesn't work if tail-call is not a jump. I agree that nothing prevents doing optimization of return foo(args); in case a) if optimization is enabled, and this is something GCC/LLVM already do. But more importantly there should be an explicit tool for the case b), with syntax: goto foo(args); that will fail to compile if tail-call is not possible. Because the alternative of double-checking on every compiler that "return foo(args)" in all the right places is optimized the right way is fragile, and would turn out to be an _ongoing_ waste of time. -- Dmitry OlshanskyWalter Bright:Then it won't tail call it.I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements?D can't give an error in that case because it's a big breaking change. So is D in that case just silently not performing the tail call as the programmer meant?That's right.
Apr 03 2014
Am Wed, 02 Apr 2014 18:05:32 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/2/2014 3:06 PM, bearophile wrote:How would `return foo(args)' be used in the C preprocessor example? Would it be implemented like with a jump table?: char c = ...; return caseTable[c](args); What are the benefits over "goto"? I figure with "goto" the semantics are clearer and you can easily access the variables on the stack. Does the TCO version enable any more use-cases? (Note: I consider "far less convoluted code" a valid use-case. :) ) -- MarcoWalter Bright:Then it won't tail call it.I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements?
Apr 05 2014
03-Apr-2014 01:51, Walter Bright пишет:On 4/2/2014 1:54 PM, Dmitry Olshansky wrote:I'd gladly author an enhancement once I'm certain of what exactly to state in it. Always optimizing return foo(args) to a jump may have impact on debugging experience (curious call-stacks) and cost people hours of wasted time chasing shadows. Adding an explicit form for it seems to me far more appealing and honest. -- Dmitry OlshanskyIf we can alter semantics of return foo(arg); to _always_ be a goto, guarantee a jump and reuse of the stack, then I'm all for it.I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?
Apr 03 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:Try this benchmark comparing various classification schemes: bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); }I'd like to point out this is quite a complicated function to begin with, so it doesn't generalize to all isXXX is ascii, for which the tests would be fairly simpler. In any case, (on my win32 machine) I can go from 810msecs to 500msecs using this function instead: bool isIdentifierChar1(ubyte c) { return c <= 'z' && ( 'a' <= c || ('0' <= c && (c <= '9' || c == '_' || ('A' <= c && c <= 'Z'))) || c == '$'); } That said, I'm abusing the fact that 50% of your bench is for chars over 0x80. If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those numbers "only" go from "320" => "300". Only slightly better, but still a win. *BUT*, if your functions were to accept any arbitrary codepoint, it would absolutely murder.
Apr 02 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:Considering SSE4.2 was released already back in 2008 and was designed to accelerate this very use-case... Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON, etc.Try this benchmark comparing various classification schemes: bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); }
Apr 02 2014
02-Apr-2014 20:52, Daniel N пишет:Ideally we'd need: a) A collection of meaningful (high-level compared to std.simd) primitives that can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc. b) A way to automatically use whatever is best and supported by the hardware. GCC got a zero-overhead stuff for that since around 4.6 (?). -- Dmitry OlshanskyOn Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:Considering SSE4.2 was released already back in 2008 and was designed to accelerate this very use-case... Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON, etc.Try this benchmark comparing various classification schemes: bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); }
Apr 02 2014
On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:Ideally we'd need: a) A collection of meaningful (high-level compared to std.simd) primitives that can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc.memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In general, enormous effort has been poured into optimizing C standard library functions. If an algorithm can be implemented using such functions, we ought to.
Apr 02 2014
03-Apr-2014 00:46, Walter Bright пишет:On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:This is all good and well, but... memchr is working on the level of a single byte, leaving nothing for 2, 4, 8 byte-wide needles. Not type-safe. No fallback for CTFE. Obviously C standard is lacking many things, hence the endless row of extensions. -- Dmitry OlshanskyIdeally we'd need: a) A collection of meaningful (high-level compared to std.simd) primitives that can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc.memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In general, enormous effort has been poured into optimizing C standard library functions. If an algorithm can be implemented using such functions, we ought to.
Apr 02 2014
On Wednesday, 2 April 2014 at 21:18:19 UTC, Dmitry Olshansky wrote:03-Apr-2014 00:46, Walter Bright пишет:I found myself needing "wmemchr" (and optionally "dmemchr") for speeding up find for wstrings and dstrings. I realized "wmemchr" exists, but it is platform defined: ushort (w) on windows, and uint (d) on linux. So it's hard to use, and definitely not the "W" version I'd expect in D code. If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions in some "core.???" module, is this something we'd want, and would somebody know how to provide an efficient implementation?On 4/2/2014 12:38 PM, Dmitry Olshansky wrote: memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In general, enormous effort has been poured into optimizing C standard library functions. If an algorithm can be implemented using such functions, we ought to.This is all good and well, but... memchr is working on the level of a single byte, leaving nothing for 2, 4, 8 byte-wide needles. Not type-safe. No fallback for CTFE.
Apr 02 2014
On 4/2/14, 11:19 PM, monarch_dodra wrote:If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions in some "core.???" module, is this something we'd want, and would somebody know how to provide an efficient implementation?Yes. Probably checking stuff with SIMD and a few other approaches would be great. Andrei
Apr 03 2014
On Thursday, 3 April 2014 at 19:39:57 UTC, Andrei Alexandrescu wrote:On 4/2/14, 11:19 PM, monarch_dodra wrote:I filed it here: https://d.puremagic.com/issues/show_bug.cgi?id=12515 Now, we just need someone to actually do it :)If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions in some "core.???" module, is this something we'd want, and would somebody know how to provide an efficient implementation?Yes. Probably checking stuff with SIMD and a few other approaches would be great. Andrei
Apr 03 2014
03-Apr-2014 10:19, monarch_dodra пишет:If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions in some "core.???" module, is this something we'd want, and would somebody know how to provide an efficient implementation?Something I wanted in D since about 2011. Should have put an enhancement request or tipped bearophile ;) -- Dmitry Olshansky
Apr 03 2014
On 4/2/2014 4:49 AM, monarch_dodra wrote:That said, I'm abusing the fact that 50% of your bench is for chars over 0x80. If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those numbers "only" go from "320" => "300". Only slightly better, but still a win.Surely a better approach would be to do a statistical analysis of character frequency in a representative corpus, and tune the compares that way. But the table lookup offered such a dramatic improvement, I didn't bother.
Apr 02 2014
On 1 April 2014 19:35, Walter Bright <newshound2 digitalmars.com> wrote:Try this benchmark comparing various classification schemes:GDC turns switches into table lookups, and speeds are comparable on my system, though your little hack is appears faster in release builds. Tests are with 1 million runs. --- Non-release run: Milliseconds 1797 1642 1501 1463 Non-release run, no bounds checking: Milliseconds 1730 1611 1483 1512 Release run: Milliseconds 1753 1633 1471 1528 Optimised run: Milliseconds 0 0 0 0 --- It is interesting that switches seem to take longer when bounds checking is turned off, and even longer when in release mode. However this is not very conclusive as I can't test with optimisations turned on. Unfortunately for your small program, gcc is too smart and optimises everything away. --- bool isIdentifierChar3(ubyte c) { switch(c) { case '0': .. case '9': case 'A': .. case 'Z': case 'a': .. case 'z': case '$': case '_': return true; default: return false; } }
Apr 07 2014
on. Unfortunately for your small program, gcc is too smart and optimises everything away.Haha, that's funny. But why don't you put the data in a separate compilation unit?
Apr 07 2014
On 4/7/2014 2:41 AM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang gmail.com>" wrote:True, it's pretty easy to defeat that optimization.on. Unfortunately for your small program, gcc is too smart and optimises everything away.Haha, that's funny. But why don't you put the data in a separate compilation unit?
Apr 07 2014
On 04/01/2014 08:35 PM, Walter Bright wrote:Try this benchmark comparing various classification schemes:Counter example: https://github.com/D-Programming-Language/phobos/commit/8599d6b4acde0eff886876a56f54ecd8e6b528e9
Apr 07 2014
On Monday, 7 April 2014 at 11:35:37 UTC, Martin Nowak wrote:On 04/01/2014 08:35 PM, Walter Bright wrote:It's all a matter of how complicated the "non-table" implementation is I guess. table-lookup has a "steep entry cost", but is constant, regardless of complexity. Table lookups were "recently" removed from std.ascii too. The implementation walter is benching "against" is particularly complicated here.Try this benchmark comparing various classification schemes:Counter example: https://github.com/D-Programming-Language/phobos/commit/8599d6b4acde0eff886876a56f54ecd8e6b528e9
Apr 07 2014
On 4/7/2014 7:11 AM, monarch_dodra wrote:It's all a matter of how complicated the "non-table" implementation is I guess.Yup. Memory lookups can be expensive. Once it's in the cache, it's cheap.
Apr 07 2014
I added a lookup scheme of my own, its not as fast as Walters (in fact its the slowest without -inline - release -O) but it uses 1 bit per entry in the table instead of a whole byte so you can have lots and lots of different tables. I'm even reasonably sure that it works correctly! =============================== import core.stdc.stdlib; import core.stdc.string; import std.algorithm; import std.array; import std.ascii; import std.datetime; import std.range; import std.stdio; import std.traits; bool isIdentifierChar0(ubyte c) { return isAlphaNum(c) || c == '_' || c == '$'; } bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); } immutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } } bool isIdentifierChar2(ubyte c) { return tab2[c]; } immutable ulong[4] tab3; static this() { for (size_t u = 0; u < 0x100; ++u) { if (isIdentifierChar0(cast(ubyte)u)) { auto sub = u >>> 6; auto b = u & 0x3f; auto mask = 0x01L << b; tab3[sub] |= mask; } } } bool isIdentifierChar3(ubyte c) { auto sub = c >>> 6; c &= 0x3f; auto mask = 0x01L << c; return (tab3[sub] & mask) > 0; } int f0() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar0(cast(ubyte)u); } return x; } int f1() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar1(cast(ubyte)u); } return x; } int f2() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar2(cast(ubyte)u); } return x; } int f3() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar3(cast(ubyte)u); } return x; } void main() { auto r = benchmark!(f0, f1, f2, f3)(10_000); writefln("Milliseconds %s %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs, r[3].msecs); }
Apr 17 2014
On Thursday, 17 April 2014 at 16:27:26 UTC, Alix Pexton wrote:I added a lookup scheme of my own, its not as fast as Walters (in fact its the slowest without -inline - release -O) but it uses 1 bit per entry in the table instead of a whole byte so you can have lots and lots of different tables. I'm even reasonably sure that it works correctly! =============================== import core.stdc.stdlib; import core.stdc.string; import std.algorithm; import std.array; import std.ascii; import std.datetime; import std.range; import std.stdio; import std.traits; bool isIdentifierChar0(ubyte c) { return isAlphaNum(c) || c == '_' || c == '$'; } bool isIdentifierChar1(ubyte c) { return ((c >= '0' || c == '$') && (c <= '9' || c >= 'A') && (c <= 'Z' || c >= 'a' || c == '_') && (c <= 'z')); } immutable bool[256] tab2; static this() { for (size_t u = 0; u < 0x100; ++u) { tab2[u] = isIdentifierChar0(cast(ubyte)u); } } bool isIdentifierChar2(ubyte c) { return tab2[c]; } immutable ulong[4] tab3; static this() { for (size_t u = 0; u < 0x100; ++u) { if (isIdentifierChar0(cast(ubyte)u)) { auto sub = u >>> 6; auto b = u & 0x3f; auto mask = 0x01L << b; tab3[sub] |= mask; } } } bool isIdentifierChar3(ubyte c) { auto sub = c >>> 6; c &= 0x3f; auto mask = 0x01L << c; return (tab3[sub] & mask) > 0; } int f0() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar0(cast(ubyte)u); } return x; } int f1() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar1(cast(ubyte)u); } return x; } int f2() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar2(cast(ubyte)u); } return x; } int f3() { int x; for (uint u = 0; u < 0x100; ++u) { x += isIdentifierChar3(cast(ubyte)u); } return x; } void main() { auto r = benchmark!(f0, f1, f2, f3)(10_000); writefln("Milliseconds %s %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs, r[3].msecs); }I feel like there must be a way of making a fast bit look up but my version is only moderate in speed. You can get all the bits you need on two 64 bit registers or one SSE register. I haven't tried bt, does that work with a 64 bit register?
Apr 17 2014
On Thursday, 17 April 2014 at 18:07:24 UTC, ixid wrote:I feel like there must be a way of making a fast bit look up but my version is only moderate in speed. You can get all the bits you need on two 64 bit registers or one SSE register. I haven't tried bt, does that work with a 64 bit register?? Note it can be applied to the table in general, rather than the byte themselves. EG: ubyte[256] buf; auto b = bt(buf.ptr, 428);
Apr 17 2014
On 17/04/2014 8:41 PM, monarch_dodra wrote:On Thursday, 17 April 2014 at 18:07:24 UTC, ixid wrote:I didn't think to look in core.bitop for a faster way to check bits, I'll check if it closes the gap to Walter's version. A...I feel like there must be a way of making a fast bit look up but my version is only moderate in speed. You can get all the bits you need on two 64 bit registers or one SSE register. I haven't tried bt, does that work with a 64 bit register?? Note it can be applied to the table in general, rather than the byte themselves. EG: ubyte[256] buf; auto b = bt(buf.ptr, 428);
Apr 18 2014
I tested this against the others... (with "-inline -release -O" of course) === uint[8] tab4; // bitop function only work on uints static this() { for (size_t u = 0; u < 0x100; ++u) { if (isIdentifierChar0(cast(ubyte)u)) { bts(tab4.ptr, u); } } } bool isIdentifierChar4(ubyte c) { return bt(tab4.ptr, c) != 0; } === it takes about 3 times as long (about the same as the original isIdentifierChar1, that used lots of &&s and ||s). So 2 shifts, 2 ands and a compare to zero, beats core.bitop. for clarity, the output of my benchmark for all 5 version was Milliseconds 21 15 2 5 15 the 2 (which on some runs was a 3) is Walter's table of bools, the 5 is my table of ulongs and the 15 on the end is core.bitop. Short of delving into the asm, which I'm trying to avoid, any other suggestions for speed-ups? A...
Apr 18 2014
PS I discovered that core.bitop.bts() won't accept a pointer to immutable even when inside a static this. Not sure if that counts as a bug or not ><
Apr 18 2014