digitalmars.D - Counting bits in a ulong
- H. S. Teoh (97/97) Jun 03 2020 So recently I needed to implement a function to count the number of 1
- Paul Backus (5/14) Jun 03 2020 Interesting stuff. Did you compare your implementations with the
- ketmar (5/17) Jun 03 2020 actually, that branch has very good prediction rate. it should be predic...
- Patrick Schluter (8/14) Jun 04 2020 You should try to use the popcnt instruction of the cpu (in asm
- H. S. Teoh (11/24) Jun 04 2020 [...]
- Patrick Schluter (14/41) Jun 04 2020 Cool, I just wanted to comment as I have myself a Phenom II X6
- wjoe (10/33) Jun 04 2020 Did you run the benchmark on different CPU architectures ?
- H. S. Teoh (29/38) Jun 04 2020 On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalma...
- Dominikus Dittes Scherkl (14/44) Jun 04 2020 wikipedia recommends this:
- ttk (25/27) Jun 05 2020 This made me wonder how well a lookup table approach would
- H. S. Teoh (29/59) Jun 05 2020 You might be running into alignment issues, possibly. For one thing, why
- H. S. Teoh (157/170) Jun 05 2020 [...]
- sebasg (3/7) Jun 05 2020 Jesus. Shouldn't you be able to generate that instead of
- ttk (9/17) Jun 08 2020 A fair point. Duplicating it in emacs was literally five
- H. S. Teoh (11/23) Jun 08 2020 No templates needed, just use static foreach. ;-) (See my other reply
So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop. There's of course the nave method (`val` is the input ulong): uint count; foreach (i; 0 .. 64) { if (val & (1L << i)) count++; } return count; Knowing that this code is inside a hotspot, I wrote a parallel bit counter based on the idea from [1]: // Parallel bit count. // We count upper/lower halves separately, interspersed to maximize // hyperthreading. >:-) uint v1 = cast(uint)(val >> 32); uint v2 = cast(uint)val & 0xFFFF_FFFF; v1 = v1 - ((v1 >> 1) & 0x5555_5555); v2 = v2 - ((v2 >> 1) & 0x5555_5555); v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333); v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333); v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F); v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F); v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF); v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF); v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF); v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF); return cast(ulong)v1 + cast(ulong)v2; [1] https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel The basic idea behind this trick is to add bits in pairs, then in 4's, then in 8's, and so on, inside the ulong itself. I furthermore split it into upper and lower halves and interleaved the two halves to cut the data dependency between instructions and (hopefully) increase the opportunity for hyperthreading / increase out-of-order instruction execution. This is completely branchless and in theory should maximize throughput. Unfortunately it comes at the cost of high instruction count. The third alternative is known as Kernighan's trick, which involves a loop that executes exactly as many times as the number of 1 bits: uint count; for (count = 0; val; count++) { val &= val - 1; } return count; (How it does this is left as an exercise for the reader. It's really quite clever.) This comes with the advantage of a very low instruction count, but it does have a branch that isn't easily predictable, so actual execution time is marginally slower. I did some measurements with real data (not just a micro benchmark, this is the actual algorithm with the bitcount function in its hotspot), and as expected the nave bitcount performed the worst, about 2x slower. However, between Kernighan and the parallel bit counter, the results depend on the kind of input given. For program inputs where there is a large number of 1's in the average ulong to be counted, the parallel bit counter performs better. Here's one example of such an input case: Nave: real 0m4.159s user 0m4.172s sys 0m0.024s Kernighan: real 0m2.114s user 0m2.129s sys 0m0.020s Parallel: real 0m2.024s user 0m2.039s sys 0m0.028s As you can see, the advantage of the parallel count is only slightly (albeit consistently) better than Kernighan. However, when the input tends to generate ulongs with a low saturation of 1's, Kernighan wins out by quite a large margin: Nave: real 0m5.661s user 0m5.706s sys 0m0.054s Kernighan: real 0m3.080s user 0m3.123s sys 0m0.058s Parallel: real 0m3.805s user 0m3.844s sys 0m0.056s This illustrates how optimization is often input-dependent, and what works well in one case may work not so well in another case. And even an overall better choice (in this case Kernighan's trick) will in some niche cases perform less well. This goes to reinforce the point Andrei recently made, that sometimes if you optimize for a micro-benchmark, you can end up with code that looks good on the benchmark but not so good on real-world data -- you end up optimizing for the benchmark rather than for the real-world use case. Real-world optimization isn't so straightforward as minimizing a single number; it's often a multi-dimensional problem. :-P T -- Amateurs built the Ark; professionals built the Titanic.
Jun 03 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop. There's of course the naïve method (`val` is the input ulong):[...]This illustrates how optimization is often input-dependent, and what works well in one case may work not so well in another case. And even an overall better choice (in this case Kernighan's trick) will in some niche cases perform less well.Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.
Jun 03 2020
H. S. Teoh wrote:The third alternative is known as Kernighan's trick, which involves a loop that executes exactly as many times as the number of 1 bits: uint count; for (count = 0; val; count++) { val &= val - 1; } return count; (How it does this is left as an exercise for the reader. It's really quite clever.) This comes with the advantage of a very low instruction count, but it does have a branch that isn't easily predictable, so actual execution time is marginally slower.actually, that branch has very good prediction rate. it should be predicted most of the time, and when predictor failed, it is mostly doesn't matter, because we're done with the loop. coincidentally, this is what your benchmarking results demonstrates. ;-)
Jun 03 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop. There's of course the naïve method (`val` is the input ulong): [...]You should try to use the popcnt instruction of the cpu (in asm or with intrinsics). It's at least 3 or 4 order of magnitude (i.e. between 1000 and 10000 times) faster than any of these tricks. I made the change last year in my work codebase since we've left the SPARC servers (they had a very slow implementation of popcount). All modern CPU's have the instruction.
Jun 04 2020
On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote:On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...][...]Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]You should try to use the popcnt instruction of the cpu (in asm or with intrinsics). It's at least 3 or 4 order of magnitude (i.e. between 1000 and 10000 times) faster than any of these tricks.As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the instruction. :-([...] Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it. After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck. Cool stuff! T -- What do you mean the Internet isn't filled with subliminal messages? What about all those buttons marked "submit"??
Jun 04 2020
On Thursday, 4 June 2020 at 17:38:48 UTC, H. S. Teoh wrote:On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote:Cool, I just wanted to comment as I have myself a Phenom II X6 1090T and was wondering. Btw, if you want to use the Kernighan function for values with a lot of set bits, count the inverted value: That's what I had in my code (gcc C) DEFINE_INLINE int PopCountFewClr(uint64_t w) { #ifdef __POPCNT__ return __builtin_popcountll(w); #else return CHAR_BIT*sizeof w - PopCountFewSet(~w); #endif }On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...][...]Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]You should try to use the popcnt instruction of the cpu (in asm or with intrinsics). It's at least 3 or 4 order of magnitude (i.e. between 1000 and 10000 times) faster than any of these tricks.As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the instruction. :-([...] Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it. After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck. Cool stuff!
Jun 04 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:I did some measurements with real data (not just a micro benchmark, this is the actual algorithm with the bitcount function in its hotspot), and as expected the naïve bitcount performed the worst, about 2x slower. However, between Kernighan and the parallel bit counter, the results depend on the kind of input given. For program inputs where there is a large number of 1's in the average ulong to be counted, the parallel bit counter performs better. Here's one example of such an input case: Naïve: real 0m4.159s user 0m4.172s sys 0m0.024s Kernighan: real 0m2.114s user 0m2.129s sys 0m0.020s Parallel: real 0m2.024s user 0m2.039s sys 0m0.028s As you can see, the advantage of the parallel count is only slightly (albeit consistently) better than Kernighan.Did you run the benchmark on different CPU architectures ? I recently benchmarked code which was performing a calculation one of 2 ways. 1st way was to calculate for real, the other used look up tables. I ran the test on an i7 haswell and an i5 skylake. The i5 was performing faster with look up tables, the i7 performance was worse using look up tables. The performance impact was similar to your 1st benchmark Parallel vs Kernighan.
Jun 04 2020
On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...]Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]You should try to use the popcnt instruction of the cpu (in asm or with intrinsics). It's at least 3 or 4 order of magnitude (i.e. between 1000 and 10000 times) faster than any of these tricks. I made the change last year in my work codebase since we've left the SPARC servers (they had a very slow implementation of popcount). All modern CPU's have the instruction.Haha, I learn something new every day. Didn't even know there was such a CPU instruction as popcnt. :-) But what's more interesting is the result of using core.bitop.popcnt: as it turns out, it runs faster than Kernighan when the input ulongs have high bit count, but *slower* than Kernighan when the inputs have low bit density. This was very puzzling to me, since a CPU instruction is supposed to be orders of magnitude faster, as Patrick said, so I looked at the disassembly. As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the instruction. :-( So the LDC intrinsic reverted to a software implementation, which is essentially the parallel bitcount I wrote, except better. Evidently, I miscalculated the impact of data dependency between instructions, or more likely, the AMD CPU doesn't have hyperthreading so splitting the ulong into upper/lower halves did more harm than good for the calculation. But nonetheless, even the better implementation of parallel bitcount loses out to Kernighan when the input does not have a high bit density. I do have a VPS that runs on an Intel CPU, but apparently also an older one that doesn't support SSE4 (either that, or the hypervisor filtered that out for whatever reason, but I doubt this). So again we see the complex landscape of optimization: not only the optimality depend on input data, it also depends on the hardware. :-) T -- Век живи - век учись. А дураком помрёшь.
Jun 04 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop. There's of course the naïve method (`val` is the input ulong): uint count; foreach (i; 0 .. 64) { if (val & (1L << i)) count++; } return count; Knowing that this code is inside a hotspot, I wrote a parallel bit counter based on the idea from [1]: // Parallel bit count. // We count upper/lower halves separately, interspersed to maximize // hyperthreading. >:-) uint v1 = cast(uint)(val >> 32); uint v2 = cast(uint)val & 0xFFFF_FFFF; v1 = v1 - ((v1 >> 1) & 0x5555_5555); v2 = v2 - ((v2 >> 1) & 0x5555_5555); v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333); v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333); v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F); v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F); v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF); v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF); v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF); v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);wikipedia recommends this: x -= (x >> 1) & m1; x = (x & m2) + ((x >> 2) & m2); x = (x + (x >> 4)) & m4; x += x >> 8; //put count of each 16 bits into their lowest 8 bits x += x >> 16; //put count of each 32 bits into their lowest 8 bits x += x >> 32; //put count of each 64 bits into their lowest 8 bits return x & 0x7f; so in the last 4 lines you do too much work. better use the intrinsics: popcnt(val>>32)+popcnt(val & uint.max)
Jun 04 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop.This made me wonder how well a lookup table approach would compare to these logical methods (and it was an excuse to blow off work a bit to fiddle with D). The popcnt instruction is going to blow away all of these, of course, but it still seems like a worthwhile exercise, if only to get a feel for how purely in-register logic performance compares to cache access performance on modern "big" CPUs. ttk kirov:/home/ttk/prog/D$ ./popcount Starting tests, which take about 10 seconds each Naive: 5596740 iter/sec Parallel: 166352970 iter/sec Kernighan: 47668800 iter/sec Lookup 8-bit: 467161540 iter/sec Lookup 16-bit: 826179570 iter/sec It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the performance of the 8-bit lookup, since both lookup tables fit well within the L1 cache. Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call): http://ciar.org/h/popcount.d I ran this on an Intel Core i7-9750M 2.6GHz, with 192KB L1 / 1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O popcount.d" on Slackware-current.
Jun 05 2020
On Fri, Jun 05, 2020 at 07:12:43PM +0000, ttk via Digitalmars-d wrote:On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:Hooray for having an excuse to play with D! ;-)So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop.This made me wonder how well a lookup table approach would compare to these logical methods (and it was an excuse to blow off work a bit to fiddle with D).The popcnt instruction is going to blow away all of these, of course, but it still seems like a worthwhile exercise, if only to get a feel for how purely in-register logic performance compares to cache access performance on modern "big" CPUs. ttk kirov:/home/ttk/prog/D$ ./popcount Starting tests, which take about 10 seconds each Naive: 5596740 iter/sec Parallel: 166352970 iter/sec Kernighan: 47668800 iter/sec Lookup 8-bit: 467161540 iter/sec Lookup 16-bit: 826179570 iter/sec It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the performance of the 8-bit lookup, since both lookup tables fit well within the L1 cache.You might be running into alignment issues, possibly. For one thing, why does LOOKUP_1 have 257 elements instead of 256? That last byte is never accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last element is redundant.Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call): http://ciar.org/h/popcount.dThere's no need to manually unroll the code: just use static foreach, since the code is identical every iteration! :-PI ran this on an Intel Core i7-9750M 2.6GHz, with 192KB L1 / 1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O popcount.d" on Slackware-current.I don't trust performance results with dmd, so I tested it on `ldc2 -O3` instead: Starting tests, which take about 10 seconds each Naive: 25665230 iter/sec Parallel: 124692900 iter/sec Kernighan: 40680010 iter/sec Lookup 8-bit: 1532505390 iter/sec Lookup 16-bit: 1690195340 iter/sec Something is up with the 16-bit lookups. Why is it only barely faster than 8-bit lookup? Disassembling shows that the code was *not* duplicated 100 times; apparently LDC's aggressive optimizer noticed that `count` is reassigned every unrolled iteration, so all except the last are no-ops. You need to do something with `count` that isn't immediately killed by the next unrolled iteration, otherwise your painstakingly copy-n-pasted iterations will all be elided by the optimizer. :-P OR'ing `count` into ACCUMULATOR after every count is probably what you need to do here. T -- "640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.
Jun 05 2020
On Fri, Jun 05, 2020 at 12:45:36PM -0700, H. S. Teoh via Digitalmars-d wrote: [...]You might be running into alignment issues, possibly. For one thing, why does LOOKUP_1 have 257 elements instead of 256? That last byte is never accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last element is redundant.[...]Disassembling shows that the code was *not* duplicated 100 times; apparently LDC's aggressive optimizer noticed that `count` is reassigned every unrolled iteration, so all except the last are no-ops. You need to do something with `count` that isn't immediately killed by the next unrolled iteration, otherwise your painstakingly copy-n-pasted iterations will all be elided by the optimizer. :-P OR'ing `count` into ACCUMULATOR after every count is probably what you need to do here.[...] So I made the above changes, and here are the new measurements: Starting tests, which take about 10 seconds each Naive: 25272530 iter/sec Parallel: 133259870 iter/sec Kernighan: 36447830 iter/sec Lookup 8-bit: 133837870 iter/sec Lookup 16-bit: 283343300 iter/sec Now the 16-bit lookup makes a lot more sense! :-D (I also confirmed via disassembly that the code is indeed unrolled 100 times, and not just elided.) Interestingly, the parallel version seems to perform almost just as well as the 8-bit lookup. Here's the fixed code (it's a lot shorter :-P): ------------------------------------------------------ import std.stdio; import std.datetime.systime; ubyte [257] LOOKUP_1; // 8-bit lookup table ubyte [65537] LOOKUP_2; // 16-bit lookup table const ulong STEPPER = 0x110D0B0705030201; // val increment step, to minimize cache effects ulong ACCUMULATOR = 0; // might keep optimizer from noticing we never use count and tossing it double TEST_DURATION = 10.0; // High-resolution time()-like, copied from VS_Common_D: double now() { double tm = Clock.currStdTime(); return (tm / 10000000.0) - 62135596800.0; } ubyte popcount_naive(ulong val) { ubyte count; foreach (i; 0 .. 64) if (val & (1L << i)) count++; return count; } ulong microbench_naive() { ulong iter_count = 0; ulong val = 0; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { foreach (i; 0 .. 64) if (val & (1L << i)) count++; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_parallel() { ulong iter_count = 0; ulong val = 0; uint count = 0; uint v1; uint v2; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { v1 = cast(uint)(val >> 32); v2 = cast(uint)val & 0xFFFF_FFFF; v1 = v1 - ((v1 >> 1) & 0x5555_5555); v2 = v2 - ((v2 >> 1) & 0x5555_5555); v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333); v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333); v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F); v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F); v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF); v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF); v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF); v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF); count += cast(ulong)v1 + cast(ulong)v2; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_kernighan() { ulong iter_count = 0; ulong val = 0; ulong tval = 0; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { for (count = 0; val; count++) val &= val - 1; val = tval += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_lookup_8bit() { ulong iter_count = 0; ulong val = 0; ubyte *av = cast(ubyte*) &val; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): // writeln("top: iter = ", iter_count, ", val = ", val, ", av = ", av, ", av0 = ", av[0], ", av1 = ", av[1], ", av2 = ", av[2], ", av3 = ", av[3], ", av4 = ", av[4], ", av5 = ", av[5], ", av6 = ", av[6], ", av7 = ", av[7]); static foreach (_; 0 .. 100) { count = LOOKUP_1[av[0]] + LOOKUP_1[av[1]] + LOOKUP_1[av[2]] + LOOKUP_1[av[3]] + LOOKUP_1[av[4]] + LOOKUP_1[av[5]] + LOOKUP_1[av[6]] + LOOKUP_1[av[7]]; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_lookup_16bit() { ulong iter_count = 0; ulong val = 0; ushort *av = cast(ushort*) &val; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { count = LOOKUP_2[av[0]] + LOOKUP_2[av[1]] + LOOKUP_2[av[2]] + LOOKUP_2[av[3]]; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } int main() { // initialize the lookup tables: foreach (i; 0 .. 256) LOOKUP_1[i] = popcount_naive(i); foreach (i; 0 .. 65536) LOOKUP_2[i] = popcount_naive(i); writeln("Starting tests, which take about ", TEST_DURATION, " seconds each"); writeln("Naive: ", microbench_naive(), " iter/sec"); writeln("Parallel: ", microbench_parallel(), " iter/sec"); writeln("Kernighan: ", microbench_kernighan(), " iter/sec"); writeln("Lookup 8-bit: ", microbench_lookup_8bit(), " iter/sec"); writeln("Lookup 16-bit: ", microbench_lookup_16bit(), " iter/sec"); // trick optimizer into thinking ACCUMULATOR, and thus count, is important: return ACCUMULATOR / STEPPER; } ------------------------------------------------------ T -- In a world without fences, who needs Windows and Gates? -- Christian Surchi
Jun 05 2020
On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call): http://ciar.org/h/popcount.dJesus. Shouldn't you be able to generate that instead of copy-pasting? It's D, after all.
Jun 05 2020
On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:A fair point. Duplicating it in emacs was literally five keystrokes, but I should rewrite it to use templates anyway (more excuse to fiddle with D!). Also, I figured out why the 16-bit lookup wasn't more closely 2.0x the performance of the 8-bit lookup. The L1 cache is 192KB, but that's divided across its six cores, so this single-threaded program only got 32KB of L1 to play with. The 8-bit lookup table fit in that, but only half of the 16-bit lookup table did.Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call): http://ciar.org/h/popcount.dJesus. Shouldn't you be able to generate that instead of copy-pasting? It's D, after all.
Jun 08 2020
On Mon, Jun 08, 2020 at 06:08:55PM +0000, ttk via Digitalmars-d wrote:On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:[...]No templates needed, just use static foreach. ;-) (See my other reply to your code.)Jesus. Shouldn't you be able to generate that instead of copy-pasting? It's D, after all.A fair point. Duplicating it in emacs was literally five keystrokes, but I should rewrite it to use templates anyway (more excuse to fiddle with D!).Also, I figured out why the 16-bit lookup wasn't more closely 2.0x the performance of the 8-bit lookup. The L1 cache is 192KB, but that's divided across its six cores, so this single-threaded program only got 32KB of L1 to play with. The 8-bit lookup table fit in that, but only half of the 16-bit lookup table did.I think your results are skewed; your copy-n-pasted block of count repeatedly overwrites `count` without assigning its value anywhere else, so the optimizer deleted all except the last block of code! See my other reply. T -- Tech-savvy: euphemism for nerdy.
Jun 08 2020