digitalmars.D - A few experiments with partial unrolling
- Andrei Alexandrescu (108/108) Dec 25 2010 I was looking at http://d.puremagic.com/issues/show_bug.cgi?id=4725 and
- bearophile (4/6) Dec 25 2010 But this is true for a microbenchmark. In a real program the code half p...
- Andrei Alexandrescu (6/10) Dec 25 2010 Yah, that's what I think unroll should be a generic function leaving it
- Andrei Alexandrescu (6/9) Dec 25 2010 [snip]
I was looking at http://d.puremagic.com/issues/show_bug.cgi?id=4725 and thought I'd experiment a bit with a generic unrolling routine (code at the end of this). So I wrote a general parameterized unroll function that accepts a function, a number of times to unroll, and a random-access range. The idea was to experiment with unroll and see how it works for sums. I also wrote two baselines - a brute force one and one that unrolls but is specialized for addition. The results are interesting. First, partial unrolling does help a lot - the brute force baseline takes 1.2s on my machine and unrollSum takes only 0.2s. Second, there is an issue with inlining: unroll takes 0.37s, almost twice as much as the hand-unrolled version. Third, it looks like larger unrolling limits is better - I only got a plateau at 128! Andrei import std.algorithm, std.conv, std.date, std.functional, std.range, std.stdio, std.string, std.traits; ElementType!R unroll(alias fun, size_t times, R)(R r) if (isRandomAccessRange!R && hasLength!R) { static assert(times > 1 && !(times & (times - 1))); static string repeat(string s, string indexPlaceholder, size_t n) { string r = ""; foreach (i; 0 .. n) { r ~= replace(s, indexPlaceholder, .to!string(i)); } return r; } alias binaryFun!fun f; typeof(return) result = 0; immutable adjusted = r.length & ~cast(size_t) (times - 1); size_t i = 0; for (; i < adjusted; i += times) { ElementType!R[times - 1] temp = void; mixin(repeat("temp[X] = f(r[i + (X + X)], r[i + (X + X + 1)]);", "X", times / 2)); mixin(repeat("temp[X + times / 2] = f(temp[X + X], temp[X + X + 1]);", "X", times / 2 - 1)); result = f(result, temp[$ - 1]); } for (; i != r.length; ++i) { result += r[i]; } return result; } ElementType!R unrollSum(size_t times, R)(R r) if (isRandomAccessRange!R && hasLength!R) { static assert(times > 1 && !(times & (times - 1))); static string repeat(string s, string indexPlaceholder, size_t n) { string r = ""; foreach (i; 0 .. n) { r ~= replace(s, indexPlaceholder, .to!string(i)); } return r; } typeof(return) result = 0; immutable adjusted = r.length & ~cast(size_t) (times - 1); size_t i = 0; for (; i < adjusted; i += times) { ElementType!R[times - 1] temp = void; mixin(repeat("temp[X] = r[i + (X + X)] + r[i + (X + X + 1)];", "X", times / 2)); mixin(repeat("temp[X + times / 2] = temp[X + X] + temp[X + X + 1];", "X", times / 2 - 1)); result += temp[$ - 1]; } for (; i != r.length; ++i) { result += r[i]; } return result; } ElementType!R sum0(R)(R r) if (isInputRange!R) { typeof(return) result = 0; for (; !r.empty; r.popFront()) { result += r.front; } return result; } void main(string[] args) { enum times = 8; auto a = array(iota(100_000_001)); double t = getUTCtime(); if (args.length == 1) { writeln("baseline"); writeln(sum0(a)); } else if (args[1] == "unroll") { writeln(args[1]); writeln(unroll!("a + b", times)(a)); } else if (args[1] == "unrollSum") { writeln(args[1]); writeln(unrollSum!(times)(a)); } else { assert(0); } writeln((getUTCtime() - t) / ticksPerSecond); }
Dec 25 2010
Andrei:Third, it looks like larger unrolling limits is better - I only got a plateau at 128!But this is true for a microbenchmark. In a real program the code half part of the CPU L1 cache is quite limited, so the more code you have to push through that little cache (code of different functions), the more cache misses you have, and this slows down the code. This is why too much unrolling or too much inlining is bad, and this is why I have unrolled my sum() only once. Bye, bearophile
Dec 25 2010
On 12/25/10 3:07 PM, bearophile wrote:Andrei:Yah, that's what I think unroll should be a generic function leaving it to the user to choose the parameters. Before that I'd like to generalize the function a bit more - right now it can only do reduce-type workloads on associative functions. AndreiThird, it looks like larger unrolling limits is better - I only got a plateau at 128!But this is true for a microbenchmark. In a real program the code half part of the CPU L1 cache is quite limited, so the more code you have to push through that little cache (code of different functions), the more cache misses you have, and this slows down the code. This is why too much unrolling or too much inlining is bad, and this is why I have unrolled my sum() only once.
Dec 25 2010
On 12/25/10 2:54 PM, Andrei Alexandrescu wrote:I was looking at http b://d.puremagic.com/issues/show_bug.cgi?id=4725 and thought I'd experiment a bit with a generic unrolling routine (code at the end of this).[snip] Ignore my previous numbers, I forgot -O -release -inline... with those, improvements are still considerable, from 220ms to 72ms, and unroll does as well as unrollSum. Andrei
Dec 25 2010