## digitalmars.D - A few experiments with partial unrolling

• Andrei Alexandrescu (108/108) Dec 25 2010 I was looking at http://d.puremagic.com/issues/show_bug.cgi?id=4725 and
• bearophile (4/6) Dec 25 2010 But this is true for a microbenchmark. In a real program the code half p...
• Andrei Alexandrescu (6/10) Dec 25 2010 Yah, that's what I think unroll should be a generic function leaving it
• Andrei Alexandrescu (6/9) Dec 25 2010 [snip]
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```I was looking at http://d.puremagic.com/issues/show_bug.cgi?id=4725 and
thought I'd experiment a bit with a generic unrolling routine (code at
the end of this). So I wrote a general parameterized unroll function
that accepts a function, a number of times to unroll, and a
random-access range. The idea was to experiment with unroll and see how
it works for sums. I also wrote two baselines - a brute force one and
one that unrolls but is specialized for addition.

The results are interesting. First, partial unrolling does help a lot -
the brute force baseline takes 1.2s on my machine and unrollSum takes
only 0.2s.

Second, there is an issue with inlining: unroll takes 0.37s, almost
twice as much as the hand-unrolled version.

Third, it looks like larger unrolling limits is better - I only got a
plateau at 128!

Andrei

#!/usr/bin/env rdmd
import std.algorithm, std.conv, std.date, std.functional, std.range,
std.stdio, std.string, std.traits;

ElementType!R unroll(alias fun, size_t times, R)(R r) if
(isRandomAccessRange!R && hasLength!R)
{
static assert(times > 1 && !(times & (times - 1)));

static string repeat(string s, string indexPlaceholder, size_t n)
{
string r = "";
foreach (i; 0 .. n)
{
r ~= replace(s, indexPlaceholder, .to!string(i));
}
return r;
}

alias binaryFun!fun f;
typeof(return) result = 0;
immutable adjusted = r.length & ~cast(size_t) (times - 1);
size_t i = 0;
for (; i < adjusted; i += times)
{
ElementType!R[times - 1] temp = void;
mixin(repeat("temp[X] = f(r[i + (X + X)], r[i + (X + X +
1)]);", "X", times / 2));
mixin(repeat("temp[X + times / 2] = f(temp[X + X], temp[X + X +
1]);", "X", times / 2 - 1));
result = f(result, temp[\$ - 1]);
}
for (; i != r.length; ++i)
{
result += r[i];
}
return result;
}

ElementType!R unrollSum(size_t times, R)(R r) if (isRandomAccessRange!R
&& hasLength!R)
{
static assert(times > 1 && !(times & (times - 1)));

static string repeat(string s, string indexPlaceholder, size_t n)
{
string r = "";
foreach (i; 0 .. n)
{
r ~= replace(s, indexPlaceholder, .to!string(i));
}
return r;
}

typeof(return) result = 0;
immutable adjusted = r.length & ~cast(size_t) (times - 1);
size_t i = 0;
for (; i < adjusted; i += times)
{
ElementType!R[times - 1] temp = void;
mixin(repeat("temp[X] = r[i + (X + X)] + r[i + (X + X + 1)];",
"X", times / 2));
mixin(repeat("temp[X + times / 2] = temp[X + X] + temp[X + X +
1];", "X", times / 2 - 1));
result += temp[\$ - 1];
}
for (; i != r.length; ++i)
{
result += r[i];
}
return result;
}

ElementType!R sum0(R)(R r) if (isInputRange!R)
{
typeof(return) result = 0;
for (; !r.empty; r.popFront()) {
result += r.front;
}
return result;
}

void main(string[] args)
{
enum times = 8;
auto a = array(iota(100_000_001));
double t = getUTCtime();
if (args.length == 1) {
writeln("baseline");
writeln(sum0(a));
} else if (args[1] == "unroll") {
writeln(args[1]);
writeln(unroll!("a + b", times)(a));
} else if (args[1] == "unrollSum") {
writeln(args[1]);
writeln(unrollSum!(times)(a));
} else {
assert(0);
}
writeln((getUTCtime() - t) / ticksPerSecond);
}
```
Dec 25 2010
bearophile <bearophileHUGS lycos.com> writes:
```Andrei:

Third, it looks like larger unrolling limits is better - I only got a
plateau at 128!

But this is true for a microbenchmark. In a real program the code half part of
the CPU L1 cache is quite limited, so the more code you have to push through
that little cache (code of different  functions), the more cache misses you
have, and this slows down the code. This is why too much unrolling or too much
inlining is bad, and this is why I have unrolled my sum() only once.

Bye,
bearophile
```
Dec 25 2010
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 12/25/10 3:07 PM, bearophile wrote:
Andrei:

Third, it looks like larger unrolling limits is better - I only got a
plateau at 128!

But this is true for a microbenchmark. In a real program the code half part of
the CPU L1 cache is quite limited, so the more code you have to push through
that little cache (code of different  functions), the more cache misses you
have, and this slows down the code. This is why too much unrolling or too much
inlining is bad, and this is why I have unrolled my sum() only once.

Yah, that's what I think unroll should be a generic function leaving it
to the user to choose the parameters. Before that I'd like to generalize
the function a bit more - right now it can only do reduce-type workloads
on associative functions.

Andrei
```
Dec 25 2010
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 12/25/10 2:54 PM, Andrei Alexandrescu wrote:
I was looking at http     b://d.puremagic.com/issues/show_bug.cgi?id=4725 and
thought I'd experiment a bit with a generic unrolling routine (code at
the end of this).

[snip]

Ignore my previous numbers, I forgot -O -release -inline... with those,
improvements are still considerable, from 220ms to 72ms, and unroll does
as well as unrollSum.

Andrei
```
Dec 25 2010