digitalmars.D - Compiler performance with my ridiculous Binderoo code

Ethan Watson (53/53) Dec 11 2016 I've been keeping in contact with Stefan and providing him

Stefan Koch (13/15) Dec 11 2016 I already what I could to optimize those parts.
safety0ff (13/17) Dec 11 2016 Have a look at this issue:

Stefan Koch (13/31) Dec 11 2016 That means you have to compute the mangled name which is crazy

safety0ff (18/30) Dec 11 2016 How often would the mangle be needed regardless later on in

Stefan Koch (19/53) Dec 11 2016 Just use this little program to simulate the process.

safety0ff (7/8) Dec 11 2016 That's not really useful for understanding and making progress on

Ethan Watson (8/10) Dec 11 2016 Uh, it kinda does actually. It's highlighting that a better hash
Stefan Koch (3/12) Dec 11 2016 iota(1000).aliasSeqOf.
Stefan Koch (2/11) Jan 02 2017 Try static immutable x = aliasSeqOf!(iota(4000))

Chris Wright (39/47) Dec 11 2016 That's one option. Here's another:

Stefan Koch (7/16) Dec 12 2016 Collisions are not the problem.

Ethan Watson <gooberman gmail.com> writes:

I've been keeping in contact with Stefan and providing him 
example code to test with his CTFE engine. He's been saying for a 
while that templates are slow. So I decided to finally work out 
just how slow we're talking about here.

I can't show the exact code I'm running with, but needless to say 
this particular test case crashes the 32-bit dmd.exe that comes 
with the official downloads. I've had to build my own 64-bit 
version... which also eventually crashes but only after consuming 
8 gigabytes of memory.

Using Visual Studio 2015's built in sample-based profiler, I 
decided to see just what the compiler was doing on a release 
build with the problem code.

http://pastebin.com/dcwwCp28

This is a copy of the calltree where DMD spends most of its time. 
If you don't know how to read these profiles, the good one to 
look at is that it's 130+ functions deep in the callstack. Plenty 
of template instances, plenty of static if, plenty of static 
foreach... I'm doing quite a bit to tax the compiler in other 
words.

Which got me thinking. I've been rolling things over to 
CTFE-generated string mixins lately in anticipation for the 
speedups Stefan will get us. But there's one bit of template code 
that I have not touched at all.

https://github.com/Remedy-Entertainment/binderoo/blob/master/binderoo_client/d/src/binderoo/binding/serialise.d

This is a simple set of templated functions that parses objects 
and serialises them to JSON (the reason I'm not just using 
std.json is because I need custom handling for pointers and 
reference types). But it turns out this is the killer. As a part 
of binding an object for Binderoo's rapid iteration purposes, it 
generates a serialise/deserialise call that instantiates for each 
type found. If I turn that code generation off, the code 
compiles. If I remove a file that has 1000+ structs 
(auto-generated) with tree-like instances embedded in the only 
object I apply Binderoo's binding to in that entire module, it 
compiles in 45% of the time (12 seconds versus 26 seconds).

The hot path without that 1000+ struct file actually goes through 
the AttribDeclaration.semantic and 
UserAttributeDeclaration.semantic code path, with the OS itself 
doing the most work for a single function thanks to 
Outbuffer::writeString needing to realloc string memory in 
dmd\backend\outbuf.c.

The hot path with that 1000+ struct file spends the most time in 
TemplateInstance.semantic, specifically with calls to 
TemplateDeclaration.findExistingInstance, 
TemplateInstance.tryExpandMembers, and 
TemplateInstance.findBestMatch taking up 90%+ of its time. 
finExistingInstance spends most of its time in arrayObjectMatch 
in dtemplate.d, which subseuqently spends most of its time in the 
match function in the same file (which calls virtuals on 
RootObject to do comparisons).

At the very least, I now have an idea of which parts of the 
compiler I'm taxing and can attempt to write around that. But I'm 
also tempted to go in and optimise those parts of the compiler.

Dec 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:
  But
 I'm also tempted to go in and optimise those parts of the 
 compiler.

I already what I could to optimize those parts.
whatever you manage to squeeze out.
It's not going to do much good.
The templates you are using are by their nature recursive,
And the template-matching is recursive as well.

Leading to a rather nasty complexity curve that looks like O(n!).
The template system needs to be fixed as a whole.
I will do this as soon as ctfe is done.
And I can already say that it will take time.
CTFE is straightforward when compared to what I will have to do 
with the templates.

Dec 11 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:
 At the very least, I now have an idea of which parts of the 
 compiler I'm taxing and can attempt to write around that. But 
 I'm also tempted to go in and optimise those parts of the 
 compiler.

Have a look at this issue: 
https://issues.dlang.org/show_bug.cgi?id=16513

In a nutshell:
Dmd puts template instances in an AA using a terrible hash 
function.
When looking for a match it does an expensive comparison for each 
collision.

The hash function is easily fixed but it does not obviate the 
expensive comparison, so it trades hashing time for comparison 
time. Therefore the performance impact of fixing it is unclear.
Martin Nowak is suggesting that using the mangled name is the way 
forward.

Dec 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 11 December 2016 at 17:04:24 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:
 At the very least, I now have an idea of which parts of the 
 compiler I'm taxing and can attempt to write around that. But 
 I'm also tempted to go in and optimise those parts of the 
 compiler.

 Have a look at this issue: 
 https://issues.dlang.org/show_bug.cgi?id=16513

 In a nutshell:
 Dmd puts template instances in an AA using a terrible hash 
 function.
 When looking for a match it does an expensive comparison for 
 each collision.

 The hash function is easily fixed but it does not obviate the 
 expensive comparison, so it trades hashing time for comparison 
 time. Therefore the performance impact of fixing it is unclear.
 Martin Nowak is suggesting that using the mangled name is the 
 way forward.

That means you have to compute the mangled name which is crazy 
expensive.
And you can't cache the parent part of mangle because it all 
freshly generated by the template.
It seems like I fail to express the problem properly so let me 
try again.

AliasSeq's that are append to look like this
: AliasSeq!(AliasSeq!(AliasSeq(! ....))))
An aliasSeq that is "appended to" n times will produce (n^2) with 
((n-1)^2) sub instances in them all the way till n is 0.

When you want to compare thier paramterter times you need to go 
through all of them and recursively call findTemplateInstance.

Dec 11 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Sunday, 11 December 2016 at 17:20:24 UTC, Stefan Koch wrote:
 That means you have to compute the mangled name which is crazy 
 expensive.
 And you can't cache the parent part of mangle because it all 
 freshly generated by the template.

How often would the mangle be needed regardless later on in 
compilation?
I don't know too much about dmd internals.
It seemed that Martin Nowak had a viable proof of concept, but 
the bugzilla discussion is quite terse.

 It seems like I fail to express the problem properly so let me 
 try again.

 AliasSeq's that are append to look like this
 : AliasSeq!(AliasSeq!(AliasSeq!(...)))
 An aliasSeq that is "appended to" n times will produce (n^2) 
 with ((n-1)^2) sub instances in them all the way till n is 0.

 When you want to compare their parameter times you need to go 
 through all of them and recursively call findTemplateInstance.

I don't see the n^2 instances, I'm likely missing an 
implementation detail.
However, I understand the quadratic nature of comparing:
AliasSeq!(AliasSeq!(AliasSeq!(...)))
to:
AliasSeq!(AliasSeq!(...))

I also don't see why you'd need to do this comparison often 
(assuming good hash functions are used.)
The comment "Hash collisions will happen!" seems to overestimate 
the current implementation IMO.


All in all, it would probably be best to have some degenerate 
code so that the issue can be investigated on a common footing.

Dec 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 11 December 2016 at 18:08:04 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 17:20:24 UTC, Stefan Koch wrote:
 That means you have to compute the mangled name which is crazy 
 expensive.
 And you can't cache the parent part of mangle because it all 
 freshly generated by the template.

 How often would the mangle be needed regardless later on in 
 compilation?
 I don't know too much about dmd internals.
 It seemed that Martin Nowak had a viable proof of concept, but 
 the bugzilla discussion is quite terse.

 It seems like I fail to express the problem properly so let me 
 try again.

 AliasSeq's that are append to look like this
 : AliasSeq!(AliasSeq!(AliasSeq!(...)))
 An aliasSeq that is "appended to" n times will produce (n^2) 
 with ((n-1)^2) sub instances in them all the way till n is 0.

 When you want to compare their parameter times you need to go 
 through all of them and recursively call findTemplateInstance.

 I don't see the n^2 instances, I'm likely missing an 
 implementation detail.
 However, I understand the quadratic nature of comparing:
 AliasSeq!(AliasSeq!(AliasSeq!(...)))
 to:
 AliasSeq!(AliasSeq!(...))

 I also don't see why you'd need to do this comparison often 
 (assuming good hash functions are used.)
 The comment "Hash collisions will happen!" seems to 
 overestimate the current implementation IMO.


 All in all, it would probably be best to have some degenerate 
 code so that the issue can be investigated on a common footing.

Just use this little program to simulate the process.
It assumes you always hit the l1 cache and that there are no hash 
collisions.

int templateTime(uint n, uint k = 300)
{
   uint result = k;
   foreach(_;0 .. n-1)
     result += templateTime(n/2, k);
   return result;
}

void main(string[] args)
{
import std.stdio;
import std.conv : to;
uint n = to!uint(args[1]);
writeln("AliasSeq with ", n, "Elements will take around ", 
templateTime(n), " us to intanciate" );
}

Dec 11 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:
 Just use this little program to simulate the process.

That's not really useful for understanding and making progress on 
the issue.

I had a patch with improved hash functions which I stashed away 
since it seemed the mangle approach would be the way forward.
I also have no test code to benchmark it either (i.e. the 
degenerate case I was asking for.)

Dec 11 2016

Ethan Watson <gooberman gmail.com> writes:

On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:
 That's not really useful for understanding and making progress 
 on the issue.

Uh, it kinda does actually. It's highlighting that a better hash 
function will only have a minor effect. The time sink is the 
number of instantiations that are happening.

But it is worth pointing out that, yes, for you to see the 
problem here you won't see it with simple tests. Binderoo very 
quickly builds up recursive template instantiations that bring 
the problem out from the crypt and in to the light.

Dec 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:
 Just use this little program to simulate the process.

 That's not really useful for understanding and making progress 
 on the issue.

 I had a patch with improved hash functions which I stashed away 
 since it seemed the mangle approach would be the way forward.
 I also have no test code to benchmark it either (i.e. the 
 degenerate case I was asking for.)

iota(1000).aliasSeqOf.
That is your degenerate case.

Dec 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:
 Just use this little program to simulate the process.

 That's not really useful for understanding and making progress 
 on the issue.

 I had a patch with improved hash functions which I stashed away 
 since it seemed the mangle approach would be the way forward.
 I also have no test code to benchmark it either (i.e. the 
 degenerate case I was asking for.)

Try static immutable x = aliasSeqOf!(iota(4000))

Jan 02 2017

Chris Wright <dhasenan gmail.com> writes:

On Sun, 11 Dec 2016 18:08:04 +0000, safety0ff wrote:
 However, I understand the quadratic nature of comparing:
 AliasSeq!(AliasSeq!(AliasSeq!(...)))
 to:
 AliasSeq!(AliasSeq!(...))

That's one option. Here's another:

Template instantiations are interned as they are constructed (or at least 
should be). You must construct their arguments before you instantiate the 
template. Therefore you can do a reference equality check instead of 
checking the values of the parameters.

Specifically, we have:

AliasSeq!(AliasSeq!(1, 2), 3)

To construct this, the compiler first constructs AliasSeq!(1, 2). It's 
already been instantiated, so instead of creating it anew, it grabs a 
pointer to the existing instantiation -- we'll call it 0xC0FE.

Now it looks if AliasSeq!(0xC0FE, 3) has already been instantiated. It 
hasn't, so we instantiate it and add it to the cache.

I looked at the compiler's implementation, and it seems like it *maybe* 
could take better advantage of interning. The difference is a handful of 
casts, some jumps, and some virtual function calls. Which isn't *great* 
but won't yield more than a modicum of improved performance.

 I also don't see why you'd need to do this comparison often (assuming
 good hash functions are used.)

The current hash function simply sums symbol pointers and integer values 
in the template arguments. It ignores all other values. So if I have a 
simple template:

template Just(T, T value)
{
  enum T Just = value;
}

All instantiations with T=double will hash to the same value. When I 
tested this with 5,000 instantiations, it took twice as long with doubles 
as with longs (1.07s vs 0.50s). At 10,000, over three times as long 
(6.89s vs 2.05s). Short strings perform about the same as doubles.

TemplateInstance uses druntime's associative arrays, which are standard 
hashtables. The interaction has some wonky parts:

* It starts out with eight buckets.
* It grows by a factor of four.
* Hashes tend to be sums of pointers.
* Pointers are aligned.

This means you're in collision city most of the time.

I tried addressing that, but I'm seeing no improvement, so I obviously 
did something wrong. (And then I messed up an invocation of `find` and 
deleted my entire dmd source tree, including the .git directory...)

 All in all, it would probably be best to have some degenerate code so
 that the issue can be investigated on a common footing.

I agree entirely.

Dec 11 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 11 December 2016 at 22:48:56 UTC, Chris Wright wrote:
 On Sun, 11 Dec 2016 18:08:04 +0000, safety0ff wrote:
 [...]

 That's one option. Here's another:

 Template instantiations are interned as they are constructed 
 (or at least should be). You must construct their arguments 
 before you instantiate the template. Therefore you can do a 
 reference equality check instead of checking the values of the 
 parameters.

 [...]

Collisions are not the problem.
In the worst case they add a linear factor.

If you truly get the interning right, it will reduce the 
complexity from O(n!) to O(n^2.6)
However it will pessimize the  overall performance of everything 
that are not deeply recursive templates.

Dec 12 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Compiler performance with my ridiculous Binderoo code