www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Compiler performance with my ridiculous Binderoo code

reply Ethan Watson <gooberman gmail.com> writes:
I've been keeping in contact with Stefan and providing him 
example code to test with his CTFE engine. He's been saying for a 
while that templates are slow. So I decided to finally work out 
just how slow we're talking about here.

I can't show the exact code I'm running with, but needless to say 
this particular test case crashes the 32-bit dmd.exe that comes 
with the official downloads. I've had to build my own 64-bit 
version... which also eventually crashes but only after consuming 
8 gigabytes of memory.

Using Visual Studio 2015's built in sample-based profiler, I 
decided to see just what the compiler was doing on a release 
build with the problem code.

http://pastebin.com/dcwwCp28

This is a copy of the calltree where DMD spends most of its time. 
If you don't know how to read these profiles, the good one to 
look at is that it's 130+ functions deep in the callstack. Plenty 
of template instances, plenty of static if, plenty of static 
foreach... I'm doing quite a bit to tax the compiler in other 
words.

Which got me thinking. I've been rolling things over to 
CTFE-generated string mixins lately in anticipation for the 
speedups Stefan will get us. But there's one bit of template code 
that I have not touched at all.

https://github.com/Remedy-Entertainment/binderoo/blob/master/binderoo_client/d/src/binderoo/binding/serialise.d

This is a simple set of templated functions that parses objects 
and serialises them to JSON (the reason I'm not just using 
std.json is because I need custom handling for pointers and 
reference types). But it turns out this is the killer. As a part 
of binding an object for Binderoo's rapid iteration purposes, it 
generates a serialise/deserialise call that instantiates for each 
type found. If I turn that code generation off, the code 
compiles. If I remove a file that has 1000+ structs 
(auto-generated) with tree-like instances embedded in the only 
object I apply Binderoo's binding to in that entire module, it 
compiles in 45% of the time (12 seconds versus 26 seconds).

The hot path without that 1000+ struct file actually goes through 
the AttribDeclaration.semantic and 
UserAttributeDeclaration.semantic code path, with the OS itself 
doing the most work for a single function thanks to 
Outbuffer::writeString needing to realloc string memory in 
dmd\backend\outbuf.c.

The hot path with that 1000+ struct file spends the most time in 
TemplateInstance.semantic, specifically with calls to 
TemplateDeclaration.findExistingInstance, 
TemplateInstance.tryExpandMembers, and 
TemplateInstance.findBestMatch taking up 90%+ of its time. 
finExistingInstance spends most of its time in arrayObjectMatch 
in dtemplate.d, which subseuqently spends most of its time in the 
match function in the same file (which calls virtuals on 
RootObject to do comparisons).

At the very least, I now have an idea of which parts of the 
compiler I'm taxing and can attempt to write around that. But I'm 
also tempted to go in and optimise those parts of the compiler.
Dec 11 2016
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:
  But
 I'm also tempted to go in and optimise those parts of the 
 compiler.
I already what I could to optimize those parts. whatever you manage to squeeze out. It's not going to do much good. The templates you are using are by their nature recursive, And the template-matching is recursive as well. Leading to a rather nasty complexity curve that looks like O(n!). The template system needs to be fixed as a whole. I will do this as soon as ctfe is done. And I can already say that it will take time. CTFE is straightforward when compared to what I will have to do with the templates.
Dec 11 2016
prev sibling parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:
 At the very least, I now have an idea of which parts of the 
 compiler I'm taxing and can attempt to write around that. But 
 I'm also tempted to go in and optimise those parts of the 
 compiler.
Have a look at this issue: https://issues.dlang.org/show_bug.cgi?id=16513 In a nutshell: Dmd puts template instances in an AA using a terrible hash function. When looking for a match it does an expensive comparison for each collision. The hash function is easily fixed but it does not obviate the expensive comparison, so it trades hashing time for comparison time. Therefore the performance impact of fixing it is unclear. Martin Nowak is suggesting that using the mangled name is the way forward.
Dec 11 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 11 December 2016 at 17:04:24 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:
 At the very least, I now have an idea of which parts of the 
 compiler I'm taxing and can attempt to write around that. But 
 I'm also tempted to go in and optimise those parts of the 
 compiler.
Have a look at this issue: https://issues.dlang.org/show_bug.cgi?id=16513 In a nutshell: Dmd puts template instances in an AA using a terrible hash function. When looking for a match it does an expensive comparison for each collision. The hash function is easily fixed but it does not obviate the expensive comparison, so it trades hashing time for comparison time. Therefore the performance impact of fixing it is unclear. Martin Nowak is suggesting that using the mangled name is the way forward.
That means you have to compute the mangled name which is crazy expensive. And you can't cache the parent part of mangle because it all freshly generated by the template. It seems like I fail to express the problem properly so let me try again. AliasSeq's that are append to look like this : AliasSeq!(AliasSeq!(AliasSeq(! ....)))) An aliasSeq that is "appended to" n times will produce (n^2) with ((n-1)^2) sub instances in them all the way till n is 0. When you want to compare thier paramterter times you need to go through all of them and recursively call findTemplateInstance.
Dec 11 2016
parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Sunday, 11 December 2016 at 17:20:24 UTC, Stefan Koch wrote:
 That means you have to compute the mangled name which is crazy 
 expensive.
 And you can't cache the parent part of mangle because it all 
 freshly generated by the template.
How often would the mangle be needed regardless later on in compilation? I don't know too much about dmd internals. It seemed that Martin Nowak had a viable proof of concept, but the bugzilla discussion is quite terse.
 It seems like I fail to express the problem properly so let me 
 try again.

 AliasSeq's that are append to look like this
 : AliasSeq!(AliasSeq!(AliasSeq!(...)))
 An aliasSeq that is "appended to" n times will produce (n^2) 
 with ((n-1)^2) sub instances in them all the way till n is 0.

 When you want to compare their parameter times you need to go 
 through all of them and recursively call findTemplateInstance.
I don't see the n^2 instances, I'm likely missing an implementation detail. However, I understand the quadratic nature of comparing: AliasSeq!(AliasSeq!(AliasSeq!(...))) to: AliasSeq!(AliasSeq!(...)) I also don't see why you'd need to do this comparison often (assuming good hash functions are used.) The comment "Hash collisions will happen!" seems to overestimate the current implementation IMO. All in all, it would probably be best to have some degenerate code so that the issue can be investigated on a common footing.
Dec 11 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 11 December 2016 at 18:08:04 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 17:20:24 UTC, Stefan Koch wrote:
 That means you have to compute the mangled name which is crazy 
 expensive.
 And you can't cache the parent part of mangle because it all 
 freshly generated by the template.
How often would the mangle be needed regardless later on in compilation? I don't know too much about dmd internals. It seemed that Martin Nowak had a viable proof of concept, but the bugzilla discussion is quite terse.
 It seems like I fail to express the problem properly so let me 
 try again.

 AliasSeq's that are append to look like this
 : AliasSeq!(AliasSeq!(AliasSeq!(...)))
 An aliasSeq that is "appended to" n times will produce (n^2) 
 with ((n-1)^2) sub instances in them all the way till n is 0.

 When you want to compare their parameter times you need to go 
 through all of them and recursively call findTemplateInstance.
I don't see the n^2 instances, I'm likely missing an implementation detail. However, I understand the quadratic nature of comparing: AliasSeq!(AliasSeq!(AliasSeq!(...))) to: AliasSeq!(AliasSeq!(...)) I also don't see why you'd need to do this comparison often (assuming good hash functions are used.) The comment "Hash collisions will happen!" seems to overestimate the current implementation IMO. All in all, it would probably be best to have some degenerate code so that the issue can be investigated on a common footing.
Just use this little program to simulate the process. It assumes you always hit the l1 cache and that there are no hash collisions. int templateTime(uint n, uint k = 300) { uint result = k; foreach(_;0 .. n-1) result += templateTime(n/2, k); return result; } void main(string[] args) { import std.stdio; import std.conv : to; uint n = to!uint(args[1]); writeln("AliasSeq with ", n, "Elements will take around ", templateTime(n), " us to intanciate" ); }
Dec 11 2016
parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:
 Just use this little program to simulate the process.
That's not really useful for understanding and making progress on the issue. I had a patch with improved hash functions which I stashed away since it seemed the mangle approach would be the way forward. I also have no test code to benchmark it either (i.e. the degenerate case I was asking for.)
Dec 11 2016
next sibling parent Ethan Watson <gooberman gmail.com> writes:
On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:
 That's not really useful for understanding and making progress 
 on the issue.
Uh, it kinda does actually. It's highlighting that a better hash function will only have a minor effect. The time sink is the number of instantiations that are happening. But it is worth pointing out that, yes, for you to see the problem here you won't see it with simple tests. Binderoo very quickly builds up recursive template instantiations that bring the problem out from the crypt and in to the light.
Dec 11 2016
prev sibling next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:
 Just use this little program to simulate the process.
That's not really useful for understanding and making progress on the issue. I had a patch with improved hash functions which I stashed away since it seemed the mangle approach would be the way forward. I also have no test code to benchmark it either (i.e. the degenerate case I was asking for.)
iota(1000).aliasSeqOf. That is your degenerate case.
Dec 11 2016
prev sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:
 On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:
 Just use this little program to simulate the process.
That's not really useful for understanding and making progress on the issue. I had a patch with improved hash functions which I stashed away since it seemed the mangle approach would be the way forward. I also have no test code to benchmark it either (i.e. the degenerate case I was asking for.)
Try static immutable x = aliasSeqOf!(iota(4000))
Jan 02 2017
prev sibling parent reply Chris Wright <dhasenan gmail.com> writes:
On Sun, 11 Dec 2016 18:08:04 +0000, safety0ff wrote:
 However, I understand the quadratic nature of comparing:
 AliasSeq!(AliasSeq!(AliasSeq!(...)))
 to:
 AliasSeq!(AliasSeq!(...))
That's one option. Here's another: Template instantiations are interned as they are constructed (or at least should be). You must construct their arguments before you instantiate the template. Therefore you can do a reference equality check instead of checking the values of the parameters. Specifically, we have: AliasSeq!(AliasSeq!(1, 2), 3) To construct this, the compiler first constructs AliasSeq!(1, 2). It's already been instantiated, so instead of creating it anew, it grabs a pointer to the existing instantiation -- we'll call it 0xC0FE. Now it looks if AliasSeq!(0xC0FE, 3) has already been instantiated. It hasn't, so we instantiate it and add it to the cache. I looked at the compiler's implementation, and it seems like it *maybe* could take better advantage of interning. The difference is a handful of casts, some jumps, and some virtual function calls. Which isn't *great* but won't yield more than a modicum of improved performance.
 I also don't see why you'd need to do this comparison often (assuming
 good hash functions are used.)
The current hash function simply sums symbol pointers and integer values in the template arguments. It ignores all other values. So if I have a simple template: template Just(T, T value) { enum T Just = value; } All instantiations with T=double will hash to the same value. When I tested this with 5,000 instantiations, it took twice as long with doubles as with longs (1.07s vs 0.50s). At 10,000, over three times as long (6.89s vs 2.05s). Short strings perform about the same as doubles. TemplateInstance uses druntime's associative arrays, which are standard hashtables. The interaction has some wonky parts: * It starts out with eight buckets. * It grows by a factor of four. * Hashes tend to be sums of pointers. * Pointers are aligned. This means you're in collision city most of the time. I tried addressing that, but I'm seeing no improvement, so I obviously did something wrong. (And then I messed up an invocation of `find` and deleted my entire dmd source tree, including the .git directory...)
 All in all, it would probably be best to have some degenerate code so
 that the issue can be investigated on a common footing.
I agree entirely.
Dec 11 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 11 December 2016 at 22:48:56 UTC, Chris Wright wrote:
 On Sun, 11 Dec 2016 18:08:04 +0000, safety0ff wrote:
 [...]
That's one option. Here's another: Template instantiations are interned as they are constructed (or at least should be). You must construct their arguments before you instantiate the template. Therefore you can do a reference equality check instead of checking the values of the parameters. [...]
Collisions are not the problem. In the worst case they add a linear factor. If you truly get the interning right, it will reduce the complexity from O(n!) to O(n^2.6) However it will pessimize the overall performance of everything that are not deeply recursive templates.
Dec 12 2016