digitalmars.D - Compiler performance with my ridiculous Binderoo code
- Ethan Watson (53/53) Dec 11 2016 I've been keeping in contact with Stefan and providing him
- Stefan Koch (13/15) Dec 11 2016 I already what I could to optimize those parts.
- safety0ff (13/17) Dec 11 2016 Have a look at this issue:
- Stefan Koch (13/31) Dec 11 2016 That means you have to compute the mangled name which is crazy
- safety0ff (18/30) Dec 11 2016 How often would the mangle be needed regardless later on in
- Stefan Koch (19/53) Dec 11 2016 Just use this little program to simulate the process.
- safety0ff (7/8) Dec 11 2016 That's not really useful for understanding and making progress on
- Ethan Watson (8/10) Dec 11 2016 Uh, it kinda does actually. It's highlighting that a better hash
- Stefan Koch (3/12) Dec 11 2016 iota(1000).aliasSeqOf.
- Stefan Koch (2/11) Jan 02 2017 Try static immutable x = aliasSeqOf!(iota(4000))
- Chris Wright (39/47) Dec 11 2016 That's one option. Here's another:
- Stefan Koch (7/16) Dec 12 2016 Collisions are not the problem.
I've been keeping in contact with Stefan and providing him example code to test with his CTFE engine. He's been saying for a while that templates are slow. So I decided to finally work out just how slow we're talking about here. I can't show the exact code I'm running with, but needless to say this particular test case crashes the 32-bit dmd.exe that comes with the official downloads. I've had to build my own 64-bit version... which also eventually crashes but only after consuming 8 gigabytes of memory. Using Visual Studio 2015's built in sample-based profiler, I decided to see just what the compiler was doing on a release build with the problem code. http://pastebin.com/dcwwCp28 This is a copy of the calltree where DMD spends most of its time. If you don't know how to read these profiles, the good one to look at is that it's 130+ functions deep in the callstack. Plenty of template instances, plenty of static if, plenty of static foreach... I'm doing quite a bit to tax the compiler in other words. Which got me thinking. I've been rolling things over to CTFE-generated string mixins lately in anticipation for the speedups Stefan will get us. But there's one bit of template code that I have not touched at all. https://github.com/Remedy-Entertainment/binderoo/blob/master/binderoo_client/d/src/binderoo/binding/serialise.d This is a simple set of templated functions that parses objects and serialises them to JSON (the reason I'm not just using std.json is because I need custom handling for pointers and reference types). But it turns out this is the killer. As a part of binding an object for Binderoo's rapid iteration purposes, it generates a serialise/deserialise call that instantiates for each type found. If I turn that code generation off, the code compiles. If I remove a file that has 1000+ structs (auto-generated) with tree-like instances embedded in the only object I apply Binderoo's binding to in that entire module, it compiles in 45% of the time (12 seconds versus 26 seconds). The hot path without that 1000+ struct file actually goes through the AttribDeclaration.semantic and UserAttributeDeclaration.semantic code path, with the OS itself doing the most work for a single function thanks to Outbuffer::writeString needing to realloc string memory in dmd\backend\outbuf.c. The hot path with that 1000+ struct file spends the most time in TemplateInstance.semantic, specifically with calls to TemplateDeclaration.findExistingInstance, TemplateInstance.tryExpandMembers, and TemplateInstance.findBestMatch taking up 90%+ of its time. finExistingInstance spends most of its time in arrayObjectMatch in dtemplate.d, which subseuqently spends most of its time in the match function in the same file (which calls virtuals on RootObject to do comparisons). At the very least, I now have an idea of which parts of the compiler I'm taxing and can attempt to write around that. But I'm also tempted to go in and optimise those parts of the compiler.
Dec 11 2016
On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote: ButI'm also tempted to go in and optimise those parts of the compiler.I already what I could to optimize those parts. whatever you manage to squeeze out. It's not going to do much good. The templates you are using are by their nature recursive, And the template-matching is recursive as well. Leading to a rather nasty complexity curve that looks like O(n!). The template system needs to be fixed as a whole. I will do this as soon as ctfe is done. And I can already say that it will take time. CTFE is straightforward when compared to what I will have to do with the templates.
Dec 11 2016
On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:At the very least, I now have an idea of which parts of the compiler I'm taxing and can attempt to write around that. But I'm also tempted to go in and optimise those parts of the compiler.Have a look at this issue: https://issues.dlang.org/show_bug.cgi?id=16513 In a nutshell: Dmd puts template instances in an AA using a terrible hash function. When looking for a match it does an expensive comparison for each collision. The hash function is easily fixed but it does not obviate the expensive comparison, so it trades hashing time for comparison time. Therefore the performance impact of fixing it is unclear. Martin Nowak is suggesting that using the mangled name is the way forward.
Dec 11 2016
On Sunday, 11 December 2016 at 17:04:24 UTC, safety0ff wrote:On Sunday, 11 December 2016 at 16:26:29 UTC, Ethan Watson wrote:That means you have to compute the mangled name which is crazy expensive. And you can't cache the parent part of mangle because it all freshly generated by the template. It seems like I fail to express the problem properly so let me try again. AliasSeq's that are append to look like this : AliasSeq!(AliasSeq!(AliasSeq(! ....)))) An aliasSeq that is "appended to" n times will produce (n^2) with ((n-1)^2) sub instances in them all the way till n is 0. When you want to compare thier paramterter times you need to go through all of them and recursively call findTemplateInstance.At the very least, I now have an idea of which parts of the compiler I'm taxing and can attempt to write around that. But I'm also tempted to go in and optimise those parts of the compiler.Have a look at this issue: https://issues.dlang.org/show_bug.cgi?id=16513 In a nutshell: Dmd puts template instances in an AA using a terrible hash function. When looking for a match it does an expensive comparison for each collision. The hash function is easily fixed but it does not obviate the expensive comparison, so it trades hashing time for comparison time. Therefore the performance impact of fixing it is unclear. Martin Nowak is suggesting that using the mangled name is the way forward.
Dec 11 2016
On Sunday, 11 December 2016 at 17:20:24 UTC, Stefan Koch wrote:That means you have to compute the mangled name which is crazy expensive. And you can't cache the parent part of mangle because it all freshly generated by the template.How often would the mangle be needed regardless later on in compilation? I don't know too much about dmd internals. It seemed that Martin Nowak had a viable proof of concept, but the bugzilla discussion is quite terse.It seems like I fail to express the problem properly so let me try again. AliasSeq's that are append to look like this : AliasSeq!(AliasSeq!(AliasSeq!(...))) An aliasSeq that is "appended to" n times will produce (n^2) with ((n-1)^2) sub instances in them all the way till n is 0. When you want to compare their parameter times you need to go through all of them and recursively call findTemplateInstance.I don't see the n^2 instances, I'm likely missing an implementation detail. However, I understand the quadratic nature of comparing: AliasSeq!(AliasSeq!(AliasSeq!(...))) to: AliasSeq!(AliasSeq!(...)) I also don't see why you'd need to do this comparison often (assuming good hash functions are used.) The comment "Hash collisions will happen!" seems to overestimate the current implementation IMO. All in all, it would probably be best to have some degenerate code so that the issue can be investigated on a common footing.
Dec 11 2016
On Sunday, 11 December 2016 at 18:08:04 UTC, safety0ff wrote:On Sunday, 11 December 2016 at 17:20:24 UTC, Stefan Koch wrote:Just use this little program to simulate the process. It assumes you always hit the l1 cache and that there are no hash collisions. int templateTime(uint n, uint k = 300) { uint result = k; foreach(_;0 .. n-1) result += templateTime(n/2, k); return result; } void main(string[] args) { import std.stdio; import std.conv : to; uint n = to!uint(args[1]); writeln("AliasSeq with ", n, "Elements will take around ", templateTime(n), " us to intanciate" ); }That means you have to compute the mangled name which is crazy expensive. And you can't cache the parent part of mangle because it all freshly generated by the template.How often would the mangle be needed regardless later on in compilation? I don't know too much about dmd internals. It seemed that Martin Nowak had a viable proof of concept, but the bugzilla discussion is quite terse.It seems like I fail to express the problem properly so let me try again. AliasSeq's that are append to look like this : AliasSeq!(AliasSeq!(AliasSeq!(...))) An aliasSeq that is "appended to" n times will produce (n^2) with ((n-1)^2) sub instances in them all the way till n is 0. When you want to compare their parameter times you need to go through all of them and recursively call findTemplateInstance.I don't see the n^2 instances, I'm likely missing an implementation detail. However, I understand the quadratic nature of comparing: AliasSeq!(AliasSeq!(AliasSeq!(...))) to: AliasSeq!(AliasSeq!(...)) I also don't see why you'd need to do this comparison often (assuming good hash functions are used.) The comment "Hash collisions will happen!" seems to overestimate the current implementation IMO. All in all, it would probably be best to have some degenerate code so that the issue can be investigated on a common footing.
Dec 11 2016
On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:Just use this little program to simulate the process.That's not really useful for understanding and making progress on the issue. I had a patch with improved hash functions which I stashed away since it seemed the mangle approach would be the way forward. I also have no test code to benchmark it either (i.e. the degenerate case I was asking for.)
Dec 11 2016
On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:That's not really useful for understanding and making progress on the issue.Uh, it kinda does actually. It's highlighting that a better hash function will only have a minor effect. The time sink is the number of instantiations that are happening. But it is worth pointing out that, yes, for you to see the problem here you won't see it with simple tests. Binderoo very quickly builds up recursive template instantiations that bring the problem out from the crypt and in to the light.
Dec 11 2016
On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:iota(1000).aliasSeqOf. That is your degenerate case.Just use this little program to simulate the process.That's not really useful for understanding and making progress on the issue. I had a patch with improved hash functions which I stashed away since it seemed the mangle approach would be the way forward. I also have no test code to benchmark it either (i.e. the degenerate case I was asking for.)
Dec 11 2016
On Sunday, 11 December 2016 at 19:40:21 UTC, safety0ff wrote:On Sunday, 11 December 2016 at 19:00:23 UTC, Stefan Koch wrote:Try static immutable x = aliasSeqOf!(iota(4000))Just use this little program to simulate the process.That's not really useful for understanding and making progress on the issue. I had a patch with improved hash functions which I stashed away since it seemed the mangle approach would be the way forward. I also have no test code to benchmark it either (i.e. the degenerate case I was asking for.)
Jan 02 2017
On Sun, 11 Dec 2016 18:08:04 +0000, safety0ff wrote:However, I understand the quadratic nature of comparing: AliasSeq!(AliasSeq!(AliasSeq!(...))) to: AliasSeq!(AliasSeq!(...))That's one option. Here's another: Template instantiations are interned as they are constructed (or at least should be). You must construct their arguments before you instantiate the template. Therefore you can do a reference equality check instead of checking the values of the parameters. Specifically, we have: AliasSeq!(AliasSeq!(1, 2), 3) To construct this, the compiler first constructs AliasSeq!(1, 2). It's already been instantiated, so instead of creating it anew, it grabs a pointer to the existing instantiation -- we'll call it 0xC0FE. Now it looks if AliasSeq!(0xC0FE, 3) has already been instantiated. It hasn't, so we instantiate it and add it to the cache. I looked at the compiler's implementation, and it seems like it *maybe* could take better advantage of interning. The difference is a handful of casts, some jumps, and some virtual function calls. Which isn't *great* but won't yield more than a modicum of improved performance.I also don't see why you'd need to do this comparison often (assuming good hash functions are used.)The current hash function simply sums symbol pointers and integer values in the template arguments. It ignores all other values. So if I have a simple template: template Just(T, T value) { enum T Just = value; } All instantiations with T=double will hash to the same value. When I tested this with 5,000 instantiations, it took twice as long with doubles as with longs (1.07s vs 0.50s). At 10,000, over three times as long (6.89s vs 2.05s). Short strings perform about the same as doubles. TemplateInstance uses druntime's associative arrays, which are standard hashtables. The interaction has some wonky parts: * It starts out with eight buckets. * It grows by a factor of four. * Hashes tend to be sums of pointers. * Pointers are aligned. This means you're in collision city most of the time. I tried addressing that, but I'm seeing no improvement, so I obviously did something wrong. (And then I messed up an invocation of `find` and deleted my entire dmd source tree, including the .git directory...)All in all, it would probably be best to have some degenerate code so that the issue can be investigated on a common footing.I agree entirely.
Dec 11 2016
On Sunday, 11 December 2016 at 22:48:56 UTC, Chris Wright wrote:On Sun, 11 Dec 2016 18:08:04 +0000, safety0ff wrote:Collisions are not the problem. In the worst case they add a linear factor. If you truly get the interning right, it will reduce the complexity from O(n!) to O(n^2.6) However it will pessimize the overall performance of everything that are not deeply recursive templates.[...]That's one option. Here's another: Template instantiations are interned as they are constructed (or at least should be). You must construct their arguments before you instantiate the template. Therefore you can do a reference equality check instead of checking the values of the parameters. [...]
Dec 12 2016