digitalmars.D - Optimization fun
- H. S. Teoh via Digitalmars-d (50/50) Nov 06 2014 So today, I was playing around with profiling and optimizing my sliding
- Walter Bright (6/35) Nov 06 2014 I've argued this point for over a year, apparently without convincing an...
- H. S. Teoh via Digitalmars-d (20/65) Nov 06 2014 I dunno, my impression is that some people agree with it, but consider
- Sean Kelly (13/28) Nov 06 2014 We've talked before about keeping statistics of how much memory
- thedeemon (5/11) Nov 07 2014 When you grow a hash table it periodically reallocates bucket
- H. S. Teoh via Digitalmars-d (18/28) Nov 07 2014 I realize that, it's just that even after accounting for that, the
- Ary Borenszweig (6/54) Nov 07 2014 Hi,
- H. S. Teoh via Digitalmars-d (13/16) Nov 07 2014 [...]
- Ary Borenszweig (3/17) Nov 07 2014 Then I don't think I'll be able to do any useful benchmark/profile with
So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time. 2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost. 3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c > 0 && *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c > 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time. On that note, at one point I was wondering why gdc -O3 didn't generate a "rep scasb" for the inner loop instead of manually incrementing the loop pointer; but a little research revealed that I'm about 6-7 years out of date: since about 2008 gcc's optimizer has not used "rep scasb" because on modern hardware it has atrocious performance -- much worse than a manually-written C loop! So much for "built-in" string processing that used to be touted in the old days. Yet another proof of the rule that overly-complex CPU instruction sets rarely actually perform better. Anyway, after everything is said and done, a puzzle that used to take about 20 seconds to solve now only takes about 6 seconds. W00t! T -- Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
Nov 06 2014
On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost.I've argued this point for over a year, apparently without convincing anyone :-( BTW, use .byChar. instead of .representation. because the former leaves it typed as 'char' and the latter converts it to 'ubyte'.3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c > 0 && *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c > 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time.Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-)
Nov 06 2014
On Thu, Nov 06, 2014 at 03:15:04PM -0800, Walter Bright via Digitalmars-d wrote:On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:I dunno, my impression is that some people agree with it, but consider that the cost of changing it now a little too prohibitive due to the extensive code breakage it will cause. But the primary obstacle is that Andrei is not convinced, so he's going to veto any change in this direction regardless. You two will have to tussle it out to resolve this. ;-)2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost.I've argued this point for over a year, apparently without convincing anyone :-(BTW, use .byChar. instead of .representation. because the former leaves it typed as 'char' and the latter converts it to 'ubyte'.I was going to use .byChar, but since I'm doing performance testing on gdc, and (my version of) gdc hasn't caught up to the latest Phobos yet, I have to settle with .representation for now. For my purposes it's harmless, since I'm not actually doing anything with the ubyte representation once countUntil is done; the return value is used to index the original char[].LOL... I wasn't sure whether said optimizations were going to be in the front end, or in the backend only. It might not be a bad idea to put it in the front end where applicable, then all compilers will benefit from it. T -- If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c > 0 && *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c > 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time.Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-)
Nov 06 2014
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote:So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time.We've talked before about keeping statistics of how much memory has been reclaimed by collections, and use that to influence whether a collection should be run before going to the OS for more. Something like this should definitely get a bugzilla ticket. For now, you might want to consider guessing at how much memory you'll need and calling GC.reserve() as well. This should improve the performance of your app even further.2) Auto-decoding is evilYep, it's the worst. Most string operations don't even need auto-decoding to work properly. I honestly have no idea why we have this feature.
Nov 06 2014
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote:1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks).When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements.
Nov 07 2014
On Fri, Nov 07, 2014 at 09:28:56AM +0000, thedeemon via Digitalmars-d wrote:On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote:I realize that, it's just that even after accounting for that, the memory consumption rate is much higher than expected, and sure enough further investigation revealed unnecessary GC load caused by reallocating arrays where (re)using an existing one would suffice. In the case of the function that used to return an array, which would allocate a new array every call (and there are hundreds of thousands of calls for this particular test case), after implementing buffer reuse, even though it does need to reallocate the buffer sometimes when the array runs out of space, it only reallocates 3-4 times for the entire run. I'm not sure what the overhead for the hash table would be, but I'm expecting it shouldn't be more than, say, 50-100 reallocations of the bucket array (I didn't measure this). So it shouldn't add up to *that* much. T -- There is no gravity. The earth sucks.1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks).When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements.
Nov 07 2014
On 11/6/14, 7:58 PM, H. S. Teoh via Digitalmars-d wrote:So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time. 2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost. 3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c > 0 && *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c > 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time. On that note, at one point I was wondering why gdc -O3 didn't generate a "rep scasb" for the inner loop instead of manually incrementing the loop pointer; but a little research revealed that I'm about 6-7 years out of date: since about 2008 gcc's optimizer has not used "rep scasb" because on modern hardware it has atrocious performance -- much worse than a manually-written C loop! So much for "built-in" string processing that used to be touted in the old days. Yet another proof of the rule that overly-complex CPU instruction sets rarely actually perform better. Anyway, after everything is said and done, a puzzle that used to take about 20 seconds to solve now only takes about 6 seconds. W00t! THi, Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D. Thanks!
Nov 07 2014
On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d wrote: [...]Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D.[...] I haven't posted the code anywhere yet, but I could certainly send you a copy if you want. However, the test input I'm using is not public, so you'll have to craft your own input files for testing purposes. The public input files I do have are too simple to pose a significant challenge to the solver (plus, they default to BFS instead of A*, so you might get different results from what I described here). T -- Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a clever lawsuit / That Windowsland will finally fall, / But thousands writing open source code / Like mice who nibble through a wall. -- The Linux-nationale by Greg Baker
Nov 07 2014
On 11/7/14, 4:16 PM, H. S. Teoh via Digitalmars-d wrote:On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d wrote: [...]Then I don't think I'll be able to do any useful benchmark/profile with that. Thanks anyway :-)Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D.[...] I haven't posted the code anywhere yet, but I could certainly send you a copy if you want. However, the test input I'm using is not public, so you'll have to craft your own input files for testing purposes. The public input files I do have are too simple to pose a significant challenge to the solver (plus, they default to BFS instead of A*, so you might get different results from what I described here). T
Nov 07 2014