digitalmars.D - rt_finalize WTFs?
- dsimcha (59/59) Dec 04 2011 I'm at my traditional passtime of trying to speed up D's garbage
- Martin Nowak (15/74) Dec 04 2011 Not for the try block. With errors being not recoverable you don't need ...
- dsimcha (9/88) Dec 04 2011 Thanks for the benchmark. I ended up deciding to just create a second
- Jonathan M Davis (7/15) Dec 04 2011 It's best to avoid code duplication in general, but I'd argue that if on...
- bearophile (6/16) Dec 04 2011 That if(p) seems fit to become a static if on bool template argument (bu...
- Rainer Schuetze (8/65) Dec 05 2011 Last time I looked at it, the try/catch/finally block was rather
- Vladimir Panteleev (10/14) Dec 05 2011 This sounds like a good idea. Just make sure that all code paths that le...
- Martin Nowak (23/95) Dec 05 2011 Just an unconditional jump into the finally body here, but
- dsimcha (21/36) Dec 05 2011 The reason the gain wasn't huge is because on the benchmark I have that ...
- Vladimir Panteleev (6/14) Dec 05 2011 A tree, with a few bits of the address space per level. It becomes bound...
- Martin Nowak (126/178) Dec 05 2011 Probably write a tool that can replay recorded allocations profiles.
- Iain Buclaw (9/14) Dec 05 2011 Indeed. The current implementation kicking around is to scan
- Jacob Carlborg (4/16) Dec 05 2011 /proc doesn't exist on Mac OS X.
- Martin Nowak (2/10) Dec 05 2011 Actually 64-bit should use a hashtable for the upper 32-bit and then
- dsimcha (5/17) Dec 05 2011 Why do you expect this to be faster than a binary search? I'm not sayin...
- Martin Nowak (15/34) Dec 05 2011 It's the tightest loop in the garbage collector.
- Steven Schveighoffer (12/70) Dec 05 2011 I think it might be a good idea to move those two operations to the clea...
- Robert Clipsham (6/8) Dec 05 2011 Have you thought about pushing for the inclusion of CDGC at all/working
- Trass3r (1/6) Dec 05 2011 So true, it's been rotting in that branch.
- dsimcha (11/18) Dec 05 2011 IIRC CDGC includes two major enhancements:
- Trass3r (3/13) Dec 05 2011 What's the status of that anyway? Every patch in that bugzilla entry is ...
- Sean Kelly (7/14) Dec 06 2011 I have an updated and win32-compilable version of CDGC in a branch. It's...
- Sean Kelly (22/68) Dec 07 2011 collector again, and I've stumbled on the fact that rt_finalize is =
- bearophile (4/6) Dec 09 2011 Are you going to create a pull request soon? So people will be able to t...
I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d: extern (C) void rt_finalize(void* p, bool det = true) { debug(PRINTF) printf("rt_finalize(p = %p)\n", p); if (p) // not necessary if called from gc { ClassInfo** pc = cast(ClassInfo**)p; if (*pc) { ClassInfo c = **pc; byte[] w = c.init; try { if (det || collectHandler is null || collectHandler(cast(Object)p)) { do { if (c.destructor) { fp_t fp = cast(fp_t)c.destructor; (*fp)(cast(Object)p); // call destructor } c = c.base; } while (c); } if ((cast(void**)p)[1]) // if monitor is not null _d_monitordelete(cast(Object)p, det); (cast(byte*) p)[0 .. w.length] = w[]; // WTF? } catch (Throwable e) { onFinalizeError(**pc, e); } finally // WTF? { *pc = null; // zero vptr } } } } Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer. Is there any good reason to keep this code around?
Dec 04 2011
On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha yahoo.com> wrote:I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d: extern (C) void rt_finalize(void* p, bool det = true) { debug(PRINTF) printf("rt_finalize(p = %p)\n", p); if (p) // not necessary if called from gc { ClassInfo** pc = cast(ClassInfo**)p; if (*pc) { ClassInfo c = **pc; byte[] w = c.init; try { if (det || collectHandler is null || collectHandler(cast(Object)p)) { do { if (c.destructor) { fp_t fp = cast(fp_t)c.destructor; (*fp)(cast(Object)p); // call destructor } c = c.base; } while (c); } if ((cast(void**)p)[1]) // if monitor is not null _d_monitordelete(cast(Object)p, det); (cast(byte*) p)[0 .. w.length] = w[]; // WTF? } catch (Throwable e) { onFinalizeError(**pc, e); } finally // WTF? { *pc = null; // zero vptr } } } } Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer. Is there any good reason to keep this code around?Not for the try block. With errors being not recoverable you don't need to care about zeroing the vtbl or you could just copy the code into the catch handler. This seems to cause less spilled variables. Most expensive is the call to a memcpy PLT, replace it with something inlineable. Zeroing is not much faster than copying init[] for small classes. At least zeroing should be worth it, unless the GC would not scan the memory otherwise. gcbench/tree1 => 41.8s => https://gist.github.com/1432117 => gcbench/tree1 => 33.4s Please add useful benchmarks to druntime. martin
Dec 04 2011
Thanks for the benchmark. I ended up deciding to just create a second function, rt_finalize_gc, that gets rid of a whole bunch of cruft that isn't necessary in the GC case. I think it's worth the small amount of code duplication it creates. Here are the results of my efforts so far: https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . I've got one other good idea that I think will shave a few seconds off the Tree1 benchmark if I don't run into any unforeseen obstacles in implementing it. On 12/4/2011 10:07 PM, Martin Nowak wrote:On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha yahoo.com> wrote:I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d: extern (C) void rt_finalize(void* p, bool det = true) { debug(PRINTF) printf("rt_finalize(p = %p)\n", p); if (p) // not necessary if called from gc { ClassInfo** pc = cast(ClassInfo**)p; if (*pc) { ClassInfo c = **pc; byte[] w = c.init; try { if (det || collectHandler is null || collectHandler(cast(Object)p)) { do { if (c.destructor) { fp_t fp = cast(fp_t)c.destructor; (*fp)(cast(Object)p); // call destructor } c = c.base; } while (c); } if ((cast(void**)p)[1]) // if monitor is not null _d_monitordelete(cast(Object)p, det); (cast(byte*) p)[0 .. w.length] = w[]; // WTF? } catch (Throwable e) { onFinalizeError(**pc, e); } finally // WTF? { *pc = null; // zero vptr } } } } Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer. Is there any good reason to keep this code around?Not for the try block. With errors being not recoverable you don't need to care about zeroing the vtbl or you could just copy the code into the catch handler. This seems to cause less spilled variables. Most expensive is the call to a memcpy PLT, replace it with something inlineable. Zeroing is not much faster than copying init[] for small classes. At least zeroing should be worth it, unless the GC would not scan the memory otherwise. gcbench/tree1 => 41.8s => https://gist.github.com/1432117 => gcbench/tree1 => 33.4s Please add useful benchmarks to druntime. martin
Dec 04 2011
On Sunday, December 04, 2011 23:41:08 dsimcha wrote:Thanks for the benchmark. I ended up deciding to just create a second function, rt_finalize_gc, that gets rid of a whole bunch of cruft that isn't necessary in the GC case. I think it's worth the small amount of code duplication it creates. Here are the results of my efforts so far: https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . I've got one other good idea that I think will shave a few seconds off the Tree1 benchmark if I don't run into any unforeseen obstacles in implementing it.It's best to avoid code duplication in general, but I'd argue that if only a small amount of code duplication is required to get a big gain from the GC, it's well worth it. In general, anything we can do to improve the GC is worthwhile, since it's _the_ place where D risks being inefficient in comparison to languages such as C++. - Jonathan M Davis
Dec 04 2011
dsimcha:I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected.DMD or LDC/GDC compiler?extern (C) void rt_finalize(void* p, bool det = true) { debug(PRINTF) printf("rt_finalize(p = %p)\n", p); if (p) // not necessary if called from gc {That if(p) seems fit to become a static if on bool template argument (but it needs to become D code or two C functions): void rt_finalize(bool byGC)(void* p, bool det = true) { ... Bye, bearophile
Dec 04 2011
Last time I looked at it, the try/catch/finally block was rather expensive because it always invokes the exception handler unwinding mechanism, even if no exception occurs. Try moving the try/catch block out of the loops that call rt_finalize. (maybe just remove it, onFinalizeError just rethrows, and it seems noone is using the information that the exception is thrown from inside the finalizer) On 05.12.2011 02:46, dsimcha wrote:I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d: extern (C) void rt_finalize(void* p, bool det = true) { debug(PRINTF) printf("rt_finalize(p = %p)\n", p); if (p) // not necessary if called from gc { ClassInfo** pc = cast(ClassInfo**)p; if (*pc) { ClassInfo c = **pc; byte[] w = c.init; try { if (det || collectHandler is null || collectHandler(cast(Object)p)) { do { if (c.destructor) { fp_t fp = cast(fp_t)c.destructor; (*fp)(cast(Object)p); // call destructor } c = c.base; } while (c); } if ((cast(void**)p)[1]) // if monitor is not null _d_monitordelete(cast(Object)p, det); (cast(byte*) p)[0 .. w.length] = w[]; // WTF? } catch (Throwable e) { onFinalizeError(**pc, e); } finally // WTF? { *pc = null; // zero vptr } } } } Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer. Is there any good reason to keep this code around?
Dec 05 2011
On Mon, 05 Dec 2011 10:14:00 +0200, Rainer Schuetze <r.sagitario gmx.de> wrote:Try moving the try/catch block out of the loops that call rt_finalize.This sounds like a good idea. Just make sure that all code paths that lead to rt_finalize (e.g. delete) can recover from an exception.(maybe just remove it, onFinalizeError just rethrows, and it seems noone is using the information that the exception is thrown from inside the finalizer)The point of onFinalizeError is that it throws an *error* (it is unrecoverable). Currently, the GC cannot recover from an exception thrown by a finalizer, which is why onFinalizeError exists. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Dec 05 2011
On Mon, 05 Dec 2011 09:14:00 +0100, Rainer Schuetze <r.sagitario gmx.de> wrote:Last time I looked at it, the try/catch/finally block was rather expensive because it always invokes the exception handler unwinding mechanism, even if no exception occurs. Try moving the try/catch block out of the loops that call rt_finalize. (maybe just remove it, onFinalizeError just rethrows, and it seems noone is using the information that the exception is thrown from inside the finalizer)Just an unconditional jump into the finally body here, but it still affects register assignment. Install an exception handler at sweep scope would save quite some Moving the exception handler to the sweep scope seems promising, can save lots of register saves. I appreciate the recursion during mark, wanted to do this myself sometime ago but expected a little more gain. Some more ideas: - Do a major refactoring of the GC code, making it less reluctant to changes. Adding sanity checks or unit tests would be great. This probably reveals some obfuscated performance issues. - Add more realistic GC benchmarks, just requires adding to druntime/test/gcbench using the new runbench. The tree1 mainly uses homogeneous classes, so this is very synthesized. - There is one binary search pool lookup for every scanned address in range. Should be a lot to gain here, but it's difficult. It needs a multilevel mixture of bitset/hashtab. - Reduce the GC roots range. I will have to work on this for shared library support anyhow. martinOn 05.12.2011 02:46, dsimcha wrote:I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d: extern (C) void rt_finalize(void* p, bool det = true) { debug(PRINTF) printf("rt_finalize(p = %p)\n", p); if (p) // not necessary if called from gc { ClassInfo** pc = cast(ClassInfo**)p; if (*pc) { ClassInfo c = **pc; byte[] w = c.init; try { if (det || collectHandler is null || collectHandler(cast(Object)p)) { do { if (c.destructor) { fp_t fp = cast(fp_t)c.destructor; (*fp)(cast(Object)p); // call destructor } c = c.base; } while (c); } if ((cast(void**)p)[1]) // if monitor is not null _d_monitordelete(cast(Object)p, det); (cast(byte*) p)[0 .. w.length] = w[]; // WTF? } catch (Throwable e) { onFinalizeError(**pc, e); } finally // WTF? { *pc = null; // zero vptr } } } } Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer. Is there any good reason to keep this code around?
Dec 05 2011
== Quote from Martin Nowak (dawg dawgfoto.de)'s articleI appreciate the recursion during mark, wanted to do this myself sometime ago but expected a little more gain.The reason the gain wasn't huge is because on the benchmark I have that involves a deep heap graph, sweeping time dominates marking time. The performance gain for the mark phase only (which is important b/c this is when the world needs to be stopped) is ~20-30%.Some more ideas: - Do a major refactoring of the GC code, making it less reluctant to changes. Adding sanity checks or unit tests would be great. This probably reveals some obfuscated performance issues.Not just obfuscated ones. I've wanted to fix an obvious perf bug for two years and haven't done it because the necessary refactoring would be unbelievably messy and I'm too afraid I'll break something. Basically, malloc() sets the bytes between the size you requested and the size of the block actually allocated to zero to prevent false pointers. This is reasonable. The problem is that it does so **while holding the GC's lock**. Fixing it for just the case when malloc() is called by the user is also easy. The problem is fixing it when malloc() gets called from realloc(), calloc(), etc.- Add more realistic GC benchmarks, just requires adding to druntime/test/gcbench using the new runbench. The tree1 mainly uses homogeneous classes, so this is very synthesized.I'll crowdsource this. I can't think of any good benchmarks that are < a few hundred lines w/ no dependencies but aren't pretty synthetic.- There is one binary search pool lookup for every scanned address in range. Should be a lot to gain here, but it's difficult. It needs a multilevel mixture of bitset/hashtab.I understand the problem, but please elaborate on the proposed solution. You've basically got a bunch of pools, each of which represents a range of memory addresses, not a single address (so a basic hashtable is out). You need to know which range some pointer fits in. How would you beat binary search/O(log N) for this?- Reduce the GC roots range. I will have to work on this for shared library support anyhow.Please clarify what you mean by "reduce" the roots range. Thanks for the feedback/suggestions.
Dec 05 2011
On Mon, 05 Dec 2011 23:07:09 +0200, dsimcha <dsimcha yahoo.com> wrote:I understand the problem, but please elaborate on the proposed solution. You've basically got a bunch of pools, each of which represents a range of memory addresses, not a single address (so a basic hashtable is out). You need to know which range some pointer fits in. How would you beat binary search/O(log N) for this?A tree, with a few bits of the address space per level. It becomes bound to the size of the address space, not the number of pools. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Dec 05 2011
On Mon, 05 Dec 2011 22:07:09 +0100, dsimcha <dsimcha yahoo.com> wrote:== Quote from Martin Nowak (dawg dawgfoto.de)'s articleProbably write a tool that can replay recorded allocations profiles. But it will be difficult to fake internal pointers if they are not recorded. If we were able to enable recording in the runtime, people having performance issues with the GC could then send a file that can be used for optimizations.I appreciate the recursion during mark, wanted to do this myself sometime ago but expected a little more gain.The reason the gain wasn't huge is because on the benchmark I have that involves a deep heap graph, sweeping time dominates marking time. The performance gain for the mark phase only (which is important b/c this is when the world needs to be stopped) is ~20-30%.Some more ideas: - Do a major refactoring of the GC code, making it less reluctant to changes. Adding sanity checks or unit tests would be great. This probably reveals some obfuscated performance issues.Not just obfuscated ones. I've wanted to fix an obvious perf bug for two years and haven't done it because the necessary refactoring would be unbelievably messy and I'm too afraid I'll break something. Basically, malloc() sets the bytes between the size you requested and the size of the block actually allocated to zero to prevent false pointers. This is reasonable. The problem is that it does so **while holding the GC's lock**. Fixing it for just the case when malloc() is called by the user is also easy. The problem is fixing it when malloc() gets called from realloc(), calloc(), etc.- Add more realistic GC benchmarks, just requires adding to druntime/test/gcbench using the new runbench. The tree1 mainly uses homogeneous classes, so this is very synthesized.I'll crowdsource this. I can't think of any good benchmarks that are < a few hundred lines w/ no dependencies but aren't pretty synthetic.It is possible to use a hashtable or a bitset because each pool occupies at least PAGESIZE. One would need to use addresses relative to the minimum address but that still needs 256K entries per GByte, so the maintenance overhead is likely too big. More promising is to put pool addresses ranges in a trie. addr[7] [... . ...] / | \ addr[6] [... . ...] [... . ...] / | \ / | \ addr[5] pool:8 [... . ...] / | \ addr[4] pool:8 [....] pool:5 Lookup is bounded constant time, with at maximum 8(4) memory accesses. Maintenance is only required when adding/removing pools, little sophisticated though. ----- alias Node Trie; struct Node { union { Type _type; Pool* _pool; Node[16]* _nodes16; // uses upper 3 bits per level ... Node[256]* _nodes256; // uses full 8 bits per level } // use lower 3 bits to encode type, given guaranteed 8-byte alignment of malloc enum NTypeBits = 3; enum TypeMask = (1 << NTypeBits) - 1); enum PtrMask = ~TypeMask; enum Type { Empty, Pool, Nodes16, Nodes256 }; static assert(Type.max <= TypeMask); Pool* getPool(void* p) { return getPoolImpl((cast(ubyte*)&p) + size_t.sizeof - 1); } Pool* getPoolImpl(ubyte* p) { Node* node = void; final switch (type) { case Type.Empty: return null; case Type.Pool: return pool; case Type.Nodes16: node = nodes16[*p >> 5]; break; case Type.Nodes256: node = nodes256[*p]; break; } if (node.type == Type.Pool) return node.pool; else return node.getPoolImpl(p - 1); } void insertPool(Pool* npool, uint level=0) { final switch (type) { case Type.Empty: pool = npool; case Type.Pool: Pool *opool = pool; if (opool != npool) { nodes16 = new Node[16]; foreach (i; 0 .. 16) { // non-trivial code here // Needs to figure out into which subnode opool and npool // should be inserted. They can use pool.minAddr and pool.maxAddr nodes16[i].insertPool(opool, level + 1); nodes16[i].insertPool(npool, level + 1); } } } case Type.Nodes16: // count empty nodes and decide whether // to switch to more child nodes. } } void removePool(Pool* npool, uint level=0) { // reverse of above } property Type type() { return cast(Type)(_type & TypeMask); } property void type(Type t) { assert(t <= TypeMask); _type |= t; } property Pool* pool() { return cast(Pool*)(cast(size_t)_pool & PtrMask); } property void pool(Pool* p) { _pool = p; type = Type.Pool; } ... for nodesN } ------ There is one binary search pool lookup for every scanned address in range. Should be a lot to gain here, but it's difficult. It needs a multilevel mixture of bitset/hashtab.I understand the problem, but please elaborate on the proposed solution. You've basically got a bunch of pools, each of which represents a range of memory addresses, not a single address (so a basic hashtable is out). You need to know which range some pointer fits in. How would you beat binary search/O(log N) for this?We currently add something from etext to _end (rt.memory) as static root. This probably contains read-only sections, data from other languages and (unlikely) other unrelated sections. I think some help from the compiler will be necessary to support shared libraries anyhow, so we should be able to get a precise range.- Reduce the GC roots range. I will have to work on this for shared library support anyhow.Please clarify what you mean by "reduce" the roots range.Thanks for the feedback/suggestions.
Dec 05 2011
On 5 December 2011 22:34, Martin Nowak <dawg dawgfoto.de> wrote:We currently add something from etext to _end (rt.memory) as static root. This probably contains read-only sections, data from other languages and (unlikely) other unrelated sections. I think some help from the compiler will be necessary to support shared libraries anyhow, so we should be able to get a precise range.Indeed. The current implementation kicking around is to scan /proc/self/maps on init (this is true for the runtime that ships with GDC at least). Which is not the most pleasant of things to do - not to mention only supports Unix-flavoured systems, but it works well enough for when linking against shared D libraries. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Dec 05 2011
On 2011-12-05 23:45, Iain Buclaw wrote:On 5 December 2011 22:34, Martin Nowak<dawg dawgfoto.de> wrote:/proc doesn't exist on Mac OS X. -- /Jacob CarlborgWe currently add something from etext to _end (rt.memory) as static root. This probably contains read-only sections, data from other languages and (unlikely) other unrelated sections. I think some help from the compiler will be necessary to support shared libraries anyhow, so we should be able to get a precise range.Indeed. The current implementation kicking around is to scan /proc/self/maps on init (this is true for the runtime that ships with GDC at least). Which is not the most pleasant of things to do - not to mention only supports Unix-flavoured systems, but it works well enough for when linking against shared D libraries.
Dec 05 2011
More promising is to put pool addresses ranges in a trie. addr[7] [... . ...] / | \ addr[6] [... . ...] [... . ...] / | \ / | \ addr[5] pool:8 [... . ...] / | \ addr[4] pool:8 [....] pool:5Actually 64-bit should use a hashtable for the upper 32-bit and then the the 32-bit trie for lower.
Dec 05 2011
== Quote from Martin Nowak (dawg dawgfoto.de)'s articleWhy do you expect this to be faster than a binary search? I'm not saying it won't be, just that it's not a home run that deserves a high priority as an optimization. You still have a whole bunch of indirections, probably more than you would ever have for binary search.More promising is to put pool addresses ranges in a trie. addr[7] [... . ...] / | \ addr[6] [... . ...] [... . ...] / | \ / | \ addr[5] pool:8 [... . ...] / | \ addr[4] pool:8 [....] pool:5Actually 64-bit should use a hashtable for the upper 32-bit and then the the 32-bit trie for lower.
Dec 05 2011
On Tue, 06 Dec 2011 00:16:01 +0100, dsimcha <dsimcha yahoo.com> wrote:== Quote from Martin Nowak (dawg dawgfoto.de)'s articleIt's the tightest loop in the garbage collector. It will end with few array accesses and the locality of memory being pointed too is paired by tree locality, thus you'd have good chances to finding them in the cache. What this gives you is that it scales very good to many pools/regions. Thus you can better tweak the pool granularity against pool sizes. Also: for (p < pend) if (*p in lastPool) can be: for (p < pend) if (*p in lastNode) It is definitely no home run, but I'll try it out when I find some time.Why do you expect this to be faster than a binary search? I'm not saying it won't be, just that it's not a home run that deserves a high priority as an optimization. You still have a whole bunch of indirections, probably more than you would ever have for binary search.More promising is to put pool addresses ranges in a trie. addr[7] [... . ...] / | \ addr[6] [... . ...] [... . ...] / | \ / | \ addr[5] pool:8 [... . ...] / | \ addr[4] pool:8 [....] pool:5Actually 64-bit should use a hashtable for the upper 32-bit and then the the 32-bit trie for lower.
Dec 05 2011
On Sun, 04 Dec 2011 20:46:27 -0500, dsimcha <dsimcha yahoo.com> wrote:I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d: extern (C) void rt_finalize(void* p, bool det = true) { debug(PRINTF) printf("rt_finalize(p = %p)\n", p); if (p) // not necessary if called from gc { ClassInfo** pc = cast(ClassInfo**)p; if (*pc) { ClassInfo c = **pc; byte[] w = c.init; try { if (det || collectHandler is null || collectHandler(cast(Object)p)) { do { if (c.destructor) { fp_t fp = cast(fp_t)c.destructor; (*fp)(cast(Object)p); // call destructor } c = c.base; } while (c); } if ((cast(void**)p)[1]) // if monitor is not null _d_monitordelete(cast(Object)p, det); (cast(byte*) p)[0 .. w.length] = w[]; // WTF? } catch (Throwable e) { onFinalizeError(**pc, e); } finally // WTF? { *pc = null; // zero vptr } } } } Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer.I think it might be a good idea to move those two operations to the clear function. Currently, clear(obj) simply calls rt_finalize(obj). It does seem rather strange to have that finally there, why do we care on an exception/error that we zero the vptr? I'd at least put that code up where the first WTF is. But I think we do need to have the reinitialization of the object and zeroing of vptr for clear, since it's part of clear's charter. I'd support any effort to speed up the GC, it's definitely the worst offender for performance. Looks like there's quite a few good ideas in this thread. -Steve
Dec 05 2011
On 05/12/2011 01:46, dsimcha wrote:I'm at my traditional passtime of trying to speed up D's garbage collector againHave you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC? -- Robert http://octarineparrot.com/
Dec 05 2011
On 05/12/2011 01:46, dsimcha wrote:So true, it's been rotting in that branch.I'm at my traditional passtime of trying to speed up D's garbage collector againHave you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC?
Dec 05 2011
On 12/5/2011 6:39 PM, Trass3r wrote:IIRC CDGC includes two major enhancements: 1. The snapshot GC for Linux. (Does this work on OSX/FreeBSD/anything Posix, or just Linux? I'm a bit skeptical about whether a snapshot GC is really that great an idea given its propensity to waste memory on long collect cycles with a lot of mutation.) 2. I think there was some precise heap scanning-related stuff in it. I originally tried to implement precise heap scanning a couple years ago, but it went nowhere for reasons too complicated to explain here. Given this experience, I'm not inclined to try again until the compiler has extensions for generating pointer offset information.On 05/12/2011 01:46, dsimcha wrote:So true, it's been rotting in that branch.I'm at my traditional passtime of trying to speed up D's garbage collector againHave you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC?
Dec 05 2011
IIRC CDGC includes two major enhancements: 1. The snapshot GC for Linux. (Does this work on OSX/FreeBSD/anything Posix, or just Linux? I'm a bit skeptical about whether a snapshot GC is really that great an idea given its propensity to waste memory on long collect cycles with a lot of mutation.)I guess, at least it uses fork IIRC.2. I think there was some precise heap scanning-related stuff in it. I originally tried to implement precise heap scanning a couple years ago, but it went nowhere for reasons too complicated to explain here. Given this experience, I'm not inclined to try again until the compiler has extensions for generating pointer offset information.What's the status of that anyway? Every patch in that bugzilla entry is marked obsolete and there's no real final answer.
Dec 05 2011
I have an updated and win32-compilable version of CDGC in a branch. It's mis= sing some features from the current GC though (it's based on the Tango GC wh= ich has remained relatively static for the past years while druntime's GC ha= s improved). Sent from my iPhone On Dec 5, 2011, at 3:39 PM, Trass3r <un known.com> wrote:n the tweaks needed to make it the main GC?On 05/12/2011 01:46, dsimcha wrote:I'm at my traditional passtime of trying to speed up D's garbage collector again=20 Have you thought about pushing for the inclusion of CDGC at all/working o==20 So true, it's been rotting in that branch.
Dec 06 2011
On Dec 4, 2011, at 5:46 PM, dsimcha wrote:I'm at my traditional passtime of trying to speed up D's garbage =collector again, and I've stumbled on the fact that rt_finalize is = taking up a ridiculous share of the time (~30% of total runtime) on a = benchmark where huge numbers of classes **that don't have destructors** = are being created and collected. Here's the code to this function, from = lifetime.d:=20 extern (C) void rt_finalize(void* p, bool det =3D true) { debug(PRINTF) printf("rt_finalize(p =3D %p)\n", p); =20 if (p) // not necessary if called from gc { ClassInfo** pc =3D cast(ClassInfo**)p; =20 if (*pc) { ClassInfo c =3D **pc; byte[] w =3D c.init; =20 try { if (det || collectHandler is null || =collectHandler(cast(Object)p)){ do { if (c.destructor) { fp_t fp =3D cast(fp_t)c.destructor; (*fp)(cast(Object)p); // call destructor } c =3D c.base; } while (c); } if ((cast(void**)p)[1]) // if monitor is not null _d_monitordelete(cast(Object)p, det); (cast(byte*) p)[0 .. w.length] =3D w[]; // WTF? } catch (Throwable e) { onFinalizeError(**pc, e); } finally // WTF? { *pc =3D null; // zero vptr } } } } =20 Getting rid of the stuff I've marked with //WTF? comments (namely the =finally block and the re-initializing of the memory occupied by the = finalized object) speeds things up by ~15% on the benchmark in question. = Why do we care what state the blob of memory is left in after we = finalize it? I can kind of see that we want to clear things if = delete/clear was called manually and we want to leave the object in a = state that doesn't look valid. However, this has significant = performance costs and IIRC is already done in clear() and delete is = supposed to be deprecated. Furthermore, I'd like to get rid of the = finally block entirely, since I assume its presence and the effect on = the generated code is causing the slowdown, not the body, which just = assigns a pointer. Now that having multiple in-flight exceptions is actually legal, we can = probably just throw out the try/catch entirely. The try/catch and call = to onFinalizeError is a holdover from when it was effectively illegal to = throw from a finalizer.=
Dec 07 2011
dsimcha:I'm at my traditional passtime of trying to speed up D's garbage collector again,Are you going to create a pull request soon? So people will be able to try it with their D programs and spot potential troubles... Bye, bearophile
Dec 09 2011