digitalmars.D - TempAlloc and druntime GC
- dsimcha (45/45) Jan 18 2009 For those of you who aren't familiar with TempAlloc, it's a region-based
- Sean Kelly (11/33) Jan 19 2009 The way I'd do this at the druntime level is generate a Thread.Context s...
- dsimcha (24/57) Jan 19 2009 Can you elaborate on this idea? I don't understand it, so I can't comme...
- Sean Kelly (25/79) Jan 19 2009 The data range of every execution context in druntime is tracked by the ...
- dsimcha (25/43) Jan 19 2009 yet more
- Daniel Keep (34/47) Jan 19 2009 Just a random thought here; what if you had this:
- dsimcha (6/18) Jan 19 2009 Never mind as far as this. I realized that the reason why the try/catch...
For those of you who aren't familiar with TempAlloc, it's a region-based memory allocator that's essentially a second stack for allocating memory in a last in, first out manner. It is based on an idea proposed on this newsgroup a few months back by Andrei under the name SuperStack, and was later implemented by me. The goal is to provide fast thread-local pointer-bumping memory allocation for objects whose lifetime is bound to a scope, while allowing the whole heap, instead of the relatively limited call stack space, to be used. According to some benchmarks I've done, TempAlloc can be anywhere between 30% and 10s of times faster than the druntime GC allocator, depending on several factors, the biggest being lock contention for the druntime allocator. TempAlloc's biggest weakness is that nothing allocated by TempAlloc is scanned by the GC, making it impossible to safely store the only reference to GC-allocated objects inside TempAlloc-allocated space. I've been using TempAlloc in my own personal projects for a while now, and have not found this to be a significant limitation in practice. Usually, if I'm using TempAlloc, the whole point is to avoid generating GC heap activity within the function I'm working in, meaning that I don't create any new GC-allocated objects in that function. If the objects I were created before calling that function, there's another reference to them somewhere. However, for people who are not as familiar with how TempAlloc is implemented, or with how GC is implemented, it may be easy to screw up, making TempAlloc a shoe with a gun built in and probably unsuitable for inclusion in Phobos or Tango. I've been thinking about how to deal with this issue and make TempAlloc safe enough for inclusion in the standard library. Since it works by allocating large (currently 4 MB) regions from the C heap, and then sub-allocating these regions to its clients, simply scanning these whole chunks is simply not a reasonable option, as it would lead to way too many false pointers. Using the GC's addRegion/removeRegion interface at every TempAlloc allocation is also not a reasonable option, as this operation requires a lock and would negate any performance gains from using TempAlloc instead of the GC heap. The only alternative I see is to make the GC aware of TempAlloc at a very low level. I've looked at the GC code and this doesn't look terribly hard to do. Using some bit twiddling tricks, I can do the bookkeeping to determine which TempAlloc allocated objects should be scanned by the GC w/o any additional space overhead. The downside to this is that TempAlloc would be very tightly coupled to druntime. Furthermore, if Sean and the other druntime maintainers decided they didn't like my patch, my effort will be wasted. Therefore, I'd like to get some feedback before I bother doing anything. Do you think that TempAlloc is universally useful enough to belong in the standard library? If so, do you believe that it is only suitable for inclusion if it can be made safe w/ respect to storing the only reference to GC-allocated objects? If so, is it worth tightly coupling it to the GC, and should I start working on this and submit a patch to druntime?
Jan 18 2009
== Quote from dsimcha (dsimcha yahoo.com)'s articleI've been thinking about how to deal with this issue and make TempAlloc safe enough for inclusion in the standard library. Since it works by allocating large (currently 4 MB) regions from the C heap, and then sub-allocating these regions to its clients, simply scanning these whole chunks is simply not a reasonable option, as it would lead to way too many false pointers. Using the GC's addRegion/removeRegion interface at every TempAlloc allocation is also not a reasonable option, as this operation requires a lock and would negate any performance gains from using TempAlloc instead of the GC heap. The only alternative I see is to make the GC aware of TempAlloc at a very low level. I've looked at the GC code and this doesn't look terribly hard to do. Using some bit twiddling tricks, I can do the bookkeeping to determine which TempAlloc allocated objects should be scanned by the GC w/o any additional space overhead.The way I'd do this at the druntime level is generate a Thread.Context struct for each distinct block used by TempAlloc. It shouldn't be too difficult to create an API to provide this functionality. I'd have to do something roughly along these lines anyway to get thread support working for DLLs in Win32.The downside to this is that TempAlloc would be very tightly coupled to druntime. Furthermore, if Sean and the other druntime maintainers decided they didn't like my patch, my effort will be wasted. Therefore, I'd like to get some feedback before I bother doing anything. Do you think that TempAlloc is universally useful enough to belong in the standard library? If so, do you believe that it is only suitable for inclusion if it can be made safe w/ respect to storing the only reference to GC-allocated objects? If so, is it worth tightly coupling it to the GC, and should I start working on this and submit a patch to druntime?For any new runtime features, the easiest thing is just to submit a ticket or to contact me and tell me what you need. At first blush, I don't see any problem with providing the functions you'd need to make this work. It may eventually make sense to put TempAlloc, or something like it, into core.memory anyway (there's a reason I didn't name that module core.gc). Sean
Jan 19 2009
== Quote from Sean Kelly (sean invisibleduck.org)'s article== Quote from dsimcha (dsimcha yahoo.com)'s articleCan you elaborate on this idea? I don't understand it, so I can't comment on whether it would work.I've been thinking about how to deal with this issue and make TempAlloc safe enough for inclusion in the standard library. Since it works by allocating large (currently 4 MB) regions from the C heap, and then sub-allocating these regions to its clients, simply scanning these whole chunks is simply not a reasonable option, as it would lead to way too many false pointers. Using the GC's addRegion/removeRegion interface at every TempAlloc allocation is also not a reasonable option, as this operation requires a lock and would negate any performance gains from using TempAlloc instead of the GC heap. The only alternative I see is to make the GC aware of TempAlloc at a very low level. I've looked at the GC code and this doesn't look terribly hard to do. Using some bit twiddling tricks, I can do the bookkeeping to determine which TempAlloc allocated objects should be scanned by the GC w/o any additional space overhead.The way I'd do this at the druntime level is generate a Thread.Context struct for each distinct block used by TempAlloc. It shouldn't be too difficult to create an API to provide this functionality. I'd have to do something roughly along these lines anyway to get thread support working for DLLs in Win32.It seems like you want to create some kind of more abstract interface to solve the more general problem of interfacing stuff with the GC, rather than just integrating TempAlloc directly at a low level. One way this could be done (not necessarily the best way, but the best I've thought of so far) is to make it possible to register class objects conforming to some specific range interface (an explicit runtime interface w/ virtual functions, not an implicit compile-time interface, since we care about the ABI here) with the GC. These objects would return a pointer at each iteration, and then the GC would mark whatever was pointed to by that pointer. This would allow me to use some hacks to keep TempAlloc efficient, while keeping these hacks well-encapsulated and out of the core GC code. In TempAlloc's case specifically, every time a new thread's TempAlloc state was initialized, it would register one of these objects with the GC. This object would contain the thread's ID internally, and be set to return empty immediately if the thread was already dead. (TempAlloc-allocated data is absolutely NOT meant to be shared between threads.) Otherwise, it would scan through the currently allocated elements and provide the GC with the pointers that are located there, through the range interface. I'm not sure if this would also work for DLL stuff since I know nothing about that problem. However, if this would solve the other cases you have in mind, I think it's an elegant solution.The downside to this is that TempAlloc would be very tightly coupled to druntime. Furthermore, if Sean and the other druntime maintainers decided they didn't like my patch, my effort will be wasted. Therefore, I'd like to get some feedback before I bother doing anything. Do you think that TempAlloc is universally useful enough to belong in the standard library? If so, do you believe that it is only suitable for inclusion if it can be made safe w/ respect to storing the only reference to GC-allocated objects? If so, is it worth tightly coupling it to the GC, and should I start working on this and submit a patch to druntime?For any new runtime features, the easiest thing is just to submit a ticket or to contact me and tell me what you need. At first blush, I don't see any problem with providing the functions you'd need to make this work. It may eventually make sense to put TempAlloc, or something like it, into core.memory anyway (there's a reason I didn't name that module core.gc). Sean
Jan 19 2009
== Quote from dsimcha (dsimcha yahoo.com)'s article== Quote from Sean Kelly (sean invisibleduck.org)'s articleThe data range of every execution context in druntime is tracked by the GC via a linked-list of structs that look roughly like this: struct Context { void* begin; void* end; Context* next; } The stack of each kernel thread has one, as does each user thread (Fiber). Currently, this is all internal, but I could expose an API along these lines: void attachContext( Context* ); void detachContext( Context* ); Each ThreadAlloc range would have its own Context struct which it would attach on creation and detach upon destruction. You're still paying for a mutex call to attach and detach the struct, but the begin and end pointers can be modified at will, which gets rid of the problem with addRange, etc. I suppose I could add this to the interface of GC as well, instead of sticking yet more logic in core.thread.== Quote from dsimcha (dsimcha yahoo.com)'s articleCan you elaborate on this idea? I don't understand it, so I can't comment on whether it would work.I've been thinking about how to deal with this issue and make TempAlloc safe enough for inclusion in the standard library. Since it works by allocating large (currently 4 MB) regions from the C heap, and then sub-allocating these regions to its clients, simply scanning these whole chunks is simply not a reasonable option, as it would lead to way too many false pointers. Using the GC's addRegion/removeRegion interface at every TempAlloc allocation is also not a reasonable option, as this operation requires a lock and would negate any performance gains from using TempAlloc instead of the GC heap. The only alternative I see is to make the GC aware of TempAlloc at a very low level. I've looked at the GC code and this doesn't look terribly hard to do. Using some bit twiddling tricks, I can do the bookkeeping to determine which TempAlloc allocated objects should be scanned by the GC w/o any additional space overhead.The way I'd do this at the druntime level is generate a Thread.Context struct for each distinct block used by TempAlloc. It shouldn't be too difficult to create an API to provide this functionality. I'd have to do something roughly along these lines anyway to get thread support working for DLLs in Win32.That's basically how the Context stuff works above.It seems like you want to create some kind of more abstract interface to solve the more general problem of interfacing stuff with the GC, rather than just integrating TempAlloc directly at a low level. One way this could be done (not necessarily the best way, but the best I've thought of so far) is to make it possible to register class objects conforming to some specific range interface (an explicit runtime interface w/ virtual functions, not an implicit compile-time interface, since we care about the ABI here) with the GC. These objects would return a pointer at each iteration, and then the GC would mark whatever was pointed to by that pointer. This would allow me to use some hacks to keep TempAlloc efficient, while keeping these hacks well-encapsulated and out of the core GC code.The downside to this is that TempAlloc would be very tightly coupled to druntime. Furthermore, if Sean and the other druntime maintainers decided they didn't like my patch, my effort will be wasted. Therefore, I'd like to get some feedback before I bother doing anything. Do you think that TempAlloc is universally useful enough to belong in the standard library? If so, do you believe that it is only suitable for inclusion if it can be made safe w/ respect to storing the only reference to GC-allocated objects? If so, is it worth tightly coupling it to the GC, and should I start working on this and submit a patch to druntime?For any new runtime features, the easiest thing is just to submit a ticket or to contact me and tell me what you need. At first blush, I don't see any problem with providing the functions you'd need to make this work. It may eventually make sense to put TempAlloc, or something like it, into core.memory anyway (there's a reason I didn't name that module core.gc). SeanIn TempAlloc's case specifically, every time a new thread's TempAlloc state was initialized, it would register one of these objects with the GC. This object would contain the thread's ID internally, and be set to return empty immediately if the thread was already dead. (TempAlloc-allocated data is absolutely NOT meant to be shared between threads.) Otherwise, it would scan through the currently allocated elements and provide the GC with the pointers that are located there, through the range interface.Hm... this may be more tricky than I had outlined above. If I were to do this I'd probably expose it from GC rather than allow access to the existing context mechanism. However... I /could/ possibly add an onExit hook for the Thread object. I'll have to think about whether that could cause any problems. Sean
Jan 19 2009
== Quote from Sean Kelly (sean invisibleduck.org)'s articleThe data range of every execution context in druntime is tracked by the GC via a linked-list of structs that look roughly like this: struct Context { void* begin; void* end; Context* next; } The stack of each kernel thread has one, as does each user thread (Fiber).Currently,this is all internal, but I could expose an API along these lines: void attachContext( Context* ); void detachContext( Context* ); Each ThreadAlloc range would have its own Context struct which it would attach on creation and detach upon destruction. You're still paying for a mutex call to attach and detach the struct, but the begin and end pointers can be modified at will, which gets rid of the problem with addRange, etc. I suppose I could add this to the interface of GC as well, instead of stickingyet morelogic in core.thread.The problem with this is that I'd ideally like to control whether each TempAlloc-allocated object is scanned by the GC individually, rather than scanning the entire used portion of the TempAlloc stack. Having to create and store a context struct for every object would be incredibly inefficient. Scanning the entire used portion would be incredibly conservative. At an implementation level, TempAlloc already uses an array of void*s internally to track what it's allocated (only one per object, in an array, not a linked list) so that it can be freed properly. Since TempAlloc allocates using 16-byte alignment, I was planning to avoid any extra overhead by storing the scan/noscan bits in the low order bits of these pointers. One more note: It would also be greatly appreciated if you would expose a version of GC.malloc that returns null on failure instead of throwing an exception. For performance, TempAlloc treats out of memory as non-recoverable. This is necessary because, since scope statements are often used to manage freeing, the performance hit from having TempAlloc-related functions be throwing would be too large. Also, making TempAlloc functions non-throwing by catching outOfMemoryErrors and aborting is in itself expensive. Furthermore, I just realized (literally while I was writing this post) that C's malloc (which I use only b/c it doesn't throw) doesn't always return 16-byte aligned pointers like druntime's. Ideally, I'd like to stop using the C heap and start using the druntime heap, but I can't do that until there is a non-throwing implementation of GC.malloc.
Jan 19 2009
dsimcha wrote:[snip] The problem with this is that I'd ideally like to control whether each TempAlloc-allocated object is scanned by the GC individually, rather than scanning the entire used portion of the TempAlloc stack. Having to create and store a context struct for every object would be incredibly inefficient. Scanning the entire used portion would be incredibly conservative. At an implementation level, TempAlloc already uses an array of void*s internally to track what it's allocated (only one per object, in an array, not a linked list) so that it can be freed properly. Since TempAlloc allocates using 16-byte alignment, I was planning to avoid any extra overhead by storing the scan/noscan bits in the low order bits of these pointers. [snip]Just a random thought here; what if you had this: struct Context { void* begin; void* end; Context* next; } struct FatContext { void* begin; void* end; Context* next; uint* bitmap; } void attachFatContext( FatContext* ctx ) { assert( ctx.next & 3 == 0 ); ctx.next |= 1; attachContext( cast(Context*) ctx ); } FatContext* isFatContext( Context* ctx ) { return (ctx.next & 1) ? cast(FatContext*) ctx : null; } Context* nextContext( Context* ctx ) { return ctx.next & ~3; } Now, we assume that bitmap is a pointer to the first element of an array of uints that's (end-begin)/16 long. You'd only need one bit per 16 bytes, and it'd let you change the scan/noscan of any area of memory without having to talk to the GC every time. -- Daniel
Jan 19 2009
== Quote from dsimcha (dsimcha yahoo.com)'s articleOne more note: It would also be greatly appreciated if you would expose a version of GC.malloc that returns null on failure instead of throwing an exception. For performance, TempAlloc treats out of memory as non-recoverable. This is necessary because, since scope statements are often used to manage freeing, the performance hit from having TempAlloc-related functions be throwing would be too large. Also, making TempAlloc functions non-throwing by catching outOfMemoryErrors and aborting is in itself expensive. Furthermore, I just realized (literally while I was writing this post) that C's malloc (which I use only b/c it doesn't throw) doesn't always return 16-byte aligned pointers like druntime's. Ideally, I'd like to stop using the C heap and start using the druntime heap, but I can't do that until there is a non-throwing implementation of GC.malloc.Never mind as far as this. I realized that the reason why the try/catch blocks were costing so much performance is because I was doing them inline and in some cases putting them around a whole bunch of code. This was probably killing a lot of compiler optimizations. When I instead did it the non-stupid way and wrote a wrapper for malloc, this no longer resulted in a significant performance penalty.
Jan 19 2009