digitalmars.D - Memory Management in D: Request for Comment
- dsimcha (52/52) Nov 02 2009 During my occasional forays into the dark side of Python and Java, I am ...
- Jason House (2/62) Nov 03 2009
- dsimcha (4/5) Nov 03 2009 Probably be better to submit it as a patch to Bugzilla if it's reasonabl...
- Jason House (3/9) Nov 03 2009 I did hack gcx.d, but only to make the global mutex public. The implemen...
- downs (2/2) Nov 03 2009 I submitted a patch a while back for constant-time removeRange. Can we d...
- dsimcha (8/10) Nov 03 2009 that one up and/or implement something similar? It's rather useful for t...
- Leandro Lucarella (18/55) Nov 03 2009 The Boehm GC has a lot to learn from. I'm planning to add concurrent
- BCS (11/19) Nov 04 2009 I think there are general malloc functions that are lock free. On the sa...
- dsimcha (11/30) Nov 04 2009 But x86 atomic ops, while not as slow as locks, are pretty slow. Some b...
- BCS (15/52) Nov 04 2009 IIRC a mutex/lock has to have a CAS at some point so if you can get a fu...
- Sean Kelly (3/10) Nov 04 2009 I actually started working on this a while back and then set it aside to...
- dsimcha (5/14) Nov 04 2009 care of some other things. I'm planning to revisit it once the concurre...
- Andrei Alexandrescu (3/18) Nov 04 2009 Performance improvements are fortunately not tied to the language spec.
- dsimcha (3/21) Nov 04 2009 But the concurrency model is.
- Andrei Alexandrescu (7/28) Nov 04 2009 Sorry, I misunderstood a bit. Anyway, the process is a continuum, so you...
During my occasional forays into the dark side of Python and Java, I am often amazed at the extent to which memory management in these languages "just works". D should be like this for all but the most low-level programming tasks, and was intended to be. It seems like most of the other regulars here have carved out niches that don't involve improving memory management. My attempts at adding precise heap scanning to the GC got me thinking about other ways to improve memory management in D. I love most aspects of D, but the constant memory management issues make it feel like much less of a high-level language than it should feel like. I'm thinking of making this my niche around here, as I already know more about the problem than I ever wanted to know and I'm sick of having memory management not work properly. Here are some things that I'd like comments on: 1. In the Hans Boehm GC, they use a blacklisting scheme whereby they avoid allocating memory pages that currently have false pointers pointing into them. (If a page is not allocated but looks like it has a pointer into it, then we can assume this is a false pointer.) If I dip into the GC code again and implement something like this, we'll be one step closer to making D memory management "just work" and making false pointers a thing of the past. 2. I've mentioned this a few times here before, but I wrote a second stack-based memory allocator called TempAlloc, which allows stack-based memory allocation that is not necessarily bound to function calls and automatically falls back on heap allocation for very large objects, rather than killing your program with a "stack overflow" error message. I've also written some implementations of common data structures (so far arrays, hash tables and sets; I'll probably add binary trees at some point) that are optimized for it. (see http://svn.dsource.org/projects/dstats/docs/alloc.html). The biggest problem with TempAlloc is that it is not scanned by the GC, meaning that you can't store heap-allocated data in it unless you have another reference somewhere else. I don't know how to remedy this, partly because the stacks are thread-local and I don't know how to remove a range from the GC upon a thread terminating, even if I hack the GC to give it the features it needs to properly scan TempAlloc. Advice would be appreciated. Other than GC scanning, is there anything else you would like to see added to TempAlloc? Do you think it's general enough to be included in core.memory, or is it too niche? 3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now. 4. I submitted a patch a while back to allow the GC to ignore interior pointers for specific objects. (http://d.puremagic.com/issues/show_bug.cgi?id=2927) This would be useful if you have, for example, a large array that never escapes a class and the class always maintains a pointer to the head of the array as long as it's alive. This way, when the class dies, the array dies too even if there are false pointers to its interior. Few people have commented on this. Is there any reason why it's not a good idea? Yes, it's somewhat unsafe if you're not careful, but when the alternative is horrible memory leaks, sometimes unsafeness is a necessary evil.
Nov 02 2009
Please add weak references! I can email you a druntime compatible implementation. dsimcha Wrote:During my occasional forays into the dark side of Python and Java, I am often amazed at the extent to which memory management in these languages "just works". D should be like this for all but the most low-level programming tasks, and was intended to be. It seems like most of the other regulars here have carved out niches that don't involve improving memory management. My attempts at adding precise heap scanning to the GC got me thinking about other ways to improve memory management in D. I love most aspects of D, but the constant memory management issues make it feel like much less of a high-level language than it should feel like. I'm thinking of making this my niche around here, as I already know more about the problem than I ever wanted to know and I'm sick of having memory management not work properly. Here are some things that I'd like comments on: 1. In the Hans Boehm GC, they use a blacklisting scheme whereby they avoid allocating memory pages that currently have false pointers pointing into them. (If a page is not allocated but looks like it has a pointer into it, then we can assume this is a false pointer.) If I dip into the GC code again and implement something like this, we'll be one step closer to making D memory management "just work" and making false pointers a thing of the past. 2. I've mentioned this a few times here before, but I wrote a second stack-based memory allocator called TempAlloc, which allows stack-based memory allocation that is not necessarily bound to function calls and automatically falls back on heap allocation for very large objects, rather than killing your program with a "stack overflow" error message. I've also written some implementations of common data structures (so far arrays, hash tables and sets; I'll probably add binary trees at some point) that are optimized for it. (see http://svn.dsource.org/projects/dstats/docs/alloc.html). The biggest problem with TempAlloc is that it is not scanned by the GC, meaning that you can't store heap-allocated data in it unless you have another reference somewhere else. I don't know how to remedy this, partly because the stacks are thread-local and I don't know how to remove a range from the GC upon a thread terminating, even if I hack the GC to give it the features it needs to properly scan TempAlloc. Advice would be appreciated. Other than GC scanning, is there anything else you would like to see added to TempAlloc? Do you think it's general enough to be included in core.memory, or is it too niche? 3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now. 4. I submitted a patch a while back to allow the GC to ignore interior pointers for specific objects. (http://d.puremagic.com/issues/show_bug.cgi?id=2927) This would be useful if you have, for example, a large array that never escapes a class and the class always maintains a pointer to the head of the array as long as it's alive. This way, when the class dies, the array dies too even if there are false pointers to its interior. Few people have commented on this. Is there any reason why it's not a good idea? Yes, it's somewhat unsafe if you're not careful, but when the alternative is horrible memory leaks, sometimes unsafeness is a necessary evil.
Nov 03 2009
== Quote from Jason House (jason.james.house gmail.com)'s articlePlease add weak references! I can email you a druntime compatible implementation.Probably be better to submit it as a patch to Bugzilla if it's reasonably well-tested and working already. On the other hand, I am curious to see how this was done. Did it involve hacking gcx.d or was there a more graceful way?
Nov 03 2009
dsimcha Wrote:== Quote from Jason House (jason.james.house gmail.com)'s articleI did hack gcx.d, but only to make the global mutex public. The implementation is relatively simple. My ticket submission failed when I tried to submit it, so I emailed Sean directly. That was months ago, and I never heard back. I'll resubmit a ticket... Theoretically, a weak ref might naturally tie into your precise scanning changes. Admittedly, it's the dereferencing that is tricky to get right without hacking gcx.d.Please add weak references! I can email you a druntime compatible implementation.Probably be better to submit it as a patch to Bugzilla if it's reasonably well-tested and working already. On the other hand, I am curious to see how this was done. Did it involve hacking gcx.d or was there a more graceful way?
Nov 03 2009
I submitted a patch a while back for constant-time removeRange. Can we dredge that one up and/or implement something similar? It's rather useful for things like stackthreads that need to add and remove lots of ranges :) (Search the NG for "Feature req.+Patch: O(1) removeRange")
Nov 03 2009
== Quote from downs (default_357-line yahoo.de)'s articleI submitted a patch a while back for constant-time removeRange. Can we dredgethat one up and/or implement something similar? It's rather useful for things like stackthreads that need to add and remove lots of ranges :)(Search the NG for "Feature req.+Patch: O(1) removeRange")I guess removeRange is currently O(N)? I haven't looked. Anyhow, I think the bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?
Nov 03 2009
dsimcha wrote:== Quote from downs (default_357-line yahoo.de)'s articleThe virtual stacks I use are being allocated via mmap. Every stack is a range. With every context switch, the relevant ranges need to be modified - the range that is being switched to needs to be removed (because it becomes the "main" stack which has special GC handling), and the stack range that's being switched from needs to be added.I submitted a patch a while back for constant-time removeRange. Can we dredgethat one up and/or implement something similar? It's rather useful for things like stackthreads that need to add and remove lots of ranges :)(Search the NG for "Feature req.+Patch: O(1) removeRange")I guess removeRange is currently O(N)? I haven't looked. Anyhow, I think the bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?
Nov 03 2009
== Quote from downs (default_357-line yahoo.de)'s articledsimcha wrote:With every context switch, the relevant ranges need to be modified - the range that is being switched to needs to be removed (because it becomes the "main" stack which has special GC handling), and the stack range that's being switched from needs to be added. IDK, that sounds a bit expensive even with O(1) removeRange. Are you using some home-grown impl. of StackThreads/fibers, Tango's, or D2 druntime's? If it's Tango's or druntime's, I haven't looked, but I would think the GC understands stuff about them. If it's your own, have you considered switching and maybe submitting whatever extra features your impl. has to Tango and/or druntime?== Quote from downs (default_357-line yahoo.de)'s articleThe virtual stacks I use are being allocated via mmap. Every stack is a range.I submitted a patch a while back for constant-time removeRange. Can we dredgethat one up and/or implement something similar? It's rather useful for things like stackthreads that need to add and remove lots of ranges :)(Search the NG for "Feature req.+Patch: O(1) removeRange")I guess removeRange is currently O(N)? I haven't looked. Anyhow, I think the bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?
Nov 03 2009
dsimcha wrote:== Quote from downs (default_357-line yahoo.de)'s articleWell .... D1 Phobos :) I have some philosophical disagreements with the Tango project, much as I admit their technical superiority, and D2 is still too beta. But at least I don't use tools.stackthreads that much these days, so it's fine :)dsimcha wrote:With every context switch, the relevant ranges need to be modified - the range that is being switched to needs to be removed (because it becomes the "main" stack which has special GC handling), and the stack range that's being switched from needs to be added. IDK, that sounds a bit expensive even with O(1) removeRange. Are you using some home-grown impl. of StackThreads/fibers, Tango's, or D2 druntime's? If it's Tango's or druntime's, I haven't looked, but I would think the GC understands stuff about them. If it's your own, have you considered switching and maybe submitting whatever extra features your impl. has to Tango and/or druntime?== Quote from downs (default_357-line yahoo.de)'s articleThe virtual stacks I use are being allocated via mmap. Every stack is a range.I submitted a patch a while back for constant-time removeRange. Can we dredgethat one up and/or implement something similar? It's rather useful for things like stackthreads that need to add and remove lots of ranges :)(Search the NG for "Feature req.+Patch: O(1) removeRange")I guess removeRange is currently O(N)? I haven't looked. Anyhow, I think the bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?
Nov 04 2009
dsimcha, el 3 de noviembre a las 05:46 me escribiste:During my occasional forays into the dark side of Python and Java, I am often amazed at the extent to which memory management in these languages "just works". D should be like this for all but the most low-level programming tasks, and was intended to be. It seems like most of the other regulars here have carved out niches that don't involve improving memory management. My attempts at adding precise heap scanning to the GC got me thinking about other ways to improve memory management in D. I love most aspects of D, but the constant memory management issues make it feel like much less of a high-level language than it should feel like. I'm thinking of making this my niche around here, as I already know more about the problem than I ever wanted to know and I'm sick of having memory management not work properly. Here are some things that I'd like comments on: 1. In the Hans Boehm GC, they use a blacklisting scheme whereby they avoid allocating memory pages that currently have false pointers pointing into them. (If a page is not allocated but looks like it has a pointer into it, then we can assume this is a false pointer.) If I dip into the GC code again and implement something like this, we'll be one step closer to making D memory management "just work" and making false pointers a thing of the past.The Boehm GC has a lot to learn from. I'm planning to add concurrent collection in a sightly different way than the Boehm GC, which have at least a couple of ways of doing so. I know it might look like vaporware at this point but I have to do it eventually, I need to finish that to finally get my degree ;). I was just a little busy lately.3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I think avoiding the lock might be possible using atomic ops. I think it should be easier than using thread-local, but I'm not entirely sure if it's really possible.4. I submitted a patch a while back to allow the GC to ignore interior pointers for specific objects. (http://d.puremagic.com/issues/show_bug.cgi?id=2927) This would be useful if you have, for example, a large array that never escapes a class and the class always maintains a pointer to the head of the array as long as it's alive. This way, when the class dies, the array dies too even if there are false pointers to its interior. Few people have commented on this. Is there any reason why it's not a good idea? Yes, it's somewhat unsafe if you're not careful, but when the alternative is horrible memory leaks, sometimes unsafeness is a necessary evil.I think I commented it, and I think is a good idea to have it. It's not unsafer than having NO_SCAN or casting a pointer to a size_t. -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Todos en el mundo somos grasas, no hago distinción de sexo y raza, sólo que algunos lo disfrutan y otros no pueden evitarlo.
Nov 03 2009
Hello dsimcha,3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I think there are general malloc functions that are lock free. On the same note, I've been toying with an idea for how to do memory heap structure that would be totally lock free (think CAS) but has some downsides (like it won't handle a non-continues memory space and expanding the memory could get very expensive. Both of these could be non issues in some cases; short lived job specific heaps, memory mapped (shared?) data(base) files, resource limited systems, etc. I've been meaning to get around to implementing it but never had time. I'd be willing to collaborate or, if even that doesn't get me moving, just hand over the whole project in exchange for a "coauthor" line in the comments.
Nov 04 2009
== Quote from BCS (none anon.com)'s articleHello dsimcha,But x86 atomic ops, while not as slow as locks, are pretty slow. Some benchmarks I did a while back indicated that, on Windows, atomic increment is only about twice as fast as an uncontested synchronized { var++; }. I think using thread-local storage and having each thread-local heap interact with the global heap only occasionally, for very large transactions, is a much better bet than using atomic CAS. I'm also a little bit confused about what you're suggesting, specifically about the non-continuous memory space. Does this mean that things would have to be freed LIFO? If so, what use cases would you be covering that TempAlloc doesn't already cover? If not, what does it mean?3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I think there are general malloc functions that are lock free. On the same note, I've been toying with an idea for how to do memory heap structure that would be totally lock free (think CAS) but has some downsides (like it won't handle a non-continues memory space and expanding the memory could get very expensive. Both of these could be non issues in some cases; short lived job specific heaps, memory mapped (shared?) data(base) files, resource limited systems, etc. I've been meaning to get around to implementing it but never had time. I'd be willing to collaborate or, if even that doesn't get me moving, just hand over the whole project in exchange for a "coauthor" line in the comments.
Nov 04 2009
Hello dsimcha,== Quote from BCS (none anon.com)'s articleIIRC a mutex/lock has to have a CAS at some point so if you can get a function to work with a single CAS, at worst an uncontested action will be as good as an uncontested mutex/lock. Once uncontention starts happening it depends on how much your retry costs. Another thing to take into account is the overhead vs granularity tradeoff you get with locks that can sometimes be skiped with direct use of CAS.Hello dsimcha,But x86 atomic ops, while not as slow as locks, are pretty slow. Some benchmarks I did a while back indicated that, on Windows, atomic increment is only about twice as fast as an uncontested synchronized { var++; }.3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I think there are general malloc functions that are lock free. On the same note, I've been toying with an idea for how to do memory heap structure that would be totally lock free (think CAS) but has some downsides (like it won't handle a non-continues memory space and expanding the memory could get very expensive. Both of these could be non issues in some cases; short lived job specific heaps, memory mapped (shared?) data(base) files, resource limited systems, etc. I've been meaning to get around to implementing it but never had time. I'd be willing to collaborate or, if even that doesn't get me moving, just hand over the whole project in exchange for a "coauthor" line in the comments.I think using thread-local storage and having each thread-local heap interact with the global heap only occasionally, for very large transactions, is a much better bet than using atomic CAS.I got started thinking about this when I came across a comment that some particular malloc didn't have issues with freeing memory from a different thread than it was malloced from.I'm also a little bit confused about what you're suggesting, specifically about the non-continuous memory space. Does this mean that things would have to be freed LIFO? If so, what use cases would you be covering that TempAlloc doesn't already cover? If not, what does it mean?IIRC, when the current GC runs out of space, it uses mmap (on linux) to get more address space mapped to something. This wouldn't work for my idea because you can't count of mmap returning a block of ram at any given location, it is free to leave gaping holes in the places you can allocate at and I wouldn't be able to handle that.
Nov 04 2009
dsimcha Wrote:3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I actually started working on this a while back and then set it aside to take care of some other things. I'm planning to revisit it once the concurrency model is reasonably sorted out, since such a GC is pretty much useless until then anyway.
Nov 04 2009
== Quote from Sean Kelly (sean invisibleduck.org)'s articledsimcha Wrote:care of some other things. I'm planning to revisit it once the concurrency model is reasonably sorted out, since such a GC is pretty much useless until then anyway. Does this mean that this has officially, for all practical purposes, been put off until D3?3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I actually started working on this a while back and then set it aside to take
Nov 04 2009
dsimcha wrote:== Quote from Sean Kelly (sean invisibleduck.org)'s articlePerformance improvements are fortunately not tied to the language spec. Andreidsimcha Wrote:care of some other things. I'm planning to revisit it once the concurrency model is reasonably sorted out, since such a GC is pretty much useless until then anyway. Does this mean that this has officially, for all practical purposes, been put off until D3?3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I actually started working on this a while back and then set it aside to take
Nov 04 2009
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articledsimcha wrote:anyway.== Quote from Sean Kelly (sean invisibleduck.org)'s articledsimcha Wrote:care of some other things. I'm planning to revisit it once the concurrency model is reasonably sorted out, since such a GC is pretty much useless until then3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I actually started working on this a while back and then set it aside to takeBut the concurrency model is.Does this mean that this has officially, for all practical purposes, been put off until D3?Performance improvements are fortunately not tied to the language spec. Andrei
Nov 04 2009
dsimcha wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleSorry, I misunderstood a bit. Anyway, the process is a continuum, so you shouldn't expect that the concurrency model will be available only on the day of D2's release. Also, there will be many releases of D2 following the "gold" release. It's not like once we deploy D2 we just start working on D3. Andreidsimcha wrote:anyway.== Quote from Sean Kelly (sean invisibleduck.org)'s articledsimcha Wrote:care of some other things. I'm planning to revisit it once the concurrency model is reasonably sorted out, since such a GC is pretty much useless until then3. This one is an order of magnitude less likely than the other two to actually get implemented, at least by me, but how about thread-local allocators so you can call malloc() without taking a lock? I vaguely remember Sean saying he was working on that a while back, but I never heard anything about it again. It's probably best to wait for shared to be implemented for this so that unshared objects can also be collected w/o stopping the world, but we should start at least discussing this now.I actually started working on this a while back and then set it aside to takeBut the concurrency model is.Does this mean that this has officially, for all practical purposes, been put off until D3?Performance improvements are fortunately not tied to the language spec. Andrei
Nov 04 2009