digitalmars.D - Concurrent GC (for Windows)
- Rainer Schuetze (9/9) Jun 03 2014 Hi,
- Benjamin Thaut (13/22) Jun 03 2014 This sounds quite promising.
- Rainer Schuetze (4/28) Jun 04 2014 Yes, write barriers are a good addition to this, but is way more
- Benjamin Thaut (5/8) Jun 04 2014 I assume they are hard to implement because the backend no longer has
- Jacob Carlborg (5/13) Jun 03 2014 How does other GC's handle these memory problems? I'm thinking of Java
- Paulo Pinto (2/19) Jun 04 2014 Note that there are native compilers for Java and C#.
- Jacob Carlborg (4/5) Jun 04 2014 My question still remains :)
- Rainer Schuetze (11/25) Jun 04 2014 All more sophisticated GCs have write or read barriers. That makes it
- Jacob Carlborg (4/6) Jun 05 2014 Right, now I start to remember.
- Martin Nowak (5/9) Jun 15 2014 That's pausing the thread which triggered the collection, right?
- Rainer Schuetze (6/15) Jun 15 2014 Yes, the world is not stopped during sweeping. The GC lock still blocks
- Dmitry Olshansky (26/35) Jun 11 2014 I'm not sure if how it would perform but what about the following idea
- Rainer Schuetze (26/62) Jun 11 2014 Cool stuff! I remember trying something similar, but IIRC forcing the
- Dmitry Olshansky (22/62) Jun 12 2014 Yes, exactly, but I forgot the recipe to convert COFF/OMF import librari...
- Rainer Schuetze (16/49) Jun 13 2014 There could also be the fallback to VirtualQuery if QueryWorkingSetEx
- Dmitry Olshansky (8/25) Jun 13 2014 I thought that's what I do by keeping original view mapped.
- Rainer Schuetze (12/28) Jun 14 2014 I added COW to the concurrent GC (
- Dmitry Olshansky (7/37) Jun 14 2014 No idea. It may also depend on OS version, given the fuzz about Win7
- Rainer Schuetze (5/12) Jun 15 2014 The benchmarks have been recently merged into druntime. Compile and run
- Martin Nowak (3/8) Jun 15 2014 Is there a chance that we can pause mutator threads when they trigger to...
Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory. Rainer
Jun 03 2014
Am 03.06.2014 09:35, schrieb Rainer Schuetze:Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory. RainerThis sounds quite promising. Did you think about using a write barrier instead of doing the chech per page? Because the way you propose will mean, that all writes, no matter if relevant for the GC or not, will cause the GC to reconsider the page. If we implement a write barrier into the front end that only gets called when a write through a pointer to a pointer happens, the GC could only rescan memory blocks that actually have changed pointers. Also the GC might be able to remember the old pointer value before the write is execued resulting in a consitent snapshot of the heap, which would eliminate the need to rescan after the background thread is finished. Kind Regards Benjamin Thaut
Jun 03 2014
On 04.06.2014 07:44, Benjamin Thaut wrote:Am 03.06.2014 09:35, schrieb Rainer Schuetze:Yes, write barriers are a good addition to this, but is way more difficult to implement and will meet more resistance from Walter. It's still something I'd like to look into.Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory. RainerThis sounds quite promising. Did you think about using a write barrier instead of doing the chech per page? Because the way you propose will mean, that all writes, no matter if relevant for the GC or not, will cause the GC to reconsider the page. If we implement a write barrier into the front end that only gets called when a write through a pointer to a pointer happens, the GC could only rescan memory blocks that actually have changed pointers. Also the GC might be able to remember the old pointer value before the write is execued resulting in a consitent snapshot of the heap, which would eliminate the need to rescan after the background thread is finished.
Jun 04 2014
Am 04.06.2014 22:26, schrieb Rainer Schuetze:Yes, write barriers are a good addition to this, but is way more difficult to implement and will meet more resistance from Walter. It's still something I'd like to look into.I assume they are hard to implement because the backend no longer has type information when generating the writes? I remember that Walter stated a few weeks ago that he considered implementing write barriers.
Jun 04 2014
On 03/06/14 09:35, Rainer Schuetze wrote:Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory.How does other GC's handle these memory problems? I'm thinking of Java -- /Jacob Carlborg
Jun 03 2014
On Wednesday, 4 June 2014 at 06:51:04 UTC, Jacob Carlborg wrote:On 03/06/14 09:35, Rainer Schuetze wrote:Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory.How does other GC's handle these memory problems? I'm thinking a virtual machine?
Jun 04 2014
On 2014-06-04 09:14, Paulo Pinto wrote:My question still remains :) -- /Jacob Carlborg
Jun 04 2014
On 04.06.2014 08:51, Jacob Carlborg wrote:On 03/06/14 09:35, Rainer Schuetze wrote:All more sophisticated GCs have write or read barriers. That makes it much easier to keep track of modifications during concurrent collection. One reason for the huge memory foot print is that the application is allowed to continue to allocate during collection, but it will do this by getting more memory from the OS. It needs to be somehow throttled to avoid going too far with this. Most of the remaining pause time is sweeping garbage. I think about deferring sweeping into allocations by running finalizers just before reusing the memory for another object. This can throttle allocations a bit while at the same time reduce pauses.Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory.How does other GC's handle these memory problems? I'm thinking of Java
Jun 04 2014
On 2014-06-04 22:37, Rainer Schuetze wrote:All more sophisticated GCs have write or read barriers. That makes it much easier to keep track of modifications during concurrent collection.Right, now I start to remember. -- /Jacob Carlborg
Jun 05 2014
On 06/04/2014 10:37 PM, Rainer Schuetze wrote:Most of the remaining pause time is sweeping garbage. I think about deferring sweeping into allocations by running finalizers just before reusing the memory for another object. This can throttle allocations a bit while at the same time reduce pauses.That's pausing the thread which triggered the collection, right? Other threads don't need to wait for sweeping. Dividing the cost to run finalizers among many allocations is a charming idea.
Jun 15 2014
On 15.06.2014 23:30, Martin Nowak wrote:On 06/04/2014 10:37 PM, Rainer Schuetze wrote:Yes, the world is not stopped during sweeping. The GC lock still blocks any thread that tries to allocate, though.Most of the remaining pause time is sweeping garbage. I think about deferring sweeping into allocations by running finalizers just before reusing the memory for another object. This can throttle allocations a bit while at the same time reduce pauses.That's pausing the thread which triggered the collection, right? Other threads don't need to wait for sweeping.Dividing the cost to run finalizers among many allocations is a charming idea.Another nice property: If memory is recycled during malloc, you can run the finalizers without the GC lock held just before returning, so there are less restrictions on what can happen in a destructor.
Jun 15 2014
03-Jun-2014 11:35, Rainer Schuetze пишет:Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory.I'm not sure if how it would perform but what about the following idea for COW snapshotting. 1. Don't use the separate process and shared view, use background thread. 2. Allocate heap space with MapViewOfFile, but private not shared. During collection, creating a snapshot (during stop the world) as: 1. Make a new, read-write view of heap with MapViewOfFile, this is what background thread will use to scan memory. 2. Save base address of the original view, and unmap it. 3. MapViewOfFileEx with MAP_COPY and the old base address (that gives us back the same heap but in CoW mode). ... move on with application and let background thread do the marking. Once marking is done, do stop the world and: 1. VirtualQuery (or QueryWorkingSetEx) over the application's CoW view of the heap. See the ranges of pages that have flipped to PAGE_READWRITE, those are the one modified and duplicated. If new allocation are served from it then it will include newly allocated stuff. 2. Copy these ranges of pages over to the normal read-write view. 3. Close CoW mapping (thereby freeing duplicates) and remap it again as r-w view on the same address with MapViewOfFileEx. I wasn't able to get QueryWorkingSetEx to behave but I believe it must be a better way then VirtualQuery every page-range to death. See the sketch of the idea here : https://gist.github.com/DmitryOlshansky/5e32057e047425480f0eRainer-- Dmitry Olshansky
Jun 11 2014
On 11.06.2014 18:59, Dmitry Olshansky wrote:03-Jun-2014 11:35, Rainer Schuetze пишет:Cool stuff! I remember trying something similar, but IIRC forcing the same address with MapViewOfFile somehow failed (maybe this was across processes). I tried your version on both Win32 and Win64 successfully, though. I implemented the QueryWorkingSetEx version like this (you need a converted psapi.lib for Win32): enum PAGES = 512; //SIZE / 4096; PSAPI_WORKING_SET_EX_INFORMATION[PAGES] info; foreach(i, ref inf; info) inf.VirtualAddress = heap + i * 4096; if (!QueryWorkingSetEx(GetCurrentProcess(), info.ptr, info.sizeof)) throw new Exception(format("Could not query info (%d).\n", GetLastError())); foreach(i, ref inf; info) writefln("flags page %d: %x", i, inf.VirtualAttributes); and you can check the "shared" field to get copied pages. This function is not supported on XP, though. A short benchmark shows that VirtualQuery needs 55/42 ms for your test on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17 ms for both. If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more. The numbers are not great, but I guess the usual memory usage and number of modified pages will be much lower. I'll see if I can integrate this into the concurrent implementation.Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.html tl;dr: there is a working(?) partially concurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 but it eats a whole lot of memory.I'm not sure if how it would perform but what about the following idea for COW snapshotting. 1. Don't use the separate process and shared view, use background thread. 2. Allocate heap space with MapViewOfFile, but private not shared. During collection, creating a snapshot (during stop the world) as: 1. Make a new, read-write view of heap with MapViewOfFile, this is what background thread will use to scan memory. 2. Save base address of the original view, and unmap it. 3. MapViewOfFileEx with MAP_COPY and the old base address (that gives us back the same heap but in CoW mode). .... move on with application and let background thread do the marking. Once marking is done, do stop the world and: 1. VirtualQuery (or QueryWorkingSetEx) over the application's CoW view of the heap. See the ranges of pages that have flipped to PAGE_READWRITE, those are the one modified and duplicated. If new allocation are served from it then it will include newly allocated stuff. 2. Copy these ranges of pages over to the normal read-write view. 3. Close CoW mapping (thereby freeing duplicates) and remap it again as r-w view on the same address with MapViewOfFileEx. I wasn't able to get QueryWorkingSetEx to behave but I believe it must be a better way then VirtualQuery every page-range to death. See the sketch of the idea here : https://gist.github.com/DmitryOlshansky/5e32057e047425480f0e
Jun 11 2014
12-Jun-2014 10:34, Rainer Schuetze пишет:On 11.06.2014 18:59, Dmitry Olshansky wrote:[snip]03-Jun-2014 11:35, Rainer Schuetze пишет:Hi, more GC talk: the last couple of days, I've been experimenting with implementing a concurrent GC on Windows inspired by Leandros CDGC. Here's a report on my experiments: http://rainers.github.io/visuald/druntime/concurrentgc.htmlSee the sketch of the idea here : https://gist.github.com/DmitryOlshansky/5e32057e047425480f0eCool stuff! I remember trying something similar, but IIRC forcing the same address with MapViewOfFile somehow failed (maybe this was across processes). I tried your version on both Win32 and Win64 successfully, though.I implemented the QueryWorkingSetEx version like this (you need a converted psapi.lib for Win32):Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.enum PAGES = 512; //SIZE / 4096; PSAPI_WORKING_SET_EX_INFORMATION[PAGES] info; foreach(i, ref inf; info) inf.VirtualAddress = heap + i * 4096; if (!QueryWorkingSetEx(GetCurrentProcess(), info.ptr, info.sizeof)) throw new Exception(format("Could not query info (%d).\n", GetLastError())); foreach(i, ref inf; info) writefln("flags page %d: %x", i, inf.VirtualAttributes); and you can check the "shared" field to get copied pages.This function is not supported on XP, though.I wouldn't worry about it, it's not like XP users are growing in numbers. Also it looks like only 64bit version is good to go, as on 32bit it would reduce usable memory in half.A short benchmark shows that VirtualQuery needs 55/42 ms for your test on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17 ms for both.Seems in line with my measurements. Strictly speaking 1/2 of pages, interleaved should give the estimate of the worst case. Together with remapping (freeing duplicated pages) It doesn't go beyond 250ms on 640Mb of heap.If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more.Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time. It might help to split the full heap into multiple views. Also using VirtualProtect during the first step - turning a mapping into CoW one is faster then unmap/map (by factor of 2). One thing that may help is saving a pointer to the end of used heap at the moment of scan, then remaping only this portion as COW. Last issue I see is adjustment of pointers - in a GC, the mapped view is mapped at new address so it would need a fixup them during scanning.The numbers are not great, but I guess the usual memory usage and number of modified pages will be much lower. I'll see if I can integrate this into the concurrent implementation.Wish you luck, I'm still not sure if it will help. -- Dmitry Olshansky
Jun 12 2014
On 13.06.2014 02:38, Dmitry Olshansky wrote:12-Jun-2014 10:34, Rainer Schuetze пишет:Grab coffimplib.exe.I implemented the QueryWorkingSetEx version like this (you need a converted psapi.lib for Win32):Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.There could also be the fallback to VirtualQuery if QueryWorkingSetEx doesn't exist.This function is not supported on XP, though.I wouldn't worry about it, it's not like XP users are growing in numbers. Also it looks like only 64bit version is good to go, as on 32bit it would reduce usable memory in half.Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.A short benchmark shows that VirtualQuery needs 55/42 ms for your test on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17 ms for both.Seems in line with my measurements. Strictly speaking 1/2 of pages, interleaved should give the estimate of the worst case. Together with remapping (freeing duplicated pages) It doesn't go beyond 250ms on 640Mb of heap.If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more.Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.It might help to split the full heap into multiple views.The current GC uses pools of max 32 MB, so that already exists.Also using VirtualProtect during the first step - turning a mapping into CoW one is faster then unmap/map (by factor of 2). One thing that may help is saving a pointer to the end of used heap at the moment of scan, then remaping only this portion as COW.The pool architecture already does this if scanning just ignores new pools during collection. Another optimization is to segregate the heap into memory with references and memory with plain data (NO_SCAN) at page/pool granularity. NO_SCAN-pages won't need COW. My hope is that this reduces necessary duplicate memory addresses considerably.Last issue I see is adjustment of pointers - in a GC, the mapped view is mapped at new address so it would need a fixup them during scanning.Agreed, that's a slight additional cost for scanning, but I don't think it will be too difficult to implement.The numbers are not great, but I guess the usual memory usage and number of modified pages will be much lower. I'll see if I can integrate this into the concurrent implementation.Wish you luck, I'm still not sure if it will help.
Jun 13 2014
13-Jun-2014 12:22, Rainer Schuetze пишет:On 13.06.2014 02:38, Dmitry Olshansky wrote:Cool, thanks.12-Jun-2014 10:34, Rainer Schuetze пишет:Grab coffimplib.exe.I implemented the QueryWorkingSetEx version like this (you need a converted psapi.lib for Win32):Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.I thought that's what I do by keeping original view mapped. The time taken is ~ proportional to the number of pages modified in the CoW view, since these are private copies I think it might take this much to recycle them. It's ~1 ms if no changes are made. -- Dmitry OlshanskyMaybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more.Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.
Jun 13 2014
On 13.06.2014 12:08, Dmitry Olshansky wrote:13-Jun-2014 12:22, Rainer Schuetze пишет:I added COW to the concurrent GC ( https://github.com/rainers/druntime/tree/concurrent_gc2 ) and it seems more stable with respect to pause times than the version with reading write watches. I also see considerable time spent during unmapping the CoW view, about as much as copying the data. One issue with calling OS functions while the world is stopped: if these function require a lock on some synchronization object, and this object is locked by another thread that is suspended, we create a dead lock. Do you know if MapViewOfFileEx and UnmapViewOfFile are lock free? The same can happen with GetWriteWatch and ResetWriteWatch, though.I thought that's what I do by keeping original view mapped. The time taken is ~ proportional to the number of pages modified in the CoW view, since these are private copies I think it might take this much to recycle them. It's ~1 ms if no changes are made.Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more.Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.
Jun 14 2014
14-Jun-2014 20:34, Rainer Schuetze пишет:On 13.06.2014 12:08, Dmitry Olshansky wrote:How do I run benchmarks and what are the newest results for you?13-Jun-2014 12:22, Rainer Schuetze пишет:I added COW to the concurrent GC ( https://github.com/rainers/druntime/tree/concurrent_gc2 ) and it seems more stable with respect to pause times than the version with reading write watches.I thought that's what I do by keeping original view mapped. The time taken is ~ proportional to the number of pages modified in the CoW view, since these are private copies I think it might take this much to recycle them. It's ~1 ms if no changes are made.Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more.Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.I also see considerable time spent during unmapping the CoW view, about as much as copying the data. One issue with calling OS functions while the world is stopped: if these function require a lock on some synchronization object, and this object is locked by another thread that is suspended, we create a dead lock. Do you know if MapViewOfFileEx and UnmapViewOfFile are lock free?No idea. It may also depend on OS version, given the fuzz about Win7 being the first with lock-free thread scheduler there may be giant-locks in memory-mapping at least in some versions.The same can happen with GetWriteWatch and ResetWriteWatch, though.-- Dmitry Olshansky
Jun 14 2014
On 14.06.2014 20:17, Dmitry Olshansky wrote:The benchmarks have been recently merged into druntime. Compile and run runbench.d in the benchmark folder. I have updated the report and added new benchmarks at http://rainers.github.io/visuald/druntime/concurrentgc.htmlI added COW to the concurrent GC ( https://github.com/rainers/druntime/tree/concurrent_gc2 ) and it seems more stable with respect to pause times than the version with reading write watches.How do I run benchmarks and what are the newest results for you?
Jun 15 2014
On 06/12/2014 08:34 AM, Rainer Schuetze wrote:If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more. The numbers are not great, but I guess the usual memory usage and number of modified pages will be much lower. I'll see if I can integrate this into the concurrent implementation.Is there a chance that we can pause mutator threads when they trigger to many page copies?
Jun 15 2014