www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D GC theory

reply "Sativa" <Sativa Indica.org> writes:
How hard would it be to modify D's GC to do the following two 
things:

1. Scan the heap in the BG on another thread/cpu for 
compactification.

2. Move blocks of memory in predictable(timewise) chunks that 
prevents locking the main thread for any period of time.

e.g., in the first step the GC finds some blocks of memory that 
need to be freed/compacted. In the 2nd step it starts 
freeing/compacting it in predictable pieces by limiting the time 
it takes while working.

The point is, that maybe the GC is ran more often but in smaller 
and predictable steps.

That is, the GC should be able to calculate how long it will take 
to free/compact a block. If it takes too long then it simply does 
it in stages.

This way, there is essentially a very predictable and consistent 
cpu usage with the GC running but never any major lag spikes that 
are going to throw real time behavior out the window.

It would seem that such a "Feature" would be easy to implement by 
modifying the existing GC code to be "incremental".

I'd prefer a constant 1-5% cpu usage given to the GC if it didn't 
blow up for no reason. This way, it being very predictable, just 
mean one has to get a slightly faster cpu to compensate or 
optimize the code slightly.

It would be analogous to game programming.

1. We can have the GC steal, say, 1 fps to do it's work...

2. Or we can keep the GC asleep doing nothing until it gets so 
much work it has pause the entire engine for a 1/2 second 
dropping the fps down by half momentarily. This might happen 
every 1 minute or so.... but still unacceptable for most gamers. 
(assuming 30-60 fps)


I'd prefer the first case.
Feb 23 2015
next sibling parent "Sativa" <Sativa Indica.org> writes:
Basically, I am simply wondering if the GC can "throttle" itself 
as to reduce the *maximum* time the program has to wait.
Feb 23 2015
prev sibling next sibling parent reply "weaselcat" <weaselcat gmail.com> writes:
On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
 How hard would it be to modify D's GC to do the following two 
 things:

 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 2. Move blocks of memory in predictable(timewise) chunks that 
 prevents locking the main thread for any period of time.

 e.g., in the first step the GC finds some blocks of memory that 
 need to be freed/compacted. In the 2nd step it starts 
 freeing/compacting it in predictable pieces by limiting the 
 time it takes while working.

 The point is, that maybe the GC is ran more often but in 
 smaller and predictable steps.

 That is, the GC should be able to calculate how long it will 
 take to free/compact a block. If it takes too long then it 
 simply does it in stages.

 This way, there is essentially a very predictable and 
 consistent cpu usage with the GC running but never any major 
 lag spikes that are going to throw real time behavior out the 
 window.

 It would seem that such a "Feature" would be easy to implement 
 by modifying the existing GC code to be "incremental".

 I'd prefer a constant 1-5% cpu usage given to the GC if it 
 didn't blow up for no reason. This way, it being very 
 predictable, just mean one has to get a slightly faster cpu to 
 compensate or optimize the code slightly.
Hi, D's GC actually is predictable. It will not collect unless you allocate. You can also disable it and manually collect. I use it this way and essentially use it as a smart freelist.
Feb 23 2015
parent "Sativa" <Sativa Indica.org> writes:
On Monday, 23 February 2015 at 22:11:48 UTC, weaselcat wrote:
 On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
 How hard would it be to modify D's GC to do the following two 
 things:

 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 2. Move blocks of memory in predictable(timewise) chunks that 
 prevents locking the main thread for any period of time.

 e.g., in the first step the GC finds some blocks of memory 
 that need to be freed/compacted. In the 2nd step it starts 
 freeing/compacting it in predictable pieces by limiting the 
 time it takes while working.

 The point is, that maybe the GC is ran more often but in 
 smaller and predictable steps.

 That is, the GC should be able to calculate how long it will 
 take to free/compact a block. If it takes too long then it 
 simply does it in stages.

 This way, there is essentially a very predictable and 
 consistent cpu usage with the GC running but never any major 
 lag spikes that are going to throw real time behavior out the 
 window.

 It would seem that such a "Feature" would be easy to implement 
 by modifying the existing GC code to be "incremental".

 I'd prefer a constant 1-5% cpu usage given to the GC if it 
 didn't blow up for no reason. This way, it being very 
 predictable, just mean one has to get a slightly faster cpu to 
 compensate or optimize the code slightly.
Hi, D's GC actually is predictable. It will not collect unless you allocate. You can also disable it and manually collect. I use it this way and essentially use it as a smart freelist.
That isn't the problem. The problem is that when it does collect, the time it takes is unpredictable. It could take 1us or 10m. If there is a cap on how long the GC can run at any particular time interval, then it it's time complexity is simple, predicable, and can be better used for RT applications. Effectively I would rather the GC run every second and spend a maximum of 1ms doing cleaning up(not necessarily finishing) instead of running when ever and potentially taking seconds to cleanup. It's all about how to actually distribute the GC running in time. As it stands now, it can run anytime and take as long as it wants. I'd rather have it running "continuously" but not take as long as it wants. By allowing it to run continuously, in short bursts, one should get the same long term behavior but not have the spikes in cpu usage which prevent its usefulness in RT applications.
Feb 23 2015
prev sibling next sibling parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two things:
=20
 1. Scan the heap in the BG on another thread/cpu for compactification.
needs read/write barriers added to generated code. a major slowdown for=20 ALL memory access.
 2. Move blocks of memory in predictable(timewise) chunks that prevents
 locking the main thread for any period of time.
as you can't do "partial scans" without barriers... see above.
 e.g., in the first step the GC finds some blocks of memory that need to
 be freed/compacted. In the 2nd step it starts freeing/compacting it in
 predictable pieces by limiting the time it takes while working.
it's not freeing that takes time, it's scanning that takes time. after=20 scanning is complete there is not much left to do. ah, well, if you have=20 thousands of small objects, this can be slow too, but with such use case=20 partial frees will lead to OOM situations very fast (as your code have to=20 allocate as crazy to get to such situation). tl;dr: you can't do this in compiled language without introducing major=20 slowdown for indirect memory access. and indirect memory access is used=20 *everywhere*. you will not like the results.=
Feb 23 2015
next sibling parent reply "Kagamin" <spam here.lot> writes:
On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.
needs read/write barriers added to generated code. a major slowdown for ALL memory access.
Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.
Feb 24 2015
next sibling parent reply "Sativa" <Sativa Indica.org> writes:
On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:
 On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.
needs read/write barriers added to generated code. a major slowdown for ALL memory access.
Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.
But this type of thinking is the reason why the current GC is in the state it is. The compiler knows which pointers are "free" and which ones are "bound". Bound pointers are pointers that are not assigned freely by the user. e.g., a pointer to an array who's address never is arbitrarily set by the user is bound. The compiler knows where and how the pointer is assigned. Most pointers are this way. Bound pointers are pointers the GC can easily clean up because it knows when and how they are used. In this way, if all pointers of a program were bound, the GC can work in the background and never pause the state to clean up. (essentially the compiler would need to insert special code) most pointers are bound pointers. Free pointers are more difficult as they can, say, be randomly initiated and point to anywhere on the heap and have to be looked in a locked way. (to prevent them changing in the middle of some GC operation) But if one distinguishes bound and free pointers(Easily done with a bit in the pointers) and has the compiler keep track of when free pointers are used(by having a dirty bit when they are written to), then one can more easily scan the heap in the background. In fact, one could potentially get away from all synching issues by doing the following: When ever free pointers are used a simple spin lock is used. The spin lock checks a flag in the free pointers table that signals that a pointer is being changed by the code. When this is true, the free pointers table is in a state of flux and can't be relied on. In the mean time, the GC can build up information about the heap for the bound pointers. It can figure out what needs to be changed, setup buffering(which can be done using bits in the pointer), etc all in the background because the bound pointers are "stable" and deterministically change. When the free pointers table's dirty flag is unset it means that the free pointers are not changing in the program and the GC can lock the table using another flag. When the flag is set the spin lock kicks in and pauses the program while the GC is working on the free pointers table. (or to be more efficient, the program can yield to some other background task code) By having multiple tables of free pointers one can reduce the overhead. The GC looks at on a piece at a time and locks on a fraction of the code at any point in time. The compiler can distribute the locks vs pages in an optimized way through profiling.
Feb 24 2015
next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Tue, 24 Feb 2015 15:22:01 +0000, Sativa wrote:

 Free pointers are more difficult as they can, say, be randomly initiated
 and point to anywhere on the heap and have to be looked in a locked way.
 (to prevent them changing in the middle of some GC operation)
and they can live in CPU register... ooops.=
Feb 24 2015
prev sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 24 February 2015 at 15:22:02 UTC, Sativa wrote:
 Only modifications of pointers, which introduce new 
 cross-block dependencies (so that GC knows to recheck the new 
 dependency). Other memory access goes without slowdown.
But this type of thinking is the reason why the current GC is in the state it is.
There are limitation you have in a system language, that you don't in a VM language. That being said, D has a type system that allow for various optimization when it come to the GC, so I don't think it is hopeless, but we need to plug the remaining holes.
Feb 24 2015
prev sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:
 On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.
needs read/write barriers added to generated code. a major slowdown for ALL memory access.
Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.
That is not quite true. You don't know before hand which one are these, so you always need to, at least, check if you need to do the work.
Feb 24 2015
prev sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.
needs read/write barriers added to generated code. a major slowdown for ALL memory access.
It's possible to (ab)use the MMU as a write barrier. Sociomantic's GC implementation does this by forking and running the scan in the child, then communicating information back about what can be freed. Video: http://dconf.org/talks/lucarella.html IIRC, at the end the speaker was asked about performance overhead of the COW that is involved, but he couldn't give any numbers.
Feb 24 2015
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:
 It's possible to (ab)use the MMU as a write barrier.
Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... The cost of a page miss is 1000 cycles + overhead for copying. And that's assuming that you have enough memory and that the TLB isn't affected...
Feb 24 2015
next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:
 It's possible to (ab)use the MMU as a write barrier.
Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... The cost of a page miss is 1000 cycles + overhead for copying. And that's assuming that you have enough memory and that the TLB isn't affected...
Yes, but if the program has a small working set, you pay that price only a few times (once per page), and in any case, only during a running background scan, rather than always.
Feb 24 2015
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 24 February 2015 at 16:05:06 UTC, Marc Schütz wrote:
 On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim 
 Grøstad wrote:
 On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz 
 wrote:
 It's possible to (ab)use the MMU as a write barrier.
Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33
(Typo, MFENCE is 33...)
 cycles... The cost of a page miss is 1000 cycles + overhead 
 for copying.

 And that's assuming that you have enough memory and that the 
 TLB isn't affected...
Yes, but if the program has a small working set, you pay that price only a few times (once per page), and in any case, only during a running background scan, rather than always.
But if you can control where the code is running when you run the GC scan then you might as well have a separate code path for concurrent GC too (i.e. two versions of the code). One with fences and one without... If you loose the TLB, then you also loose all caches too IIRC.
Feb 24 2015
parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Tuesday, 24 February 2015 at 17:07:03 UTC, Ola Fosheim Grøstad 
wrote:
 But if you can control where the code is running when you run 
 the GC scan then you might as well have a separate code path 
 for concurrent GC too (i.e. two versions of the code). One with 
 fences and one without...
I'd say this is impractical. You could only reasonably expect this to work at certain checkpoints, say, an event loop with short calls into the "actual" application code. You cannot simply switch right in the middle of a deep call. And with Sociomantic's GC, you don't have any more "control" over the GC than you have with the current one in druntime.
Feb 24 2015
prev sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:
 It's possible to (ab)use the MMU as a write barrier.
Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... The cost of a page miss is 1000 cycles + overhead for copying. And that's assuming that you have enough memory and that the TLB isn't affected...
That's why this is used for immutable, or mostly immutable data, but generally avoided in the case of mutable data.
Feb 24 2015
prev sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Tue, 24 Feb 2015 14:45:16 +0000, Marc Sch=C3=BCtz wrote:

 It's possible to (ab)use the MMU as a write barrier. Sociomantic's GC
 implementation does this by forking and running the scan in the child,
 then communicating information back about what can be freed. Video:
sure, it's possible sometimes. i'd like to see forking GC in druntime. but all in all it's just a clever hack. concurrent GC in the same process=20 can't be efficiently done with MMU only: MMU protection has too big=20 granularity and page faults are too slow.=
Feb 24 2015
prev sibling parent "Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> writes:
On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.
You need to a precise GC to do this, because you need to know what is a pointer. Otherwise if you have two variables, say a size_t and a pointer that contain the same raw value, on compacting you could overwrite the size_t despite it just being a number and not a pointer. Also you need to ban certain pointer shenanigans (like xor fun stuff), though the current GC doesn't work when those are in use anyway, so I guess it's not too bad.
Feb 24 2015