digitalmars.D - D GC theory
- Sativa (32/32) Feb 23 2015 How hard would it be to modify D's GC to do the following two
- Sativa (2/2) Feb 23 2015 Basically, I am simply wondering if the GC can "throttle" itself
- weaselcat (5/30) Feb 23 2015 Hi,
- Sativa (17/55) Feb 23 2015 That isn't the problem. The problem is that when it does collect,
- ketmar (12/20) Feb 23 2015 needs read/write barriers added to generated code. a major slowdown for=...
- Kagamin (4/13) Feb 24 2015 Only modifications of pointers, which introduce new cross-block
- Sativa (44/59) Feb 24 2015 But this type of thinking is the reason why the current GC is in
- ketmar (2/5) Feb 24 2015 and they can live in CPU register... ooops.=
- deadalnix (5/10) Feb 24 2015 There are limitation you have in a system language, that you
- deadalnix (4/19) Feb 24 2015 That is not quite true. You don't know before hand which one are
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (8/17) Feb 24 2015 It's possible to (ab)use the MMU as a write barrier.
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (6/7) Feb 24 2015 Uhm...
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (5/12) Feb 24 2015 Yes, but if the program has a small working set, you pay that
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (7/24) Feb 24 2015 But if you can control where the code is running when you run the
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (8/12) Feb 24 2015 I'd say this is impractical. You could only reasonably expect
- deadalnix (4/11) Feb 24 2015 That's why this is used for immutable, or mostly immutable data,
- ketmar (5/8) Feb 24 2015 sure, it's possible sometimes. i'd like to see forking GC in druntime.
- "Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> (9/11) Feb 24 2015 You need to a precise GC to do this, because you need to know
How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification. 2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time. e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working. The point is, that maybe the GC is ran more often but in smaller and predictable steps. That is, the GC should be able to calculate how long it will take to free/compact a block. If it takes too long then it simply does it in stages. This way, there is essentially a very predictable and consistent cpu usage with the GC running but never any major lag spikes that are going to throw real time behavior out the window. It would seem that such a "Feature" would be easy to implement by modifying the existing GC code to be "incremental". I'd prefer a constant 1-5% cpu usage given to the GC if it didn't blow up for no reason. This way, it being very predictable, just mean one has to get a slightly faster cpu to compensate or optimize the code slightly. It would be analogous to game programming. 1. We can have the GC steal, say, 1 fps to do it's work... 2. Or we can keep the GC asleep doing nothing until it gets so much work it has pause the entire engine for a 1/2 second dropping the fps down by half momentarily. This might happen every 1 minute or so.... but still unacceptable for most gamers. (assuming 30-60 fps) I'd prefer the first case.
Feb 23 2015
Basically, I am simply wondering if the GC can "throttle" itself as to reduce the *maximum* time the program has to wait.
Feb 23 2015
On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification. 2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time. e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working. The point is, that maybe the GC is ran more often but in smaller and predictable steps. That is, the GC should be able to calculate how long it will take to free/compact a block. If it takes too long then it simply does it in stages. This way, there is essentially a very predictable and consistent cpu usage with the GC running but never any major lag spikes that are going to throw real time behavior out the window. It would seem that such a "Feature" would be easy to implement by modifying the existing GC code to be "incremental". I'd prefer a constant 1-5% cpu usage given to the GC if it didn't blow up for no reason. This way, it being very predictable, just mean one has to get a slightly faster cpu to compensate or optimize the code slightly.Hi, D's GC actually is predictable. It will not collect unless you allocate. You can also disable it and manually collect. I use it this way and essentially use it as a smart freelist.
Feb 23 2015
On Monday, 23 February 2015 at 22:11:48 UTC, weaselcat wrote:On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:That isn't the problem. The problem is that when it does collect, the time it takes is unpredictable. It could take 1us or 10m. If there is a cap on how long the GC can run at any particular time interval, then it it's time complexity is simple, predicable, and can be better used for RT applications. Effectively I would rather the GC run every second and spend a maximum of 1ms doing cleaning up(not necessarily finishing) instead of running when ever and potentially taking seconds to cleanup. It's all about how to actually distribute the GC running in time. As it stands now, it can run anytime and take as long as it wants. I'd rather have it running "continuously" but not take as long as it wants. By allowing it to run continuously, in short bursts, one should get the same long term behavior but not have the spikes in cpu usage which prevent its usefulness in RT applications.How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification. 2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time. e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working. The point is, that maybe the GC is ran more often but in smaller and predictable steps. That is, the GC should be able to calculate how long it will take to free/compact a block. If it takes too long then it simply does it in stages. This way, there is essentially a very predictable and consistent cpu usage with the GC running but never any major lag spikes that are going to throw real time behavior out the window. It would seem that such a "Feature" would be easy to implement by modifying the existing GC code to be "incremental". I'd prefer a constant 1-5% cpu usage given to the GC if it didn't blow up for no reason. This way, it being very predictable, just mean one has to get a slightly faster cpu to compensate or optimize the code slightly.Hi, D's GC actually is predictable. It will not collect unless you allocate. You can also disable it and manually collect. I use it this way and essentially use it as a smart freelist.
Feb 23 2015
On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:How hard would it be to modify D's GC to do the following two things: =20 1. Scan the heap in the BG on another thread/cpu for compactification.needs read/write barriers added to generated code. a major slowdown for=20 ALL memory access.2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time.as you can't do "partial scans" without barriers... see above.e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working.it's not freeing that takes time, it's scanning that takes time. after=20 scanning is complete there is not much left to do. ah, well, if you have=20 thousands of small objects, this can be slow too, but with such use case=20 partial frees will lead to OOM situations very fast (as your code have to=20 allocate as crazy to get to such situation). tl;dr: you can't do this in compiled language without introducing major=20 slowdown for indirect memory access. and indirect memory access is used=20 *everywhere*. you will not like the results.=
Feb 23 2015
On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification.needs read/write barriers added to generated code. a major slowdown for ALL memory access.
Feb 24 2015
On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:But this type of thinking is the reason why the current GC is in the state it is. The compiler knows which pointers are "free" and which ones are "bound". Bound pointers are pointers that are not assigned freely by the user. e.g., a pointer to an array who's address never is arbitrarily set by the user is bound. The compiler knows where and how the pointer is assigned. Most pointers are this way. Bound pointers are pointers the GC can easily clean up because it knows when and how they are used. In this way, if all pointers of a program were bound, the GC can work in the background and never pause the state to clean up. (essentially the compiler would need to insert special code) most pointers are bound pointers. Free pointers are more difficult as they can, say, be randomly initiated and point to anywhere on the heap and have to be looked in a locked way. (to prevent them changing in the middle of some GC operation) But if one distinguishes bound and free pointers(Easily done with a bit in the pointers) and has the compiler keep track of when free pointers are used(by having a dirty bit when they are written to), then one can more easily scan the heap in the background. In fact, one could potentially get away from all synching issues by doing the following: When ever free pointers are used a simple spin lock is used. The spin lock checks a flag in the free pointers table that signals that a pointer is being changed by the code. When this is true, the free pointers table is in a state of flux and can't be relied on. In the mean time, the GC can build up information about the heap for the bound pointers. It can figure out what needs to be changed, setup buffering(which can be done using bits in the pointer), etc all in the background because the bound pointers are "stable" and deterministically change. When the free pointers table's dirty flag is unset it means that the free pointers are not changing in the program and the GC can lock the table using another flag. When the flag is set the spin lock kicks in and pauses the program while the GC is working on the free pointers table. (or to be more efficient, the program can yield to some other background task code) By having multiple tables of free pointers one can reduce the overhead. The GC looks at on a piece at a time and locks on a fraction of the code at any point in time. The compiler can distribute the locks vs pages in an optimized way through profiling.On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification.needs read/write barriers added to generated code. a major slowdown for ALL memory access.
Feb 24 2015
On Tue, 24 Feb 2015 15:22:01 +0000, Sativa wrote:Free pointers are more difficult as they can, say, be randomly initiated and point to anywhere on the heap and have to be looked in a locked way. (to prevent them changing in the middle of some GC operation)and they can live in CPU register... ooops.=
Feb 24 2015
On Tuesday, 24 February 2015 at 15:22:02 UTC, Sativa wrote:There are limitation you have in a system language, that you don't in a VM language. That being said, D has a type system that allow for various optimization when it come to the GC, so I don't think it is hopeless, but we need to plug the remaining holes.Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.But this type of thinking is the reason why the current GC is in the state it is.
Feb 24 2015
On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:That is not quite true. You don't know before hand which one are these, so you always need to, at least, check if you need to do the work.On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification.needs read/write barriers added to generated code. a major slowdown for ALL memory access.
Feb 24 2015
On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:It's possible to (ab)use the MMU as a write barrier. Sociomantic's GC implementation does this by forking and running the scan in the child, then communicating information back about what can be freed. Video: http://dconf.org/talks/lucarella.html IIRC, at the end the speaker was asked about performance overhead of the COW that is involved, but he couldn't give any numbers.How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification.needs read/write barriers added to generated code. a major slowdown for ALL memory access.
Feb 24 2015
On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:It's possible to (ab)use the MMU as a write barrier.Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... The cost of a page miss is 1000 cycles + overhead for copying. And that's assuming that you have enough memory and that the TLB isn't affected...
Feb 24 2015
On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad wrote:On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:Yes, but if the program has a small working set, you pay that price only a few times (once per page), and in any case, only during a running background scan, rather than always.It's possible to (ab)use the MMU as a write barrier.Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... The cost of a page miss is 1000 cycles + overhead for copying. And that's assuming that you have enough memory and that the TLB isn't affected...
Feb 24 2015
On Tuesday, 24 February 2015 at 16:05:06 UTC, Marc Schütz wrote:On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad wrote:(Typo, MFENCE is 33...)On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:It's possible to (ab)use the MMU as a write barrier.Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33But if you can control where the code is running when you run the GC scan then you might as well have a separate code path for concurrent GC too (i.e. two versions of the code). One with fences and one without... If you loose the TLB, then you also loose all caches too IIRC.cycles... The cost of a page miss is 1000 cycles + overhead for copying. And that's assuming that you have enough memory and that the TLB isn't affected...Yes, but if the program has a small working set, you pay that price only a few times (once per page), and in any case, only during a running background scan, rather than always.
Feb 24 2015
On Tuesday, 24 February 2015 at 17:07:03 UTC, Ola Fosheim Grøstad wrote:But if you can control where the code is running when you run the GC scan then you might as well have a separate code path for concurrent GC too (i.e. two versions of the code). One with fences and one without...I'd say this is impractical. You could only reasonably expect this to work at certain checkpoints, say, an event loop with short calls into the "actual" application code. You cannot simply switch right in the middle of a deep call. And with Sociomantic's GC, you don't have any more "control" over the GC than you have with the current one in druntime.
Feb 24 2015
On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad wrote:On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:That's why this is used for immutable, or mostly immutable data, but generally avoided in the case of mutable data.It's possible to (ab)use the MMU as a write barrier.Uhm... The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... The cost of a page miss is 1000 cycles + overhead for copying. And that's assuming that you have enough memory and that the TLB isn't affected...
Feb 24 2015
On Tue, 24 Feb 2015 14:45:16 +0000, Marc Sch=C3=BCtz wrote:It's possible to (ab)use the MMU as a write barrier. Sociomantic's GC implementation does this by forking and running the scan in the child, then communicating information back about what can be freed. Video:sure, it's possible sometimes. i'd like to see forking GC in druntime. but all in all it's just a clever hack. concurrent GC in the same process=20 can't be efficiently done with MMU only: MMU protection has too big=20 granularity and page faults are too slow.=
Feb 24 2015
On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:1. Scan the heap in the BG on another thread/cpu for compactification.You need to a precise GC to do this, because you need to know what is a pointer. Otherwise if you have two variables, say a size_t and a pointer that contain the same raw value, on compacting you could overwrite the size_t despite it just being a number and not a pointer. Also you need to ban certain pointer shenanigans (like xor fun stuff), though the current GC doesn't work when those are in use anyway, so I guess it's not too bad.
Feb 24 2015