digitalmars.D - D GC theory

Sativa (32/32) Feb 23 2015 How hard would it be to modify D's GC to do the following two

Sativa (2/2) Feb 23 2015 Basically, I am simply wondering if the GC can "throttle" itself
weaselcat (5/30) Feb 23 2015 Hi,

Sativa (17/55) Feb 23 2015 That isn't the problem. The problem is that when it does collect,

ketmar (12/20) Feb 23 2015 needs read/write barriers added to generated code. a major slowdown for=...

Kagamin (4/13) Feb 24 2015 Only modifications of pointers, which introduce new cross-block

Sativa (44/59) Feb 24 2015 But this type of thinking is the reason why the current GC is in

ketmar (2/5) Feb 24 2015 and they can live in CPU register... ooops.=
deadalnix (5/10) Feb 24 2015 There are limitation you have in a system language, that you

deadalnix (4/19) Feb 24 2015 That is not quite true. You don't know before hand which one are

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (8/17) Feb 24 2015 It's possible to (ab)use the MMU as a write barrier.

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (6/7) Feb 24 2015 Uhm...

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (5/12) Feb 24 2015 Yes, but if the program has a small working set, you pay that

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (7/24) Feb 24 2015 But if you can control where the code is running when you run the

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (8/12) Feb 24 2015 I'd say this is impractical. You could only reasonably expect

deadalnix (4/11) Feb 24 2015 That's why this is used for immutable, or mostly immutable data,

ketmar (5/8) Feb 24 2015 sure, it's possible sometimes. i'd like to see forking GC in druntime.

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> (9/11) Feb 24 2015 You need to a precise GC to do this, because you need to know

"Sativa" <Sativa Indica.org> writes:

How hard would it be to modify D's GC to do the following two 
things:

1. Scan the heap in the BG on another thread/cpu for 
compactification.

2. Move blocks of memory in predictable(timewise) chunks that 
prevents locking the main thread for any period of time.

e.g., in the first step the GC finds some blocks of memory that 
need to be freed/compacted. In the 2nd step it starts 
freeing/compacting it in predictable pieces by limiting the time 
it takes while working.

The point is, that maybe the GC is ran more often but in smaller 
and predictable steps.

That is, the GC should be able to calculate how long it will take 
to free/compact a block. If it takes too long then it simply does 
it in stages.

This way, there is essentially a very predictable and consistent 
cpu usage with the GC running but never any major lag spikes that 
are going to throw real time behavior out the window.

It would seem that such a "Feature" would be easy to implement by 
modifying the existing GC code to be "incremental".

I'd prefer a constant 1-5% cpu usage given to the GC if it didn't 
blow up for no reason. This way, it being very predictable, just 
mean one has to get a slightly faster cpu to compensate or 
optimize the code slightly.

It would be analogous to game programming.

1. We can have the GC steal, say, 1 fps to do it's work...

2. Or we can keep the GC asleep doing nothing until it gets so 
much work it has pause the entire engine for a 1/2 second 
dropping the fps down by half momentarily. This might happen 
every 1 minute or so.... but still unacceptable for most gamers. 
(assuming 30-60 fps)


I'd prefer the first case.

Feb 23 2015

"Sativa" <Sativa Indica.org> writes:

Basically, I am simply wondering if the GC can "throttle" itself 
as to reduce the *maximum* time the program has to wait.

Feb 23 2015

"weaselcat" <weaselcat gmail.com> writes:

On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
 How hard would it be to modify D's GC to do the following two 
 things:

 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 2. Move blocks of memory in predictable(timewise) chunks that 
 prevents locking the main thread for any period of time.

 e.g., in the first step the GC finds some blocks of memory that 
 need to be freed/compacted. In the 2nd step it starts 
 freeing/compacting it in predictable pieces by limiting the 
 time it takes while working.

 The point is, that maybe the GC is ran more often but in 
 smaller and predictable steps.

 That is, the GC should be able to calculate how long it will 
 take to free/compact a block. If it takes too long then it 
 simply does it in stages.

 This way, there is essentially a very predictable and 
 consistent cpu usage with the GC running but never any major 
 lag spikes that are going to throw real time behavior out the 
 window.

 It would seem that such a "Feature" would be easy to implement 
 by modifying the existing GC code to be "incremental".

 I'd prefer a constant 1-5% cpu usage given to the GC if it 
 didn't blow up for no reason. This way, it being very 
 predictable, just mean one has to get a slightly faster cpu to 
 compensate or optimize the code slightly.

Hi,
D's GC actually is predictable. It will not collect unless you 
allocate. You can also disable it and manually collect. I use it 
this way and essentially use it as a smart freelist.

Feb 23 2015

"Sativa" <Sativa Indica.org> writes:

On Monday, 23 February 2015 at 22:11:48 UTC, weaselcat wrote:
 On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
 How hard would it be to modify D's GC to do the following two 
 things:

 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 2. Move blocks of memory in predictable(timewise) chunks that 
 prevents locking the main thread for any period of time.

 e.g., in the first step the GC finds some blocks of memory 
 that need to be freed/compacted. In the 2nd step it starts 
 freeing/compacting it in predictable pieces by limiting the 
 time it takes while working.

 The point is, that maybe the GC is ran more often but in 
 smaller and predictable steps.

 That is, the GC should be able to calculate how long it will 
 take to free/compact a block. If it takes too long then it 
 simply does it in stages.

 This way, there is essentially a very predictable and 
 consistent cpu usage with the GC running but never any major 
 lag spikes that are going to throw real time behavior out the 
 window.

 It would seem that such a "Feature" would be easy to implement 
 by modifying the existing GC code to be "incremental".

 I'd prefer a constant 1-5% cpu usage given to the GC if it 
 didn't blow up for no reason. This way, it being very 
 predictable, just mean one has to get a slightly faster cpu to 
 compensate or optimize the code slightly.

 Hi,
 D's GC actually is predictable. It will not collect unless you 
 allocate. You can also disable it and manually collect. I use 
 it this way and essentially use it as a smart freelist.


That isn't the problem. The problem is that when it does collect, 
the time it takes is unpredictable. It could take 1us or 10m. If 
there is a cap on how long the GC can run at any particular time 
interval, then it it's time complexity is simple, predicable, and 
can be better used for RT applications.

Effectively I would rather the GC run every second and spend a 
maximum of 1ms doing cleaning up(not necessarily finishing) 
instead of running when ever and potentially taking seconds to 
cleanup.

It's all about how to actually distribute the GC running in time. 
As it stands now, it can run anytime and take as long as it 
wants. I'd rather have it running "continuously" but not take as 
long as it wants. By allowing it to run continuously, in short 
bursts, one should get the same long term behavior but not have 
the spikes in cpu usage which prevent its usefulness in RT 
applications.

Feb 23 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two things:
=20
 1. Scan the heap in the BG on another thread/cpu for compactification.

needs read/write barriers added to generated code. a major slowdown for=20
ALL memory access.

 2. Move blocks of memory in predictable(timewise) chunks that prevents
 locking the main thread for any period of time.

as you can't do "partial scans" without barriers... see above.

 e.g., in the first step the GC finds some blocks of memory that need to
 be freed/compacted. In the 2nd step it starts freeing/compacting it in
 predictable pieces by limiting the time it takes while working.

it's not freeing that takes time, it's scanning that takes time. after=20
scanning is complete there is not much left to do. ah, well, if you have=20
thousands of small objects, this can be slow too, but with such use case=20
partial frees will lead to OOM situations very fast (as your code have to=20
allocate as crazy to get to such situation).

tl;dr: you can't do this in compiled language without introducing major=20
slowdown for indirect memory access. and indirect memory access is used=20
*everywhere*. you will not like the results.=

Feb 23 2015

"Kagamin" <spam here.lot> writes:

On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 needs read/write barriers added to generated code. a major 
 slowdown for
 ALL memory access.

Only modifications of pointers, which introduce new cross-block 
dependencies (so that GC knows to recheck the new dependency). 
Other memory access goes without slowdown.

Feb 24 2015

"Sativa" <Sativa Indica.org> writes:

On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:
 On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 needs read/write barriers added to generated code. a major 
 slowdown for
 ALL memory access.

 Only modifications of pointers, which introduce new cross-block 
 dependencies (so that GC knows to recheck the new dependency). 
 Other memory access goes without slowdown.

But this type of thinking is the reason why the current GC is in 
the state it is.

The compiler knows which pointers are "free" and which ones are 
"bound". Bound pointers are pointers that are not assigned freely 
by the user. e.g., a pointer to an array who's address never is 
arbitrarily set by the user is bound. The compiler knows where 
and how the pointer is assigned. Most pointers are this way.

Bound pointers are pointers the GC can easily clean up because it 
knows when and how they are used. In this way, if all pointers of 
a program were bound, the GC can work in the background and never 
pause the state to clean up. (essentially the compiler would need 
to insert special code) most pointers are bound pointers.

Free pointers are more difficult as they can, say, be randomly 
initiated and point to anywhere on the heap and have to be looked 
in a locked way. (to prevent them changing in the middle of some 
GC operation)

But if one distinguishes bound and free pointers(Easily done with 
a bit in the pointers) and has the compiler keep track of when 
free pointers are used(by having a dirty bit when they are 
written to), then one can more easily scan the heap in the 
background.

In fact, one could potentially get away from all synching issues 
by doing the following:

When ever free pointers are used a simple spin lock is used. The 
spin lock checks a flag in the free pointers table that signals 
that a pointer is being changed by the code. When this is true, 
the free pointers table is in a state of flux and can't be relied 
on. In the mean time, the GC can build up information about the 
heap for the bound pointers. It can figure out what needs to be 
changed, setup buffering(which can be done using bits in the 
pointer), etc all in the background because the bound pointers 
are "stable" and deterministically change.

When the free pointers table's dirty flag is unset it means that 
the free pointers are not changing in the program and the GC can 
lock the table using another flag. When the flag is set the spin 
lock kicks in and pauses the program while the GC is working on 
the free pointers table. (or to be more efficient, the program 
can yield to some other background task code)

By having multiple tables of free pointers one can reduce the 
overhead. The GC looks at on a piece at a time and locks on a 
fraction of the code at any point in time. The compiler can 
distribute the locks vs pages in an optimized way through 
profiling.

Feb 24 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Tue, 24 Feb 2015 15:22:01 +0000, Sativa wrote:

 Free pointers are more difficult as they can, say, be randomly initiated
 and point to anywhere on the heap and have to be looked in a locked way.
 (to prevent them changing in the middle of some GC operation)

and they can live in CPU register... ooops.=

Feb 24 2015

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 24 February 2015 at 15:22:02 UTC, Sativa wrote:
 Only modifications of pointers, which introduce new 
 cross-block dependencies (so that GC knows to recheck the new 
 dependency). Other memory access goes without slowdown.

 But this type of thinking is the reason why the current GC is 
 in the state it is.

There are limitation you have in a system language, that you 
don't in a VM language. That being said, D has a type system that 
allow for various optimization when it come to the GC, so I don't 
think it is hopeless, but we need to plug the remaining holes.

Feb 24 2015

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:
 On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 needs read/write barriers added to generated code. a major 
 slowdown for
 ALL memory access.

 Only modifications of pointers, which introduce new cross-block 
 dependencies (so that GC knows to recheck the new dependency). 
 Other memory access goes without slowdown.

That is not quite true. You don't know before hand which one are 
these, so you always need to, at least, check if you need to do 
the work.

Feb 24 2015

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
 On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:

 How hard would it be to modify D's GC to do the following two 
 things:
 
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

 needs read/write barriers added to generated code. a major 
 slowdown for
 ALL memory access.

It's possible to (ab)use the MMU as a write barrier. 
Sociomantic's GC implementation does this by forking and running 
the scan in the child, then communicating information back about 
what can be freed. Video:

http://dconf.org/talks/lucarella.html

IIRC, at the end the speaker was asked about performance overhead 
of the COW that is involved, but he couldn't give any numbers.

Feb 24 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:
 It's possible to (ab)use the MMU as a write barrier.

Uhm...

The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... 
The cost of a page miss is 1000 cycles + overhead for copying.

And that's assuming that you have enough memory and that the TLB 
isn't affected...

Feb 24 2015

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:
 It's possible to (ab)use the MMU as a write barrier.

 Uhm...

 The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... 
 The cost of a page miss is 1000 cycles + overhead for copying.

 And that's assuming that you have enough memory and that the 
 TLB isn't affected...

Yes, but if the program has a small working set, you pay that 
price only a few times (once per page), and in any case, only 
during a running background scan, rather than always.

Feb 24 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Tuesday, 24 February 2015 at 16:05:06 UTC, Marc Schütz wrote:
 On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim 
 Grøstad wrote:
 On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz 
 wrote:
 It's possible to (ab)use the MMU as a write barrier.

 Uhm...

 The throughput for L/SFENCE is 5 cycles and SFENCE 33


(Typo,  MFENCE is 33...)

 cycles... The cost of a page miss is 1000 cycles + overhead 
 for copying.

 And that's assuming that you have enough memory and that the 
 TLB isn't affected...

 Yes, but if the program has a small working set, you pay that 
 price only a few times (once per page), and in any case, only 
 during a running background scan, rather than always.

But if you can control where the code is running when you run the 
GC scan then you might as well have a separate code path for 
concurrent GC too (i.e. two versions of the code). One with 
fences and one without...

If you loose the TLB, then you also loose all caches too IIRC.

Feb 24 2015

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Tuesday, 24 February 2015 at 17:07:03 UTC, Ola Fosheim Grøstad 
wrote:
 But if you can control where the code is running when you run 
 the GC scan then you might as well have a separate code path 
 for concurrent GC too (i.e. two versions of the code). One with 
 fences and one without...

I'd say this is impractical. You could only reasonably expect 
this to work at certain checkpoints, say, an event loop with 
short calls into the "actual" application code. You cannot simply 
switch right in the middle of a deep call.

And with Sociomantic's GC, you don't have any more "control" over 
the GC than you have with the current one in druntime.

Feb 24 2015

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:
 It's possible to (ab)use the MMU as a write barrier.

 Uhm...

 The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... 
 The cost of a page miss is 1000 cycles + overhead for copying.

 And that's assuming that you have enough memory and that the 
 TLB isn't affected...

That's why this is used for immutable, or mostly immutable data, 
but generally avoided in the case of mutable data.

Feb 24 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Tue, 24 Feb 2015 14:45:16 +0000, Marc Sch=C3=BCtz wrote:

 It's possible to (ab)use the MMU as a write barrier. Sociomantic's GC
 implementation does this by forking and running the scan in the child,
 then communicating information back about what can be freed. Video:

sure, it's possible sometimes. i'd like to see forking GC in druntime.

but all in all it's just a clever hack. concurrent GC in the same process=20
can't be efficiently done with MMU only: MMU protection has too big=20
granularity and page faults are too slow.=

Feb 24 2015

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> writes:

On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
 1. Scan the heap in the BG on another thread/cpu for 
 compactification.

You need to a precise GC to do this, because you need to know 
what is a pointer. Otherwise if you have two variables, say a 
size_t and a pointer that contain the same raw value, on 
compacting you could overwrite the size_t despite it just being a 
number and not a pointer.
Also you need to ban certain pointer shenanigans (like xor fun 
stuff), though the current GC doesn't work when those are in use 
anyway, so I guess it's not too bad.

Feb 24 2015

D Programming

C/C++ Programming

Other

digitalmars.D - D GC theory