digitalmars.D - A different, precise TLS garbage collector?
- Etienne (21/21) Nov 16 2014 I always wondered why we would use the shared keyword on GC allocations
- Sean Kelly (3/3) Nov 16 2014 We'll have to change the way "immutable" is treated for
- Etienne Cimon (11/14) Nov 16 2014 Exactly, I'm not sure how DMD currently handles immutable but it should
- Etienne Cimon (7/7) Nov 16 2014 This GC model also seems to work fine for locally-allocated __gshared
- Xinok (4/26) Nov 16 2014 How about immutable data which is implicitly shareable? Granted
- Etienne Cimon (7/10) Nov 16 2014 Immutable data would proxy through malloc and would not be scanned as it...
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (8/13) Nov 16 2014 If you go for thread local garbage collection then there is no
- Sean Kelly (4/6) Nov 16 2014 Yes. There are a lot of little "gotchas" with thread-local
- Etienne (6/13) Nov 16 2014 I can't even think of a situation when this would be necessary.
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (14/19) Nov 16 2014 There is a reason for why "elegant" GC languages pick one primary
- Etienne Cimon (10/14) Nov 16 2014 I'm not sure what this means, wouldn't the fiber stacks be saved on the
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (13/21) Nov 16 2014 If you want performant, low latency fibers you want load
- Kagamin (2/2) Nov 17 2014 Previous thread:
- Etienne (20/22) Nov 17 2014 Looks somewhat similar but the idea of a shared GC will defeat the
I always wondered why we would use the shared keyword on GC allocations if only the stack can be optimized for TLS Storage. After thinking about how shared objects should work with the GC, it's become obvious that the GC should be optimized for local data. Anything shared would have to be manually managed, because the biggest slowdown of all is stopping the world to facilitate concurrency. With a precise GC on the way, it's become easy to filter out allocations from shared objects. Simply proxy them through malloc and get right of the locks. Make the GC thread-local, and you can expect it to scale with the number of processors. Any thread-local data should already have to be duplicated into a shared object to be used from another thread, and the lifetime is easy to manage manually. SomeTLS variable = new SomeTLS("Data"); shared SomeTLS variable2 = cast(shared) variable.dupShared(); Tid tid = spawn(&doSomething, variable2); variable = receive!variable2(tid).dupLocal(); delete variable2; Programming with a syntax that makes use of shared objects, and forces manual management on those, seems to make "stop the world" a thing of the past. Any thoughts?
Nov 16 2014
We'll have to change the way "immutable" is treated for allocations. Which I think is a good thing. Just because something can be shared doesn't meant that I intend to share it.
Nov 16 2014
On 2014-11-16 10:20, Sean Kelly wrote:We'll have to change the way "immutable" is treated for allocations. Which I think is a good thing. Just because something can be shared doesn't meant that I intend to share it.Exactly, I'm not sure how DMD currently handles immutable but it should automatically be mangled in the global namespace in the application data. If this seems feasible to everyone I wouldn't mind forking the precise GC into a thread-local library, without any "stop the world" slowdown. A laptop with 4 cores in a multi-threaded application would (theoretically) run through the marking/collect process 4 times faster, and allocate unbelievably faster due to no locks :) The only problem is having to manually allocate shared objects, which seems fine because most of the time they'd be deallocated in shared static ~this anyways.
Nov 16 2014
This GC model also seems to work fine for locally-allocated __gshared objects. Since they're registered locally but available globally, they'll be collected once the thread that created it is gone. Also, when an object is cast(shared) before being sent to another thread, it's usually still in scope once the other thread returns. So there seems to be some very thin chances that existing code will be broken with a thread-local GC.
Nov 16 2014
On Sunday, 16 November 2014 at 13:58:19 UTC, Etienne wrote:I always wondered why we would use the shared keyword on GC allocations if only the stack can be optimized for TLS Storage. After thinking about how shared objects should work with the GC, it's become obvious that the GC should be optimized for local data. Anything shared would have to be manually managed, because the biggest slowdown of all is stopping the world to facilitate concurrency. With a precise GC on the way, it's become easy to filter out allocations from shared objects. Simply proxy them through malloc and get right of the locks. Make the GC thread-local, and you can expect it to scale with the number of processors. Any thread-local data should already have to be duplicated into a shared object to be used from another thread, and the lifetime is easy to manage manually. SomeTLS variable = new SomeTLS("Data"); shared SomeTLS variable2 = cast(shared) variable.dupShared(); Tid tid = spawn(&doSomething, variable2); variable = receive!variable2(tid).dupLocal(); delete variable2; Programming with a syntax that makes use of shared objects, and forces manual management on those, seems to make "stop the world" a thing of the past. Any thoughts?How about immutable data which is implicitly shareable? Granted you can destroy/free the data asynchronously, but you would still need to check all threads for references to that data.
Nov 16 2014
On 2014-11-16 10:21, Xinok wrote:How about immutable data which is implicitly shareable? Granted you can destroy/free the data asynchronously, but you would still need to check all threads for references to that data.Immutable data would proxy through malloc and would not be scanned as it can only contain immutable data that cannot be deleted nor scanned. This is also shared by every thread without any locking. Currently, immutable data is global in storage but may be local in access rights I think? I would have assumed it would automatically be in the .rdata process segments.
Nov 16 2014
On Sunday, 16 November 2014 at 13:58:19 UTC, Etienne wrote:After thinking about how shared objects should work with the GC, it's become obvious that the GC should be optimized for local data. Anything shared would have to be manually managed, because the biggest slowdown of all is stopping the world to facilitate concurrency.If you go for thread local garbage collection then there is no reason for being more general and support per-data-structure garbage collection as well. That's more useful, it can be used for collecting cycles in graphs. Just let the application initiate collection when there are no reference pointing into it. But keep in mind that you also have to account for fibers that move between threads.
Nov 16 2014
On Sunday, 16 November 2014 at 17:38:54 UTC, Ola Fosheim Grøstad wrote:But keep in mind that you also have to account for fibers that move between threads.Yes. There are a lot of little "gotchas" with thread-local allocation.
Nov 16 2014
On Sunday, 16 November 2014 at 17:40:30 UTC, Sean Kelly wrote:On Sunday, 16 November 2014 at 17:38:54 UTC, Ola Fosheim Grøstad wrote:I can't even think of a situation when this would be necessary. It sounds like all I would need is to take the precise GC and store each instance in the thread data, I'll probably only need the rtinfo to see if it's shared during allocation to proxy towards malloc. Am I missing something?But keep in mind that you also have to account for fibers that move between threads.Yes. There are a lot of little "gotchas" with thread-local allocation.
Nov 16 2014
On Sunday, 16 November 2014 at 19:13:27 UTC, Etienne wrote:I can't even think of a situation when this would be necessary. It sounds like all I would need is to take the precise GC and store each instance in the thread data, I'll probably only need the rtinfo to see if it's shared during allocation to proxy towards malloc. Am I missing something?There is a reason for why "elegant" GC languages pick one primary type of concurrency. If you say that all code is running on a fiber and that there is no such thing as thread local, then you can tie the local GC partition to the fiber and collect it on any thread. If you say that functions called from a fiber sometimes call into global statespace, sometimes into thread statespace and sometimes into fiber statespace… then you need to figure out ownership on all allocations. Does the allocated object belong to a global database, a thread local database or a fiber cache which is flushed automatically when moving to a new thread? Or is it an extension of the fiber statespace that should be transparent to threads?
Nov 16 2014
On 2014-11-16 19:32, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang gmail.com>" wrote:Does the allocated object belong to a global database, a thread local database or a fiber cache which is flushed automatically when moving to a new thread? Or is it an extension of the fiber statespace that should be transparent to threads?I'm not sure what this means, wouldn't the fiber stacks be saved on the thread-local space when they yield? In turn, they become part of the thread-local stack space I guess. Overall, I'd put all the GC allocations through malloc the same way it is right now. I don't see anything that needs to be done other than make multiple thread-local GC instances and remove the locks. I'm sure I'll find obstacles but I don't see them right now, do you know of any that I should look out for?
Nov 16 2014
On Monday, 17 November 2014 at 00:44:13 UTC, Etienne Cimon wrote:I'm not sure what this means, wouldn't the fiber stacks be saved on the thread-local space when they yield? In turn, they become part of the thread-local stack space I guess.If you want performant, low latency fibers you want load balancing, so they should not be affiliated with a thread but live in a pool. That said, fibers aren't a low level construct like threads so I am not sure if they belong in system level programming anyway.Overall, I'd put all the GC allocations through malloc the same way it is right now. I don't see anything that needs to be done other than make multiple thread-local GC instances and remove the locks. I'm sure I'll find obstacles but I don't see them right now, do you know of any that I should look out for?Not if you work hard to ensure referential integrity. I personally would find it more useful with an optional GC where the programmer takes responsibility for collecting when the situation is right. (specifying root pointers, stack etc). I think the language should limit itself to generate the information that can enable precise collection, then leave the rest to the programmer…
Nov 16 2014
Previous thread: http://forum.dlang.org/post/dnxgbumzenupviqymhrg forum.dlang.org
Nov 17 2014
On 2014-11-17 9:45 AM, Kagamin wrote:Previous thread: http://forum.dlang.org/post/dnxgbumzenupviqymhrg forum.dlang.orgLooks somewhat similar but the idea of a shared GC will defeat the purpose and will end up complicating things. After another review of the problem, I've come up with some new observations: - The shared data needs to be manually managed for a thread-local GC in order to scale with the number of CPU cores - Anything instantiated as `new shared X` will have to proxy into a shared allocator interface or malloc. - All __gshared instances containing mutable indirections will cause undefined behavior - All thread-local instances moved through threads using cast(shared) and carrying indirections will cause undefined behavior - Immutable object values must not be allocated on the GC and defined only in a shared static this constructor to ensure the values are available to all threads at all times The only necessity is shared/non-shared type information during allocation and deallocation. The __gshared and cast(shared) issues are certainly the most daunting. This is why this GC would have to be optional through a version(ThreadLocalGC)
Nov 17 2014