digitalmars.D - Interior pointers and fast GC
- Chris Wright (150/150) Jan 13 2017 Interior pointers are a barrier to performant garbage collection.
- Rainer Schuetze (19/29) Jan 14 2017 [...]
- Nick Treleaven (11/14) Jan 21 2017 Makes me wonder about a GC'd language where each pointer is
- Shachar Shemesh (6/17) Jan 22 2017 You are neglecting to account for cache locality performance cost. An
- deadalnix (13/15) Jan 21 2017 1. Split the heap in chunk of size n being a power of 2, say 4M.
- Araq (8/20) Jan 21 2017 It's an O(1) that requires a hash table lookup in general because
- Chris Wright (24/30) Jan 21 2017 Overexplaining because I'm tired.
- Araq (6/20) Jan 21 2017 Sure, O(log #pools) + O(1) is "much better" than O(1).
- Chris Wright (16/37) Jan 22 2017 Much better than what's available today, unless I misunderstood the
- deadalnix (7/14) Jan 22 2017 Huge allocs are always handled specifically by allocators. The
Interior pointers are a barrier to performant garbage collection. Here, I'm brainstorming about the problem and not really coming to any conclusions. The world today =============== D's garbage collector is significantly less performant than Java's, and this is partly the result of D's features. Interior pointers are costly. What's the deal with interior pointers? --------------------------------------- A heap object (for this discussion) is a series of bytes on the GC heap. An interior pointer is a pointer to some memory location within a heap object, but not to the beginning of the heap object. A garbage collector that never has to deal with internal pointers has several options for storing metadata related to a heap object: * In a segment of data immediately before the heap object * In a hashmap * In a binary search tree, a prefix tree, or the like The first strategy is O(1) and has good data locality relative to the data pointed at. Eclipse OMR assumes you use this strategy (though it supports other methods). The second strategy is O(1) in the expected case. By default, it has poor data locality -- though you can improve that with careful management of virtual memory. The third strategy is O(log N) in the expected case and has similar data locality to the hashtable. Obviously, we prefer the first strategy. Unfortunately, given an interior pointer, you can't identify the base of its heap object in constant time. I believe D's garbage collector uses a binary search tree. D allows unrestricted interior pointers. This makes it difficult to experiment with other garbage collectors, except for those designed for D or C++. Addressof --------- D lets you take the address of anything: class Paddock { uint goats, sheep; } auto paddock = new Paddock; auto goatPtr = &paddock.goats; The creation of interior pointers this way can be detected at compile time, if that benefits us. Arrays ------ A D array is a struct containing a pointer and a length. D encourages array slicing: auto str = "hello world!"; foreach (word; str.split(' ')) { // This is an interior pointer assert(word.ptr >= str.ptr); assert(word.ptr < str.ptr + str.length); } Array slicing is splendid and worthwhile, and eliminating it would be a huge cost. We can at least potentially detect non-pathological cases of this. Closures -------- Closures can create interior pointers, depending on the compiler's implementation: class Paddock { uint goats, sheep; void trackCattle() { cattleEnter.connect((flock) { if (flock.isGoats) goats += flock.size; if (flock.isSheep) sheep += flock.size; }); } } The compiler might include a copy of the `this` pointer in the closure, or it might include direct pointers to `goats` and `sheep`. Both produce equivalent results aside from the creation of interior pointers. When the closure is defined in a struct method, it is impossible to ensure that the contents of the closure do not include an interior pointer -- the struct itself might be a member of another heap object. What options do we have? ======================== If we want to reduce interior pointers, we have a number of options. We would likely need several to get reasonable results. Forbid user-generated interior pointers --------------------------------------- We might simply forbid taking the address of a member of a heap object, turning it into a compile-time error, or merely leaving it as undefined behavior. This is unlikely to be feasible. D's lifeblood is array slicing, and that produces interior pointers left and right. Augment arrays -------------- We can augment arrays to eliminate interior pointers for common operations: struct Array(T) { private T* baseptr; size_t offset; size_t length; T* ptr() { return baseptr + offset; } } It is possible to construct an array from a pointer, and any pointer might be an interior pointer, but this strategy would at least reduce the number of interior pointers we have floating around. Change struct this pointer to pointer+offset -------------------------------------------- We convert struct `this` pointers to a pointer plus an offset in order to eliminate interior pointers from closures involving struct members. This probably requires adding a hidden field to each struct, which breaks C compatibility. It also requires a lot of bookkeeping, and I'm not sure that's always possible. Uninitialized struct values break this. Mark allocations that might contain interior pointers ----------------------------------------------------- Since we can detect many things that might create interior pointers at compile time, we can explicitly mark them as such. Conversely, we can prove that many things do not contain interior pointers, and we could mark those. When we scan a heap object that is marked as not containing interior pointers, we can use a fast path to locate the metadata associated with that heap object. Otherwise, we revert to the current, slow path. This will probably be useless unless we take steps to reduce the number of interior pointers. By default, if an aggregate contains a pointer or an array, it contains a possible interior pointer. If we augment arrays to avoid interior pointers, this might be useful. This is always an optimization because we can tell immediately (O(1) time, reading values that are already in memory) whether to use the slow or fast path. Mark potential interior pointers -------------------------------- By managing virtual memory explicitly, we can use the high order bit of a pointer into the GC heap to indicate whether that is potentially an interior pointer. We can potentially memory map multiple preselected virtual address ranges to the same physical range (I'm uncertain about this). However, on systems where that is not possible, every pointer read requires invoking a runtime function. This is not an option on 32-bit systems, where we have limited address space to use. It would reduce us to ~1.5GB of GC heap (since the OS typically reserves 0.5GB of address space). ASLR adds more work. If implemented, this would give us an O(1) means for determining whether to use the slow case or not, assuming we also managed to identify the construction of all potential interior pointers. This is also problematic with C/C++ interop and inline assembly. You can trivially construct interior pointers in either, and they would be unmarked. If pointer reads require invoking a runtime function, that is a barrier with passing pointers into C/C++ code. Hashtable with search tree fallback ----------------------------------- We can use a hashtable of allocations first and fall back to a binary search tree if necessary. This has us maintaining two search data structures when we'd prefer zero. However, it's the most easily implemented version. The utility of this is proportional to the relative frequency of interior pointers. If they are rare, this strategy pays off. If they are common, this strategy adds overhead for little gain. I anticipate that this would be slower with current DMD.
Jan 13 2017
On 14.01.2017 05:37, Chris Wright wrote:Interior pointers are a barrier to performant garbage collection.[...]* In a binary search tree, a prefix tree, or the like[...]The third strategy is O(log N) in the expected case and has similar data locality to the hashtable. Obviously, we prefer the first strategy. Unfortunately, given an interior pointer, you can't identify the base of its heap object in constant time. I believe D's garbage collector uses a binary search tree. D allows unrestricted interior pointers. This makes it difficult to experiment with other garbage collectors, except for those designed for D or C++.The garbage collected memory is organized as pools of growing size. For each pool there are lookup tables for each page that has fixed sized memory blocks. That way finding the base of an allocation is a matter of finding the pool (currently O(log #pools); by N I assume you mean the number of allocations) but the rest is constant time. IIRC an application that uses 1 GB of memory has about 20 pools, which means the time to figure out the base address and the size of an allocation is more or less constant. In addition, you need to lookup the pool anyway to figure out if the pointer points to non-managed memory (stack, global data, malloc'd memory). So, I don't think interior pointers are a performance problem of D's GC. Other GCs might not be able to deal with these, though. The much larger problem with trying more sophisticated GCs is the lack of a fast write barrier. If we get one with compiler assisted (automatic) reference counting, we might aswell try this for other types of GC.
Jan 14 2017
On Saturday, 14 January 2017 at 15:30:42 UTC, Rainer Schuetze wrote:In addition, you need to lookup the pool anyway to figure out if the pointer points to non-managed memory (stack, global data, malloc'd memory).Makes me wonder about a GC'd language where each pointer is actually a member of a struct which also has a base allocation pointer. The base pointer could either only be set for managed allocations, or the low bit of the base address could be used to indicate such instead*. This would make for faster GC scans, but would also cause slowdowns when copying pointers. Pointer arithmetic would be just as efficient though. (*) With this scheme, pointer arithmetic can be safe in assert mode.
Jan 21 2017
On 21/01/17 17:30, Nick Treleaven wrote:On Saturday, 14 January 2017 at 15:30:42 UTC, Rainer Schuetze wrote:You are neglecting to account for cache locality performance cost. An array of pointers would, under this plan, store half as many pointers in a single cache line. This means handling this array would be, give or take, half as fast. That's a huge price to pay. ShacharIn addition, you need to lookup the pool anyway to figure out if the pointer points to non-managed memory (stack, global data, malloc'd memory).Makes me wonder about a GC'd language where each pointer is actually a member of a struct which also has a base allocation pointer. The base pointer could either only be set for managed allocations, or the low bit of the base address could be used to indicate such instead*. This would make for faster GC scans, but would also cause slowdowns when copying pointers. Pointer arithmetic would be just as efficient though. (*) With this scheme, pointer arithmetic can be safe in assert mode.
Jan 22 2017
On Saturday, 14 January 2017 at 04:37:01 UTC, Chris Wright wrote:Unfortunately, given an interior pointer, you can't identify the base of its heap object in constant time.1. Split the heap in chunk of size n being a power of 2, say 4M. Align them 4M. 2. Find the chunk an alloc is part of in O(1) bu masking the lower bits (22 bits to mask in our 4M case). 3. Have a table of page descriptor in the chunk header. Lookup the page the alloc is in in O(1). 4a. If the alloc is large (flag in the page descriptor), find the base pointer in O(1). 4b. if the alloc is small, compute the index of the item in the page from the size class in the page descriptor (on addition, one multiply and one shift) in O(1). Start on false premise, end up nowhere.
Jan 21 2017
On Saturday, 21 January 2017 at 17:42:46 UTC, deadalnix wrote:1. Split the heap in chunk of size n being a power of 2, say 4M. Align them 4M. 2. Find the chunk an alloc is part of in O(1) bu masking the lower bits (22 bits to mask in our 4M case). 3. Have a table of page descriptor in the chunk header. Lookup the page the alloc is in in O(1). 4a. If the alloc is large (flag in the page descriptor), find the base pointer in O(1). 4b. if the alloc is small, compute the index of the item in the page from the size class in the page descriptor (on addition, one multiply and one shift) in O(1). Start on false premise, end up nowhere.It's an O(1) that requires a hash table lookup in general because allocations can exceed the chunk size and so you cannot just mask the pointer and look at the chunk header because it might not be a chunk header at all. Know any production GCs that use hash table lookups for pointer assignments? Me neither. Ok ok, maybe Go does, it's the only language with GC that embraces interior pointers as stupid as that is.
Jan 21 2017
On Sun, 22 Jan 2017 05:02:43 +0000, Araq wrote:It's an O(1) that requires a hash table lookup in general because allocations can exceed the chunk size and so you cannot just mask the pointer and look at the chunk header because it might not be a chunk header at all. Know any production GCs that use hash table lookups for pointer assignments? Me neither. Ok ok, maybe Go does, it's the only language with GC that embraces interior pointers as stupid as that is.Overexplaining because I'm tired. If you have fixed-size allocations, you get much better time: O(log #pools) to find the pool, O(1) to find the offset in that pool. The cost is you multiply the number of pools in existence by the number of allocation sizes you decide to support. This obviously helps for structs and classes. It also helps for small arrays: a small array is a fixed-size allocation that can potentially be reallocated. For instance, I decide to support allocations of 16, 32, and 256 bytes. The string "A British tar is a soaring soul" fits into a 32-byte allocation, so I put it in the 32-byte pool. If I append the string " as free as a mountain bird" to it, that no longer fits, so I copy it into the 256 byte pool and append there. If you then take a slice from it, "a soaring soul", I can find the relevant pool by examining every pool (or by traversing a search tree). The pool says it's a 32-byte pool, so floor-32 of the pointer is the base pointer. Large heap objects (anything above the maximum supported size, which will be about one OS page) don't afford this convenience -- they aren't of any particular size, so you still need to search through every allocation. So the performance of D's GC is rather strongly dependent on how large arrays tend to get. Unless I'm misunderstanding the GC yet again.
Jan 21 2017
On Sunday, 22 January 2017 at 06:28:35 UTC, Chris Wright wrote:On Sun, 22 Jan 2017 05:02:43 +0000, Araq wrote:Sure, O(log #pools) + O(1) is "much better" than O(1). To shorten the discussion: You're looking for a variant of "card marking". No overhead for interior pointers if you do it right. I think, never worked out the details though. http://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-worksIt's an O(1) that requires a hash table lookup in general because allocations can exceed the chunk size and so you cannot just mask the pointer and look at the chunk header because it might not be a chunk header at all. Know any production GCs that use hash table lookups for pointer assignments? Me neither. Ok ok, maybe Go does, it's the only language with GC that embraces interior pointers as stupid as that is.Overexplaining because I'm tired. If you have fixed-size allocations, you get much better time: O(log #pools) to find the pool, O(1) to find the offset in that pool. The cost is you multiply the number of pools in existence by the number of allocation sizes you decide to support.
Jan 21 2017
On Sun, 22 Jan 2017 06:44:47 +0000, Araq wrote:On Sunday, 22 January 2017 at 06:28:35 UTC, Chris Wright wrote:Much better than what's available today, unless I misunderstood the current GC again. A tree lookup with < 100 elements in the tree won't be much slower than a control virtual memory more strictly (and have virtual memory to burn -- so not on 32-bit), you can put a better constant on the hashtable lookup.On Sun, 22 Jan 2017 05:02:43 +0000, Araq wrote:Sure, O(log #pools) + O(1) is "much better" than O(1).It's an O(1) that requires a hash table lookup in general because allocations can exceed the chunk size and so you cannot just mask the pointer and look at the chunk header because it might not be a chunk header at all. Know any production GCs that use hash table lookups for pointer assignments? Me neither. Ok ok, maybe Go does, it's the only language with GC that embraces interior pointers as stupid as that is.Overexplaining because I'm tired. If you have fixed-size allocations, you get much better time: O(log #pools) to find the pool, O(1) to find the offset in that pool. The cost is you multiply the number of pools in existence by the number of allocation sizes you decide to support.To shorten the discussion: You're looking for a variant of "card marking". No overhead for interior pointers if you do it right. I think, never worked out the details though.Card marking is about write barriers. Write barriers are about incremental collection, caching part of your past GC run information so you can update it using only those pages of memory that have changed since last time. It's low granularity because it's essentially free to read an entire page of memory once you've read one byte. It's also low granularity because one common strategy is to mprotect(3) the pages as read-only and install a fault handler, which is an expensive way of doing things that you want to repeat fewer times if possible. It's not what I'm looking for.
Jan 22 2017
On Sunday, 22 January 2017 at 05:02:43 UTC, Araq wrote:It's an O(1) that requires a hash table lookup in general because allocations can exceed the chunk size and so you cannot just mask the pointer and look at the chunk header because it might not be a chunk header at all. Know any production GCs that use hash table lookups for pointer assignments? Me neither. Ok ok, maybe Go does, it's the only language with GC that embraces interior pointers as stupid as that is.Huge allocs are always handled specifically by allocators. The usual technique is via a radix tree. But it doesn't really matter for the discussion at hand: huge alloc are not numerous. If you have 4G of RAM, by definition, you need to have less than a 1000 of them with he above mentioned scheme. The whole lookup datastructure can fit in cache.
Jan 22 2017