digitalmars.D - Atomic Ref Counting
- dsimcha (75/75) Nov 21 2010 This message originally appeared on the Phobos list, but I've grown impa...
- Jonathan M Davis (4/94) Nov 21 2010 I'm all for atomic ref counting, personally. It simplifies all kinds of ...
- Andrei Alexandrescu (12/18) Nov 21 2010 Thanks for taking the time to build a credibla baseline. A lot of this
- dsimcha (5/7) Nov 22 2010 Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x s...
- Andrei Alexandrescu (6/13) Nov 22 2010 I think the cost is close to the real cost of copying a refcounted
- dsimcha (7/21) Nov 22 2010 Ok, if the destructor issue needs to be solved for the general case anyh...
- Johann MacDonagh (10/85) Nov 23 2010 Did you try the inline ASM block without explicit register preservation?...
- Johann MacDonagh (6/15) Nov 23 2010 Ah nvm. Results in a 6% or so savings. It's definitely the lock prefix
- Don (2/5) Nov 24 2010 Simplifies the syntax considerably. rep is treated in the same way.
This message originally appeared on the Phobos list, but I've grown impatient waiting for responses and decided it needs to see a wider audience: Now that I have time, I did some benchmarks of atomic ref counting. It is somewhat expensive, but not obscenely so. Here's my benchmark: import std.stdio, std.datetime, std.container; void main() { Array!uint arr; arr.length = 4; auto sw = StopWatch(autoStart); foreach(i; 0..10_000_000) { auto arr2 = arr; } writeln(sw.peek.milliseconds); } Here's a diff of std.typecons: Index: typecons.d =================================================================== --- typecons.d (revision 2181) +++ typecons.d (working copy) -2328,7 +2328,16 this(this) { if (!RefCounted.isInitialized) return; - ++RefCounted._store._count; + //++RefCounted._store._count; + auto p = &(RefCounted._store._count); + asm { + push EAX; + mov EAX, p; + lock; + inc dword ptr[EAX]; + pop EAX; + } + debug(RefCounted) if (RefCounted.debugging) writeln(typeof(this).stringof, " ", cast(void*) RefCounted._store, ": bumped refcount to ", -2345,7 +2354,23 { if (!RefCounted._store) return; assert(RefCounted._store._count > 0); - if (--RefCounted._store._count) + + // Can't just do lock; dec; here because I need the value afterwords. + size_t oldVal = void; + auto p = &(RefCounted._store._count); + asm { + push EAX; + push EBX; + mov EAX, size_t.max; // + size_t.max is equivalent to - 1. + mov EBX, p; + lock; + xadd [EBX], EAX; + mov oldVal, EAX; + pop EBX; + pop EAX; + } + + if (oldVal > 1) { debug(RefCounted) if (RefCounted.debugging) writeln(typeof(this).stringof, Note the direct use of ASM. Between function call overhead and the fact that core.atomic uses a CAS loop, using it instead would have about doubled the overhead. I'm not sure how we'll factor this when/if this hits production code. Anyhow, the timings are: Atomic: 258 milliseconds total (258 nanoseconds per iteration) Non-atomic: 88 milliseconds (88 nanoseconds per iteration) I think this is a small enough overhead that we should just make all reference counting atomic. This would make the race condition pointed out by Michael Fortin completely moot IIUC. It would also make it possible to safely share reference counted containers across threads in shared/synchronized wrappers, etc. Furthermore, the cost of atomic reference counting can be mitigated by the standard tricks for dealing with expensive to copy objects, though it shouldn't be necessary except in the most extreme conditions.
Nov 21 2010
On Sunday 21 November 2010 17:59:35 dsimcha wrote:This message originally appeared on the Phobos list, but I've grown impatient waiting for responses and decided it needs to see a wider audience: Now that I have time, I did some benchmarks of atomic ref counting. It is somewhat expensive, but not obscenely so. Here's my benchmark: import std.stdio, std.datetime, std.container; void main() { Array!uint arr; arr.length = 4; auto sw = StopWatch(autoStart); foreach(i; 0..10_000_000) { auto arr2 = arr; } writeln(sw.peek.milliseconds); } Here's a diff of std.typecons: Index: typecons.d =================================================================== --- typecons.d (revision 2181) +++ typecons.d (working copy) -2328,7 +2328,16 this(this) { if (!RefCounted.isInitialized) return; - ++RefCounted._store._count; + //++RefCounted._store._count; + auto p = &(RefCounted._store._count); + asm { + push EAX; + mov EAX, p; + lock; + inc dword ptr[EAX]; + pop EAX; + } + debug(RefCounted) if (RefCounted.debugging) writeln(typeof(this).stringof, " ", cast(void*) RefCounted._store, ": bumped refcount to ", -2345,7 +2354,23 { if (!RefCounted._store) return; assert(RefCounted._store._count > 0); - if (--RefCounted._store._count) + + // Can't just do lock; dec; here because I need the value afterwords. + size_t oldVal = void; + auto p = &(RefCounted._store._count); + asm { + push EAX; + push EBX; + mov EAX, size_t.max; // + size_t.max is equivalent to - 1. + mov EBX, p; + lock; + xadd [EBX], EAX; + mov oldVal, EAX; + pop EBX; + pop EAX; + } + + if (oldVal > 1) { debug(RefCounted) if (RefCounted.debugging) writeln(typeof(this).stringof, Note the direct use of ASM. Between function call overhead and the fact that core.atomic uses a CAS loop, using it instead would have about doubled the overhead. I'm not sure how we'll factor this when/if this hits production code. Anyhow, the timings are: Atomic: 258 milliseconds total (258 nanoseconds per iteration) Non-atomic: 88 milliseconds (88 nanoseconds per iteration) I think this is a small enough overhead that we should just make all reference counting atomic. This would make the race condition pointed out by Michael Fortin completely moot IIUC. It would also make it possible to safely share reference counted containers across threads in shared/synchronized wrappers, etc. Furthermore, the cost of atomic reference counting can be mitigated by the standard tricks for dealing with expensive to copy objects, though it shouldn't be necessary except in the most extreme conditions.I'm all for atomic ref counting, personally. It simplifies all kinds of stuff. And as long as the overhead is small enough, I don't really see a downside. - Jonathan M Davis
Nov 21 2010
On 11/21/10 7:59 PM, dsimcha wrote: [snip]Atomic: 258 milliseconds total (258 nanoseconds per iteration) Non-atomic: 88 milliseconds (88 nanoseconds per iteration) I think this is a small enough overhead that we should just make all reference counting atomic. This would make the race condition pointed out by Michael Fortin completely moot IIUC. It would also make it possible to safely share reference counted containers across threads in shared/synchronized wrappers, etc.Thanks for taking the time to build a credibla baseline. A lot of this depends on the perspective. We're looking at a 2.9x slowdown after all. After discussing this with Walter, our conclusion was that we need to solve the problem of calling destructors from a different thread in general, not only for reference counting. The fact that you could have arbitrary race conditions in destructors definitely is a problem, one that needs a solution. We currently think the solution should be that the GC guarantees destructor calls in the same thread as the constructor. Andrei
Nov 21 2010
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'sThanks for taking the time to build a credibla baseline. A lot of this depends on the perspective. We're looking at a 2.9x slowdown after all.Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x slowdown in a synthetic benchmark that does nothing but increment and decrement reference counts. Real world code would likely have a loop body and stuff. That's why I consider it minor.
Nov 22 2010
On 11/22/10 7:27 AM, dsimcha wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'sI think the cost is close to the real cost of copying a refcounted object. Experience with refcounted languages has shown that copies do impact the bottom line considerably. I wouldn't haste to discount a 2.9x factor. AndreiThanks for taking the time to build a credibla baseline. A lot of this depends on the perspective. We're looking at a 2.9x slowdown after all.Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x slowdown in a synthetic benchmark that does nothing but increment and decrement reference counts. Real world code would likely have a loop body and stuff. That's why I consider it minor.
Nov 22 2010
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleOn 11/22/10 7:27 AM, dsimcha wrote:Ok, if the destructor issue needs to be solved for the general case anyhow, then how about making it configurable with a default of non-atomic? If you want to share a ref counted object across threads and be responsible for your own higher level synchronization issues, you use atomic ref counting. If you want to create a shared wrapper, you make it use atomic under the hood. If you're not sharing the structure across threads, you stick with the default.== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'sI think the cost is close to the real cost of copying a refcounted object. Experience with refcounted languages has shown that copies do impact the bottom line considerably. I wouldn't haste to discount a 2.9x factor. AndreiThanks for taking the time to build a credibla baseline. A lot of this depends on the perspective. We're looking at a 2.9x slowdown after all.Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x slowdown in a synthetic benchmark that does nothing but increment and decrement reference counts. Real world code would likely have a loop body and stuff. That's why I consider it minor.
Nov 22 2010
On 11/21/2010 8:59 PM, dsimcha wrote:This message originally appeared on the Phobos list, but I've grown impatient waiting for responses and decided it needs to see a wider audience: Now that I have time, I did some benchmarks of atomic ref counting. It is somewhat expensive, but not obscenely so. Here's my benchmark: import std.stdio, std.datetime, std.container; void main() { Array!uint arr; arr.length = 4; auto sw = StopWatch(autoStart); foreach(i; 0..10_000_000) { auto arr2 = arr; } writeln(sw.peek.milliseconds); } Here's a diff of std.typecons: Index: typecons.d =================================================================== --- typecons.d (revision 2181) +++ typecons.d (working copy) -2328,7 +2328,16 this(this) { if (!RefCounted.isInitialized) return; - ++RefCounted._store._count; + //++RefCounted._store._count; + auto p =&(RefCounted._store._count); + asm { + push EAX; + mov EAX, p; + lock; + inc dword ptr[EAX]; + pop EAX; + } + debug(RefCounted) if (RefCounted.debugging) writeln(typeof(this).stringof, " ", cast(void*) RefCounted._store, ": bumped refcount to ", -2345,7 +2354,23 { if (!RefCounted._store) return; assert(RefCounted._store._count> 0); - if (--RefCounted._store._count) + + // Can't just do lock; dec; here because I need the value afterwords. + size_t oldVal = void; + auto p =&(RefCounted._store._count); + asm { + push EAX; + push EBX; + mov EAX, size_t.max; // + size_t.max is equivalent to - 1. + mov EBX, p; + lock; + xadd [EBX], EAX; + mov oldVal, EAX; + pop EBX; + pop EAX; + } + + if (oldVal> 1) { debug(RefCounted) if (RefCounted.debugging) writeln(typeof(this).stringof, Note the direct use of ASM. Between function call overhead and the fact that core.atomic uses a CAS loop, using it instead would have about doubled the overhead. I'm not sure how we'll factor this when/if this hits production code. Anyhow, the timings are: Atomic: 258 milliseconds total (258 nanoseconds per iteration) Non-atomic: 88 milliseconds (88 nanoseconds per iteration) I think this is a small enough overhead that we should just make all reference counting atomic. This would make the race condition pointed out by Michael Fortin completely moot IIUC. It would also make it possible to safely share reference counted containers across threads in shared/synchronized wrappers, etc. Furthermore, the cost of atomic reference counting can be mitigated by the standard tricks for dealing with expensive to copy objects, though it shouldn't be necessary except in the most extreme conditions.Did you try the inline ASM block without explicit register preservation? Walter replied to your message before (http://www.digitalmars.com/d/archives/digitalmars/D/Register_Preservation_in_Inline_ASM Blocks_122470.html) and said the compiler will work around any registers you use in inline blocks. I tested it out and I can confirm he wasn't lying ;) I'm not by a machine with dmd otherwise I'd try myself. Try removing the explicit push/pops of the registers and use ecx instead of ebx in the second diff (use of it forces dmd to push ebx in the prologue because its callee save). Might save a considerable amount of cycles.
Nov 23 2010
On 11/23/2010 8:13 PM, Johann MacDonagh wrote:Did you try the inline ASM block without explicit register preservation? Walter replied to your message before (http://www.digitalmars.com/d/archives/digitalmars/D/Register_Preservation_in_Inline_ASM_Blocks_122470.html) and said the compiler will work around any registers you use in inline blocks. I tested it out and I can confirm he wasn't lying ;) I'm not by a machine with dmd otherwise I'd try myself. Try removing the explicit push/pops of the registers and use ecx instead of ebx in the second diff (use of it forces dmd to push ebx in the prologue because its callee save). Might save a considerable amount of cycles.Ah nvm. Results in a 6% or so savings. It's definitely the lock prefix slowing it down. Mostly OT, what was the rationale for requiring the lock prefix being a separate statement in inline ASM? NASM and MASM keep it inline with the statement it affects.
Nov 23 2010
Johann MacDonagh wrote:Mostly OT, what was the rationale for requiring the lock prefix being a separate statement in inline ASM? NASM and MASM keep it inline with the statement it affects.Simplifies the syntax considerably. rep is treated in the same way.
Nov 24 2010