www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Atomic Ref Counting

reply dsimcha <dsimcha yahoo.com> writes:
This message originally appeared on the Phobos list, but I've grown impatient
waiting for responses and decided it needs to see a wider audience:

Now that I have time, I did some benchmarks of atomic ref counting.  It is
somewhat expensive, but not obscenely so.  Here's my benchmark:

import std.stdio, std.datetime, std.container;

void main() {
    Array!uint arr;
    arr.length = 4;

    auto sw = StopWatch(autoStart);
    foreach(i; 0..10_000_000) {
        auto arr2 = arr;
    }

    writeln(sw.peek.milliseconds);
}

Here's a diff of std.typecons:

Index: typecons.d
===================================================================
--- typecons.d    (revision 2181)
+++ typecons.d    (working copy)
   -2328,7 +2328,16   
     this(this)
     {
         if (!RefCounted.isInitialized) return;
-        ++RefCounted._store._count;
+        //++RefCounted._store._count;
+        auto p = &(RefCounted._store._count);
+        asm {
+            push EAX;
+            mov EAX, p;
+            lock;
+            inc dword ptr[EAX];
+            pop EAX;
+        }
+
         debug(RefCounted) if (RefCounted.debugging)
                  writeln(typeof(this).stringof,
                 " ", cast(void*) RefCounted._store, ": bumped refcount to ",
   -2345,7 +2354,23   
     {
         if (!RefCounted._store) return;
         assert(RefCounted._store._count > 0);
-        if (--RefCounted._store._count)
+
+        // Can't just do lock; dec; here because I need the value afterwords.
+        size_t oldVal = void;
+        auto p = &(RefCounted._store._count);
+        asm {
+            push EAX;
+            push EBX;
+            mov EAX, size_t.max;  // + size_t.max is equivalent to - 1.
+            mov EBX, p;
+            lock;
+            xadd [EBX], EAX;
+            mov oldVal, EAX;
+            pop EBX;
+            pop EAX;
+        }
+
+        if (oldVal > 1)
         {
             debug(RefCounted) if (RefCounted.debugging)
                      writeln(typeof(this).stringof,

Note the direct use of ASM.  Between function call overhead and the fact that
core.atomic uses a CAS loop, using it instead would have about doubled the
overhead.  I'm not sure how we'll factor this when/if this hits production code.

Anyhow, the timings are:

Atomic:  258 milliseconds total (258 nanoseconds per iteration)
Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)

I think this is a small enough overhead that we should just make all reference
counting atomic.  This would make the race condition pointed out by Michael
Fortin completely moot IIUC.  It would also make it possible to safely share
reference counted containers across threads in shared/synchronized wrappers,
etc.

Furthermore, the cost of atomic reference counting can be mitigated by the
standard tricks for dealing with expensive to copy objects, though it
shouldn't be necessary except in the most extreme conditions.
Nov 21 2010
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday 21 November 2010 17:59:35 dsimcha wrote:
 This message originally appeared on the Phobos list, but I've grown
 impatient waiting for responses and decided it needs to see a wider
 audience:
 
 Now that I have time, I did some benchmarks of atomic ref counting.  It is
 somewhat expensive, but not obscenely so.  Here's my benchmark:
 
 import std.stdio, std.datetime, std.container;
 
 void main() {
     Array!uint arr;
     arr.length = 4;
 
     auto sw = StopWatch(autoStart);
     foreach(i; 0..10_000_000) {
         auto arr2 = arr;
     }
 
     writeln(sw.peek.milliseconds);
 }
 
 Here's a diff of std.typecons:
 
 Index: typecons.d
 ===================================================================
 --- typecons.d    (revision 2181)
 +++ typecons.d    (working copy)
    -2328,7 +2328,16   
      this(this)
      {
          if (!RefCounted.isInitialized) return;
 -        ++RefCounted._store._count;
 +        //++RefCounted._store._count;
 +        auto p = &(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            mov EAX, p;
 +            lock;
 +            inc dword ptr[EAX];
 +            pop EAX;
 +        }
 +
          debug(RefCounted) if (RefCounted.debugging)
                   writeln(typeof(this).stringof,
                  " ", cast(void*) RefCounted._store, ": bumped refcount to
 ",    -2345,7 +2354,23   
      {
          if (!RefCounted._store) return;
          assert(RefCounted._store._count > 0);
 -        if (--RefCounted._store._count)
 +
 +        // Can't just do lock; dec; here because I need the value
 afterwords. +        size_t oldVal = void;
 +        auto p = &(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            push EBX;
 +            mov EAX, size_t.max;  // + size_t.max is equivalent to - 1.
 +            mov EBX, p;
 +            lock;
 +            xadd [EBX], EAX;
 +            mov oldVal, EAX;
 +            pop EBX;
 +            pop EAX;
 +        }
 +
 +        if (oldVal > 1)
          {
              debug(RefCounted) if (RefCounted.debugging)
                       writeln(typeof(this).stringof,
 
 Note the direct use of ASM.  Between function call overhead and the fact
 that core.atomic uses a CAS loop, using it instead would have about
 doubled the overhead.  I'm not sure how we'll factor this when/if this
 hits production code.
 
 Anyhow, the timings are:
 
 Atomic:  258 milliseconds total (258 nanoseconds per iteration)
 Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)
 
 I think this is a small enough overhead that we should just make all
 reference counting atomic.  This would make the race condition pointed out
 by Michael Fortin completely moot IIUC.  It would also make it possible to
 safely share reference counted containers across threads in
 shared/synchronized wrappers, etc.
 
 Furthermore, the cost of atomic reference counting can be mitigated by the
 standard tricks for dealing with expensive to copy objects, though it
 shouldn't be necessary except in the most extreme conditions.
I'm all for atomic ref counting, personally. It simplifies all kinds of stuff. And as long as the overhead is small enough, I don't really see a downside. - Jonathan M Davis
Nov 21 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 11/21/10 7:59 PM, dsimcha wrote:
[snip]
 Atomic:  258 milliseconds total (258 nanoseconds per iteration)
 Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)

 I think this is a small enough overhead that we should just make all reference
 counting atomic.  This would make the race condition pointed out by Michael
 Fortin completely moot IIUC.  It would also make it possible to safely share
 reference counted containers across threads in shared/synchronized wrappers,
etc.
Thanks for taking the time to build a credibla baseline. A lot of this depends on the perspective. We're looking at a 2.9x slowdown after all. After discussing this with Walter, our conclusion was that we need to solve the problem of calling destructors from a different thread in general, not only for reference counting. The fact that you could have arbitrary race conditions in destructors definitely is a problem, one that needs a solution. We currently think the solution should be that the GC guarantees destructor calls in the same thread as the constructor. Andrei
Nov 21 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 Thanks for taking the time to build a credibla baseline. A lot of this
 depends on the perspective. We're looking at a 2.9x slowdown after all.
Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x slowdown in a synthetic benchmark that does nothing but increment and decrement reference counts. Real world code would likely have a loop body and stuff. That's why I consider it minor.
Nov 22 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 11/22/10 7:27 AM, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 Thanks for taking the time to build a credibla baseline. A lot of this
 depends on the perspective. We're looking at a 2.9x slowdown after all.
Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x slowdown in a synthetic benchmark that does nothing but increment and decrement reference counts. Real world code would likely have a loop body and stuff. That's why I consider it minor.
I think the cost is close to the real cost of copying a refcounted object. Experience with refcounted languages has shown that copies do impact the bottom line considerably. I wouldn't haste to discount a 2.9x factor. Andrei
Nov 22 2010
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 11/22/10 7:27 AM, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 Thanks for taking the time to build a credibla baseline. A lot of this
 depends on the perspective. We're looking at a 2.9x slowdown after all.
Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x slowdown in a synthetic benchmark that does nothing but increment and decrement reference counts. Real world code would likely have a loop body and stuff. That's why I consider it minor.
I think the cost is close to the real cost of copying a refcounted object. Experience with refcounted languages has shown that copies do impact the bottom line considerably. I wouldn't haste to discount a 2.9x factor. Andrei
Ok, if the destructor issue needs to be solved for the general case anyhow, then how about making it configurable with a default of non-atomic? If you want to share a ref counted object across threads and be responsible for your own higher level synchronization issues, you use atomic ref counting. If you want to create a shared wrapper, you make it use atomic under the hood. If you're not sharing the structure across threads, you stick with the default.
Nov 22 2010
prev sibling parent reply Johann MacDonagh <johann.macdonagh..no spam..gmail.com> writes:
On 11/21/2010 8:59 PM, dsimcha wrote:
 This message originally appeared on the Phobos list, but I've grown impatient
 waiting for responses and decided it needs to see a wider audience:

 Now that I have time, I did some benchmarks of atomic ref counting.  It is
 somewhat expensive, but not obscenely so.  Here's my benchmark:

 import std.stdio, std.datetime, std.container;

 void main() {
      Array!uint arr;
      arr.length = 4;

      auto sw = StopWatch(autoStart);
      foreach(i; 0..10_000_000) {
          auto arr2 = arr;
      }

      writeln(sw.peek.milliseconds);
 }

 Here's a diff of std.typecons:

 Index: typecons.d
 ===================================================================
 --- typecons.d    (revision 2181)
 +++ typecons.d    (working copy)
    -2328,7 +2328,16   
       this(this)
       {
           if (!RefCounted.isInitialized) return;
 -        ++RefCounted._store._count;
 +        //++RefCounted._store._count;
 +        auto p =&(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            mov EAX, p;
 +            lock;
 +            inc dword ptr[EAX];
 +            pop EAX;
 +        }
 +
           debug(RefCounted) if (RefCounted.debugging)
                    writeln(typeof(this).stringof,
                   " ", cast(void*) RefCounted._store, ": bumped refcount to ",
    -2345,7 +2354,23   
       {
           if (!RefCounted._store) return;
           assert(RefCounted._store._count>  0);
 -        if (--RefCounted._store._count)
 +
 +        // Can't just do lock; dec; here because I need the value afterwords.
 +        size_t oldVal = void;
 +        auto p =&(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            push EBX;
 +            mov EAX, size_t.max;  // + size_t.max is equivalent to - 1.
 +            mov EBX, p;
 +            lock;
 +            xadd [EBX], EAX;
 +            mov oldVal, EAX;
 +            pop EBX;
 +            pop EAX;
 +        }
 +
 +        if (oldVal>  1)
           {
               debug(RefCounted) if (RefCounted.debugging)
                        writeln(typeof(this).stringof,

 Note the direct use of ASM.  Between function call overhead and the fact that
 core.atomic uses a CAS loop, using it instead would have about doubled the
 overhead.  I'm not sure how we'll factor this when/if this hits production
code.

 Anyhow, the timings are:

 Atomic:  258 milliseconds total (258 nanoseconds per iteration)
 Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)

 I think this is a small enough overhead that we should just make all reference
 counting atomic.  This would make the race condition pointed out by Michael
 Fortin completely moot IIUC.  It would also make it possible to safely share
 reference counted containers across threads in shared/synchronized wrappers,
etc.

 Furthermore, the cost of atomic reference counting can be mitigated by the
 standard tricks for dealing with expensive to copy objects, though it
 shouldn't be necessary except in the most extreme conditions.
Did you try the inline ASM block without explicit register preservation? Walter replied to your message before (http://www.digitalmars.com/d/archives/digitalmars/D/Register_Preservation_in_Inline_ASM Blocks_122470.html) and said the compiler will work around any registers you use in inline blocks. I tested it out and I can confirm he wasn't lying ;) I'm not by a machine with dmd otherwise I'd try myself. Try removing the explicit push/pops of the registers and use ecx instead of ebx in the second diff (use of it forces dmd to push ebx in the prologue because its callee save). Might save a considerable amount of cycles.
Nov 23 2010
parent reply Johann MacDonagh <johann.macdonagh..no spam..gmail.com> writes:
On 11/23/2010 8:13 PM, Johann MacDonagh wrote:
 Did you try the inline ASM block without explicit register preservation?
 Walter replied to your message before
 (http://www.digitalmars.com/d/archives/digitalmars/D/Register_Preservation_in_Inline_ASM_Blocks_122470.html)
 and said the compiler will work around any registers you use in inline
 blocks. I tested it out and I can confirm he wasn't lying ;)

 I'm not by a machine with dmd otherwise I'd try myself. Try removing the
 explicit push/pops of the registers and use ecx instead of ebx in the
 second diff (use of it forces dmd to push ebx in the prologue because
 its callee save). Might save a considerable amount of cycles.
Ah nvm. Results in a 6% or so savings. It's definitely the lock prefix slowing it down. Mostly OT, what was the rationale for requiring the lock prefix being a separate statement in inline ASM? NASM and MASM keep it inline with the statement it affects.
Nov 23 2010
parent Don <nospam nospam.com> writes:
Johann MacDonagh wrote:
 Mostly OT, what was the rationale for requiring the lock prefix being a 
 separate statement in inline ASM? NASM and MASM keep it inline with the 
 statement it affects.
Simplifies the syntax considerably. rep is treated in the same way.
Nov 24 2010