digitalmars.D - Atomic Ref Counting

dsimcha (75/75) Nov 21 2010 This message originally appeared on the Phobos list, but I've grown impa...

Jonathan M Davis (4/94) Nov 21 2010 I'm all for atomic ref counting, personally. It simplifies all kinds of ...
Andrei Alexandrescu (12/18) Nov 21 2010 Thanks for taking the time to build a credibla baseline. A lot of this

dsimcha (5/7) Nov 22 2010 Right, but this isn't a 2.9x slowdown in real world code. It's a 2.9x s...

Andrei Alexandrescu (6/13) Nov 22 2010 I think the cost is close to the real cost of copying a refcounted

dsimcha (7/21) Nov 22 2010 Ok, if the destructor issue needs to be solved for the general case anyh...

Johann MacDonagh (10/85) Nov 23 2010 Did you try the inline ASM block without explicit register preservation?...

Johann MacDonagh (6/15) Nov 23 2010 Ah nvm. Results in a 6% or so savings. It's definitely the lock prefix

Don (2/5) Nov 24 2010 Simplifies the syntax considerably. rep is treated in the same way.

dsimcha <dsimcha yahoo.com> writes:

This message originally appeared on the Phobos list, but I've grown impatient
waiting for responses and decided it needs to see a wider audience:

Now that I have time, I did some benchmarks of atomic ref counting.  It is
somewhat expensive, but not obscenely so.  Here's my benchmark:

import std.stdio, std.datetime, std.container;

void main() {
    Array!uint arr;
    arr.length = 4;

    auto sw = StopWatch(autoStart);
    foreach(i; 0..10_000_000) {
        auto arr2 = arr;
    }

    writeln(sw.peek.milliseconds);
}

Here's a diff of std.typecons:

Index: typecons.d
===================================================================
--- typecons.d    (revision 2181)
+++ typecons.d    (working copy)
   -2328,7 +2328,16   
     this(this)
     {
         if (!RefCounted.isInitialized) return;
-        ++RefCounted._store._count;
+        //++RefCounted._store._count;
+        auto p = &(RefCounted._store._count);
+        asm {
+            push EAX;
+            mov EAX, p;
+            lock;
+            inc dword ptr[EAX];
+            pop EAX;
+        }
+
         debug(RefCounted) if (RefCounted.debugging)
                  writeln(typeof(this).stringof,
                 " ", cast(void*) RefCounted._store, ": bumped refcount to ",
   -2345,7 +2354,23   
     {
         if (!RefCounted._store) return;
         assert(RefCounted._store._count > 0);
-        if (--RefCounted._store._count)
+
+        // Can't just do lock; dec; here because I need the value afterwords.
+        size_t oldVal = void;
+        auto p = &(RefCounted._store._count);
+        asm {
+            push EAX;
+            push EBX;
+            mov EAX, size_t.max;  // + size_t.max is equivalent to - 1.
+            mov EBX, p;
+            lock;
+            xadd [EBX], EAX;
+            mov oldVal, EAX;
+            pop EBX;
+            pop EAX;
+        }
+
+        if (oldVal > 1)
         {
             debug(RefCounted) if (RefCounted.debugging)
                      writeln(typeof(this).stringof,

Note the direct use of ASM.  Between function call overhead and the fact that
core.atomic uses a CAS loop, using it instead would have about doubled the
overhead.  I'm not sure how we'll factor this when/if this hits production code.

Anyhow, the timings are:

Atomic:  258 milliseconds total (258 nanoseconds per iteration)
Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)

I think this is a small enough overhead that we should just make all reference
counting atomic.  This would make the race condition pointed out by Michael
Fortin completely moot IIUC.  It would also make it possible to safely share
reference counted containers across threads in shared/synchronized wrappers,
etc.

Furthermore, the cost of atomic reference counting can be mitigated by the
standard tricks for dealing with expensive to copy objects, though it
shouldn't be necessary except in the most extreme conditions.

Nov 21 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 21 November 2010 17:59:35 dsimcha wrote:
 This message originally appeared on the Phobos list, but I've grown
 impatient waiting for responses and decided it needs to see a wider
 audience:
 
 Now that I have time, I did some benchmarks of atomic ref counting.  It is
 somewhat expensive, but not obscenely so.  Here's my benchmark:
 
 import std.stdio, std.datetime, std.container;
 
 void main() {
     Array!uint arr;
     arr.length = 4;
 
     auto sw = StopWatch(autoStart);
     foreach(i; 0..10_000_000) {
         auto arr2 = arr;
     }
 
     writeln(sw.peek.milliseconds);
 }
 
 Here's a diff of std.typecons:
 
 Index: typecons.d
 ===================================================================
 --- typecons.d    (revision 2181)
 +++ typecons.d    (working copy)
    -2328,7 +2328,16   
      this(this)
      {
          if (!RefCounted.isInitialized) return;
 -        ++RefCounted._store._count;
 +        //++RefCounted._store._count;
 +        auto p = &(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            mov EAX, p;
 +            lock;
 +            inc dword ptr[EAX];
 +            pop EAX;
 +        }
 +
          debug(RefCounted) if (RefCounted.debugging)
                   writeln(typeof(this).stringof,
                  " ", cast(void*) RefCounted._store, ": bumped refcount to
 ",    -2345,7 +2354,23   
      {
          if (!RefCounted._store) return;
          assert(RefCounted._store._count > 0);
 -        if (--RefCounted._store._count)
 +
 +        // Can't just do lock; dec; here because I need the value
 afterwords. +        size_t oldVal = void;
 +        auto p = &(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            push EBX;
 +            mov EAX, size_t.max;  // + size_t.max is equivalent to - 1.
 +            mov EBX, p;
 +            lock;
 +            xadd [EBX], EAX;
 +            mov oldVal, EAX;
 +            pop EBX;
 +            pop EAX;
 +        }
 +
 +        if (oldVal > 1)
          {
              debug(RefCounted) if (RefCounted.debugging)
                       writeln(typeof(this).stringof,
 
 Note the direct use of ASM.  Between function call overhead and the fact
 that core.atomic uses a CAS loop, using it instead would have about
 doubled the overhead.  I'm not sure how we'll factor this when/if this
 hits production code.
 
 Anyhow, the timings are:
 
 Atomic:  258 milliseconds total (258 nanoseconds per iteration)
 Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)
 
 I think this is a small enough overhead that we should just make all
 reference counting atomic.  This would make the race condition pointed out
 by Michael Fortin completely moot IIUC.  It would also make it possible to
 safely share reference counted containers across threads in
 shared/synchronized wrappers, etc.
 
 Furthermore, the cost of atomic reference counting can be mitigated by the
 standard tricks for dealing with expensive to copy objects, though it
 shouldn't be necessary except in the most extreme conditions.

I'm all for atomic ref counting, personally. It simplifies all kinds of stuff.
And 
as long as the overhead is small enough, I don't really see a downside.

- Jonathan M Davis

Nov 21 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 11/21/10 7:59 PM, dsimcha wrote:
[snip]
 Atomic:  258 milliseconds total (258 nanoseconds per iteration)
 Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)

 I think this is a small enough overhead that we should just make all reference
 counting atomic.  This would make the race condition pointed out by Michael
 Fortin completely moot IIUC.  It would also make it possible to safely share
 reference counted containers across threads in shared/synchronized wrappers,
etc.

Thanks for taking the time to build a credibla baseline. A lot of this 
depends on the perspective. We're looking at a 2.9x slowdown after all.

After discussing this with Walter, our conclusion was that we need to 
solve the problem of calling destructors from a different thread in 
general, not only for reference counting. The fact that you could have 
arbitrary race conditions in destructors definitely is a problem, one 
that needs a solution.

We currently think the solution should be that the GC guarantees 
destructor calls in the same thread as the constructor.


Andrei

Nov 21 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 Thanks for taking the time to build a credibla baseline. A lot of this
 depends on the perspective. We're looking at a 2.9x slowdown after all.

Right, but this isn't a 2.9x slowdown in real world code.  It's a 2.9x slowdown
in
a synthetic benchmark that does nothing but increment and decrement reference
counts.  Real world code would likely have a loop body and stuff.  That's why I
consider it minor.

Nov 22 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 11/22/10 7:27 AM, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 Thanks for taking the time to build a credibla baseline. A lot of this
 depends on the perspective. We're looking at a 2.9x slowdown after all.

 Right, but this isn't a 2.9x slowdown in real world code.  It's a 2.9x
slowdown in
 a synthetic benchmark that does nothing but increment and decrement reference
 counts.  Real world code would likely have a loop body and stuff.  That's why I
 consider it minor.

I think the cost is close to the real cost of copying a refcounted 
object. Experience with refcounted languages has shown that copies do 
impact the bottom line considerably. I wouldn't haste to discount a 2.9x 
factor.

Andrei

Nov 22 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 11/22/10 7:27 AM, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 Thanks for taking the time to build a credibla baseline. A lot of this
 depends on the perspective. We're looking at a 2.9x slowdown after all.

 Right, but this isn't a 2.9x slowdown in real world code.  It's a 2.9x
slowdown in
 a synthetic benchmark that does nothing but increment and decrement reference
 counts.  Real world code would likely have a loop body and stuff.  That's why I
 consider it minor.

 I think the cost is close to the real cost of copying a refcounted
 object. Experience with refcounted languages has shown that copies do
 impact the bottom line considerably. I wouldn't haste to discount a 2.9x
 factor.
 Andrei

Ok, if the destructor issue needs to be solved for the general case anyhow, then
how about making it configurable with a default of non-atomic?  If you want to
share a ref counted object across threads and be responsible for your own higher
level synchronization issues, you use atomic ref counting.  If you want to
create
a shared wrapper, you make it use atomic under the hood.  If you're not sharing
the structure across threads, you stick with the default.

Nov 22 2010

Johann MacDonagh <johann.macdonagh..no spam..gmail.com> writes:

On 11/21/2010 8:59 PM, dsimcha wrote:
 This message originally appeared on the Phobos list, but I've grown impatient
 waiting for responses and decided it needs to see a wider audience:

 Now that I have time, I did some benchmarks of atomic ref counting.  It is
 somewhat expensive, but not obscenely so.  Here's my benchmark:

 import std.stdio, std.datetime, std.container;

 void main() {
      Array!uint arr;
      arr.length = 4;

      auto sw = StopWatch(autoStart);
      foreach(i; 0..10_000_000) {
          auto arr2 = arr;
      }

      writeln(sw.peek.milliseconds);
 }

 Here's a diff of std.typecons:

 Index: typecons.d
 ===================================================================
 --- typecons.d    (revision 2181)
 +++ typecons.d    (working copy)
    -2328,7 +2328,16   
       this(this)
       {
           if (!RefCounted.isInitialized) return;
 -        ++RefCounted._store._count;
 +        //++RefCounted._store._count;
 +        auto p =&(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            mov EAX, p;
 +            lock;
 +            inc dword ptr[EAX];
 +            pop EAX;
 +        }
 +
           debug(RefCounted) if (RefCounted.debugging)
                    writeln(typeof(this).stringof,
                   " ", cast(void*) RefCounted._store, ": bumped refcount to ",
    -2345,7 +2354,23   
       {
           if (!RefCounted._store) return;
           assert(RefCounted._store._count>  0);
 -        if (--RefCounted._store._count)
 +
 +        // Can't just do lock; dec; here because I need the value afterwords.
 +        size_t oldVal = void;
 +        auto p =&(RefCounted._store._count);
 +        asm {
 +            push EAX;
 +            push EBX;
 +            mov EAX, size_t.max;  // + size_t.max is equivalent to - 1.
 +            mov EBX, p;
 +            lock;
 +            xadd [EBX], EAX;
 +            mov oldVal, EAX;
 +            pop EBX;
 +            pop EAX;
 +        }
 +
 +        if (oldVal>  1)
           {
               debug(RefCounted) if (RefCounted.debugging)
                        writeln(typeof(this).stringof,

 Note the direct use of ASM.  Between function call overhead and the fact that
 core.atomic uses a CAS loop, using it instead would have about doubled the
 overhead.  I'm not sure how we'll factor this when/if this hits production
code.

 Anyhow, the timings are:

 Atomic:  258 milliseconds total (258 nanoseconds per iteration)
 Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)

 I think this is a small enough overhead that we should just make all reference
 counting atomic.  This would make the race condition pointed out by Michael
 Fortin completely moot IIUC.  It would also make it possible to safely share
 reference counted containers across threads in shared/synchronized wrappers,
etc.

 Furthermore, the cost of atomic reference counting can be mitigated by the
 standard tricks for dealing with expensive to copy objects, though it
 shouldn't be necessary except in the most extreme conditions.

Did you try the inline ASM block without explicit register preservation? 
Walter replied to your message before 
(http://www.digitalmars.com/d/archives/digitalmars/D/Register_Preservation_in_Inline_ASM
Blocks_122470.html) 
and said the compiler will work around any registers you use in inline 
blocks. I tested it out and I can confirm he wasn't lying ;)

I'm not by a machine with dmd otherwise I'd try myself. Try removing the 
explicit push/pops of the registers and use ecx instead of ebx in the 
second diff (use of it forces dmd to push ebx in the prologue because 
its callee save). Might save a considerable amount of cycles.

Nov 23 2010

Johann MacDonagh <johann.macdonagh..no spam..gmail.com> writes:

On 11/23/2010 8:13 PM, Johann MacDonagh wrote:
 Did you try the inline ASM block without explicit register preservation?
 Walter replied to your message before
 (http://www.digitalmars.com/d/archives/digitalmars/D/Register_Preservation_in_Inline_ASM_Blocks_122470.html)
 and said the compiler will work around any registers you use in inline
 blocks. I tested it out and I can confirm he wasn't lying ;)

 I'm not by a machine with dmd otherwise I'd try myself. Try removing the
 explicit push/pops of the registers and use ecx instead of ebx in the
 second diff (use of it forces dmd to push ebx in the prologue because
 its callee save). Might save a considerable amount of cycles.

Ah nvm. Results in a 6% or so savings. It's definitely the lock prefix 
slowing it down.

Mostly OT, what was the rationale for requiring the lock prefix being a 
separate statement in inline ASM? NASM and MASM keep it inline with the 
statement it affects.

Nov 23 2010

Don <nospam nospam.com> writes:

Johann MacDonagh wrote:
 Mostly OT, what was the rationale for requiring the lock prefix being a 
 separate statement in inline ASM? NASM and MASM keep it inline with the 
 statement it affects.

Simplifies the syntax considerably. rep is treated in the same way.

Nov 24 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Atomic Ref Counting