digitalmars.D - rt_finalize WTFs?

dsimcha (59/59) Dec 04 2011 I'm at my traditional passtime of trying to speed up D's garbage

Martin Nowak (15/74) Dec 04 2011 Not for the try block. With errors being not recoverable you don't need ...

dsimcha (9/88) Dec 04 2011 Thanks for the benchmark. I ended up deciding to just create a second

Jonathan M Davis (7/15) Dec 04 2011 It's best to avoid code duplication in general, but I'd argue that if on...

bearophile (6/16) Dec 04 2011 That if(p) seems fit to become a static if on bool template argument (bu...
Rainer Schuetze (8/65) Dec 05 2011 Last time I looked at it, the try/catch/finally block was rather

Vladimir Panteleev (10/14) Dec 05 2011 This sounds like a good idea. Just make sure that all code paths that le...
Martin Nowak (23/95) Dec 05 2011 Just an unconditional jump into the finally body here, but

dsimcha (21/36) Dec 05 2011 The reason the gain wasn't huge is because on the benchmark I have that ...

Vladimir Panteleev (6/14) Dec 05 2011 A tree, with a few bits of the address space per level. It becomes bound...
Martin Nowak (126/178) Dec 05 2011 Probably write a tool that can replay recorded allocations profiles.
Iain Buclaw (9/14) Dec 05 2011 Indeed. The current implementation kicking around is to scan

Jacob Carlborg (4/16) Dec 05 2011 /proc doesn't exist on Mac OS X.

Martin Nowak (2/10) Dec 05 2011 Actually 64-bit should use a hashtable for the upper 32-bit and then

dsimcha (5/17) Dec 05 2011 Why do you expect this to be faster than a binary search? I'm not sayin...

Martin Nowak (15/34) Dec 05 2011 It's the tightest loop in the garbage collector.

Steven Schveighoffer (12/70) Dec 05 2011 I think it might be a good idea to move those two operations to the clea...
Robert Clipsham (6/8) Dec 05 2011 Have you thought about pushing for the inclusion of CDGC at all/working

Trass3r (1/6) Dec 05 2011 So true, it's been rotting in that branch.

dsimcha (11/18) Dec 05 2011 IIRC CDGC includes two major enhancements:

Trass3r (3/13) Dec 05 2011 What's the status of that anyway? Every patch in that bugzilla entry is ...

Sean Kelly (7/14) Dec 06 2011 I have an updated and win32-compilable version of CDGC in a branch. It's...

Sean Kelly (22/68) Dec 07 2011 collector again, and I've stumbled on the fact that rt_finalize is =
bearophile (4/6) Dec 09 2011 Are you going to create a pull request soon? So people will be able to t...

dsimcha <dsimcha yahoo.com> writes:

I'm at my traditional passtime of trying to speed up D's garbage 
collector again, and I've stumbled on the fact that rt_finalize is 
taking up a ridiculous share of the time (~30% of total runtime) on a 
benchmark where huge numbers of classes **that don't have destructors** 
are being created and collected.  Here's the code to this function, from 
lifetime.d:

extern (C) void rt_finalize(void* p, bool det = true)
{
     debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

     if (p) // not necessary if called from gc
     {
         ClassInfo** pc = cast(ClassInfo**)p;

         if (*pc)
         {
             ClassInfo c = **pc;
             byte[]    w = c.init;

             try
             {
                 if (det || collectHandler is null || 
collectHandler(cast(Object)p))
                 {
                     do
                     {
                         if (c.destructor)
                         {
                             fp_t fp = cast(fp_t)c.destructor;
                             (*fp)(cast(Object)p); // call destructor
                         }
                         c = c.base;
                     } while (c);
                 }
                 if ((cast(void**)p)[1]) // if monitor is not null
                     _d_monitordelete(cast(Object)p, det);
                 (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
             }
             catch (Throwable e)
             {
                 onFinalizeError(**pc, e);
             }
             finally  // WTF?
             {
                 *pc = null; // zero vptr
             }
         }
     }
}

Getting rid of the stuff I've marked with //WTF? comments (namely the 
finally block and the re-initializing of the memory occupied by the 
finalized object) speeds things up by ~15% on the benchmark in question. 
  Why do we care what state the blob of memory is left in after we 
finalize it?  I can kind of see that we want to clear things if 
delete/clear was called manually and we want to leave the object in a 
state that doesn't look valid.  However, this has significant 
performance costs and IIRC is already done in clear() and delete is 
supposed to be deprecated.  Furthermore, I'd like to get rid of the 
finally block entirely, since I assume its presence and the effect on 
the generated code is causing the slowdown, not the body, which just 
assigns a pointer.

Is there any good reason to keep this code around?

Dec 04 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha yahoo.com> wrote:

 I'm at my traditional passtime of trying to speed up D's garbage  
 collector again, and I've stumbled on the fact that rt_finalize is  
 taking up a ridiculous share of the time (~30% of total runtime) on a  
 benchmark where huge numbers of classes **that don't have destructors**  
 are being created and collected.  Here's the code to this function, from  
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
      debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

      if (p) // not necessary if called from gc
      {
          ClassInfo** pc = cast(ClassInfo**)p;

          if (*pc)
          {
              ClassInfo c = **pc;
              byte[]    w = c.init;

              try
              {
                  if (det || collectHandler is null ||  
 collectHandler(cast(Object)p))
                  {
                      do
                      {
                          if (c.destructor)
                          {
                              fp_t fp = cast(fp_t)c.destructor;
                              (*fp)(cast(Object)p); // call destructor
                          }
                          c = c.base;
                      } while (c);
                  }
                  if ((cast(void**)p)[1]) // if monitor is not null
                      _d_monitordelete(cast(Object)p, det);
                  (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
              }
              catch (Throwable e)
              {
                  onFinalizeError(**pc, e);
              }
              finally  // WTF?
              {
                  *pc = null; // zero vptr
              }
          }
      }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the  
 finally block and the re-initializing of the memory occupied by the  
 finalized object) speeds things up by ~15% on the benchmark in question.  
   Why do we care what state the blob of memory is left in after we  
 finalize it?  I can kind of see that we want to clear things if  
 delete/clear was called manually and we want to leave the object in a  
 state that doesn't look valid.  However, this has significant  
 performance costs and IIRC is already done in clear() and delete is  
 supposed to be deprecated.  Furthermore, I'd like to get rid of the  
 finally block entirely, since I assume its presence and the effect on  
 the generated code is causing the slowdown, not the body, which just  
 assigns a pointer.

 Is there any good reason to keep this code around?

Not for the try block. With errors being not recoverable you don't need to  
care
about zeroing the vtbl or you could just copy the code into the catch  
handler.
This seems to cause less spilled variables.

Most expensive is the call to a memcpy PLT, replace it with something  
inlineable.
Zeroing is not much faster than copying init[] for small classes.

At least zeroing should be worth it, unless the GC would not scan the  
memory otherwise.

gcbench/tree1 => 41.8s => https://gist.github.com/1432117 => gcbench/tree1  
=> 33.4s

Please add useful benchmarks to druntime.

martin

Dec 04 2011

dsimcha <dsimcha yahoo.com> writes:

Thanks for the benchmark.  I ended up deciding to just create a second 
function, rt_finalize_gc, that gets rid of a whole bunch of cruft that 
isn't necessary in the GC case.  I think it's worth the small amount of 
code duplication it creates.  Here are the results of my efforts so far: 
  https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . 
I've got one other good idea that I think will shave a few seconds off 
the Tree1 benchmark if I don't run into any unforeseen obstacles in 
implementing it.

On 12/4/2011 10:07 PM, Martin Nowak wrote:
 On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha yahoo.com> wrote:

 I'm at my traditional passtime of trying to speed up D's garbage
 collector again, and I've stumbled on the fact that rt_finalize is
 taking up a ridiculous share of the time (~30% of total runtime) on a
 benchmark where huge numbers of classes **that don't have
 destructors** are being created and collected. Here's the code to this
 function, from lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
 debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[] w = c.init;

 try
 {
 if (det || collectHandler is null || collectHandler(cast(Object)p))
 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the
 finally block and the re-initializing of the memory occupied by the
 finalized object) speeds things up by ~15% on the benchmark in
 question. Why do we care what state the blob of memory is left in
 after we finalize it? I can kind of see that we want to clear things
 if delete/clear was called manually and we want to leave the object in
 a state that doesn't look valid. However, this has significant
 performance costs and IIRC is already done in clear() and delete is
 supposed to be deprecated. Furthermore, I'd like to get rid of the
 finally block entirely, since I assume its presence and the effect on
 the generated code is causing the slowdown, not the body, which just
 assigns a pointer.

 Is there any good reason to keep this code around?

 Not for the try block. With errors being not recoverable you don't need
 to care
 about zeroing the vtbl or you could just copy the code into the catch
 handler.
 This seems to cause less spilled variables.

 Most expensive is the call to a memcpy PLT, replace it with something
 inlineable.
 Zeroing is not much faster than copying init[] for small classes.

 At least zeroing should be worth it, unless the GC would not scan the
 memory otherwise.

 gcbench/tree1 => 41.8s => https://gist.github.com/1432117 =>
 gcbench/tree1 => 33.4s

 Please add useful benchmarks to druntime.

 martin

Dec 04 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, December 04, 2011 23:41:08 dsimcha wrote:
 Thanks for the benchmark.  I ended up deciding to just create a second
 function, rt_finalize_gc, that gets rid of a whole bunch of cruft that
 isn't necessary in the GC case.  I think it's worth the small amount of
 code duplication it creates.  Here are the results of my efforts so far:
   https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 .
 I've got one other good idea that I think will shave a few seconds off
 the Tree1 benchmark if I don't run into any unforeseen obstacles in
 implementing it.

It's best to avoid code duplication in general, but I'd argue that if only a 
small amount of code duplication is required to get a big gain from the GC, 
it's well worth it. In general, anything we can do to improve the GC is 
worthwhile, since it's _the_ place where D risks being inefficient in
comparison 
to languages such as C++.

- Jonathan M Davis

Dec 04 2011

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:

 I've stumbled on the fact that rt_finalize is 
 taking up a ridiculous share of the time (~30% of total runtime) on a 
 benchmark where huge numbers of classes **that don't have destructors** 
 are being created and collected.

DMD or LDC/GDC compiler?


 extern (C) void rt_finalize(void* p, bool det = true)
 {
      debug(PRINTF) printf("rt_finalize(p = %p)\n", p);
 
      if (p) // not necessary if called from gc
      {

That if(p) seems fit to become a static if on bool template argument (but it
needs to become D code or two C functions):

void rt_finalize(bool byGC)(void* p, bool det = true) { ...

Bye,
bearophile

Dec 04 2011

Rainer Schuetze <r.sagitario gmx.de> writes:

Last time I looked at it, the try/catch/finally block was rather 
expensive because it always invokes the exception handler unwinding 
mechanism, even if no exception occurs.
Try moving the try/catch block out of the loops that call rt_finalize. 
(maybe just remove it, onFinalizeError just rethrows, and it seems noone 
is using the information that the exception is thrown from inside the 
finalizer)

On 05.12.2011 02:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again, and I've stumbled on the fact that rt_finalize is
 taking up a ridiculous share of the time (~30% of total runtime) on a
 benchmark where huge numbers of classes **that don't have destructors**
 are being created and collected. Here's the code to this function, from
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
 debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[] w = c.init;

 try
 {
 if (det || collectHandler is null || collectHandler(cast(Object)p))
 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the
 finally block and the re-initializing of the memory occupied by the
 finalized object) speeds things up by ~15% on the benchmark in question.
 Why do we care what state the blob of memory is left in after we
 finalize it? I can kind of see that we want to clear things if
 delete/clear was called manually and we want to leave the object in a
 state that doesn't look valid. However, this has significant performance
 costs and IIRC is already done in clear() and delete is supposed to be
 deprecated. Furthermore, I'd like to get rid of the finally block
 entirely, since I assume its presence and the effect on the generated
 code is causing the slowdown, not the body, which just assigns a pointer.

 Is there any good reason to keep this code around?

Dec 05 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Mon, 05 Dec 2011 10:14:00 +0200, Rainer Schuetze <r.sagitario gmx.de>  
wrote:

 Try moving the try/catch block out of the loops that call rt_finalize.

This sounds like a good idea. Just make sure that all code paths that lead  
to rt_finalize (e.g. delete) can recover from an exception.

 (maybe just remove it, onFinalizeError just rethrows, and it seems noone  
 is using the information that the exception is thrown from inside the  
 finalizer)

The point of onFinalizeError is that it throws an *error* (it is  
unrecoverable). Currently, the GC cannot recover from an exception thrown  
by a finalizer, which is why onFinalizeError exists.

-- 
Best regards,
  Vladimir                            mailto:vladimir thecybershadow.net

Dec 05 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 05 Dec 2011 09:14:00 +0100, Rainer Schuetze <r.sagitario gmx.de>  
wrote:

 Last time I looked at it, the try/catch/finally block was rather  
 expensive because it always invokes the exception handler unwinding  
 mechanism, even if no exception occurs.
 Try moving the try/catch block out of the loops that call rt_finalize.  
 (maybe just remove it, onFinalizeError just rethrows, and it seems noone  
 is using the information that the exception is thrown from inside the  
 finalizer)

Just an unconditional jump into the finally body here, but
it still affects register assignment.
Install an exception handler at sweep scope would save quite some
Moving the exception handler to the sweep scope seems promising,
can save lots of register saves.

I appreciate the recursion during mark, wanted to do this myself
sometime ago but expected a little more gain.

Some more ideas:

  - Do a major refactoring of the GC code, making it less reluctant
    to changes. Adding sanity checks or unit tests would be great.
    This probably reveals some obfuscated performance issues.

  - Add more realistic GC benchmarks, just requires adding to
    druntime/test/gcbench using the new runbench. The tree1 mainly
    uses homogeneous classes, so this is very synthesized.

  - There is one binary search pool lookup for every scanned address in  
range.
    Should be a lot to gain here, but it's difficult. It needs a multilevel
    mixture of bitset/hashtab.

  - Reduce the GC roots range. I will have to work on this for
    shared library support anyhow.

martin

 On 05.12.2011 02:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again, and I've stumbled on the fact that rt_finalize is
 taking up a ridiculous share of the time (~30% of total runtime) on a
 benchmark where huge numbers of classes **that don't have destructors**
 are being created and collected. Here's the code to this function, from
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
 debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[] w = c.init;

 try
 {
 if (det || collectHandler is null || collectHandler(cast(Object)p))
 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the
 finally block and the re-initializing of the memory occupied by the
 finalized object) speeds things up by ~15% on the benchmark in question.
 Why do we care what state the blob of memory is left in after we
 finalize it? I can kind of see that we want to clear things if
 delete/clear was called manually and we want to leave the object in a
 state that doesn't look valid. However, this has significant performance
 costs and IIRC is already done in clear() and delete is supposed to be
 deprecated. Furthermore, I'd like to get rid of the finally block
 entirely, since I assume its presence and the effect on the generated
 code is causing the slowdown, not the body, which just assigns a  
 pointer.

 Is there any good reason to keep this code around?

Dec 05 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Martin Nowak (dawg dawgfoto.de)'s article
 I appreciate the recursion during mark, wanted to do this myself
 sometime ago but expected a little more gain.

The reason the gain wasn't huge is because on the benchmark I have that
involves a
deep heap graph, sweeping time dominates marking time.  The performance gain for
the mark phase only (which is important b/c this is when the world needs to be
stopped) is ~20-30%.

 Some more ideas:
   - Do a major refactoring of the GC code, making it less reluctant
     to changes. Adding sanity checks or unit tests would be great.
     This probably reveals some obfuscated performance issues.

Not just obfuscated ones.  I've wanted to fix an obvious perf bug for two years
and haven't done it because the necessary refactoring would be unbelievably
messy
and I'm too afraid I'll break something.  Basically, malloc() sets the bytes
between the size you requested and the size of the block actually allocated to
zero to prevent false pointers.  This is reasonable.  The problem is that it
does
so **while holding the GC's lock**.  Fixing it for just the case when malloc()
is
called by the user is also easy.  The problem is fixing it when malloc() gets
called from realloc(), calloc(), etc.

   - Add more realistic GC benchmarks, just requires adding to
     druntime/test/gcbench using the new runbench. The tree1 mainly
     uses homogeneous classes, so this is very synthesized.

I'll crowdsource this.  I can't think of any good benchmarks that are < a few
hundred lines w/ no dependencies but aren't pretty synthetic.

   - There is one binary search pool lookup for every scanned address in
 range.
     Should be a lot to gain here, but it's difficult. It needs a multilevel
     mixture of bitset/hashtab.

I understand the problem, but please elaborate on the proposed solution.  You've
basically got a bunch of pools, each of which represents a range of memory
addresses, not a single address (so a basic hashtable is out).  You need to know
which range some pointer fits in.  How would you beat binary search/O(log N)
for this?

   - Reduce the GC roots range. I will have to work on this for
     shared library support anyhow.

Please clarify what you mean by "reduce" the roots range.

Thanks for the feedback/suggestions.

Dec 05 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Mon, 05 Dec 2011 23:07:09 +0200, dsimcha <dsimcha yahoo.com> wrote:

 I understand the problem, but please elaborate on the proposed  
 solution.  You've
 basically got a bunch of pools, each of which represents a range of  
 memory
 addresses, not a single address (so a basic hashtable is out).  You need  
 to know
 which range some pointer fits in.  How would you beat binary  
 search/O(log N) for this?

A tree, with a few bits of the address space per level. It becomes bound  
to the size of the address space, not the number of pools.

-- 
Best regards,
  Vladimir                            mailto:vladimir thecybershadow.net

Dec 05 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 05 Dec 2011 22:07:09 +0100, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Martin Nowak (dawg dawgfoto.de)'s article
 I appreciate the recursion during mark, wanted to do this myself
 sometime ago but expected a little more gain.

 The reason the gain wasn't huge is because on the benchmark I have that  
 involves a
 deep heap graph, sweeping time dominates marking time.  The performance  
 gain for
 the mark phase only (which is important b/c this is when the world needs  
 to be
 stopped) is ~20-30%.

 Some more ideas:
   - Do a major refactoring of the GC code, making it less reluctant
     to changes. Adding sanity checks or unit tests would be great.
     This probably reveals some obfuscated performance issues.

 Not just obfuscated ones.  I've wanted to fix an obvious perf bug for  
 two years
 and haven't done it because the necessary refactoring would be  
 unbelievably messy
 and I'm too afraid I'll break something.  Basically, malloc() sets the  
 bytes
 between the size you requested and the size of the block actually  
 allocated to
 zero to prevent false pointers.  This is reasonable.  The problem is  
 that it does
 so **while holding the GC's lock**.  Fixing it for just the case when  
 malloc() is
 called by the user is also easy.  The problem is fixing it when malloc()  
 gets
 called from realloc(), calloc(), etc.

   - Add more realistic GC benchmarks, just requires adding to
     druntime/test/gcbench using the new runbench. The tree1 mainly
     uses homogeneous classes, so this is very synthesized.

 I'll crowdsource this.  I can't think of any good benchmarks that are <  
 a few
 hundred lines w/ no dependencies but aren't pretty synthetic.

Probably write a tool that can replay recorded allocations profiles.
But it will be difficult to fake internal pointers if they are not  
recorded.
If we were able to enable recording in the runtime, people having  
performance
issues with the GC could then send a file that can be used for  
optimizations.

   - There is one binary search pool lookup for every scanned address in
 range.
     Should be a lot to gain here, but it's difficult. It needs a  
 multilevel
     mixture of bitset/hashtab.

 I understand the problem, but please elaborate on the proposed  
 solution.  You've
 basically got a bunch of pools, each of which represents a range of  
 memory
 addresses, not a single address (so a basic hashtable is out).  You need  
 to know
 which range some pointer fits in.  How would you beat binary  
 search/O(log N) for this?

It is possible to use a hashtable or a bitset because each pool occupies
at least PAGESIZE.
One would need to use addresses relative to the minimum address
but that still needs 256K entries per GByte, so the maintenance
overhead is likely too big.

More promising is to put pool addresses ranges in a trie.

addr[7]          [...      .     ...]
                      /     |    \
addr[6]     [...   .   ...]    [...   .   ...]
                /   |    \         /   |   \
addr[5]     pool:8             [...   .  ...]
                               /   |   \
addr[4]                  pool:8 [....] pool:5

Lookup is bounded constant time, with at maximum 8(4) memory accesses.
Maintenance is only required when adding/removing pools, little  
sophisticated though.

-----

alias Node Trie;

struct Node
{
    union
    {
       Type           _type;
       Pool*          _pool;
       Node[16]*   _nodes16; // uses upper 3 bits per level
       ...
       Node[256]* _nodes256; // uses  full 8 bits per level
    }

    // use lower 3 bits to encode type, given guaranteed 8-byte alignment  
of malloc
    enum NTypeBits = 3;
    enum TypeMask = (1 << NTypeBits) - 1);
    enum PtrMask = ~TypeMask;

    enum Type { Empty, Pool, Nodes16, Nodes256 };
    static assert(Type.max <= TypeMask);

    Pool* getPool(void* p)
    {
        return getPoolImpl((cast(ubyte*)&p) + size_t.sizeof - 1);
    }

    Pool* getPoolImpl(ubyte* p)
    {
        Node* node = void;

        final switch (type)
        {
        case Type.Empty:
           return null;
        case Type.Pool:
           return pool;
        case Type.Nodes16:
           node = nodes16[*p >> 5];
           break;
        case Type.Nodes256:
           node = nodes256[*p];
           break;
        }

        if (node.type == Type.Pool)
            return node.pool;
        else
            return node.getPoolImpl(p - 1);
    }

    void insertPool(Pool* npool, uint level=0)
    {
        final switch (type)
        {
        case Type.Empty:
            pool = npool;

        case Type.Pool:
            Pool *opool = pool;
            if (opool != npool)
            {
                nodes16 = new Node[16];

                foreach (i; 0 .. 16)
                {
                    // non-trivial code here
                    // Needs to figure out into which subnode opool and  
npool
                    // should be inserted. They can use pool.minAddr and  
pool.maxAddr

                    nodes16[i].insertPool(opool, level + 1);
                    nodes16[i].insertPool(npool, level + 1);
                }
            }
        }

        case Type.Nodes16:
            // count empty nodes and decide whether
            // to switch to more child nodes.
        }
    }

    void removePool(Pool* npool, uint level=0)
    {
        // reverse of above
    }

     property Type type()
    {
        return cast(Type)(_type & TypeMask);
    }

     property void type(Type t)
    {
        assert(t <= TypeMask);
        _type |= t;
    }

     property Pool* pool()
    {
        return cast(Pool*)(cast(size_t)_pool & PtrMask);
    }

     property void pool(Pool* p)
    {
        _pool = p;
        type = Type.Pool;
    }

    ... for nodesN
}

-----

   - Reduce the GC roots range. I will have to work on this for
     shared library support anyhow.

 Please clarify what you mean by "reduce" the roots range.

We currently add something from etext to _end (rt.memory) as static root.
This probably contains read-only sections, data from other languages
and (unlikely) other unrelated sections.
I think some help from the compiler will be necessary to support
shared libraries anyhow, so we should be able to get a precise range.

 Thanks for the feedback/suggestions.

Dec 05 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 5 December 2011 22:34, Martin Nowak <dawg dawgfoto.de> wrote:
 We currently add something from etext to _end (rt.memory) as static root.
 This probably contains read-only sections, data from other languages
 and (unlikely) other unrelated sections.
 I think some help from the compiler will be necessary to support
 shared libraries anyhow, so we should be able to get a precise range.

Indeed.  The current implementation kicking around is to scan
/proc/self/maps on init (this is true for the runtime that ships with
GDC at least).  Which is not the most pleasant of things to do - not
to mention only supports Unix-flavoured systems, but it works well
enough for when linking against shared D libraries.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Dec 05 2011

Jacob Carlborg <doob me.com> writes:

On 2011-12-05 23:45, Iain Buclaw wrote:
 On 5 December 2011 22:34, Martin Nowak<dawg dawgfoto.de>  wrote:
 We currently add something from etext to _end (rt.memory) as static root.
 This probably contains read-only sections, data from other languages
 and (unlikely) other unrelated sections.
 I think some help from the compiler will be necessary to support
 shared libraries anyhow, so we should be able to get a precise range.

 Indeed.  The current implementation kicking around is to scan
 /proc/self/maps on init (this is true for the runtime that ships with
 GDC at least).  Which is not the most pleasant of things to do - not
 to mention only supports Unix-flavoured systems, but it works well
 enough for when linking against shared D libraries.

/proc doesn't exist on Mac OS X.

-- 
/Jacob Carlborg

Dec 05 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

 More promising is to put pool addresses ranges in a trie.

 addr[7]          [...      .     ...]
                       /     |    \
 addr[6]     [...   .   ...]    [...   .   ...]
                 /   |    \         /   |   \
 addr[5]     pool:8             [...   .  ...]
                                /   |   \
 addr[4]                  pool:8 [....] pool:5

Actually 64-bit should use a hashtable for the upper 32-bit and then
the the 32-bit trie for lower.

Dec 05 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Martin Nowak (dawg dawgfoto.de)'s article
 More promising is to put pool addresses ranges in a trie.

 addr[7]          [...      .     ...]
                       /     |    \
 addr[6]     [...   .   ...]    [...   .   ...]
                 /   |    \         /   |   \
 addr[5]     pool:8             [...   .  ...]
                                /   |   \
 addr[4]                  pool:8 [....] pool:5

 Actually 64-bit should use a hashtable for the upper 32-bit and then
 the the 32-bit trie for lower.

Why do you expect this to be faster than a binary search?  I'm not saying it
won't
be, just that it's not a home run that deserves a high priority as an
optimization.  You still have a whole bunch of indirections, probably more than
you would ever have for binary search.

Dec 05 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Tue, 06 Dec 2011 00:16:01 +0100, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Martin Nowak (dawg dawgfoto.de)'s article
 More promising is to put pool addresses ranges in a trie.

 addr[7]          [...      .     ...]
                       /     |    \
 addr[6]     [...   .   ...]    [...   .   ...]
                 /   |    \         /   |   \
 addr[5]     pool:8             [...   .  ...]
                                /   |   \
 addr[4]                  pool:8 [....] pool:5

 Actually 64-bit should use a hashtable for the upper 32-bit and then
 the the 32-bit trie for lower.

 Why do you expect this to be faster than a binary search?  I'm not  
 saying it won't
 be, just that it's not a home run that deserves a high priority as an
 optimization.  You still have a whole bunch of indirections, probably  
 more than
 you would ever have for binary search.

It's the tightest loop in the garbage collector.
It will end with few array accesses and the locality of
memory being pointed too is paired by tree locality,
thus you'd have good chances to finding them in the cache.

What this gives you is that it scales very good to many pools/regions.
Thus you can better tweak the pool granularity against pool sizes.

Also:
for (p < pend)
   if (*p in lastPool)
can be:
for (p < pend)
   if (*p in lastNode)

It is definitely no home run, but I'll try it
out when I find some time.

Dec 05 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 04 Dec 2011 20:46:27 -0500, dsimcha <dsimcha yahoo.com> wrote:

 I'm at my traditional passtime of trying to speed up D's garbage  
 collector again, and I've stumbled on the fact that rt_finalize is  
 taking up a ridiculous share of the time (~30% of total runtime) on a  
 benchmark where huge numbers of classes **that don't have destructors**  
 are being created and collected.  Here's the code to this function, from  
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
      debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

      if (p) // not necessary if called from gc
      {
          ClassInfo** pc = cast(ClassInfo**)p;

          if (*pc)
          {
              ClassInfo c = **pc;
              byte[]    w = c.init;

              try
              {
                  if (det || collectHandler is null ||  
 collectHandler(cast(Object)p))
                  {
                      do
                      {
                          if (c.destructor)
                          {
                              fp_t fp = cast(fp_t)c.destructor;
                              (*fp)(cast(Object)p); // call destructor
                          }
                          c = c.base;
                      } while (c);
                  }
                  if ((cast(void**)p)[1]) // if monitor is not null
                      _d_monitordelete(cast(Object)p, det);
                  (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
              }
              catch (Throwable e)
              {
                  onFinalizeError(**pc, e);
              }
              finally  // WTF?
              {
                  *pc = null; // zero vptr
              }
          }
      }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the  
 finally block and the re-initializing of the memory occupied by the  
 finalized object) speeds things up by ~15% on the benchmark in question.  
   Why do we care what state the blob of memory is left in after we  
 finalize it?  I can kind of see that we want to clear things if  
 delete/clear was called manually and we want to leave the object in a  
 state that doesn't look valid.  However, this has significant  
 performance costs and IIRC is already done in clear() and delete is  
 supposed to be deprecated.  Furthermore, I'd like to get rid of the  
 finally block entirely, since I assume its presence and the effect on  
 the generated code is causing the slowdown, not the body, which just  
 assigns a pointer.

I think it might be a good idea to move those two operations to the clear  
function.  Currently, clear(obj) simply calls rt_finalize(obj).

It does seem rather strange to have that finally there, why do we care on  
an exception/error that we zero the vptr?  I'd at least put that code up  
where the first WTF is.

But I think we do need to have the reinitialization of the object and  
zeroing of vptr for clear, since it's part of clear's charter.

I'd support any effort to speed up the GC, it's definitely the worst  
offender for performance.  Looks like there's quite a few good ideas in  
this thread.

-Steve

Dec 05 2011

Robert Clipsham <robert octarineparrot.com> writes:

On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

Have you thought about pushing for the inclusion of CDGC at all/working 
on the tweaks needed to make it the main GC?

-- 
Robert
http://octarineparrot.com/

Dec 05 2011

Trass3r <un known.com> writes:

 On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

 Have you thought about pushing for the inclusion of CDGC at all/working  
 on the tweaks needed to make it the main GC?

So true, it's been rotting in that branch.

Dec 05 2011

dsimcha <dsimcha yahoo.com> writes:

On 12/5/2011 6:39 PM, Trass3r wrote:
 On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

 Have you thought about pushing for the inclusion of CDGC at
 all/working on the tweaks needed to make it the main GC?

 So true, it's been rotting in that branch.

IIRC CDGC includes two major enhancements:

1.  The snapshot GC for Linux.  (Does this work on OSX/FreeBSD/anything 
Posix, or just Linux?  I'm a bit skeptical about whether a snapshot GC 
is really that great an idea given its propensity to waste memory on 
long collect cycles with a lot of mutation.)

2.  I think there was some precise heap scanning-related stuff in it.  I 
originally tried to implement precise heap scanning a couple years ago, 
but it went nowhere for reasons too complicated to explain here.  Given 
this experience, I'm not inclined to try again until the compiler has 
extensions for generating pointer offset information.

Dec 05 2011

Trass3r <un known.com> writes:

 IIRC CDGC includes two major enhancements:

 1.  The snapshot GC for Linux.  (Does this work on OSX/FreeBSD/anything  
 Posix, or just Linux?  I'm a bit skeptical about whether a snapshot GC  
 is really that great an idea given its propensity to waste memory on  
 long collect cycles with a lot of mutation.)

I guess, at least it uses fork IIRC.


 2.  I think there was some precise heap scanning-related stuff in it.  I  
 originally tried to implement precise heap scanning a couple years ago,  
 but it went nowhere for reasons too complicated to explain here.  Given  
 this experience, I'm not inclined to try again until the compiler has  
 extensions for generating pointer offset information.

What's the status of that anyway? Every patch in that bugzilla entry is  
marked obsolete and there's no real final answer.

Dec 05 2011

Sean Kelly <sean invisibleduck.org> writes:

I have an updated and win32-compilable version of CDGC in a branch. It's mis=
sing some features from the current GC though (it's based on the Tango GC wh=
ich has remained relatively static for the past years while druntime's GC ha=
s improved).

Sent from my iPhone

On Dec 5, 2011, at 3:39 PM, Trass3r <un known.com> wrote:

 On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

=20
 Have you thought about pushing for the inclusion of CDGC at all/working o=


n the tweaks needed to make it the main GC?
=20
 So true, it's been rotting in that branch.

Dec 06 2011

Sean Kelly <sean invisibleduck.org> writes:

On Dec 4, 2011, at 5:46 PM, dsimcha wrote:

 I'm at my traditional passtime of trying to speed up D's garbage =

collector again, and I've stumbled on the fact that rt_finalize is =
taking up a ridiculous share of the time (~30% of total runtime) on a =
benchmark where huge numbers of classes **that don't have destructors** =
are being created and collected.  Here's the code to this function, from =
lifetime.d:
=20
 extern (C) void rt_finalize(void* p, bool det =3D true)
 {
    debug(PRINTF) printf("rt_finalize(p =3D %p)\n", p);
=20
    if (p) // not necessary if called from gc
    {
        ClassInfo** pc =3D cast(ClassInfo**)p;
=20
        if (*pc)
        {
            ClassInfo c =3D **pc;
            byte[]    w =3D c.init;
=20
            try
            {
                if (det || collectHandler is null || =

collectHandler(cast(Object)p))
                {
                    do
                    {
                        if (c.destructor)
                        {
                            fp_t fp =3D cast(fp_t)c.destructor;
                            (*fp)(cast(Object)p); // call destructor
                        }
                        c =3D c.base;
                    } while (c);
                }
                if ((cast(void**)p)[1]) // if monitor is not null
                    _d_monitordelete(cast(Object)p, det);
                (cast(byte*) p)[0 .. w.length] =3D w[];  // WTF?
            }
            catch (Throwable e)
            {
                onFinalizeError(**pc, e);
            }
            finally  // WTF?
            {
                *pc =3D null; // zero vptr
            }
        }
    }
 }
=20
 Getting rid of the stuff I've marked with //WTF? comments (namely the =

finally block and the re-initializing of the memory occupied by the =
finalized object) speeds things up by ~15% on the benchmark in question. =
 Why do we care what state the blob of memory is left in after we =
finalize it?  I can kind of see that we want to clear things if =
delete/clear was called manually and we want to leave the object in a =
state that doesn't look valid.  However, this has significant =
performance costs and IIRC is already done in clear() and delete is =
supposed to be deprecated.  Furthermore, I'd like to get rid of the =
finally block entirely, since I assume its presence and the effect on =
the generated code is causing the slowdown, not the body, which just =
assigns a pointer.

Now that having multiple in-flight exceptions is actually legal, we can =
probably just throw out the try/catch entirely.  The try/catch and call =
to onFinalizeError is a holdover from when it was effectively illegal to =
throw from a finalizer.=

Dec 07 2011

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:

 I'm at my traditional passtime of trying to speed up D's garbage 
 collector again,

Are you going to create a pull request soon? So people will be able to try it
with their D programs and spot potential troubles...

Bye,
bearophile

Dec 09 2011

D Programming

C/C++ Programming

Other

digitalmars.D - rt_finalize WTFs?