digitalmars.D.learn - Performance of GC.collect() for single block of `byte`s

Per =?UTF-8?B?Tm9yZGzDtnc=?= (43/43) Sep 24 2018 The code

Steven Schveighoffer (16/56) Sep 24 2018 It's not scanning the blocks. But it is scanning the stack.

Per =?UTF-8?B?Tm9yZGzDtnc=?= (104/108) Sep 28 2018 Ok, I modified the code to be

Per =?UTF-8?B?Tm9yZGzDtnc=?= (106/110) Sep 28 2018 I change the code to not make use of the GC when printing:

Per =?UTF-8?B?Tm9yZGzDtnc=?= (4/6) Sep 28 2018 Compared to GCs like Go's this is an enormous latency for a
Per =?UTF-8?B?Tm9yZGzDtnc=?= (10/16) Sep 28 2018 Is this because of the potentially (many) slices referencing this

Rainer Schuetze (14/33) Oct 01 2018 A profiler reveals that most of the time is spent in "sweeping" the

Steven Schveighoffer (11/49) Oct 01 2018 Ouch! I hadn't thought of that. But we aren't checking actual *pages*,

Rainer Schuetze (8/22) Oct 03 2018 It is not touching the pages, but a byte array classifying the page as

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

The code

import std.stdio;

void main(string[] args)
{
     import std.datetime.stopwatch : benchmark;
     import core.time : Duration;
     import core.memory : GC;

     immutable benchmarkCount = 1;

     foreach (const i; 0 .. 10)
     {
         const byteCount = i*100_000_000;
         const array = new byte[byteCount]; // one Gig
         const Duration[1] results = 
benchmark!(GC.collect)(benchmarkCount);
         writefln("%s bytes: Calling GC.collect() took %s nsecs 
after %s",
                  byteCount, 
cast(double)results[0].total!"nsecs"/benchmarkCount, array.ptr);
     }
}

prints when release-compiled with ldc

0 bytes: Calling GC.collect() took 600 nsecs after null
100000000 bytes: Calling GC.collect() took 83000 nsecs after 
7F785ED44010
200000000 bytes: Calling GC.collect() took 114600 nsecs after 
7F784CF29010
300000000 bytes: Calling GC.collect() took 277600 nsecs after 
7F7832201010
400000000 bytes: Calling GC.collect() took 400400 nsecs after 
7F780E5CC010
500000000 bytes: Calling GC.collect() took 449700 nsecs after 
7F77E1A8A010
600000000 bytes: Calling GC.collect() took 481200 nsecs after 
7F780E5CC010
700000000 bytes: Calling GC.collect() took 529800 nsecs after 
7F77E1A8A010
800000000 bytes: Calling GC.collect() took 547600 nsecs after 
7F779A16E010
900000000 bytes: Calling GC.collect() took 925500 nsecs after 
7F7749891010

Why is the overhead so big for a single allocation of an array 
with elements containing no indirections (which the GC doesn't 
need to scan for pointers).

Sep 24 2018

Steven Schveighoffer <schveiguy gmail.com> writes:

On 9/24/18 5:49 AM, Per Nordlöw wrote:
 The code
 
 import std.stdio;
 
 void main(string[] args)
 {
      import std.datetime.stopwatch : benchmark;
      import core.time : Duration;
      import core.memory : GC;
 
      immutable benchmarkCount = 1;
 
      foreach (const i; 0 .. 10)
      {
          const byteCount = i*100_000_000;
          const array = new byte[byteCount]; // one Gig
          const Duration[1] results = 
 benchmark!(GC.collect)(benchmarkCount);
          writefln("%s bytes: Calling GC.collect() took %s nsecs after
%s",
                   byteCount, 
 cast(double)results[0].total!"nsecs"/benchmarkCount, array.ptr);
      }
 }
 
 prints when release-compiled with ldc
 
 0 bytes: Calling GC.collect() took 600 nsecs after null
 100000000 bytes: Calling GC.collect() took 83000 nsecs after 7F785ED44010
 200000000 bytes: Calling GC.collect() took 114600 nsecs after 7F784CF29010
 300000000 bytes: Calling GC.collect() took 277600 nsecs after 7F7832201010
 400000000 bytes: Calling GC.collect() took 400400 nsecs after 7F780E5CC010
 500000000 bytes: Calling GC.collect() took 449700 nsecs after 7F77E1A8A010
 600000000 bytes: Calling GC.collect() took 481200 nsecs after 7F780E5CC010
 700000000 bytes: Calling GC.collect() took 529800 nsecs after 7F77E1A8A010
 800000000 bytes: Calling GC.collect() took 547600 nsecs after 7F779A16E010
 900000000 bytes: Calling GC.collect() took 925500 nsecs after 7F7749891010
 
 Why is the overhead so big for a single allocation of an array with 
 elements containing no indirections (which the GC doesn't need to scan 
 for pointers).

It's not scanning the blocks. But it is scanning the stack.

Each time you are increasing the space it must search for a given 
*target*. It also must *collect* any previous items at the end of the 
scan. Note that a collection is going to mark every single page and 
bitset that is contained in the item being collected (which gets 
increasingly larger).

What you are timing is the incremental work being done by an ever 
growing task. And really, this isn't that much (at the end, you are 
still less than 1ms).

One thing that may cause some headache here is that when you have a 
large block, it is part of a long chain of pages. I'm not sure I 
remember correctly, but it's a logarithmic search to see if a pointer 
points at a live GC block, but it's a linear search to find the front of 
it. It may also have changed since I looked at it last.

-Steve

Sep 24 2018

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Monday, 24 September 2018 at 14:31:45 UTC, Steven 
Schveighoffer wrote:
 Why is the overhead so big for a single allocation of an array 
 with elements containing no indirections (which the GC doesn't 
 need to scan for pointers).

 It's not scanning the blocks. But it is scanning the stack.

Ok, I modified the code to be

import std.stdio;

void* mallocAndFreeBytes(size_t byteCount)()
{
     import core.memory : pureMalloc, pureFree;
     void* ptr = pureMalloc(byteCount);
     pureFree(ptr);
     return ptr;                 // for side-effects
}

void main(string[] args)
{
     import std.datetime.stopwatch : benchmark;
     import core.time : Duration;

     immutable benchmarkCount = 1;

     // GC
     static foreach (const i; 0 .. 31)
     {
         {
             enum byteCount = 2^^i;
             const Duration[1] resultsC = 
benchmark!(mallocAndFreeBytes!(i))(benchmarkCount);
             writef("%s bytes: mallocAndFreeBytes: %s nsecs",
                    byteCount, 
cast(double)resultsC[0].total!"nsecs"/benchmarkCount);

             import core.memory : GC;
             auto dArray = new byte[byteCount]; // one Gig
             const Duration[1] resultsD = 
benchmark!(GC.collect)(benchmarkCount);
             writefln(" GC.collect(): %s nsecs after %s",
                      
cast(double)resultsD[0].total!"nsecs"/benchmarkCount, dArray.ptr);
             dArray = null;
         }
     }
}

I still be believe these numbers are absolutely horrible

1 bytes: mallocAndFreeBytes: 400 nsecs GC.collect(): 21600 nsecs 
after 7F1ECC0B1000
2 bytes: mallocAndFreeBytes: 300 nsecs GC.collect(): 20800 nsecs 
after 7F1ECC0B1010
4 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 20500 nsecs 
after 7F1ECC0B1000
8 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 20300 nsecs 
after 7F1ECC0B1010
16 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 23200 nsecs 
after 7F1ECC0B2000
32 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 19600 nsecs 
after 7F1ECC0B1000
64 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 17800 nsecs 
after 7F1ECC0B2000
128 bytes: mallocAndFreeBytes: 300 nsecs GC.collect(): 16600 
nsecs after 7F1ECC0B1000
256 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 16200 
nsecs after 7F1ECC0B2000
512 bytes: mallocAndFreeBytes: 300 nsecs GC.collect(): 15900 
nsecs after 7F1ECC0B1000
1024 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 15700 
nsecs after 7F1ECC0B2000
2048 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 14600 
nsecs after 7F1ECC0B1010
4096 bytes: mallocAndFreeBytes: 300 nsecs GC.collect(): 14400 
nsecs after 7F1ECC0B2010
8192 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 14200 
nsecs after 7F1ECC0B4010
16384 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 14100 
nsecs after 7F1ECC0B7010
32768 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 14200 
nsecs after 7F1ECC0BC010
65536 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 14200 
nsecs after 7F1ECC0C5010
131072 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 14200 
nsecs after 7F1ECC0D6010
262144 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 14200 
nsecs after 7F1ECC0F7010
524288 bytes: mallocAndFreeBytes: 300 nsecs GC.collect(): 17500 
nsecs after 7F1ECAC14010
1048576 bytes: mallocAndFreeBytes: 200 nsecs GC.collect(): 18000 
nsecs after 7F1ECAC95010
2097152 bytes: mallocAndFreeBytes: 500 nsecs GC.collect(): 18700 
nsecs after 7F1ECAD96010
4194304 bytes: mallocAndFreeBytes: 300 nsecs GC.collect(): 20000 
nsecs after 7F1ECA514010
8388608 bytes: mallocAndFreeBytes: 400 nsecs GC.collect(): 61000 
nsecs after 7F1EC9913010
16777216 bytes: mallocAndFreeBytes: 24900 nsecs GC.collect(): 
27100 nsecs after 7F1EC8112010
33554432 bytes: mallocAndFreeBytes: 800 nsecs GC.collect(): 36600 
nsecs after 7F1EC5111010
67108864 bytes: mallocAndFreeBytes: 600 nsecs GC.collect(): 57900 
nsecs after 7F1EBF110010
134217728 bytes: mallocAndFreeBytes: 500 nsecs GC.collect(): 
98300 nsecs after 7F1EB310F010
268435456 bytes: mallocAndFreeBytes: 700 nsecs GC.collect(): 
175700 nsecs after 7F1E9B10E010
536870912 bytes: mallocAndFreeBytes: 600 nsecs GC.collect(): 
326900 nsecs after 7F1E6B10D010
1073741824 bytes: mallocAndFreeBytes: 900 nsecs GC.collect(): 
641500 nsecs after 7F1E0B04B010

How is it possible for the GC to be 500-1000 times slower than a 
malloc-free call for a single array containing just bytes with no 
indirections for such a simple function!!!?

I really don't understand this...

Sep 28 2018

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Friday, 28 September 2018 at 09:14:18 UTC, Per Nordlöw wrote:
 How is it possible for the GC to be 500-1000 times slower than 
 a malloc-free call for a single array containing just bytes 
 with no indirections for such a simple function!!!?

 I really don't understand this...

I change the code to not make use of the GC when printing:

import core.stdc.stdio: printf;

void* mallocAndFreeBytes(size_t byteCount)()
{
     import core.memory : pureMalloc, pureFree;
     void* ptr = pureMalloc(byteCount);
     pureFree(ptr);
     return ptr;                 // for side-effects
}

void main(string[] args)
{
     import std.datetime.stopwatch : benchmark;
     import core.time : Duration;

     immutable benchmarkCount = 1;

     // GC
     static foreach (const size_t i; 0 .. 32)
     {
         {
             enum byteCount = 2UL^^i;
             const Duration[1] resultsC = 
benchmark!(mallocAndFreeBytes!(i))(benchmarkCount);
             printf("%ld bytes: mallocAndFreeBytes: %ld nsecs",
                    byteCount, 
cast(size_t)(cast(double)resultsC[0].total!"nsecs"/benchmarkCount));

             import core.memory : GC;
             auto dArray = new byte[byteCount]; // one Gig
             const Duration[1] resultsD = 
benchmark!(GC.collect)(benchmarkCount);
             printf("  GC.collect(): %ld nsecs after %p\n",
                    
cast(size_t)(cast(double)resultsD[0].total!"nsecs"/benchmarkCount), dArray.ptr);
             dArray = null;
         }
     }
}

I still get terrible numbers:

1 bytes: mallocAndFreeBytes: 600 nsecs  GC.collect(): 29600 nsecs 
after 0x7fbab535b000
2 bytes: mallocAndFreeBytes: 500 nsecs  GC.collect(): 28600 nsecs 
after 0x7fbab535b010
4 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 27700 nsecs 
after 0x7fbab535b000
8 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 27600 nsecs 
after 0x7fbab535b010
16 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 32100 
nsecs after 0x7fbab535c000
32 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 27100 
nsecs after 0x7fbab535b000
64 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 48500 
nsecs after 0x7fbab535c000
128 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 23300 
nsecs after 0x7fbab535b000
256 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 22300 
nsecs after 0x7fbab535c000
512 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 21800 
nsecs after 0x7fbab535b000
1024 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 21900 
nsecs after 0x7fbab535c000
2048 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 26300 
nsecs after 0x7fbab3ebe010
4096 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 25100 
nsecs after 0x7fbab3ebf010
8192 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 24500 
nsecs after 0x7fbab3ec1010
16384 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 24700 
nsecs after 0x7fbab3ec4010
32768 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 24600 
nsecs after 0x7fbab3ec9010
65536 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 24600 
nsecs after 0x7fbab3ed2010
131072 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 25000 
nsecs after 0x7fbab3ee3010
262144 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 25000 
nsecs after 0x7fbab3f04010
524288 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 25200 
nsecs after 0x7fbab3f45010
1048576 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 25800 
nsecs after 0x7fbab3fc6010
2097152 bytes: mallocAndFreeBytes: 300 nsecs  GC.collect(): 17200 
nsecs after 0x7fbab37be010
4194304 bytes: mallocAndFreeBytes: 500 nsecs  GC.collect(): 25700 
nsecs after 0x7fbab39bf010
8388608 bytes: mallocAndFreeBytes: 400 nsecs  GC.collect(): 65500 
nsecs after 0x7fbab2bbd010
16777216 bytes: mallocAndFreeBytes: 1100 nsecs  GC.collect(): 
47200 nsecs after 0x7fbab13bc010
33554432 bytes: mallocAndFreeBytes: 800 nsecs  GC.collect(): 
50300 nsecs after 0x7fbaae3bb010
67108864 bytes: mallocAndFreeBytes: 800 nsecs  GC.collect(): 
63800 nsecs after 0x7fbaa83ba010
134217728 bytes: mallocAndFreeBytes: 600 nsecs  GC.collect(): 
100000 nsecs after 0x7fba9c3b9010
268435456 bytes: mallocAndFreeBytes: 1000 nsecs  GC.collect(): 
176100 nsecs after 0x7fba843b8010
536870912 bytes: mallocAndFreeBytes: 1000 nsecs  GC.collect(): 
415500 nsecs after 0x7fba543b7010
1073741824 bytes: mallocAndFreeBytes: 800 nsecs  GC.collect(): 
649900 nsecs after 0x7fb9f42f5010
2147483648 bytes: mallocAndFreeBytes: 1200 nsecs  GC.collect(): 
973800 nsecs after 0x7fb934112010

It seems to scale kind linearly with byteCount above 16Mb... So 
it seems like its scanning the allocated block of bytes even 
though the element type of array is a value type. Why?

If I zero the pointer just after allocation I get a GC.collect() 
taking constantly 100ns so it can't be related to the stack.

Sep 28 2018

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Monday, 24 September 2018 at 14:31:45 UTC, Steven 
Schveighoffer wrote:
 ever growing task. And really, this isn't that much (at the 
 end, you are still less than 1ms).

Compared to GCs like Go's this is an enormous latency for a 
single allocation of value elements.

Sep 28 2018

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Monday, 24 September 2018 at 14:31:45 UTC, Steven 
Schveighoffer wrote:
 It's not scanning the blocks. But it is scanning the stack.

 Each time you are increasing the space it must search for a 
 given *target*. It also must *collect* any previous items at 
 the end of the scan. Note that a collection is going to mark 
 every single page and bitset that is contained in the item 
 being collected (which gets increasingly larger).

Is this because of the potentially (many) slices referencing this 
large block?

I assume the GC doesn't scan the `byte`-array for pointer-values 
in this case, but that happens for `void`-arrays and 
class/pointer-arrays right?

Couldn't that scan be optimized by adding a bitset that indicates 
which pages need to be scanned?

Is it common for GC's to treat large objects in this way?

Sep 28 2018

Rainer Schuetze <r.sagitario gmx.de> writes:

On 28/09/2018 14:21, Per Nordlöw wrote:
 On Monday, 24 September 2018 at 14:31:45 UTC, Steven Schveighoffer wrote:
 It's not scanning the blocks. But it is scanning the stack.

 Each time you are increasing the space it must search for a given 
 *target*. It also must *collect* any previous items at the end of the 
 scan. Note that a collection is going to mark every single page and 
 bitset that is contained in the item being collected (which gets 
 increasingly larger).

 
 Is this because of the potentially (many) slices referencing this large 
 block?
 
 I assume the GC doesn't scan the `byte`-array for pointer-values in this 
 case, but that happens for `void`-arrays and class/pointer-arrays right?
 
 Couldn't that scan be optimized by adding a bitset that indicates which 
 pages need to be scanned?
 
 Is it common for GC's to treat large objects in this way?

A profiler reveals that most of the time is spent in "sweeping" the 
memory, i.e. looking for allocations no longer referenced. The existing 
implementation checks every page which causes a linear growth of 
required CPU resources with used memory.

This version https://github.com/rainers/druntime/tree/gc_opt_sweep takes 
advantage of the known size of allocations to skip unnecessary checks. 
The last commit also adds support for keeping track of the size of 
blocks of consecutive free pages. With this your example has more or 
less constant collection time (note that most of the program time is 
spent setting the array to zero, though not measured, and that the 
allocation often triggers a collection, too).

I also noticed a rather serious bug for huge allocations: 
https://issues.dlang.org/show_bug.cgi?id=19281

Oct 01 2018

Steven Schveighoffer <schveiguy gmail.com> writes:

On 10/1/18 3:21 AM, Rainer Schuetze wrote:
 
 
 On 28/09/2018 14:21, Per Nordlöw wrote:
 On Monday, 24 September 2018 at 14:31:45 UTC, Steven Schveighoffer wrote:
 It's not scanning the blocks. But it is scanning the stack.

 Each time you are increasing the space it must search for a given 
 *target*. It also must *collect* any previous items at the end of the 
 scan. Note that a collection is going to mark every single page and 
 bitset that is contained in the item being collected (which gets 
 increasingly larger).

 Is this because of the potentially (many) slices referencing this 
 large block?

 I assume the GC doesn't scan the `byte`-array for pointer-values in 
 this case, but that happens for `void`-arrays and class/pointer-arrays 
 right?

 Couldn't that scan be optimized by adding a bitset that indicates 
 which pages need to be scanned?

 Is it common for GC's to treat large objects in this way?

 
 A profiler reveals that most of the time is spent in "sweeping" the 
 memory, i.e. looking for allocations no longer referenced. The existing 
 implementation checks every page which causes a linear growth of 
 required CPU resources with used memory.

Ouch! I hadn't thought of that.  But we aren't checking actual *pages*, 
right, we are checking bits to see where allocations are present?

I also remember that the way the bitsets work, they are always allocated 
for every 16 bytes, regardless of the block size for that page/pool. I 
didn't love that feature but maybe it is fixed by now.

I would imagine that checking 64 pages at once should be possible by 
logic-anding the allocated and unmarked bits to check things as quickly 
as possible.

 
 This version https://github.com/rainers/druntime/tree/gc_opt_sweep takes 
 advantage of the known size of allocations to skip unnecessary checks. 
 The last commit also adds support for keeping track of the size of 
 blocks of consecutive free pages. With this your example has more or 
 less constant collection time (note that most of the program time is 
 spent setting the array to zero, though not measured, and that the 
 allocation often triggers a collection, too).

Yes, especially when you are increasing the allocation size each time.

 I also noticed a rather serious bug for huge allocations: 
 https://issues.dlang.org/show_bug.cgi?id=19281

-Steve

Oct 01 2018

Rainer Schuetze <r.sagitario gmx.de> writes:

On 01/10/2018 15:51, Steven Schveighoffer wrote:
 On 10/1/18 3:21 AM, Rainer Schuetze wrote:
 A profiler reveals that most of the time is spent in "sweeping" the 
 memory, i.e. looking for allocations no longer referenced. The 
 existing implementation checks every page which causes a linear growth 
 of required CPU resources with used memory.

 
 Ouch! I hadn't thought of that.  But we aren't checking actual *pages*, 
 right, we are checking bits to see where allocations are present?

It is not touching the pages, but a byte array classifying the page as 
free, used and first of a used block.


 I also remember that the way the bitsets work, they are always allocated 
 for every 16 bytes, regardless of the block size for that page/pool. I 
 didn't love that feature but maybe it is fixed by now.

Not for the pool for "large" objects, it uses bits per page there. For 
small object pools, it's still as you remember.

 I would imagine that checking 64 pages at once should be possible by 
 logic-anding the allocated and unmarked bits to check things as quickly 
 as possible.

With my patch, the number of pages of allocated blocks and regions of 
consecutive free pages are known (by consulting another array), so no 
bit-scanning is necessary.

Oct 03 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Performance of GC.collect() for single block of `byte`s