www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - In-place extension of arrays only for certain alignment?

reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
Related to my DConf 2022 lightning talk, I am noticing that D runtime's 
in-place array extension optimization is available only for array data 
that are at certain memory alignments.

I used the following program to test it. In addition to repeatedly 
adding an element to an array,

- 'version = neither' does not drop any element (this is "good" as well)

- 'version = bad' case drops the front element (a moving window of 1)

- 'version = good' case drops elements only when element data will hit a 
certain alignment value.

Is this expected?

import std.stdio;
import std.range;

// PLEASE UNCOMMENT ONLY ONE:
// version = bad;        // No in-place extension
// version = good;       // In-place extension happens
// version = neither;    // In-place extension happens

mixin assertSingleVersionOf!("bad", "good", "neither");

void main() {
   struct S {
     ubyte[4] data;  // Try different reasonable sizes.
                     // For example, 5 will not work even
                     // for the "good" case.
   }

   S[] arr;

   foreach (i; 0 .. 100_000) {
     const oldCap = arr.capacity;
     const oldPtr = arr.ptr;

     arr ~= S.init;

     if (arr.capacity != oldCap) {
       // The array needed to be extended...
       if (arr.ptr == oldPtr) {
         // ... but the pointer did not change
         writefln!"In-place extension -- element: %,s  capacity: %,s -> 
%,s  ptr: %s"(
           i, oldCap, arr.capacity, arr.ptr);
       }
     }

     version (neither) {
       // Do not remove any element; just extend
     }

     version (bad) {
       // Dropping 1 element inhibits in-place extension
       // (Many values other than 1 will inhibit as well.)
       arr = arr[1..$];

       // Even this does not help
       arr.assumeSafeAppend();
     }

     version (good) {
       // Dropping front elements equaling 2048 bytes works.
       // (Likely a GC magic constant.)

       enum magic = 2048;
       enum elementsPerPage = magic / S.sizeof;

       if (arr.length == elementsPerPage) {
         arr = arr[elementsPerPage..$];
       }
     }
   }
}

// A useful template that has nothing to do with this problem.
mixin template assertSingleVersionOf(args...) {
   import std.format : format;

   static assert (1 >= {
     size_t count = 0;
     static foreach (arg; args) {
       static assert (is (typeof(arg) == string));
       mixin (format!q{
         version (%s) {
           ++count;
         }
       }(arg));
     }
     return count;
   }(), format!"Please pick only one or none of %(%s, %)"([args]));
}

Ali
Aug 16 2022
next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 8/16/22 2:11 PM, Ali Çehreli wrote:
 Related to my DConf 2022 lightning talk, I am noticing that D runtime's 
 in-place array extension optimization is available only for array data 
 that are at certain memory alignments.
 
No, it's based on 2 factors: 1. Is it a page-size-or-greater block? 2. Is there a free page after it? The smaller blocks are stored in page-sized pools and *cannot* be combined together. e.g. you can't have 2 16-byte blocks joined to form a 32 byte block. The reason why your `bad` version fails is because when it must reallocate, it still is only allocating 1 element. Now a page is 4096 bytes typically. So why does the 2048 amount work? Because in order to store a 2048 byte array, you need 2048 bytes AND the metadata (e.g. typeinfo for destruction and append capacity). This requires a full page. Note that once you reach page size, the optimization can happen, *even if you are only appending to an end slice*. Here's something to try: when the capacity is less than the "magic" number of elements, `reserve` that number of elements. Then the "drop one element each loop" should use the optimization. -Steve
Aug 16 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
Thank you for the quick response.

On 8/16/22 12:31, Steven Schveighoffer wrote:
 On 8/16/22 2:11 PM, Ali Çehreli wrote:
 Related to my DConf 2022 lightning talk, I am noticing that D
 runtime's in-place array extension optimization is available only for
 array data that are at certain memory alignments.
No, it's based on 2 factors: 1. Is it a page-size-or-greater block?
I assume the length of the new block.
 2. Is there a free page after it?
Makes sense.
 The reason why your `bad` version fails is because when it must
 reallocate, it still is only allocating 1 element.
That part I still don't understand. The same block of e.g. 16 bytes still has room. Why not use that remaining portion? Here is a simpler test with better-than-expected results. Array c is what I want to do. In this test it works and I don't even need to call assumeSafeAppend() (this must be what you call "end slice"): import std.stdio; void main() { ubyte[] a; a ~= 0; a.length = 0; a.assumeSafeAppend(); // Needed assert(a.capacity == 15); // Essentially, the same as above ubyte[] b; b ~= 0; b = b[0..0]; b.assumeSafeAppend(); // Needed assert(b.capacity == 15); ubyte[] c; c ~= 0; c = c[1..$]; // c.assumeSafeAppend(); assert(c.capacity == 14); }
 metadata (e.g. typeinfo for destruction and append capacity).
I think it is 16 bytes total (on 64 bits): void* + size_t and I see this when I print .ptr: The change is always 0x10.
 Note that once you reach page size, the optimization can happen, *even
 if you are only appending to an end slice*. Here's something to try:
 when the capacity is less than the "magic" number of elements, `reserve`
 that number of elements. Then the "drop one element each loop" should
 use the optimization.
.reserve sounds promising but it will sometimes allocate memory and move elements even if I will not really need e.g. more than just one more element. (In my case, I may not know how many will be needed.) In other words, I really don't know how much to reserve. What I seem to need is this function: void increaseCapacityWithoutAllocating(T)(ref T[] arr) { // ... } Only the runtime seems to be able to implement that function. Can I call something in the runtime similar to how assumeSafeAppend() calls _d_arrayshrinkfit() in object.d?
 -Steve
Ali
Aug 16 2022
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 8/16/22 4:53 PM, Ali Çehreli wrote:
 On 8/16/22 12:31, Steven Schveighoffer wrote:
  >
  > No, it's based on 2 factors:
  >
  > 1. Is it a page-size-or-greater block?
 
 I assume the length of the new block.
No, the length of the *existing* block. Everything in the memory allocator is in terms of pages. A pool is a section of pages. The large blocks are a *multiple* of pages, whereas the small blocks are pages that are *divided* into same-sized chunks. When you want to allocate a block, if it's a half-page or less, then it goes into the small pool, where you can't glue 2 blocks together. If it's greater than 1/2 page, then it needs page+ sized blocks. These are too big really to set aside, so the allocator will just grab some pool that has enough free pages, and return it. These blocks can be stitched together depending on what is needed. Once freed, they can also be split up into pages. It's here where the capacity can grow without copying.
  > The reason why your `bad` version fails is because when it must
  > reallocate, it still is only allocating 1 element.
 
 That part I still don't understand. The same block of e.g. 16 bytes 
 still has room. Why not use that remaining portion?
It does until it doesn't. Then it needs to reallocate. Note that if you remove the front element, then you are pointing at a *slice* of the array. Maybe some ASCII art? `A` is "used by the current slice", `x` is "allocated, but not referenced by the array". `.` is "part of the block but not used (i.e. can grow into this)". `M` is "metadata" ```d auto arr = new ubyte[10]; // AAAA AAAA AA.. ...M arr = arr[1 .. $]; // xAAA AAAA AA.. ...M arr ~= 0; // xAAA AAAA AAA. ...M arr ~= 0; // xAAA AAAA AAAA ...M arr = arr[3 .. $]; // xxxx AAAA AAAA ...M arr ~= cast(ubyte[])[1, 2, 3]; // xxxx AAAA AAAA AAAM // full! arr ~= 1; // AAAA AAAA AAAA ...M // reallocated! arr.ptr has changed ``` Note in the last instance, it has now moved into a different 16-byte block -- the original is still intact, but just not pointed at. It will be garbage collected eventually. But you can see that it only allocates enough to hold what the slice already contains + the new elements.
 
 Here is a simpler test with better-than-expected results. Array c is 
 what I want to do. In this test it works and I don't even need to call 
 assumeSafeAppend() (this must be what you call "end slice"):
 
 import std.stdio;
 
 void main() {
    ubyte[] a;
    a ~= 0;
    a.length = 0;
    a.assumeSafeAppend();         // Needed
    assert(a.capacity == 15);
 
    // Essentially, the same as above
    ubyte[] b;
    b ~= 0;
    b = b[0..0];
    b.assumeSafeAppend();        // Needed
    assert(b.capacity == 15);
 
    ubyte[] c;
    c ~= 0;
    c = c[1..$];
    // c.assumeSafeAppend();
    assert(c.capacity == 14);
 }
Yes, the "appendability" of an array slice depends on whether it *ends* at the end of the "used" space. Otherwise, if it ends earlier, appending would stomp on memory possibly referenced elsewhere. If it ends later, then you did something wrong in your code, and now the situation is corrupted.
  > metadata (e.g. typeinfo for destruction and append capacity).
 
 I think it is 16 bytes total (on 64 bits): void* + size_t and I see this 
 when I print .ptr: The change is always 0x10.
Metadata isn't always stored. Plus, it's optimized for the block size. For example, any block that is 256 bytes or less only needs a single byte to store the "used" space. The TypeInfo needed for destruction of array elements is only stored if needed (e.g. an array of `int` does not need one).
 .reserve sounds promising but it will sometimes allocate memory and move 
 elements even if I will not really need e.g. more than just one more 
 element. (In my case, I may not know how many will be needed.) In other 
 words, I really don't know how much to reserve.
What is your focus? Why do you really want this "optimization" of gluing together items to happen?
 
 What I seem to need is this function:
 
 void increaseCapacityWithoutAllocating(T)(ref T[] arr) {
    // ...
 }
You can make one with: https://dlang.org/phobos/core_memory.html#.GC.extend -Steve
Aug 16 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/16/22 19:33, Steven Schveighoffer wrote:

 Everything in the memory allocator is in terms of pages. A pool is a
 section of pages. The large blocks are a *multiple* of pages, whereas
 the small blocks are pages that are *divided* into same-sized chunks.
Thank you. I am appreciating this discussion a lot.
 When you want to allocate a block, if it's a half-page or less, then it
 goes into the small pool, where you can't glue 2 blocks together.
That's how it should be because we wouldn't want to work to know which blocks are glued.
  > The reason why your `bad` version fails is because when it must
  > reallocate, it still is only allocating 1 element.
I will get back to this 1 element allocation below.
 That part I still don't understand. The same block of e.g. 16 bytes
 still has room. Why not use that remaining portion?
It does until it doesn't. Then it needs to reallocate.
I think my problem was caused by my test data being 16 bytes and (I think) the runtime picked memory from the 16-byte pool without any hope of future growth into adjacent block. Using a 16-byte block sounds like a good strategy at first because nobody knows whether an array will get more than one element. However, if my guess is correct (i.e. the first element of size of 16-bytes is placed on a 16-byte block), then the next allocation will always allocate memory for the second element. One might argue that dynamic arrays are likely to have more than a single element, so the initial block should at least be twice the element size. This would cut memory allocation by 1 count for all arrays. And in my case of 1-element arrays, allocation count would be halved. (Because I pay for every single append right now.) Of course, I can understand that there can be applications where a large number of arrays (e.g. a 2D array) may have zero-elements or one-element, in which case my proposal of allocating the first element on a e.g. 32-byte block would be wasteful. I think such cases are rare and we incur 1 extra allocation penalty for all arrays for that strategy.
 Maybe some ASCII art? `A` is "used by the current slice", `x` is
 "allocated, but not referenced by the array". `.` is "part of the block
 but not used (i.e. can grow into this)". `M` is "metadata"

 ```d
 auto arr = new ubyte[10];      // AAAA AAAA AA.. ...M
 arr = arr[1 .. $];             // xAAA AAAA AA.. ...M
 arr ~= 0;                      // xAAA AAAA AAA. ...M
 arr ~= 0;                      // xAAA AAAA AAAA ...M
 arr = arr[3 .. $];             // xxxx AAAA AAAA ...M
 arr ~= cast(ubyte[])[1, 2, 3]; // xxxx AAAA AAAA AAAM // full!
 arr ~= 1;                      // AAAA AAAA AAAA ...M // reallocated!
 arr.ptr has changed
 ```
That all makes sense. I didn't think the meta data would be at the end but I sense it's related to the "end slice", so it's a better place there. (?)
 Metadata isn't always stored. Plus, it's optimized for the block size.
 For example, any block that is 256 bytes or less only needs a single
 byte to store the "used" space.
That's pretty interesting and smart.
 What is your focus? Why do you really want this "optimization" of gluing
 together items to happen?
This is about what you and I talked about in the past and something I mentioned in my DConf lightning talk this year. I am imagining a FIFO-like cache where elements of a source range are stored. There is a sliding window involved. I want to drop the unused front elements because they are expensive in 2 ways: 1) If the elements are not needed anymore, I have to move my slice forward so that the GC collects their pages. 2) If they are not needed anymore, I don't want to even copy them to a new block because this would be expensive, and in the case of an infinite source range, impossible. The question is when to apply this dropping of old front elements. When I need to add one more element to the array, I can detect whether this *may* allocate by the expression 'arr.length == arr.capacity' but not really though, because the runtime may give me adjacent room without allocation. So I can delay the "drop the front elements" computation because there will be no actual copying at this time. But the bigger issue is, because I drop elements my array never gets large enough to take advantage of this optimization and there is an allocation for every single append.
 https://dlang.org/phobos/core_memory.html#.GC.extend
Ok, that sounds very useful. In addition to "I can detect when it *may* allocate", I can detect whether there is adjacent free room. (I can ask for just 1 element extension; I tested; and it works.) (I guess this GC.extend has to grab a GC lock.) However, for that to work, I seem to need the initial block pointer that the GC knows about. (My sliding array's .ptr not work, so I have to save the initial arr.ptr). Conclusion: 1) Although understanding the inner workings of the runtime is very useful and core.memory has interesting functions, it feels too much fragile work to get exactly what I want. I should manage my own memory (likely still backed by the GC). 2) I argue that the initial allocation should be more than 1 element so that we skip 1 allocation for most arrays and 50% for my window-of-1 sliding window case. Ali
Aug 17 2022
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 8/17/22 2:40 PM, Ali Çehreli wrote:
 On 8/16/22 19:33, Steven Schveighoffer wrote:
 
 Using a 16-byte block sounds like a good strategy at first because 
 nobody knows whether an array will get more than one element.
 
 However, if my guess is correct (i.e. the first element of size of 
 16-bytes is placed on a 16-byte block), then the next allocation will 
 always allocate memory for the second element.
A 16-byte element size must be put in a 32-byte block, you still need one byte for the metadata.
 
 One might argue that dynamic arrays are likely to have more than a 
 single element, so the initial block should at least be twice the 
 element size. This would cut memory allocation by 1 count for all 
 arrays. And in my case of 1-element arrays, allocation count would be 
 halved. (Because I pay for every single append right now.)
So yes, if you have a 32-byte block for 16-byte elements, it means you can only fit one element in the block. If you are using a sliding window approach, where you remove the first element and then append another, you will in effect reallocate on every append. Using the append/popFront mechanism to implement your sliding window is going to perform badly. Appending is not designed to make this situation perform well.
 That all makes sense. I didn't think the meta data would be at the end 
 but I sense it's related to the "end slice", so it's a better place 
 there. (?)
It's for alignment. If I put 1 byte at the front, this means I have to always skip 7 or 15 more bytes (depending on alignment requirements). BUT, I put the metadata at the front on big (page+ size) blocks, because I can both afford to skip 16 bytes in a block of 4096, and if you extend the block, there is no need to move the metadata later. Consider that the metadata lookup cache could be out of date if it had to move.
  > What is your focus? Why do you really want this "optimization" of gluing
  > together items to happen?
 
 This is about what you and I talked about in the past and something I 
 mentioned in my DConf lightning talk this year. I am imagining a 
 FIFO-like cache where elements of a source range are stored. There is a 
 sliding window involved. I want to drop the unused front elements 
 because they are expensive in 2 ways:
 
 1) If the elements are not needed anymore, I have to move my slice 
 forward so that the GC collects their pages.
 
 2) If they are not needed anymore, I don't want to even copy them to a 
 new block because this would be expensive, and in the case of an 
 infinite source range, impossible.
Ah! I actually have a solution for this in iopipe -- a ring buffer. Basically, you map the same physical pages of memory sequentially. It allows you to simply change the pointer, and never need to copy anything. See this code for an example (I only have it for Posix, but Windows has similar features, I have to add them): https://github.com/schveiguy/iopipe/blob/6a8c10d2858f92978d72c55eecc7ad55fcc207e2/source/iopipe/buffer.d#L306
 The question is when to apply this dropping of old front elements. When 
 I need to add one more element to the array, I can detect whether this 
 *may* allocate by the expression 'arr.length == arr.capacity' but not 
 really though, because the runtime may give me adjacent room without 
 allocation. So I can delay the "drop the front elements" computation 
 because there will be no actual copying at this time.
Even if you could do this, this doesn't help because at the end of the pool, you need to reallocate into a another pool (or back to the beginning of the pool), because there can be no free pages after the last page in the pool (you can't merge pools together).
  > https://dlang.org/phobos/core_memory.html#.GC.extend
 
 Ok, that sounds very useful. In addition to "I can detect when it *may* 
 allocate", I can detect whether there is adjacent free room. (I can ask 
 for just 1 element extension; I tested; and it works.) (I guess this 
 GC.extend has to grab a GC lock.)
 
 However, for that to work, I seem to need the initial block pointer that 
 the GC knows about. (My sliding array's .ptr not work, so I have to save 
 the initial arr.ptr).
You can get this via `GC.query`, but that means 2 calls into the GC.
 Conclusion:
 
 1) Although understanding the inner workings of the runtime is very 
 useful and core.memory has interesting functions, it feels too much 
 fragile work to get exactly what I want. I should manage my own memory 
 (likely still backed by the GC).
 
 2) I argue that the initial allocation should be more than 1 element so 
 that we skip 1 allocation for most arrays and 50% for my window-of-1 
 sliding window case.
So 2 things here: 1. I highly recommend trying out the ring buffer solution to see if it helps. The disadvantage here is that you need to tie up at least a page of memory. 2. All my tests using the ring buffer showed little to no performance improvement over just copying back to the front of the buffer. So consider just copying the data back to the front of an already allocated block. IIRC, your data does not need to be sequential in *physical memory*, which means you can use a ring buffer that is segmented instead of virtually mapped, and that can be of any size. -Steve
Aug 17 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/17/22 18:31, Steven Schveighoffer wrote:

 1. I highly recommend trying out the ring buffer solution to see if it
 helps. The disadvantage here is that you need to tie up at least a page
 of memory.
I started to think holding on to multiple pages of memory should not matter anyway. If really needed, an array of Appenders could be used; when really really needed, they may come from a free list. Aside: I looked at Appender's implementation and saw that extending is one of its concerns as well.
 2. All my tests using the ring buffer showed little to no performance
 improvement over just copying back to the front of the buffer. So
 consider just copying the data back to the front of an already allocated
 block.
That makes sense as well. One worry would be types with copy constructors. (I am not sure whether D is still a language where structs can freely be moved around.)
 IIRC, your data does not need to be sequential in *physical memory*,
 which means you can use a ring buffer that is segmented instead of
 virtually mapped, and that can be of any size.
I thought about that as well. But I would like the sizes of blocks (Appenders?) be equal in size so that opIndex still can provide O(1) guarantee. (Compute the block + an offset.)
 -Steve
Ali
Aug 17 2022
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 8/17/22 10:09 PM, Ali Çehreli wrote:
  > IIRC, your data does not need to be sequential in *physical memory*,
  > which means you can use a ring buffer that is segmented instead of
  > virtually mapped, and that can be of any size.
 
 I thought about that as well. But I would like the sizes of blocks 
 (Appenders?) be equal in size so that opIndex still can provide O(1) 
 guarantee. (Compute the block + an offset.)
It's still O(1). You only have 2 slices to worry about. Perhaps next beerconf we can discuss! -Steve
Aug 17 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/17/22 19:27, Steven Schveighoffer wrote:
 On 8/17/22 10:09 PM, Ali Çehreli wrote:
  > IIRC, your data does not need to be sequential in *physical memory*,
  > which means you can use a ring buffer that is segmented instead of
  > virtually mapped, and that can be of any size.

 I thought about that as well. But I would like the sizes of blocks
 (Appenders?) be equal in size so that opIndex still can provide O(1)
 guarantee. (Compute the block + an offset.)
It's still O(1). You only have 2 slices to worry about.
Sometimes 2... I wanted to leave the sliding window width dynamic. So, there will be M buffers, not 2. If their lengths are not equal, opIndex must be O(M). M is expected to be small but still... M buffers of 'pageSize - (meta data).sizeof' each. BeerConf... Sure... :) Ali
Aug 17 2022
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 8/18/22 1:16 AM, Ali Çehreli wrote:
 On 8/17/22 19:27, Steven Schveighoffer wrote:
  > On 8/17/22 10:09 PM, Ali Çehreli wrote:
  >>  > IIRC, your data does not need to be sequential in *physical memory*,
  >>  > which means you can use a ring buffer that is segmented instead of
  >>  > virtually mapped, and that can be of any size.
  >>
  >> I thought about that as well. But I would like the sizes of blocks
  >> (Appenders?) be equal in size so that opIndex still can provide O(1)
  >> guarantee. (Compute the block + an offset.)
  >
  > It's still O(1). You only have 2 slices to worry about.
 
 Sometimes 2... I wanted to leave the sliding window width dynamic.
 
 So, there will be M buffers, not 2. If their lengths are not equal, 
 opIndex must be O(M). M is expected to be small but still...
Once you need more size, you can reallocate (it should stabilize). A ring buffer of N values where N doesn't change should only require one buffer, 2 slices of that buffer. -Steve
Aug 19 2022
prev sibling parent Salih Dincer <salihdb hotmail.com> writes:
On Tuesday, 16 August 2022 at 18:11:54 UTC, Ali Çehreli wrote:
 ```d
     version (good) {
       // Dropping front elements equaling 2048 bytes works.
       // (Likely a GC magic constant.)

       enum magic = 2048;
       enum elementsPerPage = magic / S.sizeof;

       if (arr.length == elementsPerPage) {
         arr = arr[elementsPerPage..$];
       }
     }
 ```
First I tried it on an older version (2.0.87). And with version(good) I got interesting results: ```d (length == 1, capacity: 2031 -> 6127) i = 4079, 8175 (length == 2, capacity: 1015 -> 3063) i = 2039, 4087, 6135, 8183 (length == 3, null) i > 10.000 = ? (length == 4, capacity: 507 -> 1531) i = 1019, 2043, 3067, 4091, 5115, 6139, 7163, 8187, 9211 (length == 5, 6, 7..., null) i > 10.000 = ? ``` For some reason the capacity changed frequently when data.length == 4 . And I didn't see a result in 3, 5, 6, 7 since I tested less than 10k loops. I got more interesting results with the extra version: ```d version(extra) arr.length++; /* 2087 291 elements, capacity: 582 -> 1167, ptr: 7F1306346010 1169 elements, capacity: 2338 -> 4093, ptr: 7F1306346010 2339 elements, capacity: 4678 -> 8189, ptr: 7F1306346010 4387 elements, capacity: 8774 -> 14626, ptr: 7F1306346010 7313 elements, capacity: 14626 -> 23403, ptr: 7F1306346010 Process finished. */ ``` version(neither) is concurrent with i: ```d (length == 1, capacity: i) i = 4079, 8175..16367 (length == 2) i = 2039, 4087, 8183..14327 (length == 3, capacity: i) i = 1359, 2725, 5455, 9551..16378 (length == 4, capacity: i) i = 1019, 2043, 4091, 7163..12283 (length == 5, capacity: i) i = 815, 1635, 3273, 5731, 9827..16380 ``` SDB 79
Aug 16 2022