digitalmars.D.learn - Question about CPU caches and D context pointers
- Etienne (55/55) Feb 17 2014 I've had his question at the back of my mind and I know it's
- Tobias Pankrath (7/11) Feb 17 2014 I am by far no expert. But all your buffers reside on the heap
- Dicebot (10/10) Feb 18 2014 None of your buffers are on stack in both examples. As those are
- "Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> (11/14) Feb 18 2014 That.
I've had his question at the back of my mind and I know it's probably related to back-end optimizations but I'm taking a chance to see if anyone knows anything. I know everything about how insignificant the speed difference may be, but keep in mind this is to further my low-level understandings. Here's an example to illustrate the question because it's quite complicated (to me): struct Contents { ubyte[] m_buffer; this(){ m_buffer = new ubyte[4092]; } rcv(string str){ m_buffer ~= str; } flush(){ send_4092_bytes_of_data_to_final_heap_buffer() m_buffer.reset(); } } vs.. rcv(string str){ send_small_bytes_of_data_to_final_heap_buffer(str); } The first case is the struct. When entering rcv() function, I know the pointer and length of m_buffer are on the stack at that point. That's pretty damn fast to access b/c the CPU caches keep these at level 1 through the whole routine. However, It's not obvious to me if the memory where m_buffer points to will stay in the CPU cache if there's 5 consecutive calls or so to this same routine in the same thread. Also note, it will flush to another buffer, so there's more heap roundtrips with buffers if the CPU cache isn't efficient. The second case (context-less) just sends the string right through to the final allocation procedure (another buffer), and the string stays a function parameter so it's on the stack, thus in the CPU cache through every call frame until the malloc takes place (1 heap roundtrip regardless of any optimization). So, would there be any chance for the m_buffer's pointee region to stay in the CPU cache if there's thousands of consecutive calls to the struct's recv, or do I forcefully have to keep the data on the stack and send it straight to the allocator? Is there an easy way to visualize how the CPU cache empties or fills itself, or to guarantee heap data stays in there without using the stack? I'm sorry if the question seems complicated, I read everything Ulrich Drepper had to say in What every programmer should know about memory, and I still have a bit of a hard time visualizing the question myself.
Feb 17 2014
On Tuesday, 18 February 2014 at 03:15:59 UTC, Etienne wrote:I'm sorry if the question seems complicated, I read everything Ulrich Drepper had to say in What every programmer should know about memory, and I still have a bit of a hard time visualizing the question myself.I am by far no expert. But all your buffers reside on the heap and introducing a extra copy to m_buffer introduces another chance to trash the cash. I wouldn't worry to much about the stack data, it's only some pairs of ptr/length that would even fit into the registers. cachegrind is a linux tool for cache simulation.
Feb 17 2014
None of your buffers are on stack in both examples. As those are dynamic arrays you only get pointer + length as value and data itself resides on heap in some unknown location. It can be in cache too, of course, if it has been used actively, but it can't be verified based on this simple snippet. To get best cache friendliness one may use dynamic-size structs akin to C but, of course, this does not go well with necessity to grow the buffer dynamically for given instance. Also note that your first snippet uses ~= operator which will cause new heap allocations invalidating any old cache.
Feb 18 2014
On Tuesday, 18 February 2014 at 08:11:04 UTC, Dicebot wrote:None of your buffers are on stack in both examples. As those are dynamic arrays you only get pointer + length as value and data itself resides on heap in some unknown location.That. struct S {} class C {} S[] s1; // fat pointer to an array of structs S[2] s2; // array of two structs, plus a length? C[] c1; // fat pointer to an array of pointers C[2] c2; // array of two pointers, plus a length? I tested some prime sieves both in C++ and D. They worked fastest with dynamic arrays with a size matching the L1 cache. I presume the instructions are located elsewhere in the CPU.
Feb 18 2014
On Tuesday, 18 February 2014 at 18:13:24 UTC, Casper Færgemand wrote:S[2] s2; // array of two structs, plus a length?Storing length is not needed for static arrays because it is known, well, statically.I tested some prime sieves both in C++ and D. They worked fastest with dynamic arrays with a size matching the L1 cache. I presume the instructions are located elsewhere in the CPU.Yes, instruction cache is completely separated from data cache.
Feb 18 2014
On 2014-02-18 1:13 PM, "Casper Færgemand" <shorttail hotmail.com>" wrote:On Tuesday, 18 February 2014 at 08:11:04 UTC, Dicebot wrote: I tested some prime sieves both in C++ and D. They worked fastest with dynamic arrays with a size matching the L1 cache. I presume the instructions are located elsewhere in the CPU.Does that mean ubyte[] = new ubyte[4092] is more likely to end up in the CPU cache than ubyte[4092] or the inverse ?
Feb 18 2014
On Tuesday, 18 February 2014 at 19:55:20 UTC, Etienne wrote:On 2014-02-18 1:13 PM, "Casper Færgemand" <shorttail hotmail.com>" wrote:It is irrelevant. If data is used continiously it will end up in cache anyway (assuming it fits), location does not matter. What is important is if it is already prefetched upon first access, which is more likely for static arrays as they reside in the same memory page as their aggregator. Then it will save you from cache miss upon changing the buffer context frequently. But if you use the same buffer all the time during continious code flow it shouldn't impact where exactly it is located as it simply won't be removed from cache by newer data.On Tuesday, 18 February 2014 at 08:11:04 UTC, Dicebot wrote: I tested some prime sieves both in C++ and D. They worked fastest with dynamic arrays with a size matching the L1 cache. I presume the instructions are located elsewhere in the CPU.Does that mean ubyte[] = new ubyte[4092] is more likely to end up in the CPU cache than ubyte[4092] or the inverse ?
Feb 18 2014