digitalmars.D.learn - Question about CPU caches and D context pointers

Etienne (55/55) Feb 17 2014 I've had his question at the back of my mind and I know it's

Tobias Pankrath (7/11) Feb 17 2014 I am by far no expert. But all your buffers reside on the heap
Dicebot (10/10) Feb 18 2014 None of your buffers are on stack in both examples. As those are

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> (11/14) Feb 18 2014 That.

Dicebot (5/9) Feb 18 2014 Storing length is not needed for static arrays because it is
Etienne (3/7) Feb 18 2014 Does that mean ubyte[] = new ubyte[4092] is more likely to end up in the...

Dicebot (10/19) Feb 18 2014 It is irrelevant. If data is used continiously it will end up in

"Etienne" <etcimon gmail.com> writes:

I've had his question at the back of my mind and I know it's 
probably related to back-end optimizations but I'm taking a 
chance to see if anyone knows anything.

I know everything about how insignificant the speed difference 
may be, but keep in mind this is to further my low-level 
understandings. Here's an example to illustrate the question 
because it's quite complicated (to me):


struct Contents {
	ubyte[] m_buffer;
	
	this(){
		m_buffer = new ubyte[4092];
	}
	
	rcv(string str){
		m_buffer ~= str;
	}
	
	flush(){
		send_4092_bytes_of_data_to_final_heap_buffer()
		m_buffer.reset();
	}
	
}

vs..


rcv(string str){
	send_small_bytes_of_data_to_final_heap_buffer(str);
}

The first case is the struct. When entering rcv() function, I 
know the pointer and length of m_buffer are on the stack at that 
point. That's pretty damn fast to access b/c the CPU caches keep 
these at level 1 through the whole routine. However, It's not 
obvious to me if the memory where m_buffer points to will stay in 
the CPU cache if there's 5 consecutive calls or so to this same 
routine in the same thread. Also note, it will flush to another 
buffer, so there's more heap roundtrips with buffers if the CPU 
cache isn't efficient.

The second case (context-less) just sends the string right 
through to the final allocation procedure (another buffer), and 
the string stays a function parameter so it's on the stack, thus 
in the CPU cache through every call frame until the malloc takes 
place (1 heap roundtrip regardless of any optimization).

So, would there be any chance for the m_buffer's pointee region 
to stay in the CPU cache if there's thousands of consecutive 
calls to the struct's recv, or do I forcefully have to keep the 
data on the stack and send it straight to the allocator? Is there 
an easy way to visualize how the CPU cache empties or fills 
itself, or to guarantee heap data stays in there without using 
the stack?

I'm sorry if the question seems complicated, I read everything 
Ulrich Drepper had to say in What every programmer should know 
about memory, and I still have a bit of a hard time visualizing 
the question myself.

Feb 17 2014

"Tobias Pankrath" <tobias pankrath.net> writes:

On Tuesday, 18 February 2014 at 03:15:59 UTC, Etienne wrote:
 I'm sorry if the question seems complicated, I read everything 
 Ulrich Drepper had to say in What every programmer should know 
 about memory, and I still have a bit of a hard time visualizing 
 the question myself.

I am by far no expert. But all your buffers reside on the heap 
and introducing a extra copy to m_buffer introduces another 
chance to trash the cash. I wouldn't worry to much about the 
stack data, it's only some pairs of ptr/length that would even 
fit into the registers.

cachegrind is a linux tool for cache simulation.

Feb 17 2014

"Dicebot" <public dicebot.lv> writes:

None of your buffers are on stack in both examples. As those are 
dynamic arrays you only get pointer + length as value and data 
itself resides on heap in some unknown location. It can be in 
cache too, of course, if it has been used actively, but it can't 
be verified based on this simple snippet.

To get best cache friendliness one may use dynamic-size structs 
akin to C but, of course, this does not go well with necessity to 
grow the buffer dynamically for given instance.

Also note that your first snippet uses ~= operator which will 
cause new heap allocations invalidating any old cache.

Feb 18 2014

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> writes:

On Tuesday, 18 February 2014 at 08:11:04 UTC, Dicebot wrote:
 None of your buffers are on stack in both examples. As those 
 are dynamic arrays you only get pointer + length as value and 
 data itself resides on heap in some unknown location.

That.

struct S {}
class C {}

S[] s1; // fat pointer to an array of structs
S[2] s2; // array of two structs, plus a length?

C[] c1; // fat pointer to an array of pointers
C[2] c2; // array of two pointers, plus a length?


I tested some prime sieves both in C++ and D. They worked fastest 
with dynamic arrays with a size matching the L1 cache. I presume 
the instructions are located elsewhere in the CPU.

Feb 18 2014

"Dicebot" <public dicebot.lv> writes:

On Tuesday, 18 February 2014 at 18:13:24 UTC, Casper Færgemand 
wrote:
 S[2] s2; // array of two structs, plus a length?

Storing length is not needed for static arrays because it is 
known, well, statically.

 I tested some prime sieves both in C++ and D. They worked 
 fastest with dynamic arrays with a size matching the L1 cache. 
 I presume the instructions are located elsewhere in the CPU.

Yes, instruction cache is completely separated from data cache.

Feb 18 2014

Etienne <etcimon gmail.com> writes:

On 2014-02-18 1:13 PM, "Casper Færgemand" <shorttail hotmail.com>" wrote:
 On Tuesday, 18 February 2014 at 08:11:04 UTC, Dicebot wrote:
 I tested some prime sieves both in C++ and D. They worked fastest with
 dynamic arrays with a size matching the L1 cache. I presume the
 instructions are located elsewhere in the CPU.

Does that mean ubyte[] = new ubyte[4092] is more likely to end up in the 
CPU cache than ubyte[4092] or the inverse ?

Feb 18 2014

"Dicebot" <public dicebot.lv> writes:

On Tuesday, 18 February 2014 at 19:55:20 UTC, Etienne wrote:
 On 2014-02-18 1:13 PM, "Casper Færgemand" 
 <shorttail hotmail.com>" wrote:
 On Tuesday, 18 February 2014 at 08:11:04 UTC, Dicebot wrote:
 I tested some prime sieves both in C++ and D. They worked 
 fastest with
 dynamic arrays with a size matching the L1 cache. I presume the
 instructions are located elsewhere in the CPU.

 Does that mean ubyte[] = new ubyte[4092] is more likely to end 
 up in the CPU cache than ubyte[4092] or the inverse ?

It is irrelevant. If data is used continiously it will end up in 
cache anyway (assuming it fits), location does not matter. What 
is important is if it is already prefetched upon first access, 
which is more likely for static arrays as they reside in the same 
memory page as their aggregator. Then it will save you from cache 
miss upon changing the buffer context frequently.

But if you use the same buffer all the time during continious 
code flow it shouldn't impact where exactly it is located as it 
simply won't be removed from cache by newer data.

Feb 18 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Question about CPU caches and D context pointers