digitalmars.D - Determing cache sizes -- request for testing

Don (19/19) Sep 10 2008 To implement efficient memory-intensive operations (memcpy, array

Denis Koroskin (11/30) Sep 10 2008 Vendor string: AuthenticAMD
Tomas Lindquist Olsen (10/10) Sep 10 2008 Vendor string: AuthenticAMD
Tomas Lindquist Olsen (10/10) Sep 10 2008 Vendor string: GenuineIntel
Don (3/3) Sep 10 2008 That's already showed up a heap of bugs. Aargh.

torhu (12/12) Sep 10 2008 This is a thunderbird 1.4. According to cpu-z, the L1 cache is really

Manuel =?ISO-8859-1?B?S/ZuaWc=?= (69/92) Sep 10 2008 Should the Pentium4 HyperThreading Technology count as two cores?
bearophile (17/20) Sep 10 2008 Your code is surely useful, but take a look at "cache oblivious data str...

bearophile (4/5) Sep 10 2008 I don't know if my CPU has HyperThreading, I presume not.
Don (12/22) Sep 10 2008 Correct.

bearophile (11/13) Sep 10 2008 I don't understand. And I have L2 cache.

Don (12/26) Sep 10 2008 (1) Unfortunately, I don't think it's possible to determine the amount

Manfred_Nowak (7/9) Sep 10 2008 Because without help of the OS, one will only get the available virtual

TomD (10/10) Sep 10 2008 Vendor string: GenuineIntel
dsimcha (264/286) Sep 10 2008 Seems to get things wrong for quad core Intels. No, I'm not rich enough...
=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= (75/78) Sep 10 2008 -----BEGIN PGP SIGNED MESSAGE-----
=?iso-8859-1?Q?Julio=20C=e9sar=20Carrascal=20Urquijo?= (12/21) Sep 10 2008 Vendor string: AuthenticAMD
Craig Black (10/10) Sep 10 2008 Don,

Don (10/20) Sep 11 2008 I don't think that's possible.

Bruno Medeiros (14/37) Sep 23 2008 Vendor string: AuthenticAMD

Don <nospam nospam.com.au> writes:

To implement efficient memory-intensive operations (memcpy, array 
operations, matrix multiplication, etc), you really need to know the 
sizes of the data caches.
Although most modern CPUs provide methods to determine the sizes of 
their built-in caches, it's a complete pigs breakfast. There are 
multiple complicated methods, and documentation is scant.
I've written some code to make this mess usable, and provide what you 
really want. For each level of cache, the code provides size in KB, ways 
of associativity, and the cache line size.

The attached code should eventually become part of std.cpuid, and an 
equivalent module in Tango. But, it needs significant further testing.

Please compile and run the code, and report the results. Any results 
would be useful, but particularly valuable would be:
(1) Multicore AMD machines;
(2) Early AMD machines (K6 or earlier).
(3) Early Intel machines;
(4) anything from another manufacturer.
(5) any crashes or obvious bugs.

Public domain.

Sep 10 2008

"Denis Koroskin" <2korden gmail.com> writes:

On Wed, 10 Sep 2008 13:41:40 +0400, Don <nospam nospam.com.au> wrote:

 To implement efficient memory-intensive operations (memcpy, array
 operations, matrix multiplication, etc), you really need to know the
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of
 their built-in caches, it's a complete pigs breakfast. There are
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you
 really want. For each level of cache, the code provides size in KB, ways
 of associativity, and the cache line size.

 The attached code should eventually become part of std.cpuid, and an
 equivalent module in Tango. But, it needs significant further testing.

 Please compile and run the code, and report the results. Any results
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.

 Public domain.

Vendor string:    AuthenticAMD
Processor string: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
Signature:        Family=15 Model=35 Stepping=2
Features:         MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=F Model=3 Stepping=2
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=512K, ways=16 linesize=0
Level 3 size=4194303K, ways=1 linesize=0

Sep 10 2008

Tomas Lindquist Olsen <tomas famolsen.dk> writes:

Vendor string:    AuthenticAMD
Processor string: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
Signature:        Family=15 Model=43 Stepping=1
Features:         MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=F Model=B Stepping=1
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=512K, ways=16 linesize=0
Level 3 size=4194303K, ways=1 linesize=0

Sep 10 2008

Tomas Lindquist Olsen <tomas famolsen.dk> writes:

Vendor string:    GenuineIntel
Processor string: Intel(R) Celeron(R) CPU          550    2.00GHz
Signature:        Family=6 Model=22 Stepping=1
Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64
Multithreading:   1 threads / 1 cores

Family=6 Model=6 Stepping=1
Data caches:
Level 1 size=32K, ways=8 linesize=64
Level 2 size=1024K, ways=4 linesize=64
Level 3 size=4194303K, ways=1 linesize=64

Sep 10 2008

Don <nospam nospam.com.au> writes:

That's already showed up a heap of bugs. Aargh.
Here's an improved version. But I still don't understand why it's 
reporting no L3 cache on the AMD64 machines.

Sep 10 2008

torhu <no spam.invalid> writes:

This is a thunderbird 1.4.  According to cpu-z, the L1 cache is really 
64+64 kB, with 64-byte line size.  The rest looks correct.

Vendor string:    AuthenticAMD
Processor string: AMD Athlon(tm) processor
Signature:        Family=6 Model=4 Stepping=4
Features:         MMX FXSR 3DNow! 3DNow!+ MMX+
Multithreading:   1 threads / 1 cores

Family=6 Model=4 Stepping=4
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=256K, ways=16 linesize=64
Level 3 size=4194303K, ways=1 linesize=64

Sep 10 2008

Manuel =?ISO-8859-1?B?S/ZuaWc=?= <manuelk89 gmx.net> writes:

Am Wed, 10 Sep 2008 11:41:40 +0200
schrieb Don <nospam nospam.com.au>:

 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of 
 their built-in caches, it's a complete pigs breakfast. There are 
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you 
 really want. For each level of cache, the code provides size in KB,
 ways of associativity, and the cache line size.
 
 The attached code should eventually become part of std.cpuid, and an 
 equivalent module in Tango. But, it needs significant further testing.
 
 Please compile and run the code, and report the results. Any results 
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.
 
 Public domain.
 

Should the Pentium4 HyperThreading Technology count as two cores?
$ ./cache
Vendor string:    GenuineIntel
Processor string: Intel(R) Pentium(R) 4 CPU 3.00GHz
Signature:        Family=15 Model=2 Stepping=9
Features:         MMX FXSR SSE SSE2 HTT
Multithreading:   1 threads / 1 cores

Family=F Model=2 Stepping=9
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=512K, ways=8 linesize=64
Level 3 size=4194303K, ways=1 linesize=64

$  cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 9
cpu MHz         : 2992.567
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts cid xtpr 
bogomips        : 5990.83 clflush size    : 64
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 9
cpu MHz         : 2992.567
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts cid xtpr 
bogomips        : 5986.89 clflush size    : 64
power management:

Sep 10 2008

bearophile <bearophileHUGS lycos.com> writes:

Don:
 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.

Your code is surely useful, but take a look at "cache oblivious data
structures":
http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

I presume your code doesn't work at compile time, so I presume you have to run
your code once, save the results on disk, and then re-start the compilation to
use those values to compute the tuning compile-time constants of the data
structures :-)

On this Intel the data looks almost correct (I have used the second version of
your code), but there isn't level 3 cache, and the RAM size is 2 GB.

Vendor string:    GenuineIntel
Processor string: Intel(R) Pentium(R) Dual  CPU  E2180    2.00GHz
Signature:        Family=6 Model=15 Stepping=13
Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=6 Model=F Stepping=D
Data caches:
Level 1 size=16K, ways=8 linesize=64
Level 2 size=512K, ways=4 linesize=64
Level 3 size=4194303K, ways=1 linesize=64

Bye,
bearophile

Sep 10 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile:
 Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT

I don't know if my CPU has HyperThreading, I presume not.

Bye,
bearophile

Sep 10 2008

Don <nospam nospam.com.au> writes:

bearophile wrote:
 Don:
 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.

 
 Your code is surely useful, but take a look at "cache oblivious data
structures":
 http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Yes, it's a good approach, but it doesn't help for something like memcpy.

 I presume your code doesn't work at compile time,

Correct.

 so I presume you have to run your code once, save the results on disk, and
then re-start the compilation to use those values to compute the tuning
compile-time constants of the data structures :-)

No. That wouldn't be much use.
The cache size is simply used as a parameter at run-time. It's only the 
linesize which has a major impact on optimal code -- but it's 32 or 64 
bytes on every system which I know of. So it's possible to deal with it 
at compile time, too.

You can, in fact, just plug the L1 cache size into your cache-oblivious 
algorithm as the cut-off level, significantly improving performance.

 On this Intel the data looks almost correct (I have used the second version of
your code), but there isn't level 3 cache, and the RAM size is 2 GB.

The value shown for L3 should be greater than the memory size, if there 
is no L2 cache. (you never fall out of the L3 cache). So it's correct.

Sep 10 2008

bearophile <bearophileHUGS lycos.com> writes:

Don:
 The value shown for L3 should be greater than the memory size, if there 
 is no L2 cache. (you never fall out of the L3 cache). So it's correct.

I don't understand. And I have L2 cache.
The results I expect from your code running on my PC are:

L1: 32 + 32 KB
L2: 1024 KB
L3: 0 MB
RAM: 2 GB

Or if you want an output more usable by an algorithm, it can output a dynamic
array of longs:

Memory levels ==> [65536, 1048576, 2147483648]

Bye,
bearophile

Sep 10 2008

Don <nospam nospam.com.au> writes:

bearophile wrote:
 Don:
 The value shown for L3 should be greater than the memory size, if there 
 is no L2 cache. (you never fall out of the L3 cache). So it's correct.

 
 I don't understand. And I have L2 cache.

Oops, that should have been "if there is no L3 cache".

 The results I expect from your code running on my PC are:
 
 L1: 32 + 32 KB
 L2: 1024 KB
 L3: 0 MB
 RAM: 2 GB

(1) Unfortunately, I don't think it's possible to determine the amount 
of RAM without help from the OS. So I give the last value uint.max 
bytes. Perhaps that is too confusing.

(2) The cache values are per core.

 Or if you want an output more usable by an algorithm, it can output a dynamic
array of longs:
 
 Memory levels ==> [65536, 1048576, 2147483648]

Perhaps. I haven't decided on a final interface. Note, though, that the 
relevant size of the cache depends on what you are doing. For example, 
if you are operating on 3 arrays, you need to divide the cache size by 
3. But if the cache level has an associativity less than 3, you have the 
risk of cache thrashing.
So generally you need to make your own table of cache sizes anyway.

Sep 10 2008

"Manfred_Nowak" <svv1999 hotmail.com> writes:

Don wrote:

  I don't think it's possible to determine the amount 
 of RAM without help from the OS.

Because without help of the OS, one will only get the available virtual 
memory, which is provided by the OS?

-manfred

-- 
If life is going to exist in this Universe, then the one thing it 
cannot afford to have is a sense of proportion. (Douglas Adams)

Sep 10 2008

TomD <t_demmer nospam.web.de> writes:

Vendor string:    GenuineIntel
Processor string: Intel(R) Core(TM)2 CPU         T5500    1.66GHz
Signature:        Family=6 Model=15 Stepping=2
Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=6 Model=F Stepping=2
Data caches:
Level 1 size=16K, ways=8 linesize=64
Level 2 size=1024K, ways=8 linesize=64
Level 3 size=4194303K, ways=1 linesize=64

Sep 10 2008

dsimcha <dsimcha yahoo.com> writes:

== Quote from Don (nospam nospam.com.au)'s article
 This is a multi-part message in MIME format.
 --------------000507070609070404050303
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 Content-Transfer-Encoding: 7bit
 To implement efficient memory-intensive operations (memcpy, array
 operations, matrix multiplication, etc), you really need to know the
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of
 their built-in caches, it's a complete pigs breakfast. There are
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you
 really want. For each level of cache, the code provides size in KB, ways
 of associativity, and the cache line size.
 The attached code should eventually become part of std.cpuid, and an
 equivalent module in Tango. But, it needs significant further testing.
 Please compile and run the code, and report the results. Any results
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.

Seems to get things wrong for quad core Intels.  No, I'm not rich enough to own
one myself, this is my work computer.

Vendor string:    GenuineIntel
Processor string: Intel(R) Core(TM)2 Quad CPU             2.66GHz
Signature:        Family=6 Model=15 Stepping=7
Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT
Multithreading:   4 threads / 4 cores

Family=6 Model=F Stepping=7
Data caches:
Level 1 size=8K, ways=8 linesize=64
Level 2 size=1024K, ways=16 linesize=64
Level 3 size=4194303K, ways=1 linesize=64

According to CPU-Z, the L2 cache is 4096 Kb * 2.  I'm pretty sure that the
quad-core Intels are basically two dual-cores stuck together, meaning that they
have 2 L2 caches, each shared between two of the cores.  Also, the L1 cache info
is wrong if CPU-Z is right.  I actually have 32K of L2 cache per core.  For your
convenience, I've attached my CPU-Z HTML output.
begin 644 cpuz.htm

M93PO=&ET;&4^/"]H96%D/CQB;V1Y(&)G8V]L;W(]1D9&1D9&/ T*/&9O;G0 

M,"!W:61T:#TB,3`P)2( 8V5L;'-P86-I;F<](C`B(&-E;&QP861D:6YG/2(U

M.B\O=W=W+F-P=6ED+F-O;2( =&%R9V5T/6)L86YK/CQI;6< <W)C/2)H='1P
M.B\O=W=W+F-P=6ED+F-O;2]P:6-S+VQO9V\N:G!G(B!B;W)D97(](C`B('=I
M9'1H/2(Q-S(B(&AE:6=H=#TB-34B
M+C0U(')E<&]R="!F:6QE/"]B/CPO<#X-
M9#X\=&%B;&4 8F]R9&5R/2(Q(B!W:61T:#TB,3`P)2( 8V5L;'-P86-I;F<]
M(C`B(&-E;&QP861D:6YG/2(R(B!B9V-O;&]R/2(C1C!&,$8P(CX-"CQT<B!V
M86QI9VX](G1O<"( 8F=C;VQO<CTB(T4P13!&1B(^/'1D('=I9'1H/2(S,"4B
M/CQS;6%L;#X\8CY0<F]C97-S;W(H<RD\+V(^/"]S;6%L;#X\+W1D/CQT9"!V
M86QI9VX](F-E;G1E<B(^/'-M86QL/CQS;6%L;#X\9F]N="!C;VQO<CTB(S`P

M"CPO=&%B;&4^/"]T9#X\+W1R/ T*/'1R/CQT9#X\=&%B;&4 8F]R9&5R/2(Q
M(B!W:61T:#TB,3`P)2( 8V5L;'-P86-I;F<](C`B(&-E;&QP861D:6YG/2(R
M(B!B9V-O;&]R/2(C1C!&,$8P(CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I
M9'1H/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^3G5M8F5R(&]F('!R;V-E<W-O
M<G,\+V(^/"]S;6%L;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^

M/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B
M/CQS;6%L;#X\<VUA;&P^/&(^3G5M8F5R(&]F(&-O<F5S/"]B/CPO<VUA;&P^
M/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL/CQF;VYT(&-O;&]R/2(C

M/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B
M/CQS;6%L;#X\<VUA;&P^/&(^3G5M8F5R(&]F('1H<F5A9',\+V(^/"]S;6%L
M;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^/&9O;G0 8V]L;W(]

M;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 =VED=& ](C,P
M)2(^/'-M86QL/CQS;6%L;#X\8CY.86UE/"]B/CPO<VUA;&P^/"]S;6%L;#X\
M+W1D/CQT9#X

M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S
M,"4B/CQS;6%L;#X\<VUA;&P^/&(^0V]D92!.86UE/"]B/CPO<VUA;&P^/"]S
M;6%L;#X\+W1D/CQT9#X
M,$$P(CY+96YT<V9I96QD/"]F;VYT/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CPO

M/'-M86QL/CQB/E-P96-I9FEC871I;VX\+V(^/"]S;6%L;#X\+W-M86QL/CPO
M=&0^/'1D/CQS;6%L;#X
M=&5L*%(I($-O<F4H5$TI,B!1=6%D($-052` ("` ("` ("` 0"`R+C8V1TAZ
M/"]F;VYT/CPO<VUA;&P^/"]S;6%L;#X
M/2)T;W`B/CQT9"!W:61T:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB/E!A8VMA
M9V4\+V(^/"]S;6%L;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^

M<VUA;&P^/"]S;6%L;#X
M9"!W:61T:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB/D9A;6EL>2]-;V1E;"]3
M=&5P<&EN9SPO8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS
M;6%L;#X
M;#X\+W-M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I
M9'1H/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^17AT96YD960 1F%M:6QY+TUO
M9&5L/"]B/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL

M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S
M,"4B/CQS;6%L;#X\<VUA;&P^/&(^0V]R92!3=&5P<&EN9SPO8CX\+W-M86QL
M/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\9F]N="!C;VQO<CTB

M"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\<VUA
M;&P^/&(^5&5C:&YO;&]G>3PO8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^
M/'-M86QL/CQS;6%L;#X
M;G0^/"]S;6%L;#X\+W-M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O
M<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^0V]R92!3<&5E
M9#PO8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\

M/CPO<VUA;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 =VED
M=& ](C,P)2(^/'-M86QL/CQS;6%L;#X\8CY-=6QT:7!L:65R('  0G5S('-P
M965D/"]B/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL

M/CPO<VUA;&P^/"]S;6%L;#X
M/CQT9"!W:61T:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB/E)A=&5D($)U<R!S
M<&5E9#PO8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L
M;#X
M86QL/CPO<VUA;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 
M=VED=& ](C,P)2(^/'-M86QL/CQS;6%L;#X\8CY3=&]C:R!F<F5Q=65N8WD\
M+V(^/"]S;6%L;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^/&9O

M;6%L;#X
M,S`E(CX\<VUA;&P^/'-M86QL/CQB/DEN<W1R=6-T:6]N('-E=',\+V(^/"]S
M;6%L;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^/&9O;G0 8V]L

M-%0\+V9O;G0^/"]S;6%L;#X\+W-M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI
M9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X
M1&%T82!C86-H92`H<&5R('!R;V-E<W-O<BD\+V(^/"]S;6%L;#X\+W-M86QL
M/CPO=&0^/'1D/CQS;6%L;#X

M92!L:6YE('-I>F4\+V9O;G0^/"]S;6%L;#X\+W-M86QL/CPO=&0^/"]T<CX-
M"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\<VUA

M/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL/CQF;VYT


M;6%L;#X
M,S`E(CX\<VUA;&P^/'-M86QL/CQB/DPR(&-A8VAE("AP97( <')O8V5S<V]R
M*3PO8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\

M<V5T(&%S<V]C:6%T:79E+"`V-"UB>71E(&QI;F4 <VEZ93PO9F]N=#X\+W-M
M86QL/CPO<VUA;&P^/"]T9#X

M8VEN9STB,"( 8V5L;'!A9&1I;F<](C(B(&)G8V]L;W(](B-&,$8P1C`B/ T*
M/'1R('9A;&EG;CTB=&]P(B!B9V-O;&]R/2(C13!%,$9&(CX\=&0 =VED=& ]
M(C,P)2(^/'-M86QL/CQB/D-H:7!S970 )B!-96UO<GD\+V(^/"]S;6%L;#X\
M+W1D/CQT9"!V86QI9VX](F-E;G1E<B(^/'-M86QL/CQS;6%L;#X\9F]N="!C

M=&0^/"]T<CX-"CPO=&%B;&4^/"]T9#X\+W1R/ T*/'1R/CQT9#X\=&%B;&4 
M8F]R9&5R/2(Q(B!W:61T:#TB,3`P)2( 8V5L;'-P86-I;F<](C`B(&-E;&QP
M861D:6YG/2(R(B!B9V-O;&]R/2(C1C!&,$8P(CX-"CQT<B!V86QI9VX](G1O
M<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^3F]R=&AB<FED
M9V4\+V(^/"]S;6%L;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^

M;"!%9&ET:6]N(')E=BX 0S$\+V9O;G0^/"]S;6%L;#X\+W-M86QL/CPO=&0^
M/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L
M;#X\<VUA;&P^/&(^4V]U=&AB<FED9V4\+V(^/"]S;6%L;#X\+W-M86QL/CPO
M=&0^/'1D/CQS;6%L;#X
M241)02!N1F]R8V4 -3<P(%-,22!R978N($$R/"]F;VYT/CPO<VUA;&P^/"]S
M;6%L;#X
M,S`E(CX\<VUA;&P^/'-M86QL/CQB/D=R87!H:6, 26YT97)F86-E/"]B/CPO
M<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL/CQF;VYT(&-O

M;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 =VED=& ](C,P
M)2(^/'-M86QL/CQS;6%L;#X\8CY00TDM12!,:6YK(%=I9'1H/"]B/CPO<VUA
M;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL/CQF;VYT(&-O;&]R

M<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\
M<VUA;&P^/&(^4$-)+44 36%X($QI;FL 5VED=& \+V(^/"]S;6%L;#X\+W-M
M86QL/CPO=&0^/'1D/CQS;6%L;#X
M03`B/G Q-CPO9F]N=#X\+W-M86QL/CPO<VUA;&P^/"]T9#X\+W1R/ T*/'1R
M('9A;&EG;CTB=&]P(CX\=&0 =VED=& ](C,P)2(^/'-M86QL/CQS;6%L;#X\
M8CY-96UO<GD 5'EP93PO8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M
M86QL/CQS;6%L;#X
M+W-M86QL/CPO<VUA;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\
M=&0 =VED=& ](C,P)2(^/'-M86QL/CQS;6%L;#X\8CY-96UO<GD 4VEZ93PO
M8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\9F]N

M+W-M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H
M/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^365M;W)Y($9R97%U96YC>3PO8CX\
M+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\9F]N="!C

M/"]S;6%L;#X
M:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB/D-!4R, 3&%T96YC>2`H=$-,*3PO
M8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\9F]N

M<VUA;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 =VED=& ]
M(C,P)2(^/'-M86QL/CQS;6%L;#X\8CY205,C('1O($-!4R, *'120T0I/"]B
M/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL/CQF;VYT

M;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 =VED=& ](C,P
M)2(^/'-M86QL/CQS;6%L;#X\8CY205,C(%!R96-H87)G92`H=%)0*3PO8CX\
M+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\9F]N="!C

M/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B
M/CQS;6%L;#X\<VUA;&P^/&(^0WEC;&4 5&EM92`H=%)!4RD\+V(^/"]S;6%L
M;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^/&9O;G0 8V]L;W(]

M9#X\+W1R/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 =VED=& ](C,P)2(^/'-M
M86QL/CQS;6%L;#X\8CY"86YK($-Y8VQE(%1I;64 *'120RD\+V(^/"]S;6%L
M;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA;&P^/&9O;G0 8V]L;W(]

M9#X

M9&1I;F<](C(B(&)G8V]L;W(](B-&,$8P1C`B/ T*/'1R('9A;&EG;CTB=&]P
M(B!B9V-O;&]R/2(C13!%,$9&(CX\=&0 =VED=& ](C,P)2(^/'-M86QL/CQB
M/E-Y<W1E;3PO8CX\+W-M86QL/CPO=&0^/'1D('9A;&EG;CTB8V5N=&5R(CX\

M=#X\+W-M86QL/CPO<VUA;&P^/"]T9#X\+W1R/ T*/"]T86)L93X\+W1D/CPO

M96QL<W!A8VEN9STB,"( 8V5L;'!A9&1I;F<](C(B(&)G8V]L;W(](B-&,$8P
M1C`B/ T*/'1R('9A;&EG;CTB=&]P(CX\=&0 =VED=& ](C,P)2(^/'-M86QL
M/CQS;6%L;#X\8CY3>7-T96T 36%N=69A8W1U<F5R/"]B/CPO<VUA;&P^/"]S
M;6%L;#X\+W1D/CQT9#X
M,$$P(CY$96QL($EN8RX\+V9O;G0^/"]S;6%L;#X\+W-M86QL/CPO=&0^/"]T
M<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\
M<VUA;&P^/&(^4WES=&5M($YA;64\+V(^/"]S;6%L;#X\+W-M86QL/CPO=&0^
M/'1D/CQS;6%L;#X
M6%!3-S$P/"]F;VYT/CPO<VUA;&P^/"]S;6%L;#X
M=F%L:6=N/2)T;W`B/CQT9"!W:61T:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB
M/E-Y<W1E;2!3+TX\+V(^/"]S;6%L;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L
M;#X
M/"]S;6%L;#X\+W-M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX](G1O<"(^
M/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^36%I;F)O87)D(%9E
M;F1O<CPO8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L
M;#X
M;&P^/"]S;6%L;#X
M:61T:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB/DUA:6YB;V%R9"!-;V1E;#PO
M8CX\+W-M86QL/CPO<VUA;&P^/"]T9#X\=&0^/'-M86QL/CQS;6%L;#X\9F]N

M;#X
M(CX\<VUA;&P^/'-M86QL/CQB/D))3U, 5F5N9&]R/"]B/CPO<VUA;&P^/"]S
M;6%L;#X\+W1D/CQT9#X
M,$$P(CY$96QL($EN8RX\+V9O;G0^/"]S;6%L;#X\+W-M86QL/CPO=&0^/"]T
M<CX-"CQT<B!V86QI9VX](G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\
M<VUA;&P^/&(^0DE/4R!697)S:6]N/"]B/CPO<VUA;&P^/"]S;6%L;#X\+W1D
M/CQT9#X
M,3PO9F]N=#X\+W-M86QL/CPO<VUA;&P^/"]T9#X\+W1R/ T*/'1R('9A;&EG
M;CTB=&]P(CX\=&0 =VED=& ](C,P)2(^/'-M86QL/CQS;6%L;#X\8CY"24]3
M($1A=&4\+V(^/"]S;6%L;#X\+W-M86QL/CPO=&0^/'1D/CQS;6%L;#X\<VUA

M;6%L;#X\+W-M86QL/CPO=&0^/"]T<CX-"CPO=&%B;&4^/"]T9#X\+W1R/ T*
M/'1R/CQT9#X\=&%B;&4 8F]R9&5R/2(Q(B!W:61T:#TB,3`P)2( 8V5L;'-P
M86-I;F<](C`B(&-E;&QP861D:6YG/2(R(B!B9V-O;&]R/2(C1C!&,$8P(CX-
M"CQT<B!V86QI9VX](G1O<"( 8F=C;VQO<CTB(T4P13!&1B(^/'1D('=I9'1H
M/2(S,"4B/CQS;6%L;#X\8CY-96UO<GD 4U!$/"]B/CPO<VUA;&P^/"]T9#X\
M=&0 =F%L:6=N/2)C96YT97(B/CQS;6%L;#X\<VUA;&P^/&9O;G0 8V]L;W(]


M<CTB,2( =VED=& ](C$P,"4B(&-E;&QS<&%C:6YG/2(P(B!C96QL<&%D9&EN

M9"!W:61T:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB/DUO9'5L92`Q/"]B/CPO
M<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL/CQF;VYT(&-O

M34)Y=&5S+"!1:6UO;F1A/"]F;VYT/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CPO

M/'-M86QL/CQB/DUO9'5L92`R/"]B/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT
M9#X

M/CPO<VUA;&P^/"]S;6%L;#X
M/CQT9"!W:61T:#TB,S`E(CX\<VUA;&P^/'-M86QL/CQB/DUO9'5L92`S/"]B
M/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M86QL/CQF;VYT

M,C0 34)Y=&5S+"!1:6UO;F1A/"]F;VYT/CPO<VUA;&P^/"]S;6%L;#X\+W1D

M;&P^/'-M86QL/CQB/DUO9'5L92`T/"]B/CPO<VUA;&P^/"]S;6%L;#X\+W1D
M/CQT9#X

M;VYT/CPO<VUA;&P^/"]S;6%L;#X
M/"]T<CX-"CQT<CX\=&0^/'1A8FQE(&)O<F1E<CTB,2( =VED=& ](C$P,"4B
M(&-E;&QS<&%C:6YG/2(P(B!C96QL<&%D9&EN9STB,B( 8F=C;VQO<CTB(T8P

M9"!W:61T:#TB,S`E(CX\<VUA;&P^/&(^4V]F='=A<F4\+V(^/"]S;6%L;#X\
M+W1D/CQT9"!V86QI9VX](F-E;G1E<B(^/'-M86QL/CQS;6%L;#X\9F]N="!C

M=&0^/"]T<CX-"CPO=&%B;&4^/"]T9#X\+W1R/ T*/'1R/CQT9#X\=&%B;&4 
M8F]R9&5R/2(Q(B!W:61T:#TB,3`P)2( 8V5L;'-P86-I;F<](C`B(&-E;&QP
M861D:6YG/2(R(B!B9V-O;&]R/2(C1C!&,$8P(CX-"CQT<B!V86QI9VX](G1O
M<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^5VEN9&]W<R!6
M97)S:6]N/"]B/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^/'-M

M4"!(;VUE($5D:71I;VX (%-E<G9I8V4 4&%C:R`S("A"=6EL9"`R-C`P*2`\
M+V9O;G0^/"]S;6%L;#X\+W-M86QL/CPO=&0^/"]T<CX-"CQT<B!V86QI9VX]
M(G1O<"(^/'1D('=I9'1H/2(S,"4B/CQS;6%L;#X\<VUA;&P^/&(^1&ER96-T
M6"!697)S:6]N/"]B/CPO<VUA;&P^/"]S;6%L;#X\+W1D/CQT9#X\<VUA;&P^

M;&P^/"]S;6%L;#X
59F]N=#X\+V)O9'D^/"]H=&UL/ T*
`
end

Sep 10 2008

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

	I get:
 ./cache

unknown CPU

Family=0 Model=0 Stepping=0
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=4194303K, ways=1 linesize=32
Level 3 size=4194303K, ways=1 linesize=32

 gdc --version

gdc (GCC) 4.1.2 20070214 ( gdc 0.24, using dmd 1.030)

 cat /proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 75
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping        : 2
cpu MHz         : 1800.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov                                              pat pse36
clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm
3dno                                             wext 3dnow rep_good
pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 3618.77
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 75
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping        : 2
cpu MHz         : 1800.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov                                              pat pse36
clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm
3dno                                             wext 3dnow rep_good
pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 3618.77
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

- --
+------------------------- Jerome M. BERGER ---------------------+
|    mailto:jeberger free.fr      | ICQ:    238062172            |
|    http://jeberger.free.fr/     | Jabber: jeberger jabber.fr   |
+---------------------------------+------------------------------+
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkjIEG4ACgkQd0kWM4JG3k9yPwCgl4Fd7Yu2rH3tbIB9K/Ir05da
mcgAnjbiErqCQ+GmrxJKoeS2TeRkO5QN
=QE3i
-----END PGP SIGNATURE-----

Sep 10 2008

=?iso-8859-1?Q?Julio=20C=e9sar=20Carrascal=20Urquijo?= <jcarrascal gmail.com> writes:

Hello Don,

 Please compile and run the code, and report the results. Any results
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.
 Public domain.
 

Vendor string:    AuthenticAMD
Processor string: AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55
Signature:        Family=15 Model=104 Stepping=1
Features:         MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=F Model=8 Stepping=1
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=256K, ways=16 linesize=0
Level 3 size=4194303K, ways=1 linesize=0


The same data is reported by CPU-Z.

Sep 10 2008

"Craig Black" <craigblack2 cox.net> writes:

Don,

Very good work on all this stuff!!  I didn't realize that this was possible 
to do.  Once you get the cache sizes it would be beneficial to know what 
regions of memory are currently loaded into cache and RAM.  Do you know if 
this can be done?

If such information was available, it would be possible to write a "memory 
optimizer" that would work on special data structures that could be moved 
around on the heap.  Objects that are accessed frequently could be moved 
around to improve locality of reference.

-Craig

Sep 10 2008

Don <nospam nospam.com.au> writes:

Craig Black wrote:
 Don,
 
 Very good work on all this stuff!!  I didn't realize that this was 
 possible to do.  Once you get the cache sizes it would be beneficial to 
 know what regions of memory are currently loaded into cache and RAM.  Do 
 you know if this can be done?

I don't think that's possible.

One thing you can do, though, is use the performance counters to measure 
how many cache misses you're getting. (There are performance counters 

cache, etc). Requires a small kernel mode driver, though, so can't be 
used for client code. But it's what I use for development -- you can 
learn a lot with it.

 If such information was available, it would be possible to write a 
 "memory optimizer" that would work on special data structures that could 
 be moved around on the heap.  Objects that are accessed frequently could 
 be moved around to improve locality of reference.

Nice idea. Still, the most important things can be done at compile time.
(especially, making sure that arrays of structs are sensibly arranged).

Sep 11 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Don wrote:
 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of 
 their built-in caches, it's a complete pigs breakfast. There are 
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you 
 really want. For each level of cache, the code provides size in KB, ways 
 of associativity, and the cache line size.
 
 The attached code should eventually become part of std.cpuid, and an 
 equivalent module in Tango. But, it needs significant further testing.
 
 Please compile and run the code, and report the results. Any results 
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.
 
 Public domain.
 

Vendor string:    AuthenticAMD
Processor string: AMD Athlon(tm) 64 X2 Dual Core Processor 5000+
Signature:        Family=15 Model=107 Stepping=2
Features:         MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=F Model=6B Stepping=2
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=512K, ways=16 linesize=64
Level 3 size=4194303K, ways=1 linesize=64

-- 
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Sep 23 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Determing cache sizes -- request for testing