digitalmars.D.learn - Branch Prediction strange results

• bearophile (14/14) Nov 11 2008 I have found an interesting small article about optimization, so I've tr...
• Kagamin (14/14) Nov 12 2008 if (i % 4 == 1) {
• Kagamin (1/1) Nov 12 2008 You didn't run your code, lol.
• bearophile (29/30) Nov 12 2008 Thank you for spotting the silly bug, I'll fix it now. (It seems it's mo...
• Don (4/22) Nov 12 2008 Are you running it on a Pentium 4? Pentium 4 has *horrific* branch
• bearophile (75/78) Nov 12 2008 Sorry, I am using a Core2 @ 2GHz.
bearophile <bearophileHUGS lycos.com> writes:
```I have found an interesting small article about optimization, so I've tried the
code in C and D, and I have found strange results (the D code shows timings
opposite of the article).
This is the article, look at the "Branch Prediction" section:
http://www.ddj.com/184405848

The C code:
And its asm (MinGW 4.2.1):

The similar D code:
Its asm (DMD 1.036):

There is also about 2X performance difference.

Bye,
bearophile
```
Nov 11 2008
Kagamin <spam here.lot> writes:
```            if (i % 4 == 1) {
if (i % 4 == 0) {
counter1++;
} else {
counter2++;
}
} else {
if (i % 4 == 2) {
counter3++;
} else {
counter4++;
}
}
this is incorrect
```
Nov 12 2008
Kagamin <spam here.lot> writes:
```You didn't run your code, lol.
```
Nov 12 2008
bearophile <bearophileHUGS lycos.com> writes:
```Kagamin:
this is incorrect

Thank you for spotting the silly bug, I'll fix it now. (It seems it's more easy
to leave bugs in such kind of code because it does nothing useful).

But note that in both programs:
#define FIRST
static if (1) {
So the first part only is run in both D and C code, not the wrong one...

So the code is like this:

#include "stdio.h"

int main() {
int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
int i = 300000000;
while (i--) {
// 0.63 s
if (i % 4 == 0) {
counter0++;
} else if (i % 4 == 1) {
counter1++;
} else if (i % 4 == 2) {
counter2++;
} else {
counter3++;
}
}

printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
return 0;
}

So the problem and timings of the first part I have shown are correct still :-)

Bye,
bearophile
```
Nov 12 2008
Don <nospam nospam.com> writes:
```bearophile wrote:
I have found an interesting small article about optimization, so I've tried
the code in C and D, and I have found strange results (the D code shows timings
opposite of the article).
This is the article, look at the "Branch Prediction" section:
http://www.ddj.com/184405848

The C code:
And its asm (MinGW 4.2.1):

The similar D code:
Its asm (DMD 1.036):

There is also about 2X performance difference.

Bye,
bearophile

Are you running it on a Pentium 4? Pentium 4 has *horrific* branch
misprediction (minimum 24 cycles, 45 uops). No other processor is nearly
as bad, eg it's 15 cycles on Core2; it was just 4 cycles on PMMX.
```
Nov 12 2008
bearophile <bearophileHUGS lycos.com> writes:
```Don:
Are you running it on a Pentium 4? Pentium 4 has *horrific* branch
misprediction (minimum 24 cycles, 45 uops). No other processor is nearly
as bad, eg it's 15 cycles on Core2; it was just 4 cycles on PMMX.

Sorry, I am using a Core2   2GHz.
The fixed C code with timings:

#include "stdio.h"

//#define FIRST

int main() {
int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
int i = 300000000;
while (i--) {
#ifdef FIRST
// 0.63 s
if (i % 4 == 0) {
counter0++;
} else if (i % 4 == 1) {
counter1++;
} else if (i % 4 == 2) {
counter2++;
} else {
counter3++;
}
#else
// 0.66 s
if (i & 2) {
if (i & 1) {
counter3++;
} else {
counter2++;
}
} else {
if (i & 1) {
counter1++;
} else {
counter0++;
}
}
#endif
}

printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
return 0;
}

Fixed D code with timings:

void main() {
int counter0, counter1, counter2, counter3;

int i = 300000000;
while (i--)
static if (0) { // 1.24 s
if (i % 4 == 0) {
counter0++;
} else if (i % 4 == 1) {
counter1++;
} else if (i % 4 == 2) {
counter2++;
} else {
counter3++;
}
} else { // 1.01 s
if (i & 2) {
if (i & 1) {
counter3++;
} else {
counter2++;
}
} else {
if (i & 1) {
counter1++;
} else {
counter0++;
}
}
}

printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
}

As you can see the C version (GCC 4.2.1-dw2) is twice faster than the D one,
and it shows the scan as faster than the binary search, as says the article I