digitalmars.D - Switch implementation

bearophile (430/430) Sep 28 2010 Through Reddit I have found a small article about reverse engineering th...

Pelle (3/11) Sep 28 2010 Interesting. Do you have any performance comparisons? The input is

Iain Buclaw (68/84) Sep 28 2010 changes, this is the D program and the asm from the two compilers:

bearophile (6/7) Sep 28 2010 You are mostly testing compiler speed, while my worry was about runtime....

Iain Buclaw (6/13) Sep 28 2010 something useful to say:

bearophile (106/108) Sep 28 2010 If your purpose is to test runtime speed, use a more natural number of c...

bioinfornatics (55/55) Sep 28 2010 with ldc and tango (up to date)

bioinfornatics (2/2) Sep 28 2010 I have a AMD Phenom(tm) II X4 955 Processor and this script works only o...
bearophile (5/13) Sep 28 2010 LDC has llvm back-end, that doesn't share that dmd problem.

Iain Buclaw (44/56) Sep 28 2010 --snip--

bearophile (5/15) Sep 28 2010 Probably because gcc/gdc is not using a jump table here, but a binary se...

Jonathan M Davis (18/25) Sep 28 2010 The real question though isn't worst case performance but rather average...

retard (4/12) Sep 28 2010 Instead of O(n) linear search or O(ln n) binary search, why not use O(1)...

bearophile (4/6) Sep 28 2010 I don't exactly know. But you must take into account the constants too, ...

Robert Jacques (4/15) Sep 28 2010 Well there are 28 labeled cases and ~16kb of jump table address space.

bearophile (4/6) Sep 28 2010 32 kb are enough to fill the code part of the L1 cache on most CPUs.

retard (5/10) Sep 28 2010 28 cases and 32 kB of space seems like a waste. I'd use some sort of

bearophile <bearophileHUGS lycos.com> writes:

Through Reddit I have found a small article about reverse engineering the
switch statement:
http://www.codeproject.com/KB/cpp/switch.aspx

I have compiled a test program with GCC and then with DMD with minimal changes,
this is the D program and the asm from the two compilers:


import std.c.stdio: puts;
import std.c.stdlib: atoi;

void f1() { puts("f1 called"); }
void f2() { puts("f2 called"); }
void f3() { puts("f3 called"); }

void main() {
    int i = atoi("3");

    switch (i) {
        case 140: f1(); break;
        case 300: f1(); break;
        case 1280: f1(); break;
        case 1540: f1(); break;
        case 1660: f1(); break;
        case 1770: f2(); break;
        case 2150: f2(); break;
        case 2190: f1(); break;
        case 2530: f2(); break;
        case 2560: f2(); break;
        case 2590: f1(); break;
        case 2660: f1(); break;
        case 2720: f2(); break;
        case 3010: f1(); break;
        case 3100: f1(); break;
        case 3390: f2(); break;
        case 3760: f1(); break;
        case 3970: f2(); break;
        case 4050: f1(); break;
        case 4140: f1(); break;
        case 4360: f2(); break;
        case 4540: f1(); break;
        case 4600: f2(); break;
        case 4720: f2(); break;
        case 4730: f2(); break;
        case 4740: f2(); break;
        case 4880: f2(); break;
        case 4950: f1(); break;

        default: f3();
    }
}


/*
------------------------------
DMD optimized build:

__Dmain comdat
        push    EBX
        mov EAX,offset FLAT:_DATA[024h]
        push    EAX
        call    near ptr _atoi
        add ESP,4
        cmp EAX,08Ch
        je  L115
        cmp EAX,012Ch
        je  L115
        cmp EAX,0500h
        je  L115
        cmp EAX,0604h
        je  L115
        cmp EAX,067Ch
        je  L115
        cmp EAX,06EAh
        je  L105
        cmp EAX,0866h
        je  L105
        cmp EAX,088Eh
        je  L115
        cmp EAX,09E2h
        je  L105
        cmp EAX,0A00h
        je  L105
        cmp EAX,0A1Eh
        je  L115
        cmp EAX,0A64h
        je  L115
        cmp EAX,0AA0h
        je  L105
        cmp EAX,0BC2h
        je  L115
        cmp EAX,0C1Ch
        je  L115
        cmp EAX,0D3Eh
        je  L105
        cmp EAX,0EB0h
        je  L115
        cmp EAX,0F82h
        je  L105
        cmp EAX,0FD2h
        je  L115
        cmp EAX,0102Ch
        je  L115
        cmp EAX,01108h
        je  L105
        cmp EAX,011BCh
        je  L115
        cmp EAX,011F8h
        je  L105
        cmp EAX,01270h
        je  L105
        cmp EAX,0127Ah
        je  L105
        cmp EAX,01284h
        je  L105
        cmp EAX,01310h
        je  L105
        cmp EAX,01356h
        je  L115
        jmp short   L125
L105:       mov ECX,offset FLAT:_DATA[0Ch]
        push    ECX
        call    near ptr _puts
        add ESP,4
        jmp short   L133
L115:       mov EDX,offset FLAT:_DATA
        push    EDX
        call    near ptr _puts
        add ESP,4
        jmp short   L133
L125:       mov EBX,offset FLAT:_DATA[018h]
        push    EBX
        call    near ptr _puts
        add ESP,4
L133:       pop EBX
        xor EAX,EAX
        ret

----------------------------------

GCC 4.5.1 -O3

_main:
    pushl   %ebp
    movl    %esp, %ebp
    andl    $-16, %esp
    subl    $16, %esp
    call    ___main
    movl    $LC3, (%esp)
    call    _atoi
    cmpl    $3010, %eax
    je  L33
    jle L43
    cmpl    $4360, %eax
    je  L32
    .p2align 4,,6
    jle L44
    cmpl    $4730, %eax
    je  L32
    .p2align 4,,6
    jle L45
    cmpl    $4880, %eax
    je  L32
    cmpl    $4950, %eax
    je  L33
    cmpl    $4740, %eax
    jne L5
    .p2align 4,,7
L32:
    movl    $LC1, (%esp)
    call    _puts
    xorl    %eax, %eax
    leave
LCFI16:
    ret
    .p2align 4,,7
L45:
LCFI17:
    cmpl    $4600, %eax
    je  L32
    cmpl    $4720, %eax
    je  L32
    cmpl    $4540, %eax
    jne L5
    .p2align 4,,7
L33:
    movl    $LC0, (%esp)
    call    _puts
L41:
    xorl    %eax, %eax
    leave
LCFI18:
    ret
    .p2align 4,,7
L43:
LCFI19:
    cmpl    $2150, %eax
    je  L32
    .p2align 4,,4
    jle L46
    cmpl    $2560, %eax
    je  L32
    .p2align 4,,6
    jle L47
    cmpl    $2660, %eax
    je  L33
    cmpl    $2720, %eax
    je  L32
    cmpl    $2590, %eax
    jne L5
    jmp L33
    .p2align 4,,7
L44:
    cmpl    $3760, %eax
    je  L33
    .p2align 4,,6
    jle L48
    cmpl    $4050, %eax
    je  L33
    cmpl    $4140, %eax
    je  L33
    cmpl    $3970, %eax
    jne L5
    jmp L32
    .p2align 4,,7
L46:
    cmpl    $1280, %eax
    je  L33
    .p2align 4,,6
    jle L49
    cmpl    $1660, %eax
    je  L33
    cmpl    $1770, %eax
    je  L32
    cmpl    $1540, %eax
    jne L5
    jmp L33
    .p2align 4,,7
L47:
    cmpl    $2190, %eax
    je  L33
    cmpl    $2530, %eax
    je  L32
    .p2align 4,,7
L5:
    movl    $LC2, (%esp)
    call    _puts
    jmp L41
    .p2align 4,,7
L49:
    cmpl    $140, %eax
    je  L33
    cmpl    $300, %eax
    jne L5
    jmp L33
    .p2align 4,,7
L48:
    cmpl    $3100, %eax
    je  L33
    cmpl    $3390, %eax
    jne L5
    jmp L32

---------------------------

llvm-gcc V.2.7, -O3

_main:
    pushl   %ebp
    movl    %esp, %ebp
    subl    $8, %esp
    call    ___main
    movl    $L_.str3, (%esp)
    call    _atoi
    cmpl    $299, %eax
    jg  LBB4_4
    cmpl    $140, %eax
    jne LBB4_56
LBB4_2:
    movl    $L_.str2, (%esp)
LBB4_3:
    call    _puts
    xorl    %eax, %eax
    addl    $8, %esp
    popl    %ebp
    ret
LBB4_4:
    cmpl    $1279, %eax
    jg  LBB4_6
    cmpl    $300, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_6:
    cmpl    $1539, %eax
    jg  LBB4_8
    cmpl    $1280, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_8:
    cmpl    $1659, %eax
    jg  LBB4_10
    cmpl    $1540, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_10:
    cmpl    $1769, %eax
    jg  LBB4_12
    cmpl    $1660, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_12:
    cmpl    $2149, %eax
    jg  LBB4_15
    cmpl    $1770, %eax
    jne LBB4_56
LBB4_14:
    movl    $L_.str1, (%esp)
    jmp LBB4_3
LBB4_15:
    cmpl    $4949, %eax
    jg  LBB4_55
    cmpl    $4879, %eax
    jg  LBB4_54
    cmpl    $2189, %eax
    jg  LBB4_19
    cmpl    $2150, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_19:
    cmpl    $2529, %eax
    jg  LBB4_21
    cmpl    $2190, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_21:
    cmpl    $2559, %eax
    jg  LBB4_23
    cmpl    $2530, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_23:
    cmpl    $2589, %eax
    jg  LBB4_25
    cmpl    $2560, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_25:
    cmpl    $2659, %eax
    jg  LBB4_27
    cmpl    $2590, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_27:
    cmpl    $2719, %eax
    jg  LBB4_29
    cmpl    $2660, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_29:
    cmpl    $3009, %eax
    jg  LBB4_31
    cmpl    $2720, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_31:
    cmpl    $3099, %eax
    jg  LBB4_33
    cmpl    $3010, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_33:
    cmpl    $3389, %eax
    jg  LBB4_35
    cmpl    $3100, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_35:
    cmpl    $3759, %eax
    jg  LBB4_37
    cmpl    $3390, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_37:
    cmpl    $3969, %eax
    jg  LBB4_39
    cmpl    $3760, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_39:
    cmpl    $4049, %eax
    jg  LBB4_41
    cmpl    $3970, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_41:
    cmpl    $4139, %eax
    jg  LBB4_43
    cmpl    $4050, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_43:
    cmpl    $4359, %eax
    jg  LBB4_45
    cmpl    $4140, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_45:
    cmpl    $4539, %eax
    jg  LBB4_47
    cmpl    $4360, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_47:
    cmpl    $4599, %eax
    jg  LBB4_49
    cmpl    $4540, %eax
    je  LBB4_2
    jmp LBB4_56
LBB4_49:
    cmpl    $4719, %eax
    jg  LBB4_51
    cmpl    $4600, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_51:
    cmpl    $4720, %eax
    je  LBB4_14
    cmpl    $4730, %eax
    je  LBB4_14
    cmpl    $4740, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_54:
    cmpl    $4880, %eax
    je  LBB4_14
    jmp LBB4_56
LBB4_55:
    cmpl    $4950, %eax
    je  LBB4_2
LBB4_56:
    movl    $L_.str, (%esp)
    jmp LBB4_3
---------------------------------
*/

gcc and llvm-gcc use a binary search, dmd a linear one.

I have done a similar interesting test (similar to switch5.cpp of the article
author) where a good implementation of the switch is a small table for part of
the cases and a binary tree for the other cases.

Bye,
bearophile

Sep 28 2010

Pelle <pelle.mansson gmail.com> writes:

On 09/28/2010 07:33 PM, bearophile wrote:
 Through Reddit I have found a small article about reverse engineering the
switch statement:
 http://www.codeproject.com/KB/cpp/switch.aspx

 I have compiled a test program with GCC and then with DMD with minimal
changes, this is the D program and the asm from the two compilers:

 ... listing ...

 gcc and llvm-gcc use a binary search, dmd a linear one.

 I have done a similar interesting test (similar to switch5.cpp of the article
author) where a good implementation of the switch is a small table for part of
the cases and a binary tree for the other cases.

 Bye,
 bearophile

Interesting. Do you have any performance comparisons? The input is 
fairly small, I'm not sure a binary search will help a lot.

Sep 28 2010

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Pelle (pelle.mansson gmail.com)'s article
 On 09/28/2010 07:33 PM, bearophile wrote:
 Through Reddit I have found a small article about reverse engineering the


switch statement:
 http://www.codeproject.com/KB/cpp/switch.aspx

 I have compiled a test program with GCC and then with DMD with minimal


changes, this is the D program and the asm from the two compilers:
 ... listing ...

 gcc and llvm-gcc use a binary search, dmd a linear one.

 I have done a similar interesting test (similar to switch5.cpp of the article


author) where a good implementation of the switch is a small table for part of
the
cases and a binary tree for the other cases.
 Bye,
 bearophile

 Interesting. Do you have any performance comparisons? The input is
 fairly small, I'm not sure a binary search will help a lot.

Out of curiousity, I thought I might give a good stress test a try.

Specs of my machine:

Samsung N110, 2.1GHz Intel Atom, 2GB Memory.

Code looks like this: It's a switch statement with 20,000 cases.

import std.c.stdio: puts;
import std.c.stdlib: atoi;

void f1() { puts("f1 called"); }
void f2() { puts("f2 called"); }
void f3() { puts("f3 called"); }

void main() {
   int i = atoi("20000");

   switch (i) {
      case 1: f1(); break;
      case 2: f2(); break;
      case 3: f3(); break;
// Turtles all the down...
      case 19999: f1(); break;
      case 20000: f2(); break;
      default: f3(); break;
   }
}


Compiling with gdc -O3. objdump output shows us the binary search bearophile
mentioned:

lea    0x0(%esi),%esi
jmp    *0x80a9350(,%eax,4)
nop
call   8049430 <_D4test2f2FZv>
lea    0x0(%esi),%esi
jmp    8049491 <_Dmain+0x21>
lea    0x0(%esi),%esi
call   8049410 <_D4test2f3FZv>
lea    0x0(%esi),%esi
jmp    8049491 <_Dmain+0x21>
lea    0x0(%esi),%esi
call   8049430 <_D4test2f2FZv>
...

Time it takes to compile:
real	2m0.061s
user	1m46.567s
sys	0m1.176s

Time it takes to run:
real	0m0.007s
user	0m0.004s
sys	0m0.004s


What is interesting (for me) is that the compile time *could* be quicker if the
semantic for CaseStatement is little smarter at checking for duplicate cases.

The current implementation is (in pseudo code), is a very linear

for (int i = 0; i < num_searched_cases; i++)
    check_is_dupe();

So if we have 20,000 cases in a statement, the compiler will iterate over the
loop
like so:
0
0 1
0 1 2
0 1 2 3
...
0 .. 19999
0 .. 20000

Which is slow on any machine...

I would post results for DMD, but that is still compiling 20 minutes in...
Currently writing out function main to object file...

This is certainly a first for me, but this one user case shows that DMD *can* be
woefully slower than GDC at something. :3

Sep 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:
 Out of curiousity, I thought I might give a good stress test a try.

You are mostly testing compiler speed, while my worry was about runtime. I
didn't even know DMD is able to digest switch statements with more than 256
cases.

I have opened an enhancement request, add a comment to it if you think you have
something useful to say:
http://d.puremagic.com/issues/show_bug.cgi?id=4952

Bye,
bearophile

Sep 28 2010

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 Out of curiousity, I thought I might give a good stress test a try.

 You are mostly testing compiler speed, while my worry was about runtime. I

didn't even know DMD is able to digest switch statements with more than 256
cases.
 I have opened an enhancement request, add a comment to it if you think you have

something useful to say:
 http://d.puremagic.com/issues/show_bug.cgi?id=4952
 Bye,
 bearophile

Oh, it was my *original* intention to test runtime speed. However, the time to
compile just stood out little more like a sore thumb than what I anticipated.

Iain

Sep 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 Oh, it was my *original* intention to test runtime speed. However, the time to
 compile just stood out little more like a sore thumb than what I anticipated.

If your purpose is to test runtime speed, use a more natural number of cases
like 10 or 20 or even 50 :-)

So I have done a better test, see below for the code.

Timings, NLOOPS = 100_000, best of 6, seconds:
  DMD: 7.70
  GCC: 2.42

gcc  4.5.1, -Wall -O3 -s
dmd 2.049, -O -release -inline

--------------------------------

// D code
import std.c.stdio: printf;

enum int NLOOPS = 100000;

int c1, c2, c3;

void f1() { c1++; }
void f2() { c2++; }
void f3() { c3++; }

int main() {
    int i, j;
    for (i = 0; i < NLOOPS; i++) {
        for (j = 0; j < 5000; j++) {
            switch (j) {
                case 140: f1(); break;
                case 300: f1(); break;
                case 1280: f1(); break;
                case 1540: f1(); break;
                case 1660: f1(); break;
                case 1770: f2(); break;
                case 2150: f2(); break;
                case 2190: f1(); break;
                case 2530: f2(); break;
                case 2560: f2(); break;
                case 2590: f1(); break;
                case 2660: f1(); break;
                case 2720: f2(); break;
                case 3010: f1(); break;
                case 3100: f1(); break;
                case 3390: f2(); break;
                case 3760: f1(); break;
                case 3970: f2(); break;
                case 4050: f1(); break;
                case 4140: f1(); break;
                case 4360: f2(); break;
                case 4540: f1(); break;
                case 4600: f2(); break;
                case 4720: f2(); break;
                case 4730: f2(); break;
                case 4740: f2(); break;
                case 4880: f2(); break;
                case 4950: f1(); break;

                default: f3();
            }
        }
    }

    printf("%d %d %d\n", c1, c2, c3);
    return 0;
}

--------------------------------

// C code
#include "stdio.h"

#define NLOOPS 100000

int c1, c2, c3;

void f1() { c1++; }
void f2() { c2++; }
void f3() { c3++; }

int main() {
    int i, j;
    for (i = 0; i < NLOOPS; i++) {
        for (j = 0; j < 5000; j++) {
            switch (j) {
                case 140: f1(); break;
                case 300: f1(); break;
                case 1280: f1(); break;
                case 1540: f1(); break;
                case 1660: f1(); break;
                case 1770: f2(); break;
                case 2150: f2(); break;
                case 2190: f1(); break;
                case 2530: f2(); break;
                case 2560: f2(); break;
                case 2590: f1(); break;
                case 2660: f1(); break;
                case 2720: f2(); break;
                case 3010: f1(); break;
                case 3100: f1(); break;
                case 3390: f2(); break;
                case 3760: f1(); break;
                case 3970: f2(); break;
                case 4050: f1(); break;
                case 4140: f1(); break;
                case 4360: f2(); break;
                case 4540: f1(); break;
                case 4600: f2(); break;
                case 4720: f2(); break;
                case 4730: f2(); break;
                case 4740: f2(); break;
                case 4880: f2(); break;
                case 4950: f1(); break;

                default: f3();
            }
        }
    }

    printf("%d %d %d\n", c1, c2, c3);
    return 0;
}

Bye,
bearophile

Sep 28 2010

bioinfornatics <bioinfornatics fedoraproject.org> writes:

with ldc and tango (up to date)
$ ldc -O5 -release -enable-inlining test.d
$ time ./test
1500000 1300000 497200000

real	0m4.376s
user	0m4.373s
sys	0m0.001s



D Code
____________________________________
import tango.stdc.stdio: printf;
int NLOOPS = 100000;

int c1, c2, c3;

void f1() { c1++; }
void f2() { c2++; }
void f3() { c3++; }

int main() {
    int i, j;
    for (i = 0; i < NLOOPS; i++) {
        for (j = 0; j < 5000; j++) {
            switch (j) {
                case 140: f1(); break;
                case 300: f1(); break;
                case 1280: f1(); break;
                case 1540: f1(); break;
                case 1660: f1(); break;
                case 1770: f2(); break;
                case 2150: f2(); break;
                case 2190: f1(); break;
                case 2530: f2(); break;
                case 2560: f2(); break;
                case 2590: f1(); break;
                case 2660: f1(); break;
                case 2720: f2(); break;
                case 3010: f1(); break;
                case 3100: f1(); break;
                case 3390: f2(); break;
                case 3760: f1(); break;
                case 3970: f2(); break;
                case 4050: f1(); break;
                case 4140: f1(); break;
                case 4360: f2(); break;
                case 4540: f1(); break;
                case 4600: f2(); break;
                case 4720: f2(); break;
                case 4730: f2(); break;
                case 4740: f2(); break;
                case 4880: f2(); break;
                case 4950: f1(); break;

                default: f3();
            }
        }
    }

    printf("%d %d %d\n", c1, c2, c3);
    return 0;
}

Sep 28 2010

bioinfornatics <bioinfornatics fedoraproject.org> writes:

I have a AMD Phenom(tm) II X4 955 Processor and this script works only on one
core of my quad
(800Mhz)

Sep 28 2010

bearophile <bearophileHUGS lycos.com> writes:

bioinfornatics:

 with ldc and tango (up to date)
 $ ldc -O5 -release -enable-inlining test.d
 $ time ./test
 1500000 1300000 497200000
 
 real	0m4.376s
 user	0m4.373s
 sys	0m0.001s

LDC has llvm back-end, that doesn't share that dmd problem.
But to give us meaningful data, I suggest you to time the C program too on your
machine, so we can compare :-)

Bye,
bearophile

Sep 28 2010

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 Oh, it was my *original* intention to test runtime speed. However, the time to
 compile just stood out little more like a sore thumb than what I anticipated.

 If your purpose is to test runtime speed, use a more natural number of cases

like 10 or 20 or even 50 :-)
 So I have done a better test, see below for the code.
 Timings, NLOOPS = 100_000, best of 6, seconds:
   DMD: 7.70
   GCC: 2.42
 gcc  4.5.1, -Wall -O3 -s
 dmd 2.049, -O -release -inline

--snip--
 Bye,
 bearophile

Well, this should give you a good comparison between case jump tables on and off
in GCC.

gdc-4.4 -O0
real	0m9.994s
user	0m9.957s
sys	0m0.028s

gdc-4.4 -O0 -fno-jump-tables
real	0m10.222s
user	0m10.197s
sys	0m0.012s

gdc-4.4 -O0 -funroll-loops
real	0m10.004s
user	0m9.965s
sys	0m0.032s

gdc-4.4 -O0 -funroll-loops -fno-jump-tables
real	0m10.011s
user	0m9.981s
sys	0m0.020s

gdc-4.4 -O3
real	0m7.107s
user	0m7.092s
sys	0m0.008s

gdc-4.4 -O3 -fno-jump-tables
real	0m7.136s
user	0m7.112s
sys	0m0.008s

gdc-4.4 -O3 -funroll-loops
real	0m7.127s
user	0m7.104s
sys	0m0.008s

gdc-4.4 -O3 -funroll-loops -fno-jump-tables
real	0m7.237s
user	0m7.184s
sys	0m0.044s

Differences are pretty minimal...


In comparison to DMD:

dmd -O
real	0m15.049s
user	0m14.989s
sys	0m0.044s


Iain

Sep 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 gdc-4.4 -O3 -fno-jump-tables
 real	0m7.136s
 user	0m7.112s
 sys	0m0.008s

...
 gdc-4.4 -O3 -funroll-loops -fno-jump-tables
 real	0m7.237s
 user	0m7.184s
 sys	0m0.044s
 
 Differences are pretty minimal...

Probably because gcc/gdc is not using a jump table here, but a binary search
tree :-)

Bye,
bearophile

Sep 28 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, September 28, 2010 15:46:01 Iain Buclaw wrote:
 Out of curiousity, I thought I might give a good stress test a try.
 
 Specs of my machine:
 
 Samsung N110, 2.1GHz Intel Atom, 2GB Memory.
 
 Code looks like this: It's a switch statement with 20,000 cases.

The real question though isn't worst case performance but rather average 
performance. 20,000 cases is totally unrealistic. 100 cases would be rare. 10
is 
probably the most that you get in most code, though obviously there are cases 
where you'd get quite a few more. If dmd produced code that was very efficient
for 
the average case but horrible for the worst case, I think that that would be 
fare better than producing code that was mediocre for the average case and good 
for the worst case. Of course, if it was determined that a particular algorithm 
worked well in smaller cases and another in larger cases, then you could just 
have the compiler use the algorithm that works best for the number of case 
statements that you have, but regardless, while how fast switch statements are 
with an insane numbers of case statements may be interested, it's nowhere near 
as relevant as how fast they are with a relatively small number. Whether
there's 
any relation between the speed with a small number of case statements and the 
speed with a large number is something that would have to be verified before 
20,000 cases becomes particularly relevant, much as it would be theoretically 
nice if having 20,000 case statements were efficient.

- Jonathan M Davis

Sep 28 2010

retard <re tard.com.invalid> writes:

Tue, 28 Sep 2010 13:33:16 -0400, bearophile wrote:

 Through Reddit I have found a small article about reverse engineering
 the switch statement: http://www.codeproject.com/KB/cpp/switch.aspx
 
 I have compiled a test program with GCC and then with DMD with minimal
 changes, this is the D program and the asm from the two compilers:
 

[snip]
 
 gcc and llvm-gcc use a binary search, dmd a linear one.

Instead of O(n) linear search or O(ln n) binary search, why not use O(1) 
jump tables in this case?

Sep 28 2010

bearophile <bearophileHUGS lycos.com> writes:

retard:

 Instead of O(n) linear search or O(ln n) binary search, why not use O(1) 
 jump tables in this case?

I don't exactly know. But you must take into account the constants too, it's
not just a matter of worst-case computational complexity. Probably when the
density of a large jump table becomes too much low, its experimental
performance on modern CPUs gets worse than a binary search among few entries.
But I am not sure, I have not written&run benchmarks on this.

Bye,
bearophile

Sep 28 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Tue, 28 Sep 2010 20:45:27 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 retard:

 Instead of O(n) linear search or O(ln n) binary search, why not use O(1)
 jump tables in this case?

 I don't exactly know. But you must take into account the constants too,  
 it's not just a matter of worst-case computational complexity. Probably  
 when the density of a large jump table becomes too much low, its  
 experimental performance on modern CPUs gets worse than a binary search  
 among few entries. But I am not sure, I have not written&run benchmarks  
 on this.

 Bye,
 bearophile

Well there are 28 labeled cases and ~16kb of jump table address space.  
(32kb on 64-bit platforms)

Sep 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:
 Well there are 28 labeled cases and ~16kb of jump table address space.  
 (32kb on 64-bit platforms)

32 kb are enough to fill the code part of the L1 cache on most CPUs.

Bye,
bearophile

Sep 28 2010

retard <re tard.com.invalid> writes:

Tue, 28 Sep 2010 22:00:06 -0400, bearophile wrote:

 Robert Jacques:
 Well there are 28 labeled cases and ~16kb of jump table address space.
 (32kb on 64-bit platforms)

 
 32 kb are enough to fill the code part of the L1 cache on most CPUs.

28 cases and 32 kB of space seems like a waste. I'd use some sort of 
hashing before the jump to eliminate unnecessary blank slots. However, 
even if this kind of solution was cache friendly, I have no idea how this 
would affect the operation of modern branch predictors.

Sep 28 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Switch implementation