digitalmars.D - Switch implementation
- bearophile (430/430) Sep 28 2010 Through Reddit I have found a small article about reverse engineering th...
- Pelle (3/11) Sep 28 2010 Interesting. Do you have any performance comparisons? The input is
- Iain Buclaw (68/84) Sep 28 2010 changes, this is the D program and the asm from the two compilers:
- bearophile (6/7) Sep 28 2010 You are mostly testing compiler speed, while my worry was about runtime....
- Iain Buclaw (6/13) Sep 28 2010 something useful to say:
- bearophile (106/108) Sep 28 2010 If your purpose is to test runtime speed, use a more natural number of c...
- bioinfornatics (55/55) Sep 28 2010 with ldc and tango (up to date)
- bioinfornatics (2/2) Sep 28 2010 I have a AMD Phenom(tm) II X4 955 Processor and this script works only o...
- bearophile (5/13) Sep 28 2010 LDC has llvm back-end, that doesn't share that dmd problem.
- Iain Buclaw (44/56) Sep 28 2010 --snip--
- bearophile (5/15) Sep 28 2010 Probably because gcc/gdc is not using a jump table here, but a binary se...
- Jonathan M Davis (18/25) Sep 28 2010 The real question though isn't worst case performance but rather average...
- retard (4/12) Sep 28 2010 Instead of O(n) linear search or O(ln n) binary search, why not use O(1)...
- bearophile (4/6) Sep 28 2010 I don't exactly know. But you must take into account the constants too, ...
- Robert Jacques (4/15) Sep 28 2010 Well there are 28 labeled cases and ~16kb of jump table address space.
- bearophile (4/6) Sep 28 2010 32 kb are enough to fill the code part of the L1 cache on most CPUs.
- retard (5/10) Sep 28 2010 28 cases and 32 kB of space seems like a waste. I'd use some sort of
Through Reddit I have found a small article about reverse engineering the switch statement: http://www.codeproject.com/KB/cpp/switch.aspx I have compiled a test program with GCC and then with DMD with minimal changes, this is the D program and the asm from the two compilers: import std.c.stdio: puts; import std.c.stdlib: atoi; void f1() { puts("f1 called"); } void f2() { puts("f2 called"); } void f3() { puts("f3 called"); } void main() { int i = atoi("3"); switch (i) { case 140: f1(); break; case 300: f1(); break; case 1280: f1(); break; case 1540: f1(); break; case 1660: f1(); break; case 1770: f2(); break; case 2150: f2(); break; case 2190: f1(); break; case 2530: f2(); break; case 2560: f2(); break; case 2590: f1(); break; case 2660: f1(); break; case 2720: f2(); break; case 3010: f1(); break; case 3100: f1(); break; case 3390: f2(); break; case 3760: f1(); break; case 3970: f2(); break; case 4050: f1(); break; case 4140: f1(); break; case 4360: f2(); break; case 4540: f1(); break; case 4600: f2(); break; case 4720: f2(); break; case 4730: f2(); break; case 4740: f2(); break; case 4880: f2(); break; case 4950: f1(); break; default: f3(); } } /* ------------------------------ DMD optimized build: __Dmain comdat push EBX mov EAX,offset FLAT:_DATA[024h] push EAX call near ptr _atoi add ESP,4 cmp EAX,08Ch je L115 cmp EAX,012Ch je L115 cmp EAX,0500h je L115 cmp EAX,0604h je L115 cmp EAX,067Ch je L115 cmp EAX,06EAh je L105 cmp EAX,0866h je L105 cmp EAX,088Eh je L115 cmp EAX,09E2h je L105 cmp EAX,0A00h je L105 cmp EAX,0A1Eh je L115 cmp EAX,0A64h je L115 cmp EAX,0AA0h je L105 cmp EAX,0BC2h je L115 cmp EAX,0C1Ch je L115 cmp EAX,0D3Eh je L105 cmp EAX,0EB0h je L115 cmp EAX,0F82h je L105 cmp EAX,0FD2h je L115 cmp EAX,0102Ch je L115 cmp EAX,01108h je L105 cmp EAX,011BCh je L115 cmp EAX,011F8h je L105 cmp EAX,01270h je L105 cmp EAX,0127Ah je L105 cmp EAX,01284h je L105 cmp EAX,01310h je L105 cmp EAX,01356h je L115 jmp short L125 L105: mov ECX,offset FLAT:_DATA[0Ch] push ECX call near ptr _puts add ESP,4 jmp short L133 L115: mov EDX,offset FLAT:_DATA push EDX call near ptr _puts add ESP,4 jmp short L133 L125: mov EBX,offset FLAT:_DATA[018h] push EBX call near ptr _puts add ESP,4 L133: pop EBX xor EAX,EAX ret ---------------------------------- GCC 4.5.1 -O3 _main: pushl %ebp movl %esp, %ebp andl $-16, %esp subl $16, %esp call ___main movl $LC3, (%esp) call _atoi cmpl $3010, %eax je L33 jle L43 cmpl $4360, %eax je L32 .p2align 4,,6 jle L44 cmpl $4730, %eax je L32 .p2align 4,,6 jle L45 cmpl $4880, %eax je L32 cmpl $4950, %eax je L33 cmpl $4740, %eax jne L5 .p2align 4,,7 L32: movl $LC1, (%esp) call _puts xorl %eax, %eax leave LCFI16: ret .p2align 4,,7 L45: LCFI17: cmpl $4600, %eax je L32 cmpl $4720, %eax je L32 cmpl $4540, %eax jne L5 .p2align 4,,7 L33: movl $LC0, (%esp) call _puts L41: xorl %eax, %eax leave LCFI18: ret .p2align 4,,7 L43: LCFI19: cmpl $2150, %eax je L32 .p2align 4,,4 jle L46 cmpl $2560, %eax je L32 .p2align 4,,6 jle L47 cmpl $2660, %eax je L33 cmpl $2720, %eax je L32 cmpl $2590, %eax jne L5 jmp L33 .p2align 4,,7 L44: cmpl $3760, %eax je L33 .p2align 4,,6 jle L48 cmpl $4050, %eax je L33 cmpl $4140, %eax je L33 cmpl $3970, %eax jne L5 jmp L32 .p2align 4,,7 L46: cmpl $1280, %eax je L33 .p2align 4,,6 jle L49 cmpl $1660, %eax je L33 cmpl $1770, %eax je L32 cmpl $1540, %eax jne L5 jmp L33 .p2align 4,,7 L47: cmpl $2190, %eax je L33 cmpl $2530, %eax je L32 .p2align 4,,7 L5: movl $LC2, (%esp) call _puts jmp L41 .p2align 4,,7 L49: cmpl $140, %eax je L33 cmpl $300, %eax jne L5 jmp L33 .p2align 4,,7 L48: cmpl $3100, %eax je L33 cmpl $3390, %eax jne L5 jmp L32 --------------------------- llvm-gcc V.2.7, -O3 _main: pushl %ebp movl %esp, %ebp subl $8, %esp call ___main movl $L_.str3, (%esp) call _atoi cmpl $299, %eax jg LBB4_4 cmpl $140, %eax jne LBB4_56 LBB4_2: movl $L_.str2, (%esp) LBB4_3: call _puts xorl %eax, %eax addl $8, %esp popl %ebp ret LBB4_4: cmpl $1279, %eax jg LBB4_6 cmpl $300, %eax je LBB4_2 jmp LBB4_56 LBB4_6: cmpl $1539, %eax jg LBB4_8 cmpl $1280, %eax je LBB4_2 jmp LBB4_56 LBB4_8: cmpl $1659, %eax jg LBB4_10 cmpl $1540, %eax je LBB4_2 jmp LBB4_56 LBB4_10: cmpl $1769, %eax jg LBB4_12 cmpl $1660, %eax je LBB4_2 jmp LBB4_56 LBB4_12: cmpl $2149, %eax jg LBB4_15 cmpl $1770, %eax jne LBB4_56 LBB4_14: movl $L_.str1, (%esp) jmp LBB4_3 LBB4_15: cmpl $4949, %eax jg LBB4_55 cmpl $4879, %eax jg LBB4_54 cmpl $2189, %eax jg LBB4_19 cmpl $2150, %eax je LBB4_14 jmp LBB4_56 LBB4_19: cmpl $2529, %eax jg LBB4_21 cmpl $2190, %eax je LBB4_2 jmp LBB4_56 LBB4_21: cmpl $2559, %eax jg LBB4_23 cmpl $2530, %eax je LBB4_14 jmp LBB4_56 LBB4_23: cmpl $2589, %eax jg LBB4_25 cmpl $2560, %eax je LBB4_14 jmp LBB4_56 LBB4_25: cmpl $2659, %eax jg LBB4_27 cmpl $2590, %eax je LBB4_2 jmp LBB4_56 LBB4_27: cmpl $2719, %eax jg LBB4_29 cmpl $2660, %eax je LBB4_2 jmp LBB4_56 LBB4_29: cmpl $3009, %eax jg LBB4_31 cmpl $2720, %eax je LBB4_14 jmp LBB4_56 LBB4_31: cmpl $3099, %eax jg LBB4_33 cmpl $3010, %eax je LBB4_2 jmp LBB4_56 LBB4_33: cmpl $3389, %eax jg LBB4_35 cmpl $3100, %eax je LBB4_2 jmp LBB4_56 LBB4_35: cmpl $3759, %eax jg LBB4_37 cmpl $3390, %eax je LBB4_14 jmp LBB4_56 LBB4_37: cmpl $3969, %eax jg LBB4_39 cmpl $3760, %eax je LBB4_2 jmp LBB4_56 LBB4_39: cmpl $4049, %eax jg LBB4_41 cmpl $3970, %eax je LBB4_14 jmp LBB4_56 LBB4_41: cmpl $4139, %eax jg LBB4_43 cmpl $4050, %eax je LBB4_2 jmp LBB4_56 LBB4_43: cmpl $4359, %eax jg LBB4_45 cmpl $4140, %eax je LBB4_2 jmp LBB4_56 LBB4_45: cmpl $4539, %eax jg LBB4_47 cmpl $4360, %eax je LBB4_14 jmp LBB4_56 LBB4_47: cmpl $4599, %eax jg LBB4_49 cmpl $4540, %eax je LBB4_2 jmp LBB4_56 LBB4_49: cmpl $4719, %eax jg LBB4_51 cmpl $4600, %eax je LBB4_14 jmp LBB4_56 LBB4_51: cmpl $4720, %eax je LBB4_14 cmpl $4730, %eax je LBB4_14 cmpl $4740, %eax je LBB4_14 jmp LBB4_56 LBB4_54: cmpl $4880, %eax je LBB4_14 jmp LBB4_56 LBB4_55: cmpl $4950, %eax je LBB4_2 LBB4_56: movl $L_.str, (%esp) jmp LBB4_3 --------------------------------- */ gcc and llvm-gcc use a binary search, dmd a linear one. I have done a similar interesting test (similar to switch5.cpp of the article author) where a good implementation of the switch is a small table for part of the cases and a binary tree for the other cases. Bye, bearophile
Sep 28 2010
On 09/28/2010 07:33 PM, bearophile wrote:Through Reddit I have found a small article about reverse engineering the switch statement: http://www.codeproject.com/KB/cpp/switch.aspx I have compiled a test program with GCC and then with DMD with minimal changes, this is the D program and the asm from the two compilers: ... listing ... gcc and llvm-gcc use a binary search, dmd a linear one. I have done a similar interesting test (similar to switch5.cpp of the article author) where a good implementation of the switch is a small table for part of the cases and a binary tree for the other cases. Bye, bearophileInteresting. Do you have any performance comparisons? The input is fairly small, I'm not sure a binary search will help a lot.
Sep 28 2010
== Quote from Pelle (pelle.mansson gmail.com)'s articleOn 09/28/2010 07:33 PM, bearophile wrote:switch statement:Through Reddit I have found a small article about reverse engineering thechanges, this is the D program and the asm from the two compilers:http://www.codeproject.com/KB/cpp/switch.aspx I have compiled a test program with GCC and then with DMD with minimalauthor) where a good implementation of the switch is a small table for part of the cases and a binary tree for the other cases.... listing ... gcc and llvm-gcc use a binary search, dmd a linear one. I have done a similar interesting test (similar to switch5.cpp of the articleOut of curiousity, I thought I might give a good stress test a try. Specs of my machine: Samsung N110, 2.1GHz Intel Atom, 2GB Memory. Code looks like this: It's a switch statement with 20,000 cases. import std.c.stdio: puts; import std.c.stdlib: atoi; void f1() { puts("f1 called"); } void f2() { puts("f2 called"); } void f3() { puts("f3 called"); } void main() { int i = atoi("20000"); switch (i) { case 1: f1(); break; case 2: f2(); break; case 3: f3(); break; // Turtles all the down... case 19999: f1(); break; case 20000: f2(); break; default: f3(); break; } } Compiling with gdc -O3. objdump output shows us the binary search bearophile mentioned: lea 0x0(%esi),%esi jmp *0x80a9350(,%eax,4) nop call 8049430 <_D4test2f2FZv> lea 0x0(%esi),%esi jmp 8049491 <_Dmain+0x21> lea 0x0(%esi),%esi call 8049410 <_D4test2f3FZv> lea 0x0(%esi),%esi jmp 8049491 <_Dmain+0x21> lea 0x0(%esi),%esi call 8049430 <_D4test2f2FZv> ... Time it takes to compile: real 2m0.061s user 1m46.567s sys 0m1.176s Time it takes to run: real 0m0.007s user 0m0.004s sys 0m0.004s What is interesting (for me) is that the compile time *could* be quicker if the semantic for CaseStatement is little smarter at checking for duplicate cases. The current implementation is (in pseudo code), is a very linear for (int i = 0; i < num_searched_cases; i++) check_is_dupe(); So if we have 20,000 cases in a statement, the compiler will iterate over the loop like so: 0 0 1 0 1 2 0 1 2 3 ... 0 .. 19999 0 .. 20000 Which is slow on any machine... I would post results for DMD, but that is still compiling 20 minutes in... Currently writing out function main to object file... This is certainly a first for me, but this one user case shows that DMD *can* be woefully slower than GDC at something. :3Bye, bearophileInteresting. Do you have any performance comparisons? The input is fairly small, I'm not sure a binary search will help a lot.
Sep 28 2010
Iain Buclaw:Out of curiousity, I thought I might give a good stress test a try.You are mostly testing compiler speed, while my worry was about runtime. I didn't even know DMD is able to digest switch statements with more than 256 cases. I have opened an enhancement request, add a comment to it if you think you have something useful to say: http://d.puremagic.com/issues/show_bug.cgi?id=4952 Bye, bearophile
Sep 28 2010
== Quote from bearophile (bearophileHUGS lycos.com)'s articleIain Buclaw:didn't even know DMD is able to digest switch statements with more than 256 cases.Out of curiousity, I thought I might give a good stress test a try.You are mostly testing compiler speed, while my worry was about runtime. II have opened an enhancement request, add a comment to it if you think you havesomething useful to say:http://d.puremagic.com/issues/show_bug.cgi?id=4952 Bye, bearophileOh, it was my *original* intention to test runtime speed. However, the time to compile just stood out little more like a sore thumb than what I anticipated. Iain
Sep 28 2010
Iain Buclaw:Oh, it was my *original* intention to test runtime speed. However, the time to compile just stood out little more like a sore thumb than what I anticipated.If your purpose is to test runtime speed, use a more natural number of cases like 10 or 20 or even 50 :-) So I have done a better test, see below for the code. Timings, NLOOPS = 100_000, best of 6, seconds: DMD: 7.70 GCC: 2.42 gcc 4.5.1, -Wall -O3 -s dmd 2.049, -O -release -inline -------------------------------- // D code import std.c.stdio: printf; enum int NLOOPS = 100000; int c1, c2, c3; void f1() { c1++; } void f2() { c2++; } void f3() { c3++; } int main() { int i, j; for (i = 0; i < NLOOPS; i++) { for (j = 0; j < 5000; j++) { switch (j) { case 140: f1(); break; case 300: f1(); break; case 1280: f1(); break; case 1540: f1(); break; case 1660: f1(); break; case 1770: f2(); break; case 2150: f2(); break; case 2190: f1(); break; case 2530: f2(); break; case 2560: f2(); break; case 2590: f1(); break; case 2660: f1(); break; case 2720: f2(); break; case 3010: f1(); break; case 3100: f1(); break; case 3390: f2(); break; case 3760: f1(); break; case 3970: f2(); break; case 4050: f1(); break; case 4140: f1(); break; case 4360: f2(); break; case 4540: f1(); break; case 4600: f2(); break; case 4720: f2(); break; case 4730: f2(); break; case 4740: f2(); break; case 4880: f2(); break; case 4950: f1(); break; default: f3(); } } } printf("%d %d %d\n", c1, c2, c3); return 0; } -------------------------------- // C code #include "stdio.h" #define NLOOPS 100000 int c1, c2, c3; void f1() { c1++; } void f2() { c2++; } void f3() { c3++; } int main() { int i, j; for (i = 0; i < NLOOPS; i++) { for (j = 0; j < 5000; j++) { switch (j) { case 140: f1(); break; case 300: f1(); break; case 1280: f1(); break; case 1540: f1(); break; case 1660: f1(); break; case 1770: f2(); break; case 2150: f2(); break; case 2190: f1(); break; case 2530: f2(); break; case 2560: f2(); break; case 2590: f1(); break; case 2660: f1(); break; case 2720: f2(); break; case 3010: f1(); break; case 3100: f1(); break; case 3390: f2(); break; case 3760: f1(); break; case 3970: f2(); break; case 4050: f1(); break; case 4140: f1(); break; case 4360: f2(); break; case 4540: f1(); break; case 4600: f2(); break; case 4720: f2(); break; case 4730: f2(); break; case 4740: f2(); break; case 4880: f2(); break; case 4950: f1(); break; default: f3(); } } } printf("%d %d %d\n", c1, c2, c3); return 0; } Bye, bearophile
Sep 28 2010
with ldc and tango (up to date) $ ldc -O5 -release -enable-inlining test.d $ time ./test 1500000 1300000 497200000 real 0m4.376s user 0m4.373s sys 0m0.001s D Code ____________________________________ import tango.stdc.stdio: printf; int NLOOPS = 100000; int c1, c2, c3; void f1() { c1++; } void f2() { c2++; } void f3() { c3++; } int main() { int i, j; for (i = 0; i < NLOOPS; i++) { for (j = 0; j < 5000; j++) { switch (j) { case 140: f1(); break; case 300: f1(); break; case 1280: f1(); break; case 1540: f1(); break; case 1660: f1(); break; case 1770: f2(); break; case 2150: f2(); break; case 2190: f1(); break; case 2530: f2(); break; case 2560: f2(); break; case 2590: f1(); break; case 2660: f1(); break; case 2720: f2(); break; case 3010: f1(); break; case 3100: f1(); break; case 3390: f2(); break; case 3760: f1(); break; case 3970: f2(); break; case 4050: f1(); break; case 4140: f1(); break; case 4360: f2(); break; case 4540: f1(); break; case 4600: f2(); break; case 4720: f2(); break; case 4730: f2(); break; case 4740: f2(); break; case 4880: f2(); break; case 4950: f1(); break; default: f3(); } } } printf("%d %d %d\n", c1, c2, c3); return 0; }
Sep 28 2010
I have a AMD Phenom(tm) II X4 955 Processor and this script works only on one core of my quad (800Mhz)
Sep 28 2010
bioinfornatics:with ldc and tango (up to date) $ ldc -O5 -release -enable-inlining test.d $ time ./test 1500000 1300000 497200000 real 0m4.376s user 0m4.373s sys 0m0.001sLDC has llvm back-end, that doesn't share that dmd problem. But to give us meaningful data, I suggest you to time the C program too on your machine, so we can compare :-) Bye, bearophile
Sep 28 2010
== Quote from bearophile (bearophileHUGS lycos.com)'s articleIain Buclaw:like 10 or 20 or even 50 :-)Oh, it was my *original* intention to test runtime speed. However, the time to compile just stood out little more like a sore thumb than what I anticipated.If your purpose is to test runtime speed, use a more natural number of casesSo I have done a better test, see below for the code. Timings, NLOOPS = 100_000, best of 6, seconds: DMD: 7.70 GCC: 2.42 gcc 4.5.1, -Wall -O3 -s dmd 2.049, -O -release -inline--snip--Bye, bearophileWell, this should give you a good comparison between case jump tables on and off in GCC. gdc-4.4 -O0 real 0m9.994s user 0m9.957s sys 0m0.028s gdc-4.4 -O0 -fno-jump-tables real 0m10.222s user 0m10.197s sys 0m0.012s gdc-4.4 -O0 -funroll-loops real 0m10.004s user 0m9.965s sys 0m0.032s gdc-4.4 -O0 -funroll-loops -fno-jump-tables real 0m10.011s user 0m9.981s sys 0m0.020s gdc-4.4 -O3 real 0m7.107s user 0m7.092s sys 0m0.008s gdc-4.4 -O3 -fno-jump-tables real 0m7.136s user 0m7.112s sys 0m0.008s gdc-4.4 -O3 -funroll-loops real 0m7.127s user 0m7.104s sys 0m0.008s gdc-4.4 -O3 -funroll-loops -fno-jump-tables real 0m7.237s user 0m7.184s sys 0m0.044s Differences are pretty minimal... In comparison to DMD: dmd -O real 0m15.049s user 0m14.989s sys 0m0.044s Iain
Sep 28 2010
Iain Buclaw:gdc-4.4 -O3 -fno-jump-tables real 0m7.136s user 0m7.112s sys 0m0.008s...gdc-4.4 -O3 -funroll-loops -fno-jump-tables real 0m7.237s user 0m7.184s sys 0m0.044s Differences are pretty minimal...Probably because gcc/gdc is not using a jump table here, but a binary search tree :-) Bye, bearophile
Sep 28 2010
On Tuesday, September 28, 2010 15:46:01 Iain Buclaw wrote:Out of curiousity, I thought I might give a good stress test a try. Specs of my machine: Samsung N110, 2.1GHz Intel Atom, 2GB Memory. Code looks like this: It's a switch statement with 20,000 cases.The real question though isn't worst case performance but rather average performance. 20,000 cases is totally unrealistic. 100 cases would be rare. 10 is probably the most that you get in most code, though obviously there are cases where you'd get quite a few more. If dmd produced code that was very efficient for the average case but horrible for the worst case, I think that that would be fare better than producing code that was mediocre for the average case and good for the worst case. Of course, if it was determined that a particular algorithm worked well in smaller cases and another in larger cases, then you could just have the compiler use the algorithm that works best for the number of case statements that you have, but regardless, while how fast switch statements are with an insane numbers of case statements may be interested, it's nowhere near as relevant as how fast they are with a relatively small number. Whether there's any relation between the speed with a small number of case statements and the speed with a large number is something that would have to be verified before 20,000 cases becomes particularly relevant, much as it would be theoretically nice if having 20,000 case statements were efficient. - Jonathan M Davis
Sep 28 2010
Tue, 28 Sep 2010 13:33:16 -0400, bearophile wrote:Through Reddit I have found a small article about reverse engineering the switch statement: http://www.codeproject.com/KB/cpp/switch.aspx I have compiled a test program with GCC and then with DMD with minimal changes, this is the D program and the asm from the two compilers:[snip]gcc and llvm-gcc use a binary search, dmd a linear one.Instead of O(n) linear search or O(ln n) binary search, why not use O(1) jump tables in this case?
Sep 28 2010
retard:Instead of O(n) linear search or O(ln n) binary search, why not use O(1) jump tables in this case?I don't exactly know. But you must take into account the constants too, it's not just a matter of worst-case computational complexity. Probably when the density of a large jump table becomes too much low, its experimental performance on modern CPUs gets worse than a binary search among few entries. But I am not sure, I have not written&run benchmarks on this. Bye, bearophile
Sep 28 2010
On Tue, 28 Sep 2010 20:45:27 -0400, bearophile <bearophileHUGS lycos.com> wrote:retard:Well there are 28 labeled cases and ~16kb of jump table address space. (32kb on 64-bit platforms)Instead of O(n) linear search or O(ln n) binary search, why not use O(1) jump tables in this case?I don't exactly know. But you must take into account the constants too, it's not just a matter of worst-case computational complexity. Probably when the density of a large jump table becomes too much low, its experimental performance on modern CPUs gets worse than a binary search among few entries. But I am not sure, I have not written&run benchmarks on this. Bye, bearophile
Sep 28 2010
Robert Jacques:Well there are 28 labeled cases and ~16kb of jump table address space. (32kb on 64-bit platforms)32 kb are enough to fill the code part of the L1 cache on most CPUs. Bye, bearophile
Sep 28 2010
Tue, 28 Sep 2010 22:00:06 -0400, bearophile wrote:Robert Jacques:28 cases and 32 kB of space seems like a waste. I'd use some sort of hashing before the jump to eliminate unnecessary blank slots. However, even if this kind of solution was cache friendly, I have no idea how this would affect the operation of modern branch predictors.Well there are 28 labeled cases and ~16kb of jump table address space. (32kb on 64-bit platforms)32 kb are enough to fill the code part of the L1 cache on most CPUs.
Sep 28 2010