digitalmars.D - Virtual methods
- bearophile (723/733) Feb 03 2010 This post is long, I hope the web interface doesn't cut it down.
- Don (7/8) Feb 03 2010 So probably this can't be useful before D2 is out of alpha state.
- Walter Bright (50/56) Feb 03 2010 Here's what's going on. The virtual function calls are the calls to
- bearophile (8/18) Feb 03 2010 There are Leaf instances too.
- Walter Bright (9/11) Feb 03 2010 That's not safe because there are always opaque modules that are binary
- bearophile (8/11) Feb 03 2010 This sounds like:
- Walter Bright (7/21) Feb 03 2010 A link step is required to figure out what modules wind up in the final
- bearophile (6/8) Feb 04 2010 OK. People can live with a compiler that doesn't perform all possible op...
- Walter Bright (4/13) Feb 04 2010 Believe it or not, you can do profile guided optimizations with the dmd
- dsimcha (4/17) Feb 04 2010 Is that still a relevant optimization? It probably made sense circa 199...
- bearophile (4/7) Feb 04 2010 It's useful still, but just a little, there are far more important optim...
- BCS (7/19) Feb 04 2010 It's still valid (to an extent) for two reasons; careful placement of co...
- Walter Bright (13/21) Feb 04 2010 I'll add to the other replies, and say that:
- Trass3r (1/9) Feb 04 2010 What about C++ code?
- bearophile (5/6) Feb 04 2010 I think D classes are not compatible with C++ ones. And even if that bec...
- retard (3/5) Feb 03 2010 Interesting. So a novice programmer can produce much faster code in Java...
- Walter Bright (5/12) Feb 03 2010 Ever since Java implementations supported JITs, there were some
This post is long, I hope the web interface doesn't cut it down. This post is about few simple experiments I've done in D about virtual calls and how to improve their performance. So probably this can't be useful before D2 is out of alpha state. For people working at Sun this stuff is probably very old and very naive, but I think currently no D compilers are optimizing this, and if you are not interested please ignore this post. The performance problem caused by virtual calls is not the small amount of time they require, but mostly the missing inlining they cause. I have created a small synthetic benchmark, not realistic but useful to measure this topic. The program creates a binary tree with 932_465 nodes, and then scans it 600 times computing the sum of the values stored in the leaves, incrementing such leave values each time. The tree contains two different kinds of nodes, Internal and Leave, so the walking on the tree is done with a virtual function. HotSpot shows a significant performance improvement over the C++ code, probably caused by the de-virtualizing the virtual calls. GCC Profile Guided Optimization doesn't produce speedups on this program. // sumtree_d1.d // D code, works in D1 on Tango and Phobos // not hard to adapt to D2 const bool USE_TRICKS = false; version (Tango) { import tango.stdc.stdio: printf; static if (USE_TRICKS) { import tango.stdc.stdlib: exit; import tango.core.Memory: GC; alias GC.disable disable; } } else { import std.c.stdio: printf; static if (USE_TRICKS) { import std.c.stdlib: exit; import std.gc: disable; } } struct Random { // portable generator private const int m = int.max; private const int a = 48271; private const int q = m / a; private const int r = m % a; public int seed; public double nextDouble() { int k = seed / q; seed = a * (seed - k * q) - r * k; if (seed < 1) seed += m; return cast(double)seed / m; } // returns in [0, max] public int nextInt(int max) { int n = max + 1; int k = cast(int)(n * this.nextDouble()); return (k == n) ? k - 1 : k; } } abstract class Node { abstract int walkSum(); } final class Leaf : Node { int value; this(int x) { this.value = x; } override int walkSum() { return this.value++; } } final class Internal : Node { Node left, right; this(Node l, Node r) { this.left = l; this.right = r; } override int walkSum() { // createTree() never generates null pointers, but we test // anyway to be more general with other tree generators return ((this.left is null) ? 0 : this.left.walkSum()) + ((this.right is null) ? 0 : this.right.walkSum()); } } struct SumTree { static int nnodes; static Random rnd; static Node createTree(int deptMax, int dept) { nnodes++; if (rnd.nextDouble() > 0.30 && dept <= deptMax) return new Internal(createTree(deptMax, dept+1), createTree(deptMax, dept+1)); else return new Leaf(rnd.nextInt(100)); } } void main() { const int NLOOPS = 600; static if (USE_TRICKS) { disable(); } SumTree.rnd = Random(142465964); SumTree.nnodes = 0; Node root = SumTree.createTree(35, 0); // 35 0 printf("nnodes: %d\n", SumTree.nnodes); if (root !is null) { printf("sum: %d\n", root.walkSum()); int tot = 0; for (int i = 0; i < NLOOPS-1; i++) tot += root.walkSum(); printf("sum (%d scans): %d\n", NLOOPS-1, tot); } static if (USE_TRICKS) { exit(0); } } I have translated that code to Java and C++, ask if you want to see those versions. I have created more D versions: the struct tag. probably improves code inlining. internal nodes using pointers tagged with one bit (so the struct memory is allocated from the C heap). The total memory allocated is less (the tags require no extra memory) and this speeds up the allocation a little. The walk is about as fast. In the D code I have also added a global constant boolean USE_TRICKS that enables/disables two dirty tricks, that essentially save some of the time used by the D GC to allocate and deallocate the tree. Such tricks don't speed up the tree walks. The profiling of the C++ program shows that most of the time is spent inside Internal::walkSum(). Code compiled with: LLVM 2.6 gcc version 4.2.1 (Based on Apple Inc. build 5649) (LLVM build) llvm-g++ -Wall -O3 -s -Wl,--enable-stdcall-fixup -fomit-frame-pointer -march=native -ffast-math Java build 1.7.0-ea-b76 java -server SumTree CPU: Celeron 2.33 GHz, on Ubuntu running on VirtualBox running on Windows Vista 32 bit. The current benchmark with seed=142465964 allocates 466_232 Internal nodes, and 466_234 Leaves. Timings (just tree building), best of 3, seconds: C++: 0.21 Java: 0.83 D 1 notricks: 0.49 D 2 notricks: 0.46 D 3 notricks: 0.45 D 4 notricks: 0.23 D 1 tricks: 0.24 D 2 tricks: 0.23 D 3 tricks: 0.23 D 4 tricks: 0.22 Timings (full program), best of 3, seconds: C++: 15.64 Java: 8.74 D 1 notricks: 10.30 D 2 notricks: 14.26 D 3 notricks: 14.06 D 4 notricks: 13.57 D 1 tricks: 9.77 D 2 tricks: 13.58 D 3 tricks: 13.32 D 4 tricks: 13.06 As you can see from the timings it's not a matter of GC and memory allocations, the building phase in the C++ is actually faster compared to the allocation in the Java version. So probably here HotSpot is optimizing the virtual calls. Internal.walkSum pushl %edi pushl %esi subl $4, %esp movl 8(%eax), %ecx testl %ecx, %ecx movl %eax, %esi je .LBB5_5 movl (%ecx), %edx movl %ecx, %eax call *24(%edx) .LBB5_2: movl %eax, %edi movl 12(%esi), %eax testl %eax, %eax je .LBB5_6 movl (%eax), %ecx call *24(%ecx) .LBB5_4: addl %edi, %eax addl $4, %esp popl %esi popl %edi ret .LBB5_5: xorl %eax, %eax jmp .LBB5_2 .LBB5_6: xorl %eax, %eax jmp .LBB5_4 ------------------ Internal.walkSum: pushl %ebp pushl %ebx pushl %edi pushl %esi subl $4, %esp movl 4(%eax), %esi testl %esi, %esi movl %eax, %edi je .LBB7_12 cmpl $0, (%esi) jne .LBB7_13 movl 4(%esi), %eax testl %eax, %eax je .LBB7_14 call _D2s24Node7walkSumMFZi .LBB7_4: movl %eax, %ebp movl 8(%esi), %eax testl %eax, %eax je .LBB7_15 cmpl $0, (%eax) jne .LBB7_16 call _D2s28Internal7walkSumMFZi movl %eax, %ebx .LBB7_7: addl %ebp, %ebx .LBB7_8: movl 8(%edi), %eax testl %eax, %eax je .LBB7_17 cmpl $0, (%eax) jne .LBB7_18 call _D2s28Internal7walkSumMFZi movl %eax, %ecx .LBB7_11: movl %ecx, %eax addl %ebx, %eax addl $4, %esp popl %esi popl %edi popl %ebx popl %ebp ret .LBB7_12: xorl %ebx, %ebx jmp .LBB7_8 .LBB7_13: movl 4(%esi), %ebx leal 1(%ebx), %eax movl %eax, 4(%esi) jmp .LBB7_8 .LBB7_14: xorl %eax, %eax jmp .LBB7_4 .LBB7_15: xorl %ebx, %ebx jmp .LBB7_7 .LBB7_16: movl 4(%eax), %ebx leal 1(%ebx), %ecx movl %ecx, 4(%eax) jmp .LBB7_7 .LBB7_17: xorl %ecx, %ecx jmp .LBB7_11 .LBB7_18: movl 4(%eax), %ecx leal 1(%ecx), %edx movl %edx, 4(%eax) jmp .LBB7_11 ------------------ walkSum: pushl %ebx pushl %edi pushl %esi testl %eax, %eax jne .LBB6_3 xorl %eax, %eax .LBB6_2: popl %esi popl %edi popl %ebx ret .LBB6_3: movl %eax, %esi cmpl $0, (%esi) jne .LBB6_8 movl 4(%esi), %eax call _D2s37walkSumFPS2s34NodeZi movl 8(%esi), %esi testl %esi, %esi movl %eax, %edi je .LBB6_9 cmpl $0, (%esi) jne .LBB6_10 movl 4(%esi), %eax call _D2s37walkSumFPS2s34NodeZi movl %eax, %ebx movl 8(%esi), %eax call _D2s37walkSumFPS2s34NodeZi addl %ebx, %eax .LBB6_7: addl %edi, %eax jmp .LBB6_2 .LBB6_8: movl 4(%esi), %eax leal 1(%eax), %ecx movl %ecx, 4(%esi) jmp .LBB6_2 .LBB6_9: xorl %eax, %eax jmp .LBB6_7 .LBB6_10: movl 4(%esi), %eax leal 1(%eax), %ecx movl %ecx, 4(%esi) jmp .LBB6_7 ------------------ walkSum: pushl %ebx pushl %edi pushl %esi movl %eax, %esi andl $4294967294, %esi testl %esi, %esi jne .LBB13_3 xorl %eax, %eax .LBB13_2: popl %esi popl %edi popl %ebx ret .LBB13_3: testb $1, %al movl (%esi), %eax je .LBB13_8 movl %eax, %edi andl $4294967294, %edi testl %edi, %edi je .LBB13_9 testb $1, %al movl (%edi), %eax je .LBB13_10 call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi movl %eax, %ebx movl 4(%edi), %eax call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi addl %ebx, %eax .LBB13_7: movl %eax, %edi movl 4(%esi), %eax call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi addl %edi, %eax jmp .LBB13_2 .LBB13_8: leal 1(%eax), %ecx movl %ecx, (%esi) jmp .LBB13_2 .LBB13_9: xorl %eax, %eax jmp .LBB13_7 .LBB13_10: leal 1(%eax), %ecx movl %ecx, (%edi) jmp .LBB13_7 ------------------ st, C++, gcc: Internal.walkSum: subl $12, %esp movl %ebx, 4(%esp) movl %esi, 8(%esp) xorl %ebx, %ebx movl 16(%esp), %esi movl 4(%esi), %edx testl %edx, %edx je .L5 movl (%edx), %eax movl %edx, (%esp) call *(%eax) movl %eax, %ebx .L5: movl 8(%esi), %edx xorl %eax, %eax testl %edx, %edx je .L7 movl (%edx), %eax movl %edx, (%esp) call *(%eax) .L7: addl %ebx, %eax movl 8(%esp), %esi movl 4(%esp), %ebx addl $12, %esp ret ------------------ st, C++, llvm-gcc: Internal.walkSum: pushl %edi pushl %esi subl $4, %esp movl 16(%esp), %esi movl 4(%esi), %eax testl %eax, %eax je LBB2_5 movl (%eax), %ecx movl %eax, (%esp) call *(%ecx) LBB2_2: movl %eax, %edi movl 8(%esi), %eax testl %eax, %eax je LBB2_6 movl (%eax), %ecx movl %eax, (%esp) call *(%ecx) LBB2_4: addl %edi, %eax addl $4, %esp popl %esi popl %edi ret LBB2_5: xorl %eax, %eax jmp LBB2_2 LBB2_6: xorl %eax, %eax jmp LBB2_4 I have collected the asm generated by HotSpot from the Internal.walkSum, but I am not able to read it, it's too much messy for me. I have created an alternative version of the first D program, sumtree_d1b, that uses the __vptr to avoid the virtual table and to allow LDC to inline the code better. The performance is better than sumtree_d1, but not good enough to beat the Java version yet. // sumtree_d1b.d const bool USE_TRICKS = false; version (Tango) { import tango.stdc.stdio: printf; static if (USE_TRICKS) { import tango.stdc.stdlib: exit; import tango.core.Memory: GC; alias GC.disable disable; } } else { import std.c.stdio: printf; static if (USE_TRICKS) { import std.c.stdlib: exit; import std.gc: disable; } } struct Random { private const int m = int.max; private const int a = 48271; private const int q = m / a; private const int r = m % a; public int seed; public double nextDouble() { int k = seed / q; seed = a * (seed - k * q) - r * k; if (seed < 1) seed += m; return cast(double)seed / m; } // returns in [0, max] public int nextInt(int max) { int n = max + 1; int k = cast(int)(n * this.nextDouble()); return (k == n) ? k - 1 : k; } } //void** leaf__vptr; // D1 __gshared immutable(void*)* leaf__vptr; // D2 abstract class Node {} final class Leaf : Node { int value; this(int x) { this.value = x; } } final class Internal : Node { Node left, right; this(Node l, Node r) { this.left = l; this.right = r; } } struct SumTree { static int nnodes; static Random rnd; static Node createTree(int deptMax, int dept) { nnodes++; if (rnd.nextDouble() > 0.30 && dept <= deptMax) return new Internal(createTree(deptMax, dept+1), createTree(deptMax, dept+1)); else return new Leaf(rnd.nextInt(100)); } } int walkTree(Node root) { if (root.__vptr == leaf__vptr) { return (cast(Leaf)cast(void*)root).value++; } else { Internal iroot = cast(Internal)cast(void*)root; return ((iroot.left is null) ? 0 : walkTree(iroot.left)) + ((iroot.right is null) ? 0 : walkTree(iroot.right)); } } void main() { const int NLOOPS = 600; static if (USE_TRICKS) { disable(); } { scope Leaf l = new Leaf(0); leaf__vptr = l.__vptr; } SumTree.rnd = Random(142465964); SumTree.nnodes = 0; Node root = SumTree.createTree(35, 0); // 35 0 printf("nnodes: %d\n", SumTree.nnodes); if (root !is null) { printf("sum: %d\n", walkTree(root)); int tot = 0; for (int i = 0; i < NLOOPS-1; i++) tot += walkTree(root); printf("sum (%d scans): %d\n", NLOOPS-1, tot); } static if (USE_TRICKS) { exit(0); } } Timings (full program), best of 3, seconds: Java: 8.83 (-server) D 1 notricks: 10.44 D 1b notricks: 9.73 D 1 tricks: 10.18 D 1b tricks: 9.45 In this program there are only two possible classes of nodes, so to manually remove virtual calls I have stored the pointer to the virtual table of one of the two classes, the Leaf (Here unfortunately I can't use conditional compilation to let the compiler compile the right code according to the language version because the D2 version is not syntactially valid D1). //void** leaf__vptr; // D1 __gshared immutable(void*)* leaf__vptr; // D2 D docs state that user code shall not use the pointer to the virtual table (probably because it's an implementation detail, that can change according to the way virtual calls are implemented. For example dispatch trees don't have such pointer. Here I have essentially created a small dispatch tree manually). At the beginning of the main() the leaf__vptr gets initialized (I don't know a way to initialize this value that doesn't require me the creation of a class instance): { scope Leaf l = new Leaf(0); leaf__vptr = l.__vptr; } Then this is a free function that scans the tree (it can be written iterative too, but I think the performance is not that diffeernt on modern CPUs. This can be tested): int walkTree(Node root) { if (root.__vptr == leaf__vptr) { return (cast(Leaf)cast(void*)root).value++; } else { Internal iroot = cast(Internal)cast(void*)root; return ((iroot.left is null) ? 0 : walkTree(iroot.left)) + ((iroot.right is null) ? 0 : walkTree(iroot.right)); } } The double cast like: cast(Leaf)cast(void*)root is necessary because a single cast in D is done dynamically, so the first cast to void* punches a hole in the type system, so the pair becomes a static cast. I think it's similar to the C++ code: Leaf::root The asm of walkTree() shows no virtual calls and it also shows some inlining: walkTree of sumtree_d1b.d: _D8sumtree27walkTreeFC8sumtree24NodeZi: pushl %ebx pushl %edi pushl %esi movl _D8sumtree210leaf__vptrPPv, %ecx cmpl %ecx, (%eax) movl %eax, %esi je .LBB8_12 movl 8(%esi), %eax testl %eax, %eax je .LBB8_13 call _D8sumtree27walkTreeFC8sumtree24NodeZi .LBB8_3: movl %eax, %edi movl 12(%esi), %esi testl %esi, %esi je .LBB8_14 movl _D8sumtree210leaf__vptrPPv, %eax cmpl %eax, (%esi) je .LBB8_15 movl 8(%esi), %eax testl %eax, %eax je .LBB8_16 call _D8sumtree27walkTreeFC8sumtree24NodeZi .LBB8_7: movl %eax, %ebx movl 12(%esi), %eax testl %eax, %eax je .LBB8_17 call _D8sumtree27walkTreeFC8sumtree24NodeZi .LBB8_9: addl %ebx, %eax .LBB8_10: addl %edi, %eax .LBB8_11: popl %esi popl %edi popl %ebx ret .LBB8_12: movl 8(%esi), %eax leal 1(%eax), %ecx movl %ecx, 8(%esi) jmp .LBB8_11 .LBB8_13: xorl %eax, %eax jmp .LBB8_3 .LBB8_14: xorl %eax, %eax jmp .LBB8_10 .LBB8_15: movl 8(%esi), %eax leal 1(%eax), %ecx movl %ecx, 8(%esi) jmp .LBB8_10 .LBB8_16: xorl %eax, %eax jmp .LBB8_7 .LBB8_17: xorl %eax, %eax jmp .LBB8_9 A possible cause of the slow performance of sumtree_d1b compared to the Java version is in the cache effects. The garbage collector of Java HotSpot can lay objects in memory in a better way: In the "Eden" part of the memory allocated by the hotspot GC the cache coherence is very high. While the current D GC has to avoid long-term memory fragmentation, the Eden doesn't share this problem, it can be used as a stack. So I have created a different version of the walkTree() function that prints the memory positions of the nodes. Then I have used Python to find the requences of the distances between node pointers during the walkTree() in sumtree_d1b:[(-16, 232285), (-32, 115525), (-48, 57657), (-64, 28535), (-80, 14365), (-96, 7266), (-112, 3447), (-128, 1721), (8160, 922), (-144, 917), (8176, 901), (8144, 688), (8128, 460), (-160, 431), (8112, 305), (-176, 218), (8096, 145), (-192, 111), (8080, 99), (-208, 61), (8064, 52), (8048, 28), (-224, 25), (8032, 20), (-240, 10), (8016, 10), (-272, 6), (8000, 5), (-288, 4), (7968, 3), (-256, 3), (-336, 1), (5497120, 1), (7952, 1), (16368, 1), (-1032304, 1), (-320, 1), (7936, 1)] Those are quite jumpy, and it shows that the minimal memory block size is 16, so a Leaf object uses 16 bytes as an Internal object. So I have overloaded the new() operator in the abstract Node class, allocating a single block of memory from the C heap at the beginning and then slicing it in pieces during node allocations, using it as a stack, as in the Eden of the Java GC. The pointer that new() returns isn't aligned to 16 bytes as the D specs require, so this program is just for experiments (I don't know why it has to be aligned to 16 bytes, maybe just for SSE matters, I think I have received no direct answer about this. An alignment to 8 bytes can be useful for tag bits added to pointers by the garbage collector when the GC runs). It uses just an exit() instead of raising the correct OutOfMemoryError. The bufsize size is found experimentally, this is hard-coded. A real program can't know this value, so it can for example allocate memory in chunks, for example with a dynamic array of pointers to the memory chunks plus a current position pointer in the last chunk and a pointr to the last position in the last chunk. The resulting sumtree_d1c.d code is faster than the Java version: // sumtree_d1c.d version (Tango) { import tango.stdc.stdio: printf; import tango.stdc.stdlib: exit, cmalloc = malloc, cfree = free; } else { import std.c.stdio: printf; import std.c.stdlib: exit, cmalloc = malloc, cfree = free; } struct Random { private const int m = int.max; private const int a = 48271; private const int q = m / a; private const int r = m % a; public int seed; public double nextDouble() { int k = seed / q; seed = a * (seed - k * q) - r * k; if (seed < 1) seed += m; return cast(double)seed / m; } // returns in [0, max] public int nextInt(int max) { int n = max + 1; int k = cast(int)(n * this.nextDouble()); return (k == n) ? k - 1 : k; } } //void** leaf__vptr; // D1 __gshared immutable(void*)* leaf__vptr; // D2 abstract class Node { static void[] buffer; static int bufindex; static const int bufsize = (466_232 * 16) + (466_234 * 12); static this() { void *p = cmalloc(bufsize); if (!p) exit(1); // out of memory buffer = p[0 .. bufsize]; } static ~this() { if (buffer.length) { cfree(buffer.ptr); buffer = null; } } new(size_t sz) { void *p = &buffer[bufindex]; bufindex += sz; if (bufindex <= buffer.length) return p; else { printf("out of memory"); exit(1); } assert(0); } delete(void* p) { assert(0); } } final class Leaf : Node { int value; this(int x) { this.value = x; } } final class Internal : Node { Node left, right; this(Node l, Node r) { this.left = l; this.right = r; } } struct SumTree { static int nnodes; static Random rnd; static Node createTree(int deptMax, int dept) { nnodes++; if (rnd.nextDouble() > 0.30 && dept <= deptMax) return new Internal(createTree(deptMax, dept+1), createTree(deptMax, dept+1)); else return new Leaf(rnd.nextInt(100)); } } int walkTree(Node root) { if (root.__vptr == leaf__vptr) { return (cast(Leaf)cast(void*)root).value++; } else { Internal iroot = cast(Internal)cast(void*)root; return ((iroot.left is null) ? 0 : walkTree(iroot.left)) + ((iroot.right is null) ? 0 : walkTree(iroot.right)); } } void main() { const int NLOOPS = 600; { scope Leaf l = new Leaf(0); leaf__vptr = l.__vptr; } SumTree.rnd = Random(142465964); SumTree.nnodes = 0; Node root = SumTree.createTree(35, 0); // 35 0 (Internal, Leaves: 466232 466234) printf("nnodes: %d\n", SumTree.nnodes); if (root !is null) { printf("sum: %d\n", walkTree(root)); int tot = 0; for (int i = 0; i < NLOOPS-1; i++) tot += walkTree(root); printf("sum (%d scans): %d\n", NLOOPS-1, tot); } exit(0); } Timings (full program), best of 3, seconds: Java: 8.83 (-server) D 1 notricks: 10.44 D 1b notricks: 9.73 D 1 tricks: 10.18 D 1b tricks: 9.45 D 1c tricks: 8.25 D 1d tricks: 8.12 The frequences of the distances between node pointers during the walkTree() in sumtree_d1d, show a much nicer position in memory, that probably improves cache coherence:t = file("adds.txt").read() t2 = map(int, t.split()) t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)] from util import frequency frequency(t3, mode="freq")[(12, 233188), (28, 116447), (44, 58345), (60, 28995), (76, 14672), (92, 7411), (108, 3546), (124, 1773), (140, 945), (156, 451), (172, 228), (188, 115), (204, 61), (220, 28), (236, 11), (268, 6), (284, 4), (252, 4), (316, 1), (332, 1)] The sumtree_d1d.d version is the same as the sumtree_d1c.d version, but it swaps the null tests in walkTree(), because the CPU is faster if the "then" clause of the if is the most likely, and in this tree there are no null pointers: int walkTree(Node root) { if (root.__vptr == leaf__vptr) { return (cast(Leaf)cast(void*)root).value++; } else { Internal iroot = cast(Internal)cast(void*)root; return ((iroot.left !is null) ? walkTree(iroot.left) : 0) + ((iroot.right !is null) ? walkTree(iroot.right): 0); } } The Java code doesn't seem to enjoy this change, probably because the run time profiling performs that optimization already. With a less cheating program, overloading the new() in a correct and flexible way, this code can be faster anyway than the Java code (but I can't be sure before testing it). I have implemented in D the well known "Richards" benchmark (used in the famous V8 JavaScript benchmarks of Google, the D version is about 8 times faster if well written and well compiled with ldc). I have done a similar experiment of manually removing the single virtual call present in the Richards benchmark. That virtual call is among 4 possible classes, and after the removal the code is a little slower. So that manual optimization is probably worth when in a virtual call there are only one or two possible classes to choose from. Ask me if you want the Richards benchmark in D, and various translations, variations, timings, etc. I have read papers that say that the devirtualization, if well implemented can also be useful in other situations. For example even if there are many possible calls, but only one of them is much more frequent than all the other ones, then in the call point you can put an if to test if it's the most frequent class, and perform a static call to it, otherwise you can perform a dynamic call. This usually can speed up the code. There are many ways to improve this situation in D, I think we can start with something simple: 1) Give the dmd a function (if not already present, but I think it's not present) that for a call point returns an array of the linearized inheritance tree. If it contains only a class (beside object) then this call can become static. In D methods are virtual by default, so this simple optimization is probably able to remove many useless virtual calls. So there's no need to turn all classes into "static", keeping classes more flexible. 2) If not already present it can be added to the profiler built-in in DMD the collection of statistics at the virtual call points. Collecting full statistics as I have done with an associative array is probably not that useful, the data structure used can be much simpler, something like a static fixed sized array about 4 or 5 items long, each item is a ID-frequency pair, that gets incremented by 1. Such array can have a move-to-front strategy. And beside this little arrays there can be variable that counts total calls for that virtual call. In D a linear scan in a such small array is simple to implement and it's faster than an an associative array access. if there are more than 4-5 different calls, then they get not recorded. The total call count in the locus is useful to find how many calls are not registered in the array. Later the compiler can use those frequencies at each call point. If there are only 2 different calls, it can use a tiny dispatch tree. If there is a call much more frequent than the other ones can use a static call for the frequent one and a dynamic call for the others, otherwise it can use a dynamic call. Later more things can be improved, but I think this is enough to fix most of the situation. Bye, bearophilet = file("adds.txt").read() t2 = map(int, t.split()) t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)] from util import frequency frequency(t3, mode="freq")
Feb 03 2010
bearophile wrote:This post is about few simple experiments I've done in D about virtual calls and how to improve their performance.So probably this can't be useful before D2 is out of alpha state. [snip] This is an excellent piece of research. As you say, it's not top priority right now, but it's important that this stuff doesn't get lost. It'd be great if you could make a 'performance' page on the D Wiki, and link your findings to it. It will be very valuable later on.
Feb 03 2010
bearophile wrote:This post is about few simple experiments I've done in D about virtual calls and how to improve their performance. So probably this can't be useful before D2 is out of alpha state. For people working at Sun this stuff is probably very old and very naive, but I think currently no D compilers are optimizing this, and if you are not interested please ignore this post.Here's what's going on. The virtual function calls are the calls to Internal.walkSum() via an expression typed as Node.walkSum(). The Java JIT has access to the entire program, so it can do whole program analysis and determine that the *only* instances of Node that ever exist are actually instances Internal, therefore all virtual calls to Node.walkSum() can be replaced with direct calls to (and inlining of) Internal.walkSum(). This is called "direct call optimization". I'm pretty sure that Java compilers did this optimization from the beginning. It is not advanced technology, nor even terribly clever. Even C++ compilers did it 10 years earlier (when the static type of the instance was known, i.e. it was a value type). So why doesn't D do it? Because D's optimization runs when compiling a module. It does not have access to the whole program, therefore it cannot know that the only instances of Node are Internal, and cannot make the optimization. (The C++ style optimization can't be done because D classes are not value types.) Making Internal final and Internal.walkSum() final does not help, because the types of left and right are Node, not Internal. Another module might import this one, have another class derived from Node, and store an instance to them in left and right. This would invalidate the direct call optimization. However, if you change the types of left and right from Node to Internal, then the compiler knows that walkSum() is final (because Internal being final means there are no overrides of walkSum()), and hence the direct call optimization can be (and is) performed. You can easily verify this by running obj2asm on the output of the compiler. You can also see it in the e2ir.c source of dmd: if (!fd->isVirtual() || directcall || fd->isFinal() ) { // make static call ec = el_var(sfunc); } else { // make virtual call elem *ev; unsigned vindex; assert(ethis); ev = el_same(ðis); ev = el_una(OPind, TYnptr, ev); vindex = fd->vtblIndex; // Build *(ev + vindex * 4) ec = el_bin(OPadd,TYnptr,ev,el_long(TYint, vindex * 4)); ec = el_una(OPind,TYnptr,ec); ec = el_una(OPind,tybasic(sfunc->Stype->Tty),ec); }
Feb 03 2010
Walter Bright:The virtual function calls are the calls to Internal.walkSum() via an expression typed as Node.walkSum().That expression typed as Node.walkSum() also calls Leaf.walkSum().The Java JIT has access to the entire program, so it can do whole program analysis and determine that the *only* instances of Node that ever exist are actually instances Internal,There are Leaf instances too.So why doesn't D do it? Because D's optimization runs when compiling a module. It does not have access to the whole program,With LDC you can use a level of optimization (LTO) that spans all the modules you are compiling at once. There are times when you want to optimize a final release of a program as well as possible. And anyway, this whole benchmark program is made of a single little module.However, if you change the types of left and right from Node to Internal,I can't, because there is are the Leaf objects too. Thank you, bye, bearophile
Feb 03 2010
Sorry, I had missed the Leaf node. bearophile wrote:With LDC you can use a level of optimization (LTO) that spans all the modules you are compiling at once.That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't. Hardly any programmers are knowledgeable enough about compiler optimizations to be comfortable using unsafe optimizations correctly. Most of the time, you just get mad users and and a reputation for a "buggy" optimizer.
Feb 03 2010
Walter Bright:That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't.This sounds like: Lasciate ogni speranza, voi ch'entrate. Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do. Bye, bearophile
Feb 03 2010
bearophile wrote:Walter Bright:A link step is required to figure out what modules wind up in the final program. So, in order for this to get the information into the front end where it belongs, you essentially need to compile it, link it, feed the results of the link back into the compiler, recompile, and relink. Not impossible, and a "pseudo-link" could be done after the semantic phase of compilation, but these things would be a lot of work to implement.That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't.This sounds like: Lasciate ogni speranza, voi ch'entrate. Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do.
Feb 03 2010
Walter Bright:Not impossible, and a "pseudo-link" could be done after the semantic phase of compilation, but these things would be a lot of work to implement.OK. People can live with a compiler that doesn't perform all possible optimizations from its first version 1.0, but I think it's important that the language is not designed to make such improvements / optimizations impossible (or too much hard) to implement (this is also why time ago I suggested D type system to tell apart pinned GC pointers from normal raw unpinned pointers. So if in future D has success someone is free to realize a mostly moving GC, even if now no one is able or interested in doing it. You told me that idea makes the language too much hairy). Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.html Bye, bearophile
Feb 04 2010
bearophile wrote:Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.htmlBelieve it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.
Feb 04 2010
== Quote from Walter Bright (newshound1 digitalmars.com)'s articlebearophile wrote:Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.htmlBelieve it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.
Feb 04 2010
dsimcha:Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.It's useful still, but just a little, there are far more important optimizations (LTO, other forms of profiling that allow for partial unrolling even you don't know at compile time the number of cycles, inlining of ex-virtual functions, and so on). Grouping often used functions is useful because they can be kept better in L2 cache, and sometimes they can even be kept in L1 :-) Bye, bearophile
Feb 04 2010
Hello dsimcha,== Quote from Walter Bright (newshound1 digitalmars.com)'s articleIt's still valid (to an extent) for two reasons; careful placement of code can make things load faster by causing a more ideal page load order (all code starts paged out) and if you turn your head and squint, the CPU cache looks like paged memory. -- <IXOYE><Believe it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.
Feb 04 2010
dsimcha wrote:== Quote from Walter Bright (newshound1 digitalmars.com)'s articleI'll add to the other replies, and say that: 1. if your program has rarely used features, those can be grouped together in such a way that they never load into memory, and so are essentially costless. 2. initialization code can be separated from the working loop, such that the init code pages are loaded, run, and discarded, and never need loading again. 3. file systems like to read files sequentially. So if your code is organized so the next code to execute is next on the disk image, it will load faster. 4. strongly coupled functions (functions that call each other a lot) should be next to each other, so they wind up in the same cache.Believe it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.
Feb 04 2010
Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do.What about C++ code?
Feb 04 2010
Trass3r:What about C++ code?I think D classes are not compatible with C++ ones. And even if that becomes possible you can use the safest approach when a C++-compiled opaque module uses classes and doesn't have that required data structure that all D modules must have. Anyway, Walter has not commented about that solution, so maybe it can be a wrong idea. He suggests to solve the problem with pre-link stage, that he says is possible but requires lot of work to be implemented. I think he is not much interested in this whole topic. Bye, bearophile
Feb 04 2010
Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:Later more things can be improved, but I think this is enough to fix most of the situation.Interesting. So a novice programmer can produce much faster code in Java than a semi-pro in D in this case.
Feb 03 2010
retard wrote:Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:Ever since Java implementations supported JITs, there were some benchmarks that ran faster in Java than in the best C++ optimizing compilers. This is because Java JITs have access to the whole program, and native compilers generally do not.Later more things can be improved, but I think this is enough to fix most of the situation.Interesting. So a novice programmer can produce much faster code in Java than a semi-pro in D in this case.
Feb 03 2010