digitalmars.D - Virtual methods
- bearophile (723/733) Feb 03 2010 This post is long, I hope the web interface doesn't cut it down.
- Don (7/8) Feb 03 2010 So probably this can't be useful before D2 is out of alpha state.
- Walter Bright (50/56) Feb 03 2010 Here's what's going on. The virtual function calls are the calls to
- bearophile (8/18) Feb 03 2010 There are Leaf instances too.
- Walter Bright (9/11) Feb 03 2010 That's not safe because there are always opaque modules that are binary
- bearophile (8/11) Feb 03 2010 This sounds like:
- Walter Bright (7/21) Feb 03 2010 A link step is required to figure out what modules wind up in the final
- bearophile (6/8) Feb 04 2010 OK. People can live with a compiler that doesn't perform all possible op...
- Walter Bright (4/13) Feb 04 2010 Believe it or not, you can do profile guided optimizations with the dmd
- dsimcha (4/17) Feb 04 2010 Is that still a relevant optimization? It probably made sense circa 199...
- bearophile (4/7) Feb 04 2010 It's useful still, but just a little, there are far more important optim...
- BCS (7/19) Feb 04 2010 It's still valid (to an extent) for two reasons; careful placement of co...
- Walter Bright (13/21) Feb 04 2010 I'll add to the other replies, and say that:
- Trass3r (1/9) Feb 04 2010 What about C++ code?
- bearophile (5/6) Feb 04 2010 I think D classes are not compatible with C++ ones. And even if that bec...
- retard (3/5) Feb 03 2010 Interesting. So a novice programmer can produce much faster code in Java...
- Walter Bright (5/12) Feb 03 2010 Ever since Java implementations supported JITs, there were some
This post is long, I hope the web interface doesn't cut it down.
This post is about few simple experiments I've done in D about virtual calls
and how to improve their performance. So probably this can't be useful before
D2 is out of alpha state. For people working at Sun this stuff is probably very
old and very naive, but I think currently no D compilers are optimizing this,
and if you are not interested please ignore this post.
The performance problem caused by virtual calls is not the small amount of time
they require, but mostly the missing inlining they cause.
I have created a small synthetic benchmark, not realistic but useful to measure
this topic. The program creates a binary tree with 932_465 nodes, and then
scans it 600 times computing the sum of the values stored in the leaves,
incrementing such leave values each time.
The tree contains two different kinds of nodes, Internal and Leave, so the
walking on the tree is done with a virtual function. HotSpot shows a
significant performance improvement over the C++ code, probably caused by the
de-virtualizing the virtual calls. GCC Profile Guided Optimization doesn't
produce speedups on this program.
// sumtree_d1.d
// D code, works in D1 on Tango and Phobos
// not hard to adapt to D2
const bool USE_TRICKS = false;
version (Tango) {
import tango.stdc.stdio: printf;
static if (USE_TRICKS) {
import tango.stdc.stdlib: exit;
import tango.core.Memory: GC;
alias GC.disable disable;
}
} else {
import std.c.stdio: printf;
static if (USE_TRICKS) {
import std.c.stdlib: exit;
import std.gc: disable;
}
}
struct Random { // portable generator
private const int m = int.max;
private const int a = 48271;
private const int q = m / a;
private const int r = m % a;
public int seed;
public double nextDouble() {
int k = seed / q;
seed = a * (seed - k * q) - r * k;
if (seed < 1)
seed += m;
return cast(double)seed / m;
}
// returns in [0, max]
public int nextInt(int max) {
int n = max + 1;
int k = cast(int)(n * this.nextDouble());
return (k == n) ? k - 1 : k;
}
}
abstract class Node {
abstract int walkSum();
}
final class Leaf : Node {
int value;
this(int x) { this.value = x; }
override int walkSum() { return this.value++; }
}
final class Internal : Node {
Node left, right;
this(Node l, Node r) {
this.left = l;
this.right = r;
}
override int walkSum() {
// createTree() never generates null pointers, but we test
// anyway to be more general with other tree generators
return ((this.left is null) ? 0 : this.left.walkSum()) +
((this.right is null) ? 0 : this.right.walkSum());
}
}
struct SumTree {
static int nnodes;
static Random rnd;
static Node createTree(int deptMax, int dept) {
nnodes++;
if (rnd.nextDouble() > 0.30 && dept <= deptMax)
return new Internal(createTree(deptMax, dept+1),
createTree(deptMax, dept+1));
else
return new Leaf(rnd.nextInt(100));
}
}
void main() {
const int NLOOPS = 600;
static if (USE_TRICKS) { disable(); }
SumTree.rnd = Random(142465964);
SumTree.nnodes = 0;
Node root = SumTree.createTree(35, 0); // 35 0
printf("nnodes: %d\n", SumTree.nnodes);
if (root !is null) {
printf("sum: %d\n", root.walkSum());
int tot = 0;
for (int i = 0; i < NLOOPS-1; i++)
tot += root.walkSum();
printf("sum (%d scans): %d\n", NLOOPS-1, tot);
}
static if (USE_TRICKS) { exit(0); }
}
I have translated that code to Java and C++, ask if you want to see those
versions.
I have created more D versions:
the struct tag.
probably improves code inlining.
internal nodes using pointers tagged with one bit (so the struct memory is
allocated from the C heap). The total memory allocated is less (the tags
require no extra memory) and this speeds up the allocation a little. The walk
is about as fast.
In the D code I have also added a global constant boolean USE_TRICKS that
enables/disables two dirty tricks, that essentially save some of the time used
by the D GC to allocate and deallocate the tree. Such tricks don't speed up the
tree walks.
The profiling of the C++ program shows that most of the time is spent inside
Internal::walkSum().
Code compiled with:
LLVM 2.6 gcc version 4.2.1 (Based on Apple Inc. build 5649) (LLVM build)
llvm-g++ -Wall -O3 -s -Wl,--enable-stdcall-fixup -fomit-frame-pointer
-march=native -ffast-math
Java build 1.7.0-ea-b76
java -server SumTree
CPU: Celeron 2.33 GHz, on Ubuntu running on VirtualBox running on Windows Vista
32 bit.
The current benchmark with seed=142465964 allocates 466_232 Internal nodes, and
466_234 Leaves.
Timings (just tree building), best of 3, seconds:
C++: 0.21
Java: 0.83
D 1 notricks: 0.49
D 2 notricks: 0.46
D 3 notricks: 0.45
D 4 notricks: 0.23
D 1 tricks: 0.24
D 2 tricks: 0.23
D 3 tricks: 0.23
D 4 tricks: 0.22
Timings (full program), best of 3, seconds:
C++: 15.64
Java: 8.74
D 1 notricks: 10.30
D 2 notricks: 14.26
D 3 notricks: 14.06
D 4 notricks: 13.57
D 1 tricks: 9.77
D 2 tricks: 13.58
D 3 tricks: 13.32
D 4 tricks: 13.06
As you can see from the timings it's not a matter of GC and memory allocations,
the building phase in the C++ is actually faster compared to the allocation in
the Java version. So probably here HotSpot is optimizing the virtual calls.
Internal.walkSum
pushl %edi
pushl %esi
subl $4, %esp
movl 8(%eax), %ecx
testl %ecx, %ecx
movl %eax, %esi
je .LBB5_5
movl (%ecx), %edx
movl %ecx, %eax
call *24(%edx)
.LBB5_2:
movl %eax, %edi
movl 12(%esi), %eax
testl %eax, %eax
je .LBB5_6
movl (%eax), %ecx
call *24(%ecx)
.LBB5_4:
addl %edi, %eax
addl $4, %esp
popl %esi
popl %edi
ret
.LBB5_5:
xorl %eax, %eax
jmp .LBB5_2
.LBB5_6:
xorl %eax, %eax
jmp .LBB5_4
------------------
Internal.walkSum:
pushl %ebp
pushl %ebx
pushl %edi
pushl %esi
subl $4, %esp
movl 4(%eax), %esi
testl %esi, %esi
movl %eax, %edi
je .LBB7_12
cmpl $0, (%esi)
jne .LBB7_13
movl 4(%esi), %eax
testl %eax, %eax
je .LBB7_14
call _D2s24Node7walkSumMFZi
.LBB7_4:
movl %eax, %ebp
movl 8(%esi), %eax
testl %eax, %eax
je .LBB7_15
cmpl $0, (%eax)
jne .LBB7_16
call _D2s28Internal7walkSumMFZi
movl %eax, %ebx
.LBB7_7:
addl %ebp, %ebx
.LBB7_8:
movl 8(%edi), %eax
testl %eax, %eax
je .LBB7_17
cmpl $0, (%eax)
jne .LBB7_18
call _D2s28Internal7walkSumMFZi
movl %eax, %ecx
.LBB7_11:
movl %ecx, %eax
addl %ebx, %eax
addl $4, %esp
popl %esi
popl %edi
popl %ebx
popl %ebp
ret
.LBB7_12:
xorl %ebx, %ebx
jmp .LBB7_8
.LBB7_13:
movl 4(%esi), %ebx
leal 1(%ebx), %eax
movl %eax, 4(%esi)
jmp .LBB7_8
.LBB7_14:
xorl %eax, %eax
jmp .LBB7_4
.LBB7_15:
xorl %ebx, %ebx
jmp .LBB7_7
.LBB7_16:
movl 4(%eax), %ebx
leal 1(%ebx), %ecx
movl %ecx, 4(%eax)
jmp .LBB7_7
.LBB7_17:
xorl %ecx, %ecx
jmp .LBB7_11
.LBB7_18:
movl 4(%eax), %ecx
leal 1(%ecx), %edx
movl %edx, 4(%eax)
jmp .LBB7_11
------------------
walkSum:
pushl %ebx
pushl %edi
pushl %esi
testl %eax, %eax
jne .LBB6_3
xorl %eax, %eax
.LBB6_2:
popl %esi
popl %edi
popl %ebx
ret
.LBB6_3:
movl %eax, %esi
cmpl $0, (%esi)
jne .LBB6_8
movl 4(%esi), %eax
call _D2s37walkSumFPS2s34NodeZi
movl 8(%esi), %esi
testl %esi, %esi
movl %eax, %edi
je .LBB6_9
cmpl $0, (%esi)
jne .LBB6_10
movl 4(%esi), %eax
call _D2s37walkSumFPS2s34NodeZi
movl %eax, %ebx
movl 8(%esi), %eax
call _D2s37walkSumFPS2s34NodeZi
addl %ebx, %eax
.LBB6_7:
addl %edi, %eax
jmp .LBB6_2
.LBB6_8:
movl 4(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 4(%esi)
jmp .LBB6_2
.LBB6_9:
xorl %eax, %eax
jmp .LBB6_7
.LBB6_10:
movl 4(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 4(%esi)
jmp .LBB6_7
------------------
walkSum:
pushl %ebx
pushl %edi
pushl %esi
movl %eax, %esi
andl $4294967294, %esi
testl %esi, %esi
jne .LBB13_3
xorl %eax, %eax
.LBB13_2:
popl %esi
popl %edi
popl %ebx
ret
.LBB13_3:
testb $1, %al
movl (%esi), %eax
je .LBB13_8
movl %eax, %edi
andl $4294967294, %edi
testl %edi, %edi
je .LBB13_9
testb $1, %al
movl (%edi), %eax
je .LBB13_10
call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
movl %eax, %ebx
movl 4(%edi), %eax
call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
addl %ebx, %eax
.LBB13_7:
movl %eax, %edi
movl 4(%esi), %eax
call _D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
addl %edi, %eax
jmp .LBB13_2
.LBB13_8:
leal 1(%eax), %ecx
movl %ecx, (%esi)
jmp .LBB13_2
.LBB13_9:
xorl %eax, %eax
jmp .LBB13_7
.LBB13_10:
leal 1(%eax), %ecx
movl %ecx, (%edi)
jmp .LBB13_7
------------------
st, C++, gcc:
Internal.walkSum:
subl $12, %esp
movl %ebx, 4(%esp)
movl %esi, 8(%esp)
xorl %ebx, %ebx
movl 16(%esp), %esi
movl 4(%esi), %edx
testl %edx, %edx
je .L5
movl (%edx), %eax
movl %edx, (%esp)
call *(%eax)
movl %eax, %ebx
.L5:
movl 8(%esi), %edx
xorl %eax, %eax
testl %edx, %edx
je .L7
movl (%edx), %eax
movl %edx, (%esp)
call *(%eax)
.L7:
addl %ebx, %eax
movl 8(%esp), %esi
movl 4(%esp), %ebx
addl $12, %esp
ret
------------------
st, C++, llvm-gcc:
Internal.walkSum:
pushl %edi
pushl %esi
subl $4, %esp
movl 16(%esp), %esi
movl 4(%esi), %eax
testl %eax, %eax
je LBB2_5
movl (%eax), %ecx
movl %eax, (%esp)
call *(%ecx)
LBB2_2:
movl %eax, %edi
movl 8(%esi), %eax
testl %eax, %eax
je LBB2_6
movl (%eax), %ecx
movl %eax, (%esp)
call *(%ecx)
LBB2_4:
addl %edi, %eax
addl $4, %esp
popl %esi
popl %edi
ret
LBB2_5:
xorl %eax, %eax
jmp LBB2_2
LBB2_6:
xorl %eax, %eax
jmp LBB2_4
I have collected the asm generated by HotSpot from the Internal.walkSum, but I
am not able to read it, it's too much messy for me.
I have created an alternative version of the first D program, sumtree_d1b, that
uses the __vptr to avoid the virtual table and to allow LDC to inline the code
better. The performance is better than sumtree_d1, but not good enough to beat
the Java version yet.
// sumtree_d1b.d
const bool USE_TRICKS = false;
version (Tango) {
import tango.stdc.stdio: printf;
static if (USE_TRICKS) {
import tango.stdc.stdlib: exit;
import tango.core.Memory: GC;
alias GC.disable disable;
}
} else {
import std.c.stdio: printf;
static if (USE_TRICKS) {
import std.c.stdlib: exit;
import std.gc: disable;
}
}
struct Random {
private const int m = int.max;
private const int a = 48271;
private const int q = m / a;
private const int r = m % a;
public int seed;
public double nextDouble() {
int k = seed / q;
seed = a * (seed - k * q) - r * k;
if (seed < 1)
seed += m;
return cast(double)seed / m;
}
// returns in [0, max]
public int nextInt(int max) {
int n = max + 1;
int k = cast(int)(n * this.nextDouble());
return (k == n) ? k - 1 : k;
}
}
//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2
abstract class Node {}
final class Leaf : Node {
int value;
this(int x) { this.value = x; }
}
final class Internal : Node {
Node left, right;
this(Node l, Node r) {
this.left = l;
this.right = r;
}
}
struct SumTree {
static int nnodes;
static Random rnd;
static Node createTree(int deptMax, int dept) {
nnodes++;
if (rnd.nextDouble() > 0.30 && dept <= deptMax)
return new Internal(createTree(deptMax, dept+1),
createTree(deptMax, dept+1));
else
return new Leaf(rnd.nextInt(100));
}
}
int walkTree(Node root) {
if (root.__vptr == leaf__vptr) {
return (cast(Leaf)cast(void*)root).value++;
} else {
Internal iroot = cast(Internal)cast(void*)root;
return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
((iroot.right is null) ? 0 : walkTree(iroot.right));
}
}
void main() {
const int NLOOPS = 600;
static if (USE_TRICKS) { disable(); }
{
scope Leaf l = new Leaf(0);
leaf__vptr = l.__vptr;
}
SumTree.rnd = Random(142465964);
SumTree.nnodes = 0;
Node root = SumTree.createTree(35, 0); // 35 0
printf("nnodes: %d\n", SumTree.nnodes);
if (root !is null) {
printf("sum: %d\n", walkTree(root));
int tot = 0;
for (int i = 0; i < NLOOPS-1; i++)
tot += walkTree(root);
printf("sum (%d scans): %d\n", NLOOPS-1, tot);
}
static if (USE_TRICKS) { exit(0); }
}
Timings (full program), best of 3, seconds:
Java: 8.83 (-server)
D 1 notricks: 10.44
D 1b notricks: 9.73
D 1 tricks: 10.18
D 1b tricks: 9.45
In this program there are only two possible classes of nodes, so to manually
remove virtual calls I have stored the pointer to the virtual table of one of
the two classes, the Leaf (Here unfortunately I can't use conditional
compilation to let the compiler compile the right code according to the
language version because the D2 version is not syntactially valid D1).
//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2
D docs state that user code shall not use the pointer to the virtual table
(probably because it's an implementation detail, that can change according to
the way virtual calls are implemented. For example dispatch trees don't have
such pointer. Here I have essentially created a small dispatch tree manually).
At the beginning of the main() the leaf__vptr gets initialized (I don't know a
way to initialize this value that doesn't require me the creation of a class
instance):
{
scope Leaf l = new Leaf(0);
leaf__vptr = l.__vptr;
}
Then this is a free function that scans the tree (it can be written iterative
too, but I think the performance is not that diffeernt on modern CPUs. This can
be tested):
int walkTree(Node root) {
if (root.__vptr == leaf__vptr) {
return (cast(Leaf)cast(void*)root).value++;
} else {
Internal iroot = cast(Internal)cast(void*)root;
return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
((iroot.right is null) ? 0 : walkTree(iroot.right));
}
}
The double cast like:
cast(Leaf)cast(void*)root
is necessary because a single cast in D is done dynamically, so the first cast
to void* punches a hole in the type system, so the pair becomes a static cast.
I think it's similar to the C++ code:
Leaf::root
The asm of walkTree() shows no virtual calls and it also shows some inlining:
walkTree of sumtree_d1b.d:
_D8sumtree27walkTreeFC8sumtree24NodeZi:
pushl %ebx
pushl %edi
pushl %esi
movl _D8sumtree210leaf__vptrPPv, %ecx
cmpl %ecx, (%eax)
movl %eax, %esi
je .LBB8_12
movl 8(%esi), %eax
testl %eax, %eax
je .LBB8_13
call _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_3:
movl %eax, %edi
movl 12(%esi), %esi
testl %esi, %esi
je .LBB8_14
movl _D8sumtree210leaf__vptrPPv, %eax
cmpl %eax, (%esi)
je .LBB8_15
movl 8(%esi), %eax
testl %eax, %eax
je .LBB8_16
call _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_7:
movl %eax, %ebx
movl 12(%esi), %eax
testl %eax, %eax
je .LBB8_17
call _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_9:
addl %ebx, %eax
.LBB8_10:
addl %edi, %eax
.LBB8_11:
popl %esi
popl %edi
popl %ebx
ret
.LBB8_12:
movl 8(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 8(%esi)
jmp .LBB8_11
.LBB8_13:
xorl %eax, %eax
jmp .LBB8_3
.LBB8_14:
xorl %eax, %eax
jmp .LBB8_10
.LBB8_15:
movl 8(%esi), %eax
leal 1(%eax), %ecx
movl %ecx, 8(%esi)
jmp .LBB8_10
.LBB8_16:
xorl %eax, %eax
jmp .LBB8_7
.LBB8_17:
xorl %eax, %eax
jmp .LBB8_9
A possible cause of the slow performance of sumtree_d1b compared to the Java
version is in the cache effects. The garbage collector of Java HotSpot can lay
objects in memory in a better way: In the "Eden" part of the memory allocated
by the hotspot GC the cache coherence is very high. While the current D GC has
to avoid long-term memory fragmentation, the Eden doesn't share this problem,
it can be used as a stack.
So I have created a different version of the walkTree() function that prints
the memory positions of the nodes. Then I have used Python to find the
requences of the distances between node pointers during the walkTree() in
sumtree_d1b:
t = file("adds.txt").read()
t2 = map(int, t.split())
t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)]
from util import frequency
frequency(t3, mode="freq")
[(-16, 232285), (-32, 115525), (-48, 57657), (-64, 28535), (-80, 14365), (-96,
7266), (-112, 3447), (-128, 1721), (8160, 922), (-144, 917), (8176, 901),
(8144, 688), (8128, 460), (-160, 431), (8112, 305), (-176, 218), (8096, 145),
(-192, 111), (8080, 99), (-208, 61), (8064, 52), (8048, 28), (-224, 25), (8032,
20), (-240, 10), (8016, 10), (-272, 6), (8000, 5), (-288, 4), (7968, 3), (-256,
3), (-336, 1), (5497120, 1), (7952, 1), (16368, 1), (-1032304, 1), (-320, 1),
(7936, 1)]
Those are quite jumpy, and it shows that the minimal memory block size is 16,
so a Leaf object uses 16 bytes as an Internal object.
So I have overloaded the new() operator in the abstract Node class, allocating
a single block of memory from the C heap at the beginning and then slicing it
in pieces during node allocations, using it as a stack, as in the Eden of the
Java GC. The pointer that new() returns isn't aligned to 16 bytes as the D
specs require, so this program is just for experiments (I don't know why it has
to be aligned to 16 bytes, maybe just for SSE matters, I think I have received
no direct answer about this. An alignment to 8 bytes can be useful for tag bits
added to pointers by the garbage collector when the GC runs). It uses just an
exit() instead of raising the correct OutOfMemoryError. The bufsize size is
found experimentally, this is hard-coded. A real program can't know this value,
so it can for example allocate memory in chunks, for example with a dynamic
array of pointers to the memory chunks plus a current position pointer in the
last chunk and a pointr to the last position in the last chunk.
The resulting sumtree_d1c.d code is faster than the Java version:
// sumtree_d1c.d
version (Tango) {
import tango.stdc.stdio: printf;
import tango.stdc.stdlib: exit, cmalloc = malloc, cfree = free;
} else {
import std.c.stdio: printf;
import std.c.stdlib: exit, cmalloc = malloc, cfree = free;
}
struct Random {
private const int m = int.max;
private const int a = 48271;
private const int q = m / a;
private const int r = m % a;
public int seed;
public double nextDouble() {
int k = seed / q;
seed = a * (seed - k * q) - r * k;
if (seed < 1)
seed += m;
return cast(double)seed / m;
}
// returns in [0, max]
public int nextInt(int max) {
int n = max + 1;
int k = cast(int)(n * this.nextDouble());
return (k == n) ? k - 1 : k;
}
}
//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2
abstract class Node {
static void[] buffer;
static int bufindex;
static const int bufsize = (466_232 * 16) + (466_234 * 12);
static this() {
void *p = cmalloc(bufsize);
if (!p)
exit(1); // out of memory
buffer = p[0 .. bufsize];
}
static ~this() {
if (buffer.length) {
cfree(buffer.ptr);
buffer = null;
}
}
new(size_t sz) {
void *p = &buffer[bufindex];
bufindex += sz;
if (bufindex <= buffer.length)
return p;
else {
printf("out of memory");
exit(1);
}
assert(0);
}
delete(void* p) {
assert(0);
}
}
final class Leaf : Node {
int value;
this(int x) { this.value = x; }
}
final class Internal : Node {
Node left, right;
this(Node l, Node r) {
this.left = l;
this.right = r;
}
}
struct SumTree {
static int nnodes;
static Random rnd;
static Node createTree(int deptMax, int dept) {
nnodes++;
if (rnd.nextDouble() > 0.30 && dept <= deptMax)
return new Internal(createTree(deptMax, dept+1),
createTree(deptMax, dept+1));
else
return new Leaf(rnd.nextInt(100));
}
}
int walkTree(Node root) {
if (root.__vptr == leaf__vptr) {
return (cast(Leaf)cast(void*)root).value++;
} else {
Internal iroot = cast(Internal)cast(void*)root;
return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
((iroot.right is null) ? 0 : walkTree(iroot.right));
}
}
void main() {
const int NLOOPS = 600;
{
scope Leaf l = new Leaf(0);
leaf__vptr = l.__vptr;
}
SumTree.rnd = Random(142465964);
SumTree.nnodes = 0;
Node root = SumTree.createTree(35, 0); // 35 0 (Internal, Leaves: 466232
466234)
printf("nnodes: %d\n", SumTree.nnodes);
if (root !is null) {
printf("sum: %d\n", walkTree(root));
int tot = 0;
for (int i = 0; i < NLOOPS-1; i++)
tot += walkTree(root);
printf("sum (%d scans): %d\n", NLOOPS-1, tot);
}
exit(0);
}
Timings (full program), best of 3, seconds:
Java: 8.83 (-server)
D 1 notricks: 10.44
D 1b notricks: 9.73
D 1 tricks: 10.18
D 1b tricks: 9.45
D 1c tricks: 8.25
D 1d tricks: 8.12
The frequences of the distances between node pointers during the walkTree() in
sumtree_d1d, show a much nicer position in memory, that probably improves cache
coherence:
t = file("adds.txt").read()
t2 = map(int, t.split())
t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)]
from util import frequency
frequency(t3, mode="freq")
[(12, 233188), (28, 116447), (44, 58345), (60, 28995), (76, 14672), (92, 7411),
(108, 3546), (124, 1773), (140, 945), (156, 451), (172, 228), (188, 115), (204,
61), (220, 28), (236, 11), (268, 6), (284, 4), (252, 4), (316, 1), (332, 1)]
The sumtree_d1d.d version is the same as the sumtree_d1c.d version, but it
swaps the null tests in walkTree(), because the CPU is faster if the "then"
clause of the if is the most likely, and in this tree there are no null
pointers:
int walkTree(Node root) {
if (root.__vptr == leaf__vptr) {
return (cast(Leaf)cast(void*)root).value++;
} else {
Internal iroot = cast(Internal)cast(void*)root;
return ((iroot.left !is null) ? walkTree(iroot.left) : 0) +
((iroot.right !is null) ? walkTree(iroot.right): 0);
}
}
The Java code doesn't seem to enjoy this change, probably because the run time
profiling performs that optimization already.
With a less cheating program, overloading the new() in a correct and flexible
way, this code can be faster anyway than the Java code (but I can't be sure
before testing it).
I have implemented in D the well known "Richards" benchmark (used in the famous
V8 JavaScript benchmarks of Google, the D version is about 8 times faster if
well written and well compiled with ldc). I have done a similar experiment of
manually removing the single virtual call present in the Richards benchmark.
That virtual call is among 4 possible classes, and after the removal the code
is a little slower. So that manual optimization is probably worth when in a
virtual call there are only one or two possible classes to choose from. Ask me
if you want the Richards benchmark in D, and various translations, variations,
timings, etc.
I have read papers that say that the devirtualization, if well implemented can
also be useful in other situations. For example even if there are many possible
calls, but only one of them is much more frequent than all the other ones, then
in the call point you can put an if to test if it's the most frequent class,
and perform a static call to it, otherwise you can perform a dynamic call. This
usually can speed up the code.
There are many ways to improve this situation in D, I think we can start with
something simple:
1) Give the dmd a function (if not already present, but I think it's not
present) that for a call point returns an array of the linearized inheritance
tree. If it contains only a class (beside object) then this call can become
static. In D methods are virtual by default, so this simple optimization is
probably able to remove many useless virtual calls. So there's no need to turn
all classes into "static", keeping classes more flexible.
2) If not already present it can be added to the profiler built-in in DMD the
collection of statistics at the virtual call points. Collecting full statistics
as I have done with an associative array is probably not that useful, the data
structure used can be much simpler, something like a static fixed sized array
about 4 or 5 items long, each item is a ID-frequency pair, that gets
incremented by 1. Such array can have a move-to-front strategy. And beside this
little arrays there can be variable that counts total calls for that virtual
call. In D a linear scan in a such small array is simple to implement and it's
faster than an an associative array access. if there are more than 4-5
different calls, then they get not recorded. The total call count in the locus
is useful to find how many calls are not registered in the array. Later the
compiler can use those frequencies at each call point. If there are only 2
different calls, it can use a tiny dispatch tree. If there is a call much more
frequent than the other ones can use a static call for the frequent one and a
dynamic call for the others, otherwise it can use a dynamic call.
Later more things can be improved, but I think this is enough to fix most of
the situation.
Bye,
bearophile
Feb 03 2010
bearophile wrote:This post is about few simple experiments I've done in D about virtual calls and how to improve their performance.So probably this can't be useful before D2 is out of alpha state. [snip] This is an excellent piece of research. As you say, it's not top priority right now, but it's important that this stuff doesn't get lost. It'd be great if you could make a 'performance' page on the D Wiki, and link your findings to it. It will be very valuable later on.
Feb 03 2010
bearophile wrote:This post is about few simple experiments I've done in D about virtual calls and how to improve their performance. So probably this can't be useful before D2 is out of alpha state. For people working at Sun this stuff is probably very old and very naive, but I think currently no D compilers are optimizing this, and if you are not interested please ignore this post.Here's what's going on. The virtual function calls are the calls to Internal.walkSum() via an expression typed as Node.walkSum(). The Java JIT has access to the entire program, so it can do whole program analysis and determine that the *only* instances of Node that ever exist are actually instances Internal, therefore all virtual calls to Node.walkSum() can be replaced with direct calls to (and inlining of) Internal.walkSum(). This is called "direct call optimization". I'm pretty sure that Java compilers did this optimization from the beginning. It is not advanced technology, nor even terribly clever. Even C++ compilers did it 10 years earlier (when the static type of the instance was known, i.e. it was a value type). So why doesn't D do it? Because D's optimization runs when compiling a module. It does not have access to the whole program, therefore it cannot know that the only instances of Node are Internal, and cannot make the optimization. (The C++ style optimization can't be done because D classes are not value types.) Making Internal final and Internal.walkSum() final does not help, because the types of left and right are Node, not Internal. Another module might import this one, have another class derived from Node, and store an instance to them in left and right. This would invalidate the direct call optimization. However, if you change the types of left and right from Node to Internal, then the compiler knows that walkSum() is final (because Internal being final means there are no overrides of walkSum()), and hence the direct call optimization can be (and is) performed. You can easily verify this by running obj2asm on the output of the compiler. You can also see it in the e2ir.c source of dmd: if (!fd->isVirtual() || directcall || fd->isFinal() ) { // make static call ec = el_var(sfunc); } else { // make virtual call elem *ev; unsigned vindex; assert(ethis); ev = el_same(ðis); ev = el_una(OPind, TYnptr, ev); vindex = fd->vtblIndex; // Build *(ev + vindex * 4) ec = el_bin(OPadd,TYnptr,ev,el_long(TYint, vindex * 4)); ec = el_una(OPind,TYnptr,ec); ec = el_una(OPind,tybasic(sfunc->Stype->Tty),ec); }
Feb 03 2010
Walter Bright:The virtual function calls are the calls to Internal.walkSum() via an expression typed as Node.walkSum().That expression typed as Node.walkSum() also calls Leaf.walkSum().The Java JIT has access to the entire program, so it can do whole program analysis and determine that the *only* instances of Node that ever exist are actually instances Internal,There are Leaf instances too.So why doesn't D do it? Because D's optimization runs when compiling a module. It does not have access to the whole program,With LDC you can use a level of optimization (LTO) that spans all the modules you are compiling at once. There are times when you want to optimize a final release of a program as well as possible. And anyway, this whole benchmark program is made of a single little module.However, if you change the types of left and right from Node to Internal,I can't, because there is are the Leaf objects too. Thank you, bye, bearophile
Feb 03 2010
Sorry, I had missed the Leaf node. bearophile wrote:With LDC you can use a level of optimization (LTO) that spans all the modules you are compiling at once.That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't. Hardly any programmers are knowledgeable enough about compiler optimizations to be comfortable using unsafe optimizations correctly. Most of the time, you just get mad users and and a reputation for a "buggy" optimizer.
Feb 03 2010
Walter Bright:That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't.This sounds like: Lasciate ogni speranza, voi ch'entrate. Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do. Bye, bearophile
Feb 03 2010
bearophile wrote:Walter Bright:A link step is required to figure out what modules wind up in the final program. So, in order for this to get the information into the front end where it belongs, you essentially need to compile it, link it, feed the results of the link back into the compiler, recompile, and relink. Not impossible, and a "pseudo-link" could be done after the semantic phase of compilation, but these things would be a lot of work to implement.That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't.This sounds like: Lasciate ogni speranza, voi ch'entrate. Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do.
Feb 03 2010
Walter Bright:Not impossible, and a "pseudo-link" could be done after the semantic phase of compilation, but these things would be a lot of work to implement.OK. People can live with a compiler that doesn't perform all possible optimizations from its first version 1.0, but I think it's important that the language is not designed to make such improvements / optimizations impossible (or too much hard) to implement (this is also why time ago I suggested D type system to tell apart pinned GC pointers from normal raw unpinned pointers. So if in future D has success someone is free to realize a mostly moving GC, even if now no one is able or interested in doing it. You told me that idea makes the language too much hairy). Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.html Bye, bearophile
Feb 04 2010
bearophile wrote:Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.htmlBelieve it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.
Feb 04 2010
== Quote from Walter Bright (newshound1 digitalmars.com)'s articlebearophile wrote:Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.htmlBelieve it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.
Feb 04 2010
dsimcha:Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.It's useful still, but just a little, there are far more important optimizations (LTO, other forms of profiling that allow for partial unrolling even you don't know at compile time the number of cycles, inlining of ex-virtual functions, and so on). Grouping often used functions is useful because they can be kept better in L2 cache, and sometimes they can even be kept in L1 :-) Bye, bearophile
Feb 04 2010
Hello dsimcha,== Quote from Walter Bright (newshound1 digitalmars.com)'s articleIt's still valid (to an extent) for two reasons; careful placement of code can make things load faster by causing a more ideal page load order (all code starts paged out) and if you turn your head and squint, the CPU cache looks like paged memory. -- <IXOYE><Believe it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.
Feb 04 2010
dsimcha wrote:== Quote from Walter Bright (newshound1 digitalmars.com)'s articleI'll add to the other replies, and say that: 1. if your program has rarely used features, those can be grouped together in such a way that they never load into memory, and so are essentially costless. 2. initialization code can be separated from the working loop, such that the init code pages are loaded, run, and discarded, and never need loading again. 3. file systems like to read files sequentially. So if your code is organized so the next code to execute is next on the disk image, it will load faster. 4. strongly coupled functions (functions that call each other a lot) should be next to each other, so they wind up in the same cache.Believe it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.
Feb 04 2010
Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do.What about C++ code?
Feb 04 2010
Trass3r:What about C++ code?I think D classes are not compatible with C++ ones. And even if that becomes possible you can use the safest approach when a C++-compiled opaque module uses classes and doesn't have that required data structure that all D modules must have. Anyway, Walter has not commented about that solution, so maybe it can be a wrong idea. He suggests to solve the problem with pre-link stage, that he says is possible but requires lot of work to be implemented. I think he is not much interested in this whole topic. Bye, bearophile
Feb 04 2010
Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:Later more things can be improved, but I think this is enough to fix most of the situation.Interesting. So a novice programmer can produce much faster code in Java than a semi-pro in D in this case.
Feb 03 2010
retard wrote:Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:Ever since Java implementations supported JITs, there were some benchmarks that ran faster in Java than in the best C++ optimizing compilers. This is because Java JITs have access to the whole program, and native compilers generally do not.Later more things can be improved, but I think this is enough to fix most of the situation.Interesting. So a novice programmer can produce much faster code in Java than a semi-pro in D in this case.
Feb 03 2010









Don <nospam nospam.com> 