www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Virtual methods

reply bearophile <bearophileHUGS lycos.com> writes:
This post is long, I hope the web interface doesn't cut it down.

This post is about few simple experiments I've done in D about virtual calls
and how to improve their performance. So probably this can't be useful before
D2 is out of alpha state. For people working at Sun this stuff is probably very
old and very naive, but I think currently no D compilers are optimizing this,
and if you are not interested please ignore this post.

The performance problem caused by virtual calls is not the small amount of time
they require, but mostly the missing inlining they cause.

I have created a small synthetic benchmark, not realistic but useful to measure
this topic. The program creates a binary tree with 932_465 nodes, and then
scans it 600 times computing the sum of the values stored in the leaves,
incrementing such leave values each time.

The tree contains two different kinds of nodes, Internal and Leave, so the
walking on the tree is done with a virtual function. HotSpot shows a
significant performance improvement over the C++ code, probably caused by the
de-virtualizing the virtual calls. GCC Profile Guided Optimization doesn't
produce speedups on this program.


// sumtree_d1.d
// D code, works in D1 on Tango and Phobos
// not hard to adapt to D2
const bool USE_TRICKS = false;

version (Tango) {
    import tango.stdc.stdio: printf;
    static if (USE_TRICKS) {
        import tango.stdc.stdlib: exit;
        import tango.core.Memory: GC;
        alias GC.disable disable;
    }
} else {
    import std.c.stdio: printf;
    static if (USE_TRICKS) {
        import std.c.stdlib: exit;
        import std.gc: disable;
    }
}


struct Random { // portable generator
    private const int m = int.max;
    private const int a = 48271;
    private const int q = m / a;
    private const int r = m % a;
    public int seed;

    public double nextDouble() {
        int k = seed / q;
        seed = a * (seed - k * q) - r * k;
        if (seed < 1)
            seed += m;
        return cast(double)seed / m;
    }

    // returns in [0, max]
    public int nextInt(int max) {
        int n = max + 1;
        int k = cast(int)(n * this.nextDouble());
        return (k == n) ? k - 1 : k;
    }
}


abstract class Node {
    abstract int walkSum();
}


final class Leaf : Node {
    int value;

    this(int x) { this.value = x; }

    override int walkSum() { return this.value++; }
}


final class Internal : Node {
    Node left, right;

    this(Node l, Node r) {
        this.left = l;
        this.right = r;
    }

    override int walkSum() {
        // createTree() never generates null pointers, but we test
        // anyway to be more general with other tree generators
        return ((this.left is null) ? 0 : this.left.walkSum()) +
               ((this.right is null) ? 0 : this.right.walkSum());
    }
}


struct SumTree {
    static int nnodes;
    static Random rnd;

    static Node createTree(int deptMax, int dept) {
        nnodes++;
        if (rnd.nextDouble() > 0.30 && dept <= deptMax)
            return new Internal(createTree(deptMax, dept+1),
createTree(deptMax, dept+1));
        else
            return new Leaf(rnd.nextInt(100));
    }
}


void main() {
    const int NLOOPS = 600;
    static if (USE_TRICKS) { disable(); }

    SumTree.rnd = Random(142465964);
    SumTree.nnodes = 0;
    Node root = SumTree.createTree(35, 0); // 35 0
    printf("nnodes: %d\n", SumTree.nnodes);

    if (root !is null) {
        printf("sum: %d\n", root.walkSum());
        int tot = 0;
        for (int i = 0; i < NLOOPS-1; i++)
            tot += root.walkSum();
        printf("sum (%d scans): %d\n", NLOOPS-1, tot);
    }

    static if (USE_TRICKS) { exit(0); }
}


I have translated that code to Java and C++, ask if you want to see those
versions.

I have created more D versions:


the struct tag.

probably improves code inlining.

internal nodes using pointers tagged with one bit (so the struct memory is
allocated from the C heap). The total memory allocated is less (the tags
require no extra memory) and this speeds up the allocation a little. The walk
is about as fast.

In the D code I have also added a global constant boolean USE_TRICKS that
enables/disables two dirty tricks, that essentially save some of the time used
by the D GC to allocate and deallocate the tree. Such tricks don't speed up the
tree walks.

The profiling of the C++ program shows that most of the time is spent inside
Internal::walkSum().

Code compiled with:

LLVM 2.6 gcc version 4.2.1 (Based on Apple Inc. build 5649) (LLVM build)
llvm-g++ -Wall -O3 -s -Wl,--enable-stdcall-fixup -fomit-frame-pointer
-march=native -ffast-math

Java build 1.7.0-ea-b76
java -server SumTree

CPU: Celeron 2.33 GHz, on Ubuntu running on VirtualBox running on Windows Vista
32 bit.

The current benchmark with seed=142465964 allocates 466_232 Internal nodes, and
466_234 Leaves.


Timings (just tree building), best of 3, seconds:
  C++:           0.21
  Java:          0.83
  D 1 notricks:  0.49
  D 2 notricks:  0.46
  D 3 notricks:  0.45
  D 4 notricks:  0.23
  D 1 tricks:    0.24
  D 2 tricks:    0.23
  D 3 tricks:    0.23
  D 4 tricks:    0.22
  
Timings (full program), best of 3, seconds:
  C++:          15.64
  Java:          8.74
  D 1 notricks: 10.30
  D 2 notricks: 14.26
  D 3 notricks: 14.06
  D 4 notricks: 13.57
  D 1 tricks:    9.77
  D 2 tricks:   13.58
  D 3 tricks:   13.32
  D 4 tricks:   13.06


As you can see from the timings it's not a matter of GC and memory allocations,
the building phase in the C++ is actually faster compared to the allocation in
the Java version. So probably here HotSpot is optimizing the virtual calls.



Internal.walkSum

	pushl	%edi
	pushl	%esi
	subl	$4, %esp
	movl	8(%eax), %ecx
	testl	%ecx, %ecx
	movl	%eax, %esi
	je	.LBB5_5
	movl	(%ecx), %edx
	movl	%ecx, %eax
	call	*24(%edx)
.LBB5_2:
	movl	%eax, %edi
	movl	12(%esi), %eax
	testl	%eax, %eax
	je	.LBB5_6
	movl	(%eax), %ecx
	call	*24(%ecx)
.LBB5_4:
	addl	%edi, %eax
	addl	$4, %esp
	popl	%esi
	popl	%edi
	ret
.LBB5_5:
	xorl	%eax, %eax
	jmp	.LBB5_2
.LBB5_6:
	xorl	%eax, %eax
	jmp	.LBB5_4

------------------


Internal.walkSum:

	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$4, %esp
	movl	4(%eax), %esi
	testl	%esi, %esi
	movl	%eax, %edi
	je	.LBB7_12
	cmpl	$0, (%esi)
	jne	.LBB7_13
	movl	4(%esi), %eax
	testl	%eax, %eax
	je	.LBB7_14
	call	_D2s24Node7walkSumMFZi
.LBB7_4:
	movl	%eax, %ebp
	movl	8(%esi), %eax
	testl	%eax, %eax
	je	.LBB7_15
	cmpl	$0, (%eax)
	jne	.LBB7_16
	call	_D2s28Internal7walkSumMFZi
	movl	%eax, %ebx
.LBB7_7:
	addl	%ebp, %ebx
.LBB7_8:
	movl	8(%edi), %eax
	testl	%eax, %eax
	je	.LBB7_17
	cmpl	$0, (%eax)
	jne	.LBB7_18
	call	_D2s28Internal7walkSumMFZi
	movl	%eax, %ecx
.LBB7_11:
	movl	%ecx, %eax
	addl	%ebx, %eax
	addl	$4, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	ret
.LBB7_12:
	xorl	%ebx, %ebx
	jmp	.LBB7_8
.LBB7_13:
	movl	4(%esi), %ebx
	leal	1(%ebx), %eax
	movl	%eax, 4(%esi)
	jmp	.LBB7_8
.LBB7_14:
	xorl	%eax, %eax
	jmp	.LBB7_4
.LBB7_15:
	xorl	%ebx, %ebx
	jmp	.LBB7_7
.LBB7_16:
	movl	4(%eax), %ebx
	leal	1(%ebx), %ecx
	movl	%ecx, 4(%eax)
	jmp	.LBB7_7
.LBB7_17:
	xorl	%ecx, %ecx
	jmp	.LBB7_11
.LBB7_18:
	movl	4(%eax), %ecx
	leal	1(%ecx), %edx
	movl	%edx, 4(%eax)
	jmp	.LBB7_11

------------------



walkSum:

	pushl	%ebx
	pushl	%edi
	pushl	%esi
	testl	%eax, %eax
	jne	.LBB6_3
	xorl	%eax, %eax
.LBB6_2:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret
.LBB6_3:
	movl	%eax, %esi
	cmpl	$0, (%esi)
	jne	.LBB6_8
	movl	4(%esi), %eax
	call	_D2s37walkSumFPS2s34NodeZi
	movl	8(%esi), %esi
	testl	%esi, %esi
	movl	%eax, %edi
	je	.LBB6_9
	cmpl	$0, (%esi)
	jne	.LBB6_10
	movl	4(%esi), %eax
	call	_D2s37walkSumFPS2s34NodeZi
	movl	%eax, %ebx
	movl	8(%esi), %eax
	call	_D2s37walkSumFPS2s34NodeZi
	addl	%ebx, %eax
.LBB6_7:
	addl	%edi, %eax
	jmp	.LBB6_2
.LBB6_8:
	movl	4(%esi), %eax
	leal	1(%eax), %ecx
	movl	%ecx, 4(%esi)
	jmp	.LBB6_2
.LBB6_9:
	xorl	%eax, %eax
	jmp	.LBB6_7
.LBB6_10:
	movl	4(%esi), %eax
	leal	1(%eax), %ecx
	movl	%ecx, 4(%esi)
	jmp	.LBB6_7

------------------



walkSum:

	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	%eax, %esi
	andl	$4294967294, %esi
	testl	%esi, %esi
	jne	.LBB13_3
	xorl	%eax, %eax
.LBB13_2:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret
.LBB13_3:
	testb	$1, %al
	movl	(%esi), %eax
	je	.LBB13_8
	movl	%eax, %edi
	andl	$4294967294, %edi
	testl	%edi, %edi
	je	.LBB13_9
	testb	$1, %al
	movl	(%edi), %eax
	je	.LBB13_10
	call	_D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
	movl	%eax, %ebx
	movl	4(%edi), %eax
	call	_D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
	addl	%ebx, %eax
.LBB13_7:
	movl	%eax, %edi
	movl	4(%esi), %eax
	call	_D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
	addl	%edi, %eax
	jmp	.LBB13_2
.LBB13_8:
	leal	1(%eax), %ecx
	movl	%ecx, (%esi)
	jmp	.LBB13_2
.LBB13_9:
	xorl	%eax, %eax
	jmp	.LBB13_7
.LBB13_10:
	leal	1(%eax), %ecx
	movl	%ecx, (%edi)
	jmp	.LBB13_7

------------------

st, C++, gcc:
Internal.walkSum:

	subl	$12, %esp
	movl	%ebx, 4(%esp)
	movl	%esi, 8(%esp)
	xorl	%ebx, %ebx
	movl	16(%esp), %esi
	movl	4(%esi), %edx
	testl	%edx, %edx
	je	.L5
	movl	(%edx), %eax
	movl	%edx, (%esp)
	call	*(%eax)
	movl	%eax, %ebx
.L5:
	movl	8(%esi), %edx
	xorl	%eax, %eax
	testl	%edx, %edx
	je	.L7
	movl	(%edx), %eax
	movl	%edx, (%esp)
	call	*(%eax)
.L7:
	addl	%ebx, %eax
	movl	8(%esp), %esi
	movl	4(%esp), %ebx
	addl	$12, %esp
	ret

------------------

st, C++, llvm-gcc:
Internal.walkSum:

	pushl	%edi
	pushl	%esi
	subl	$4, %esp
	movl	16(%esp), %esi
	movl	4(%esi), %eax
	testl	%eax, %eax
	je	LBB2_5
	movl	(%eax), %ecx
	movl	%eax, (%esp)
	call	*(%ecx)
LBB2_2:
	movl	%eax, %edi
	movl	8(%esi), %eax
	testl	%eax, %eax
	je	LBB2_6
	movl	(%eax), %ecx
	movl	%eax, (%esp)
	call	*(%ecx)
LBB2_4:
	addl	%edi, %eax
	addl	$4, %esp
	popl	%esi
	popl	%edi
	ret
LBB2_5:
	xorl	%eax, %eax
	jmp	LBB2_2
LBB2_6:
	xorl	%eax, %eax
	jmp	LBB2_4


I have collected the asm generated by HotSpot from the Internal.walkSum, but I
am not able to read it, it's too much messy for me.

I have created an alternative version of the first D program, sumtree_d1b, that
uses the __vptr to avoid the virtual table and to allow LDC to inline the code
better. The performance is better than sumtree_d1, but not good enough to beat
the Java version yet.


// sumtree_d1b.d
const bool USE_TRICKS = false;

version (Tango) {
    import tango.stdc.stdio: printf;
    static if (USE_TRICKS) {
        import tango.stdc.stdlib: exit;
        import tango.core.Memory: GC;
        alias GC.disable disable;
    }
} else {
    import std.c.stdio: printf;
    static if (USE_TRICKS) {
        import std.c.stdlib: exit;
        import std.gc: disable;
    }
}


struct Random {
    private const int m = int.max;
    private const int a = 48271;
    private const int q = m / a;
    private const int r = m % a;
    public int seed;

    public double nextDouble() {
        int k = seed / q;
        seed = a * (seed - k * q) - r * k;
        if (seed < 1)
            seed += m;
        return cast(double)seed / m;
    }

    // returns in [0, max]
    public int nextInt(int max) {
        int n = max + 1;
        int k = cast(int)(n * this.nextDouble());
        return (k == n) ? k - 1 : k;
    }
}


//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2


abstract class Node {}


final class Leaf : Node {
    int value;

    this(int x) { this.value = x; }
}


final class Internal : Node {
    Node left, right;

    this(Node l, Node r) {
        this.left = l;
        this.right = r;
    }
}


struct SumTree {
    static int nnodes;
    static Random rnd;

    static Node createTree(int deptMax, int dept) {
        nnodes++;
        if (rnd.nextDouble() > 0.30 && dept <= deptMax)
            return new Internal(createTree(deptMax, dept+1),
createTree(deptMax, dept+1));
        else
            return new Leaf(rnd.nextInt(100));
    }
}


int walkTree(Node root) {
    if (root.__vptr == leaf__vptr) {
        return (cast(Leaf)cast(void*)root).value++;
    } else {
        Internal iroot = cast(Internal)cast(void*)root;
        return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
               ((iroot.right is null) ? 0 : walkTree(iroot.right));
    }
}


void main() {
    const int NLOOPS = 600;
    static if (USE_TRICKS) { disable(); }

    {
        scope Leaf l = new Leaf(0);
        leaf__vptr = l.__vptr;
    }

    SumTree.rnd = Random(142465964);
    SumTree.nnodes = 0;
    Node root = SumTree.createTree(35, 0); // 35 0
    printf("nnodes: %d\n", SumTree.nnodes);

    if (root !is null) {
        printf("sum: %d\n", walkTree(root));
        int tot = 0;
        for (int i = 0; i < NLOOPS-1; i++)
            tot += walkTree(root);
        printf("sum (%d scans): %d\n", NLOOPS-1, tot);
    }

    static if (USE_TRICKS) { exit(0); }
}


Timings (full program), best of 3, seconds:
  Java:           8.83 (-server)
  D 1 notricks:  10.44
  D 1b notricks:  9.73
  D 1 tricks:    10.18
  D 1b tricks:    9.45


In this program there are only two possible classes of nodes, so to manually
remove virtual calls I have stored the pointer to the virtual table of one of
the two classes, the Leaf (Here unfortunately I can't use conditional
compilation to let the compiler compile the right code according to the
language version because the D2 version is not syntactially valid D1).

//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2


D docs state that user code shall not use the pointer to the virtual table
(probably because it's an implementation detail, that can change according to
the way virtual calls are implemented. For example dispatch trees don't have
such pointer. Here I have essentially created a small dispatch tree manually).

At the beginning of the main() the leaf__vptr gets initialized (I don't know a
way to initialize this value that doesn't require me the creation of a class
instance):

{
    scope Leaf l = new Leaf(0);
    leaf__vptr = l.__vptr;
}


Then this is a free function that scans the tree (it can be written iterative
too, but I think the performance is not that diffeernt on modern CPUs. This can
be tested):

int walkTree(Node root) {
    if (root.__vptr == leaf__vptr) {
        return (cast(Leaf)cast(void*)root).value++;
    } else {
        Internal iroot = cast(Internal)cast(void*)root;
        return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
               ((iroot.right is null) ? 0 : walkTree(iroot.right));
    }
}

The double cast like:

cast(Leaf)cast(void*)root

is necessary because a single cast in D is done dynamically, so the first cast
to void* punches a hole in the type system, so the pair becomes a static cast.
I think it's similar to the C++ code:

Leaf::root


The asm of walkTree() shows no virtual calls and it also shows some inlining:

walkTree of sumtree_d1b.d:
_D8sumtree27walkTreeFC8sumtree24NodeZi:
    pushl   %ebx
    pushl   %edi
    pushl   %esi
    movl    _D8sumtree210leaf__vptrPPv, %ecx
    cmpl    %ecx, (%eax)
    movl    %eax, %esi
    je  .LBB8_12
    movl    8(%esi), %eax
    testl   %eax, %eax
    je  .LBB8_13
    call    _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_3:
    movl    %eax, %edi
    movl    12(%esi), %esi
    testl   %esi, %esi
    je  .LBB8_14
    movl    _D8sumtree210leaf__vptrPPv, %eax
    cmpl    %eax, (%esi)
    je  .LBB8_15
    movl    8(%esi), %eax
    testl   %eax, %eax
    je  .LBB8_16
    call    _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_7:
    movl    %eax, %ebx
    movl    12(%esi), %eax
    testl   %eax, %eax
    je  .LBB8_17
    call    _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_9:
    addl    %ebx, %eax
.LBB8_10:
    addl    %edi, %eax
.LBB8_11:
    popl    %esi
    popl    %edi
    popl    %ebx
    ret
.LBB8_12:
    movl    8(%esi), %eax
    leal    1(%eax), %ecx
    movl    %ecx, 8(%esi)
    jmp .LBB8_11
.LBB8_13:
    xorl    %eax, %eax
    jmp .LBB8_3
.LBB8_14:
    xorl    %eax, %eax
    jmp .LBB8_10
.LBB8_15:
    movl    8(%esi), %eax
    leal    1(%eax), %ecx
    movl    %ecx, 8(%esi)
    jmp .LBB8_10
.LBB8_16:
    xorl    %eax, %eax
    jmp .LBB8_7
.LBB8_17:
    xorl    %eax, %eax
    jmp .LBB8_9


A possible cause of the slow performance of sumtree_d1b compared to the Java
version is in the cache effects. The garbage collector of Java HotSpot can lay
objects in memory in a better way: In the "Eden" part of the memory allocated
by the hotspot GC the cache coherence is very high. While the current D GC has
to avoid long-term memory fragmentation, the Eden doesn't share this problem,
it can be used as a stack.

So I have created a different version of the walkTree() function that prints
the memory positions of the nodes. Then I have used Python to find the
requences of the distances between node pointers during the walkTree() in
sumtree_d1b:

 t = file("adds.txt").read()
 t2 = map(int, t.split())
 t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)]
 from util import frequency
 frequency(t3, mode="freq")
[(-16, 232285), (-32, 115525), (-48, 57657), (-64, 28535), (-80, 14365), (-96, 7266), (-112, 3447), (-128, 1721), (8160, 922), (-144, 917), (8176, 901), (8144, 688), (8128, 460), (-160, 431), (8112, 305), (-176, 218), (8096, 145), (-192, 111), (8080, 99), (-208, 61), (8064, 52), (8048, 28), (-224, 25), (8032, 20), (-240, 10), (8016, 10), (-272, 6), (8000, 5), (-288, 4), (7968, 3), (-256, 3), (-336, 1), (5497120, 1), (7952, 1), (16368, 1), (-1032304, 1), (-320, 1), (7936, 1)] Those are quite jumpy, and it shows that the minimal memory block size is 16, so a Leaf object uses 16 bytes as an Internal object. So I have overloaded the new() operator in the abstract Node class, allocating a single block of memory from the C heap at the beginning and then slicing it in pieces during node allocations, using it as a stack, as in the Eden of the Java GC. The pointer that new() returns isn't aligned to 16 bytes as the D specs require, so this program is just for experiments (I don't know why it has to be aligned to 16 bytes, maybe just for SSE matters, I think I have received no direct answer about this. An alignment to 8 bytes can be useful for tag bits added to pointers by the garbage collector when the GC runs). It uses just an exit() instead of raising the correct OutOfMemoryError. The bufsize size is found experimentally, this is hard-coded. A real program can't know this value, so it can for example allocate memory in chunks, for example with a dynamic array of pointers to the memory chunks plus a current position pointer in the last chunk and a pointr to the last position in the last chunk. The resulting sumtree_d1c.d code is faster than the Java version: // sumtree_d1c.d version (Tango) { import tango.stdc.stdio: printf; import tango.stdc.stdlib: exit, cmalloc = malloc, cfree = free; } else { import std.c.stdio: printf; import std.c.stdlib: exit, cmalloc = malloc, cfree = free; } struct Random { private const int m = int.max; private const int a = 48271; private const int q = m / a; private const int r = m % a; public int seed; public double nextDouble() { int k = seed / q; seed = a * (seed - k * q) - r * k; if (seed < 1) seed += m; return cast(double)seed / m; } // returns in [0, max] public int nextInt(int max) { int n = max + 1; int k = cast(int)(n * this.nextDouble()); return (k == n) ? k - 1 : k; } } //void** leaf__vptr; // D1 __gshared immutable(void*)* leaf__vptr; // D2 abstract class Node { static void[] buffer; static int bufindex; static const int bufsize = (466_232 * 16) + (466_234 * 12); static this() { void *p = cmalloc(bufsize); if (!p) exit(1); // out of memory buffer = p[0 .. bufsize]; } static ~this() { if (buffer.length) { cfree(buffer.ptr); buffer = null; } } new(size_t sz) { void *p = &buffer[bufindex]; bufindex += sz; if (bufindex <= buffer.length) return p; else { printf("out of memory"); exit(1); } assert(0); } delete(void* p) { assert(0); } } final class Leaf : Node { int value; this(int x) { this.value = x; } } final class Internal : Node { Node left, right; this(Node l, Node r) { this.left = l; this.right = r; } } struct SumTree { static int nnodes; static Random rnd; static Node createTree(int deptMax, int dept) { nnodes++; if (rnd.nextDouble() > 0.30 && dept <= deptMax) return new Internal(createTree(deptMax, dept+1), createTree(deptMax, dept+1)); else return new Leaf(rnd.nextInt(100)); } } int walkTree(Node root) { if (root.__vptr == leaf__vptr) { return (cast(Leaf)cast(void*)root).value++; } else { Internal iroot = cast(Internal)cast(void*)root; return ((iroot.left is null) ? 0 : walkTree(iroot.left)) + ((iroot.right is null) ? 0 : walkTree(iroot.right)); } } void main() { const int NLOOPS = 600; { scope Leaf l = new Leaf(0); leaf__vptr = l.__vptr; } SumTree.rnd = Random(142465964); SumTree.nnodes = 0; Node root = SumTree.createTree(35, 0); // 35 0 (Internal, Leaves: 466232 466234) printf("nnodes: %d\n", SumTree.nnodes); if (root !is null) { printf("sum: %d\n", walkTree(root)); int tot = 0; for (int i = 0; i < NLOOPS-1; i++) tot += walkTree(root); printf("sum (%d scans): %d\n", NLOOPS-1, tot); } exit(0); } Timings (full program), best of 3, seconds: Java: 8.83 (-server) D 1 notricks: 10.44 D 1b notricks: 9.73 D 1 tricks: 10.18 D 1b tricks: 9.45 D 1c tricks: 8.25 D 1d tricks: 8.12 The frequences of the distances between node pointers during the walkTree() in sumtree_d1d, show a much nicer position in memory, that probably improves cache coherence:
 t = file("adds.txt").read()
 t2 = map(int, t.split())
 t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)]
 from util import frequency
 frequency(t3, mode="freq")
[(12, 233188), (28, 116447), (44, 58345), (60, 28995), (76, 14672), (92, 7411), (108, 3546), (124, 1773), (140, 945), (156, 451), (172, 228), (188, 115), (204, 61), (220, 28), (236, 11), (268, 6), (284, 4), (252, 4), (316, 1), (332, 1)] The sumtree_d1d.d version is the same as the sumtree_d1c.d version, but it swaps the null tests in walkTree(), because the CPU is faster if the "then" clause of the if is the most likely, and in this tree there are no null pointers: int walkTree(Node root) { if (root.__vptr == leaf__vptr) { return (cast(Leaf)cast(void*)root).value++; } else { Internal iroot = cast(Internal)cast(void*)root; return ((iroot.left !is null) ? walkTree(iroot.left) : 0) + ((iroot.right !is null) ? walkTree(iroot.right): 0); } } The Java code doesn't seem to enjoy this change, probably because the run time profiling performs that optimization already. With a less cheating program, overloading the new() in a correct and flexible way, this code can be faster anyway than the Java code (but I can't be sure before testing it). I have implemented in D the well known "Richards" benchmark (used in the famous V8 JavaScript benchmarks of Google, the D version is about 8 times faster if well written and well compiled with ldc). I have done a similar experiment of manually removing the single virtual call present in the Richards benchmark. That virtual call is among 4 possible classes, and after the removal the code is a little slower. So that manual optimization is probably worth when in a virtual call there are only one or two possible classes to choose from. Ask me if you want the Richards benchmark in D, and various translations, variations, timings, etc. I have read papers that say that the devirtualization, if well implemented can also be useful in other situations. For example even if there are many possible calls, but only one of them is much more frequent than all the other ones, then in the call point you can put an if to test if it's the most frequent class, and perform a static call to it, otherwise you can perform a dynamic call. This usually can speed up the code. There are many ways to improve this situation in D, I think we can start with something simple: 1) Give the dmd a function (if not already present, but I think it's not present) that for a call point returns an array of the linearized inheritance tree. If it contains only a class (beside object) then this call can become static. In D methods are virtual by default, so this simple optimization is probably able to remove many useless virtual calls. So there's no need to turn all classes into "static", keeping classes more flexible. 2) If not already present it can be added to the profiler built-in in DMD the collection of statistics at the virtual call points. Collecting full statistics as I have done with an associative array is probably not that useful, the data structure used can be much simpler, something like a static fixed sized array about 4 or 5 items long, each item is a ID-frequency pair, that gets incremented by 1. Such array can have a move-to-front strategy. And beside this little arrays there can be variable that counts total calls for that virtual call. In D a linear scan in a such small array is simple to implement and it's faster than an an associative array access. if there are more than 4-5 different calls, then they get not recorded. The total call count in the locus is useful to find how many calls are not registered in the array. Later the compiler can use those frequencies at each call point. If there are only 2 different calls, it can use a tiny dispatch tree. If there is a call much more frequent than the other ones can use a static call for the frequent one and a dynamic call for the others, otherwise it can use a dynamic call. Later more things can be improved, but I think this is enough to fix most of the situation. Bye, bearophile
Feb 03 2010
next sibling parent Don <nospam nospam.com> writes:
bearophile wrote:
 This post is about few simple experiments I've done in D about virtual calls
and how to improve their performance.
So probably this can't be useful before D2 is out of alpha state. [snip] This is an excellent piece of research. As you say, it's not top priority right now, but it's important that this stuff doesn't get lost. It'd be great if you could make a 'performance' page on the D Wiki, and link your findings to it. It will be very valuable later on.
Feb 03 2010
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 This post is about few simple experiments I've done in D about
 virtual calls and how to improve their performance. So probably this
 can't be useful before D2 is out of alpha state. For people working
 at Sun this stuff is probably very old and very naive, but I think
 currently no D compilers are optimizing this, and if you are not
 interested please ignore this post.
Here's what's going on. The virtual function calls are the calls to Internal.walkSum() via an expression typed as Node.walkSum(). The Java JIT has access to the entire program, so it can do whole program analysis and determine that the *only* instances of Node that ever exist are actually instances Internal, therefore all virtual calls to Node.walkSum() can be replaced with direct calls to (and inlining of) Internal.walkSum(). This is called "direct call optimization". I'm pretty sure that Java compilers did this optimization from the beginning. It is not advanced technology, nor even terribly clever. Even C++ compilers did it 10 years earlier (when the static type of the instance was known, i.e. it was a value type). So why doesn't D do it? Because D's optimization runs when compiling a module. It does not have access to the whole program, therefore it cannot know that the only instances of Node are Internal, and cannot make the optimization. (The C++ style optimization can't be done because D classes are not value types.) Making Internal final and Internal.walkSum() final does not help, because the types of left and right are Node, not Internal. Another module might import this one, have another class derived from Node, and store an instance to them in left and right. This would invalidate the direct call optimization. However, if you change the types of left and right from Node to Internal, then the compiler knows that walkSum() is final (because Internal being final means there are no overrides of walkSum()), and hence the direct call optimization can be (and is) performed. You can easily verify this by running obj2asm on the output of the compiler. You can also see it in the e2ir.c source of dmd: if (!fd->isVirtual() || directcall || fd->isFinal() ) { // make static call ec = el_var(sfunc); } else { // make virtual call elem *ev; unsigned vindex; assert(ethis); ev = el_same(&ethis); ev = el_una(OPind, TYnptr, ev); vindex = fd->vtblIndex; // Build *(ev + vindex * 4) ec = el_bin(OPadd,TYnptr,ev,el_long(TYint, vindex * 4)); ec = el_una(OPind,TYnptr,ec); ec = el_una(OPind,tybasic(sfunc->Stype->Tty),ec); }
Feb 03 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 The virtual function calls are the calls to 
 Internal.walkSum() via an expression typed as Node.walkSum().
That expression typed as Node.walkSum() also calls Leaf.walkSum().
 The Java 
 JIT has access to the entire program, so it can do whole program 
 analysis and determine that the *only* instances of Node that ever exist 
 are actually instances Internal,
There are Leaf instances too.
 So why doesn't D do it? Because D's optimization runs when compiling a 
 module. It does not have access to the whole program,
With LDC you can use a level of optimization (LTO) that spans all the modules you are compiling at once. There are times when you want to optimize a final release of a program as well as possible. And anyway, this whole benchmark program is made of a single little module.
 However, if you change the types of left and right from Node to 
 Internal,
I can't, because there is are the Leaf objects too. Thank you, bye, bearophile
Feb 03 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sorry, I had missed the Leaf node.

bearophile wrote:
 With LDC you can use a level of optimization (LTO) that spans all the
 modules you are compiling at once.
That's not safe because there are always opaque modules that are binary only. They may derive from the class in the module - and the optimizer cannot prove they don't. Hardly any programmers are knowledgeable enough about compiler optimizations to be comfortable using unsafe optimizations correctly. Most of the time, you just get mad users and and a reputation for a "buggy" optimizer.
Feb 03 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 That's not safe because there are always opaque modules that are binary 
 only. They may derive from the class in the module - and the optimizer 
 cannot prove they don't.
This sounds like: Lasciate ogni speranza, voi ch'entrate. Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do. Bye, bearophile
Feb 03 2010
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 That's not safe because there are always opaque modules that are
 binary only. They may derive from the class in the module - and the
 optimizer cannot prove they don't.
This sounds like: Lasciate ogni speranza, voi ch'entrate. Let's try again. I think there can be two types of opaque binary only modules: - Ones compiled from C code, or with a C interface. Such modules can't contain classes, so there is no problem with virtual functions. - Ones compiled from D code, or that use the D code interface, that can contain and use classes and virtual calls. The language can require such modules to contain a data structure that tells the compiler what's safe to do.
A link step is required to figure out what modules wind up in the final program. So, in order for this to get the information into the front end where it belongs, you essentially need to compile it, link it, feed the results of the link back into the compiler, recompile, and relink. Not impossible, and a "pseudo-link" could be done after the semantic phase of compilation, but these things would be a lot of work to implement.
Feb 03 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 Not impossible, and a "pseudo-link" could be done after the semantic 
 phase of compilation, but these things would be a lot of work to implement.
OK. People can live with a compiler that doesn't perform all possible optimizations from its first version 1.0, but I think it's important that the language is not designed to make such improvements / optimizations impossible (or too much hard) to implement (this is also why time ago I suggested D type system to tell apart pinned GC pointers from normal raw unpinned pointers. So if in future D has success someone is free to realize a mostly moving GC, even if now no one is able or interested in doing it. You told me that idea makes the language too much hairy). Partially unrelated note: LLVM plans to add a profile guided optimization level too, to be used with -O6, as GCC too does. LLVM also plans to do something that GCC is not currently able to do, the level -O7 lifelong optimization. This is like in Java, the profile guided optimization happens at normal runtime too, so you need to keep the llvm compiler along with your shipped binary. D2 language in future can use that technology too. With care and work, this will allow de-virtualize functions at runtime in D2/D3. http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.html Bye, bearophile
Feb 04 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Partially unrelated note: LLVM plans to add a profile guided
 optimization level too, to be used with -O6, as GCC too does. LLVM
 also plans to do something that GCC is not currently able to do, the
 level -O7 lifelong optimization. This is like in Java, the profile
 guided optimization happens at normal runtime too, so you need to
 keep the llvm compiler along with your shipped binary. D2 language in
 future can use that technology too. With care and work, this will
 allow de-virtualize functions at runtime in D2/D3. 
 http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.html
Believe it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.
Feb 04 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 bearophile wrote:
 Partially unrelated note: LLVM plans to add a profile guided
 optimization level too, to be used with -O6, as GCC too does. LLVM
 also plans to do something that GCC is not currently able to do, the
 level -O7 lifelong optimization. This is like in Java, the profile
 guided optimization happens at normal runtime too, so you need to
 keep the llvm compiler along with your shipped binary. D2 language in
 future can use that technology too. With care and work, this will
 allow de-virtualize functions at runtime in D2/D3.
 http://llvm.org/releases/1.4/docs/CommandGuide/html/llvmc.html
Believe it or not, you can do profile guided optimizations with the dmd profiler and optlink. The profiler will emit a file to be passed to the linker which will set the layout of functions to minimize paging.
Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.
Feb 04 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:
 Is that still a relevant optimization?  It probably made sense circa 1995 when
 computers had much less memory, but now I would think that binaries are so much
 smaller than a typical computer's memory that they almost never get paged out.
It's useful still, but just a little, there are far more important optimizations (LTO, other forms of profiling that allow for partial unrolling even you don't know at compile time the number of cycles, inlining of ex-virtual functions, and so on). Grouping often used functions is useful because they can be kept better in L2 cache, and sometimes they can even be kept in L1 :-) Bye, bearophile
Feb 04 2010
prev sibling next sibling parent BCS <none anon.com> writes:
Hello dsimcha,

 == Quote from Walter Bright (newshound1 digitalmars.com)'s article
 
 Believe it or not, you can do profile guided optimizations with the
 dmd profiler and optlink. The profiler will emit a file to be passed
 to the linker which will set the layout of functions to minimize
 paging.
 
Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.
It's still valid (to an extent) for two reasons; careful placement of code can make things load faster by causing a more ideal page load order (all code starts paged out) and if you turn your head and squint, the CPU cache looks like paged memory. -- <IXOYE><
Feb 04 2010
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
dsimcha wrote:
 == Quote from Walter Bright (newshound1 digitalmars.com)'s article
 Believe it or not, you can do profile guided optimizations with the dmd
 profiler and optlink. The profiler will emit a file to be passed to the
 linker which will set the layout of functions to minimize paging.
Is that still a relevant optimization? It probably made sense circa 1995 when computers had much less memory, but now I would think that binaries are so much smaller than a typical computer's memory that they almost never get paged out.
I'll add to the other replies, and say that: 1. if your program has rarely used features, those can be grouped together in such a way that they never load into memory, and so are essentially costless. 2. initialization code can be separated from the working loop, such that the init code pages are loaded, run, and discarded, and never need loading again. 3. file systems like to read files sequentially. So if your code is organized so the next code to execute is next on the disk image, it will load faster. 4. strongly coupled functions (functions that call each other a lot) should be next to each other, so they wind up in the same cache.
Feb 04 2010
prev sibling parent reply Trass3r <un known.com> writes:
 Let's try again. I think there can be two types of opaque binary only  
 modules:
 - Ones compiled from C code, or with a C interface. Such modules can't  
 contain classes, so there is no problem with virtual functions.
 - Ones compiled from D code, or that use the D code interface, that can  
 contain and use classes and virtual calls. The language can require such  
 modules to contain a data structure that tells the compiler what's safe  
 to do.
What about C++ code?
Feb 04 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Trass3r:
 What about C++ code?
I think D classes are not compatible with C++ ones. And even if that becomes possible you can use the safest approach when a C++-compiled opaque module uses classes and doesn't have that required data structure that all D modules must have. Anyway, Walter has not commented about that solution, so maybe it can be a wrong idea. He suggests to solve the problem with pre-link stage, that he says is possible but requires lot of work to be implemented. I think he is not much interested in this whole topic. Bye, bearophile
Feb 04 2010
prev sibling parent reply retard <re tard.com.invalid> writes:
Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:

 Later more things can be improved, but I think this is enough to fix
 most of the situation.
Interesting. So a novice programmer can produce much faster code in Java than a semi-pro in D in this case.
Feb 03 2010
parent Walter Bright <newshound1 digitalmars.com> writes:
retard wrote:
 Wed, 03 Feb 2010 12:37:56 -0500, bearophile wrote:
 
 Later more things can be improved, but I think this is enough to fix
 most of the situation.
Interesting. So a novice programmer can produce much faster code in Java than a semi-pro in D in this case.
Ever since Java implementations supported JITs, there were some benchmarks that ran faster in Java than in the best C++ optimizing compilers. This is because Java JITs have access to the whole program, and native compilers generally do not.
Feb 03 2010