www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Feature Request: nontrivial functions and vtable optimizations

reply downs <default_357-line yahoo.de> writes:
The overhead of vtable calls is well known.

For this reason, it may be worthwhile to include an optimization that
essentially checks if a certain class is only inherited from a few times, like,
<= three, in the visible program, and if so, replace vtable calls with a tree
of classinfo pointer comparisons and direct function calls.

When I did this manually in some hobbyist path tracing code, I gained
significant speed-ups, even though I didn't collect any hard numbers.

So, for instance, the equivalent D code would look like so:

class A { ... } class B : A { ... } class C : A { ... }

A a = genSomeObj();
// a.test();
if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as if final
else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
else a.test(); // fallback

Ideas?
Aug 12 2008
next sibling parent davidl <davidl 126.com> writes:
在 Tue, 12 Aug 2008 16:45:08 +0800,downs <default_357-line yahoo.de> 写道:

 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that  
 essentially checks if a certain class is only inherited from a few  
 times, like, <= three, in the visible program, and if so, replace vtable  
 calls with a tree of classinfo pointer comparisons and direct function  
 calls.

 When I did this manually in some hobbyist path tracing code, I gained  
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as  
 if final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
 else a.test(); // fallback

 Ideas?
It's better to show your benchmark examples, there people can observe and discuss -- 使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/
Aug 12 2008
prev sibling next sibling parent reply "Craig Black" <cblack ara.com> writes:
"downs" <default_357-line yahoo.de> wrote in message 
news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that 
 essentially checks if a certain class is only inherited from a few times, 
 like, <= three, in the visible program, and if so, replace vtable calls 
 with a tree of classinfo pointer comparisons and direct function calls.

 When I did this manually in some hobbyist path tracing code, I gained 
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as if 
 final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
 else a.test(); // fallback

 Ideas?
You are right. Function pointer invokation tends to slow down processing because it doesn't cooperate well with pipelining and branch prediction. I don't understand why this is so hard for some programmers to accept. See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call Speculation". Using PGO, the most commonly called virtual functions are optimized so that they do not require a function pointer invokation. Rather than relying on PGO, I would prefer the ability to assign a priorities to virtual functions in the code. This would indicate to the compiler what virtual functions will be called the most. -Craig
Aug 12 2008
next sibling parent reply davidl <davidl 126.com> writes:
在 Tue, 12 Aug 2008 22:46:21 +0800,Craig Black <cblack ara.com> 写道:

 "downs" <default_357-line yahoo.de> wrote in message
 news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that
 essentially checks if a certain class is only inherited from a few  
 times,
 like, <= three, in the visible program, and if so, replace vtable calls
 with a tree of classinfo pointer comparisons and direct function calls.

 When I did this manually in some hobbyist path tracing code, I gained
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as  
 if
 final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); //  
 dito
 else a.test(); // fallback

 Ideas?
You are right. Function pointer invokation tends to slow down processing because it doesn't cooperate well with pipelining and branch prediction. I don't understand why this is so hard for some programmers to accept. See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call Speculation". Using PGO, the most commonly called virtual functions are optimized so that they do not require a function pointer invokation. Rather than relying on PGO, I would prefer the ability to assign a priorities to virtual functions in the code. This would indicate to the compiler what virtual functions will be called the most. -Craig
But it's so anti-cleanness hack. -- 使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/
Aug 12 2008
parent "Craig Black" <cblack ara.com> writes:
 Rather than relying on PGO, I would prefer the ability to assign a
 priorities to virtual functions in the code.  This would indicate to the
 compiler what virtual functions will be called the most.

 -Craig
But it's so anti-cleanness hack.
Are you are refering to my suggestion that we should be able to specify virtual function priorities? This is much more clean than writing your own virtual function dispatch routine whenever you want a virtual function to be optimized. Further, the specifying of priority is completely optional. You won't be forced to specify a priority if you don't want to. -Craig
Aug 12 2008
prev sibling next sibling parent reply Don <nospam nospam.com.au> writes:
Craig Black wrote:
 "downs" <default_357-line yahoo.de> wrote in message 
 news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that 
 essentially checks if a certain class is only inherited from a few times, 
 like, <= three, in the visible program, and if so, replace vtable calls 
 with a tree of classinfo pointer comparisons and direct function calls.

 When I did this manually in some hobbyist path tracing code, I gained 
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as if 
 final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
 else a.test(); // fallback

 Ideas?
You are right. Function pointer invokation tends to slow down processing because it doesn't cooperate well with pipelining and branch prediction.
I don't think this is true any more. It was true in processors older than the Pentium M and AMD K10. On recent CPUs, indirect jumps and calls are predicted as well as conditional branches are. If you're seeing this kind of the speed difference, it's something else (probably, more code is being executed).
Aug 13 2008
parent "Craig Black" <craigblack2 cox.net> writes:
"Don" <nospam nospam.com.au> wrote in message 
news:g7u5mr$261s$1 digitalmars.com...
 Craig Black wrote:
 "downs" <default_357-line yahoo.de> wrote in message 
 news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that 
 essentially checks if a certain class is only inherited from a few 
 times, like, <= three, in the visible program, and if so, replace vtable 
 calls with a tree of classinfo pointer comparisons and direct function 
 calls.

 When I did this manually in some hobbyist path tracing code, I gained 
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as 
 if final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // 
 dito
 else a.test(); // fallback

 Ideas?
You are right. Function pointer invokation tends to slow down processing because it doesn't cooperate well with pipelining and branch prediction.
I don't think this is true any more. It was true in processors older than the Pentium M and AMD K10. On recent CPUs, indirect jumps and calls are predicted as well as conditional branches are. If you're seeing this kind of the speed difference, it's something else (probably, more code is being executed).
To be fair, modern processors are quite sophisticated and do better than older ones with function pointers and vtables. I don't mean to overemphasize the penalty of function pointer invokations. They are usually relatively fast on modern processors. Even so, they can still be the source of bottlenecks. You should never think of a function pointer invokation as free. Never assume that your compiler and processor are doing magical optimizations. Modern processors can still have problems with branch prediction. I speak from experience, and there are many others in the field that agree. Obviously, you should always profile first since your function pointer invokations may not be a bottleneck at all. It is going to depend on how often and how the function pointers are getting called. From experience, I have had numerous performance problems with function pointer invokations. For example, I found a very severe bottleneck in my rendering loop that was caused due to a function pointer invokation. It was destroying my performance. In general, when I refactor my code to make less function pointer calls, I usually get significant measurable speedups. -Craig
Aug 14 2008
prev sibling next sibling parent reply "Jb" <jb nowhere.com> writes:
"Craig Black" <cblack ara.com> wrote in message 
news:g7s7o1$dpt$1 digitalmars.com...
 You are right.  Function pointer invokation tends to slow down processing 
 because it doesn't cooperate well with pipelining and branch prediction. 
 I don't understand why this is so hard for some programmers to accept.
Thats not true these days. I know cause I've profiled it on multiple cpus, correctly predicted Indirect jumps (through a pointer) cost no more than a normal call/ret pair. They get a slot in the BTB just like conditional jumps, and IIRC they are predicted to go to the same place they did last time. (on x86 anyway). Virtual methods are in such high use that Intel and Amd have put a lot of effort into optimizing them. Replacing the function pointer with a conditional branch wont make it faster.
 See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call 
 Speculation". Using PGO, the most commonly called virtual functions are 
 optimized so that they do not require a function pointer invokation.
The speedup you get from this is the inlining of the function, *not* the getting rid of the function pointer lookup. It still requires a lookup in the class info and a conditional branch to check that the "inlined function" is the correct match for the speculated object. That will still cost a whole pipline of cycles if wrongly predicted.
Aug 13 2008
parent reply "Craig Black" <craigblack2 cox.net> writes:
"Jb" <jb nowhere.com> wrote in message news:g7ubhk$2hq5$1 digitalmars.com...
 "Craig Black" <cblack ara.com> wrote in message 
 news:g7s7o1$dpt$1 digitalmars.com...
 (on x86 anyway).

 Virtual methods are in such high use that Intel and Amd have put a lot of 
 effort into optimizing them.

 Replacing the function pointer with a conditional branch wont make it 
 faster.
It may or may not improve performance. Depends on the particular case.
 See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call 
 Speculation". Using PGO, the most commonly called virtual functions are 
 optimized so that they do not require a function pointer invokation.
The speedup you get from this is the inlining of the function, *not* the getting rid of the function pointer lookup. It still requires a lookup in the class info and a conditional branch to check that the "inlined function" is the correct match for the speculated object. That will still cost a whole pipline of cycles if wrongly predicted.
You may be right in some cases. Profile to be sure. But experience has taught me to be wary of function pointer invokations, even on modern CPU's. If you want to believe that there is never a penalty for function pointer invokation go right ahead. After all, it's only two instructions right? Must be real fast. ;) -Craig
Aug 14 2008
parent reply "Jb" <jb nowhere.com> writes:
"Craig Black" <craigblack2 cox.net> wrote in message 
news:g82rj4$1hqn$1 digitalmars.com...
 The speedup you get from this is the inlining of the function, *not* the 
 getting rid of the function pointer lookup. It still requires a lookup in 
 the class info and a conditional branch to check that the "inlined 
 function" is the correct match for the speculated object.

 That will still cost a whole pipline of cycles if wrongly predicted.
You may be right in some cases. Profile to be sure. But experience has taught me to be wary of function pointer invokations, even on modern CPU's. If you want to believe that there is never a penalty for function pointer invokation go right ahead. After all, it's only two instructions right? Must be real fast. ;)
I'm not saying there's no penalty, im saying it's a penalty you cant escape if you want virtual functions. Function pointer invocations == conditional branches as far as modern branch predictors are concerned. You dont gain anything by replacing the former with the later.
Aug 15 2008
parent "Craig Black" <craigblack2 cox.net> writes:
"Jb" <jb nowhere.com> wrote in message news:g843q6$104a$1 digitalmars.com...
 "Craig Black" <craigblack2 cox.net> wrote in message 
 news:g82rj4$1hqn$1 digitalmars.com...
 The speedup you get from this is the inlining of the function, *not* the 
 getting rid of the function pointer lookup. It still requires a lookup 
 in the class info and a conditional branch to check that the "inlined 
 function" is the correct match for the speculated object.

 That will still cost a whole pipline of cycles if wrongly predicted.
You may be right in some cases. Profile to be sure. But experience has taught me to be wary of function pointer invokations, even on modern CPU's. If you want to believe that there is never a penalty for function pointer invokation go right ahead. After all, it's only two instructions right? Must be real fast. ;)
I'm not saying there's no penalty, im saying it's a penalty you cant escape if you want virtual functions. Function pointer invocations == conditional branches as far as modern branch predictors are concerned. You dont gain anything by replacing the former with the later.
I thought you were wrong about this so I did a micro-benchmark and it indicates that you are correct. With only 1 branch the performance of an if statement was exactly the same as a function pointer invokation. Function pointer invokations ended up being roughly 50% faster than a sequence of if-elses. I guess I stand corrected. -Craig
Aug 15 2008
prev sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
Craig Black wrote:
 Rather than relying on PGO, I would prefer the ability to assign a 
 priorities to virtual functions in the code.  This would indicate to the 
 compiler what virtual functions will be called the most.
 
 -Craig 
Or you could wait a bit[1] and have LLVM do that for you, assuming you're willing to run LLVM bytecode. And considering it's LLVM, you could probably hack it to periodically dump the optimized code. So you'd optimize your application by fooling around with it for a couple hours. [1] "A bit" may be as long as a decade. I dunno.
Aug 13 2008
parent "Craig Black" <craigblack2 cox.net> writes:
"Christopher Wright" <dhasenan gmail.com> wrote in message 
news:g7uhpk$308b$2 digitalmars.com...
 Craig Black wrote:
 Rather than relying on PGO, I would prefer the ability to assign a 
 priorities to virtual functions in the code.  This would indicate to the 
 compiler what virtual functions will be called the most.

 -Craig
Or you could wait a bit[1] and have LLVM do that for you, assuming you're willing to run LLVM bytecode. And considering it's LLVM, you could probably hack it to periodically dump the optimized code. So you'd optimize your application by fooling around with it for a couple hours. [1] "A bit" may be as long as a decade. I dunno.
The LLVM based D compiler seems promising. Hopefully it will mature soon. -Craig
Aug 14 2008
prev sibling parent reply downs <default_357-line yahoo.de> writes:
To clear this up, I've been running a benchmark.

module test91;

import tools.time, std.stdio, tools.base, tools.mersenne;

class A { void test() { } }
class B : A { final override void test() { } }
class C : A { final override void test() { } }

A a, b, c;
static this() { a = new A; b = new B; c = new C; }

A gen() {
  if (randf() < 1f/3f) return a;
  else if (randf() < 0.5) return b;
  else return c;
}

void main() {
  const count = 1024*1024;
  for (int z = 0; z < 4; ++z) {
    writefln("Naive: ", time({
      for (int i = 0; i < count; ++i) gen().test();
    }()));
    writefln("Speculative for B: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else t.test();
      }
    }()));
    writefln("Speculative for B/C: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
        else t.test();
      }
    }()));
  }
}


And as it turns out, virtual method calls were at least fast enough to not make
any sort of difference in my calls.

Here's the output of my little proggy in the last iteration:

Naive: 560958
Speculative for B: 574602
Speculative for B/C: 572429

If anything, naive is often a little faster.

This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

Sorry for the confusion.
Aug 14 2008
next sibling parent reply superdan <super dan.org> writes:
downs Wrote:

 To clear this up, I've been running a benchmark.
 
 module test91;
 
 import tools.time, std.stdio, tools.base, tools.mersenne;
 
 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }
 
 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }
 
 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }
 
 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }
 
 
 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.
 
 Here's the output of my little proggy in the last iteration:
 
 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429
 
 If anything, naive is often a little faster.
 
 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.
 
 Sorry for the confusion.
you are looking with a binocular at a coin a mile away and tryin' to figure quarter or nickel. never gonna work. most likely ur benchmark is buried in randf timing. make iteration cost next to nothing. put objects in a medium size vector. then iterate many times over it.
Aug 14 2008
parent reply downs <default_357-line yahoo.de> writes:
superdan wrote:
 downs Wrote:
 
 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }

 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }


 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

 Sorry for the confusion.
you are looking with a binocular at a coin a mile away and tryin' to figure quarter or nickel. never gonna work. most likely ur benchmark is buried in randf timing. make iteration cost next to nothing. put objects in a medium size vector. then iterate many times over it.
I know. But if it doesn't matter in that case, it most likely won't matter in practial situations. Nonetheless, here are some better timings, including a faster (dirtier) randf() function, and a null pass that only generates the object. Null: 729908 Naive: 1615314 Speculative for B: 1692860 Speculative for B/C: 1664040
Aug 14 2008
parent reply superdan <super dan.org> writes:
downs Wrote:

 superdan wrote:
 downs Wrote:
 
 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }

 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }


 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

 Sorry for the confusion.
you are looking with a binocular at a coin a mile away and tryin' to figure quarter or nickel. never gonna work. most likely ur benchmark is buried in randf timing. make iteration cost next to nothing. put objects in a medium size vector. then iterate many times over it.
I know. But if it doesn't matter in that case, it most likely won't matter in practial situations. Nonetheless, here are some better timings, including a faster (dirtier) randf() function, and a null pass that only generates the object. Null: 729908 Naive: 1615314 Speculative for B: 1692860 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
Aug 14 2008
next sibling parent downs <default_357-line yahoo.de> writes:
superdan wrote:
 downs Wrote:
 
 superdan wrote:
 downs Wrote:

 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }

 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }


 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

 Sorry for the confusion.
you are looking with a binocular at a coin a mile away and tryin' to figure quarter or nickel. never gonna work. most likely ur benchmark is buried in randf timing. make iteration cost next to nothing. put objects in a medium size vector. then iterate many times over it.
I know. But if it doesn't matter in that case, it most likely won't matter in practial situations. Nonetheless, here are some better timings, including a faster (dirtier) randf() function, and a null pass that only generates the object. Null: 729908 Naive: 1615314 Speculative for B: 1692860 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
As you wish. Here's the timings for 10% A, then 81% for B, 9% for C. Null: 629760 Naive: 1383764 Speculative for B: 1397784 Speculative for B/C: 1418163
Aug 14 2008
prev sibling parent reply "Jb" <jb nowhere.com> writes:
"superdan" <super dan.org> wrote in message
news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
To speculate localy you still have to lookup the class info, compare it to the speculated object type, and then conditionaly branch. You cant avoid a conditional branch, and as it will share the same pattern as the indirect jump, and (in most cases) the same prediction mechanism, neither will be better predicted than the other. So the point is that you cant make the common path any faster unless you actualy *inline* the speculated method. In fact if you dont inline you are most probably making the situation worse because you are replacing this... (ignoring parameter setup which will be identical) MOV EAX.[objptr.vtable+offset] CALL EAX With this... MOV EAX,[objptr.classinfo] CMP EAX,expectedclasstype JNE wrongtype CALL expectedtypemethod JMP done wrongtype // normal vtable code here done And there's no way that will be faster imo. Unless perhaps you have a 486 or P1. ;-)
Aug 14 2008
parent reply superdan <super dan.org> writes:
Jb Wrote:

 
 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
To speculate localy you still have to lookup the class info, compare it to the speculated object type, and then conditionaly branch. You cant avoid a conditional branch, and as it will share the same pattern as the indirect jump, and (in most cases) the same prediction mechanism, neither will be better predicted than the other.
no that's wrong. conditional branch is better speculated than indirect jump. they share no other hardware than the instruction fetching pipe. the mechanism for conditional branch speculation is that stuff is executed in parallel on both branches inside the pipeline. the branch that makes it is committed. the other is retired without getting to write to the register file or memory. indirect jump is a whole different business altogether.
 So the point is that you cant make the common path any faster unless you 
 actualy *inline* the speculated method. In fact if you dont inline you are 
 most probably making the situation worse because you are replacing this...
you are right here tho for the wrong reasons :) yeah speculation is good if there are a few good chunky instructions to eat while speculating. if it's a function call that in turn is just a ret... the results are nothing to write home about. as it's been shown. processors today are so complicated there's no chance to have relevant synthetic tests. when checking for speed you always must use realistic loads. if method speculation as in here shows nothing on empty functions that's nothing. i've seen much weirder things. tests must be held on 5-6 real benchmarks to be any good.
Aug 14 2008
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"superdan" wrote
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
To speculate localy you still have to lookup the class info, compare it to the speculated object type, and then conditionaly branch. You cant avoid a conditional branch, and as it will share the same pattern as the indirect jump, and (in most cases) the same prediction mechanism, neither will be better predicted than the other.
no that's wrong. conditional branch is better speculated than indirect jump. they share no other hardware than the instruction fetching pipe. the mechanism for conditional branch speculation is that stuff is executed in parallel on both branches inside the pipeline. the branch that makes it is committed. the other is retired without getting to write to the register file or memory. indirect jump is a whole different business altogether.
 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...
you are right here tho for the wrong reasons :) yeah speculation is good if there are a few good chunky instructions to eat while speculating. if it's a function call that in turn is just a ret... the results are nothing to write home about. as it's been shown. processors today are so complicated there's no chance to have relevant synthetic tests. when checking for speed you always must use realistic loads. if method speculation as in here shows nothing on empty functions that's nothing. i've seen much weirder things. tests must be held on 5-6 real benchmarks to be any good.
Who the **** are you, and what did you do with superdan? ;) -Steve
Aug 14 2008
prev sibling parent reply "Jb" <jb nowhere.com> writes:
"superdan" <super dan.org> wrote in message 
news:g81o61$21bs$1 digitalmars.com...
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
To speculate localy you still have to lookup the class info, compare it to the speculated object type, and then conditionaly branch. You cant avoid a conditional branch, and as it will share the same pattern as the indirect jump, and (in most cases) the same prediction mechanism, neither will be better predicted than the other.
no that's wrong. conditional branch is better speculated than indirect jump. they share no other hardware than the instruction fetching pipe.
Sorry but you dont know what you are talking about. Indirect jumps get a slot in the branch target buffer (BTB) the same as conditional branches. They both share the same target prediction hardware. It's not only a fact it's also common sense that they would do so. http://www.agner.org/optimize/microarchitecture.pdf Read the chapter on branch prediction. http://www.intel.com/design/processor/manuals/248966.pdf Read Chapter 3.4.1.6
 the mechanism for conditional branch speculation is that stuff is executed 
 in parallel
 on both branches inside the pipeline. the branch that makes it is 
 committed. the other
  is retired without getting to write to the register file or memory.
Actualy AFAIK no current x86 processor *execute* both branches. What does happen is the instruction cache will *prefetch* both branches, so that when the BTB decides which way to go it has the instruction data ready. But it will still cost you a whole pipeline flush if the prediction was wrong.
 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...
you are right here tho for the wrong reasons :) yeah speculation is good if there are a few good chunky instructions to eat while speculating. if it's a function call that in turn is just a ret... the results are nothing to write home about. as it's been shown.
No.. erm NO... and NOOooOOOO!!!! ;-) Speculation is good if it allows you to inline the method and chop out a whole bunch of register shuffling, stack setup, and entry / exit code. It's got naff all to do with "having shit to do" while you're speculating. Just like it's got naff all to do with conditional branches being faster / better predicted than indirect calls.
Aug 14 2008
next sibling parent Yigal Chripun <yigal100 gmail.com> writes:
Jb wrote:
 "superdan" <super dan.org> wrote in message 
 news:g81o61$21bs$1 digitalmars.com...
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
To speculate localy you still have to lookup the class info, compare it to the speculated object type, and then conditionaly branch. You cant avoid a conditional branch, and as it will share the same pattern as the indirect jump, and (in most cases) the same prediction mechanism, neither will be better predicted than the other.
no that's wrong. conditional branch is better speculated than indirect jump. they share no other hardware than the instruction fetching pipe.
Sorry but you dont know what you are talking about. Indirect jumps get a slot in the branch target buffer (BTB) the same as conditional branches. They both share the same target prediction hardware. It's not only a fact it's also common sense that they would do so. http://www.agner.org/optimize/microarchitecture.pdf Read the chapter on branch prediction. http://www.intel.com/design/processor/manuals/248966.pdf Read Chapter 3.4.1.6
 the mechanism for conditional branch speculation is that stuff is executed 
 in parallel
 on both branches inside the pipeline. the branch that makes it is 
 committed. the other
  is retired without getting to write to the register file or memory.
Actualy AFAIK no current x86 processor *execute* both branches. What does happen is the instruction cache will *prefetch* both branches, so that when the BTB decides which way to go it has the instruction data ready. But it will still cost you a whole pipeline flush if the prediction was wrong.
 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...
you are right here tho for the wrong reasons :) yeah speculation is good if there are a few good chunky instructions to eat while speculating. if it's a function call that in turn is just a ret... the results are nothing to write home about. as it's been shown.
No.. erm NO... and NOOooOOOO!!!! ;-) Speculation is good if it allows you to inline the method and chop out a whole bunch of register shuffling, stack setup, and entry / exit code. It's got naff all to do with "having shit to do" while you're speculating. Just like it's got naff all to do with conditional branches being faster / better predicted than indirect calls.
I think Superdan is just too old ;) What he is talking about is probably the way things worked in P4 and that design was canceled for many good reasons. As Jb says above, No current x86 processor *execute* both branches. My Digital Systems professor works at Intel here in Israel where they design the chips ;) and he taught us the same thing Jb said above.
Aug 14 2008
prev sibling parent reply superdan <super dan.org> writes:
Jb Wrote:

 
 "superdan" <super dan.org> wrote in message 
 news:g81o61$21bs$1 digitalmars.com...
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040
the whole premise of speculation is it oils the common path. you have uniform probabilities. what you gain on speculating for B you lose in the extra test when misspeculating on others. so to really test speculation make B like 90% of cases and retry.
To speculate localy you still have to lookup the class info, compare it to the speculated object type, and then conditionaly branch. You cant avoid a conditional branch, and as it will share the same pattern as the indirect jump, and (in most cases) the same prediction mechanism, neither will be better predicted than the other.
no that's wrong. conditional branch is better speculated than indirect jump. they share no other hardware than the instruction fetching pipe.
Sorry but you dont know what you are talking about. Indirect jumps get a slot in the branch target buffer (BTB) the same as conditional branches. They both share the same target prediction hardware. It's not only a fact it's also common sense that they would do so. http://www.agner.org/optimize/microarchitecture.pdf Read the chapter on branch prediction. http://www.intel.com/design/processor/manuals/248966.pdf Read Chapter 3.4.1.6
relax. i do know what i'm talking about eh. i didn't claim it was timely :) also it got real jumbled in my head. like i described predicated execution instead of straight branch prediction. itanium has the latter and x86 has the former. (bad decision in itanium if you ask me.) so i was talking outta my... i mean i was talking nonsense.
 the mechanism for conditional branch speculation is that stuff is executed 
 in parallel
 on both branches inside the pipeline. the branch that makes it is 
 committed. the other
  is retired without getting to write to the register file or memory.
Actualy AFAIK no current x86 processor *execute* both branches. What does happen is the instruction cache will *prefetch* both branches, so that when the BTB decides which way to go it has the instruction data ready.
that's right eh. i stand corrected. what i said happens on itanium.
 But it will still cost you a whole pipeline flush if the prediction was 
 wrong.
 
 
 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...
you are right here tho for the wrong reasons :) yeah speculation is good if there are a few good chunky instructions to eat while speculating. if it's a function call that in turn is just a ret... the results are nothing to write home about. as it's been shown.
No.. erm NO... and NOOooOOOO!!!! ;-) Speculation is good if it allows you to inline the method and chop out a whole bunch of register shuffling, stack setup, and entry / exit code. It's got naff all to do with "having shit to do" while you're speculating. Just like it's got naff all to do with conditional branches being faster / better predicted than indirect calls.
hey hey... watch your language :) otherwise i agree.
Aug 14 2008
parent "Jb" <jb nowhere.com> writes:
"superdan" <super dan.org> wrote in message 
news:g82fpu$t53$1 digitalmars.com...
 Jb Wrote:

 relax.
No!!!!! ;-) Actualy yeah sorry if i came over a bit cranky.
Aug 14 2008
prev sibling parent "Jb" <jb nowhere.com> writes:
"downs" <default_357-line yahoo.de> wrote in message 
news:g8193l$t88$1 digitalmars.com...
 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
  if (randf() < 1f/3f) return a;
  else if (randf() < 0.5) return b;
  else return c;
 }

 void main() {
  const count = 1024*1024;
  for (int z = 0; z < 4; ++z) {
    writefln("Naive: ", time({
      for (int i = 0; i < count; ++i) gen().test();
    }()));
    writefln("Speculative for B: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else t.test();
      }
    }()));
    writefln("Speculative for B/C: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
        else t.test();
      }
    }()));
  }
 }


 And as it turns out, virtual method calls were at least fast enough to not 
 make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter.
 Looks like recent CPUs' branch predictions really are as good as people 
 claim.
The point is that you cant get rid of the branch prediction. You either have the indirect jump, or you have a local conditional branch, either way the cpu has to speculatively go in one direct or the other. And as both are dealt with in (near enough) exactly the same way, they both are dealt with by the same prediction mechanism, you dont gain anything from replacing one with the other. It's been that way since the Pentium 2 IIRC.
Aug 14 2008