digitalmars.D - Feature Request: nontrivial functions and vtable optimizations

downs (11/11) Aug 12 2008 The overhead of vtable calls is well known.

davidl (5/22) Aug 12 2008 It's better to show your benchmark examples, there people can observe an...
Craig Black (12/28) Aug 12 2008 You are right. Function pointer invokation tends to slow down processin...

davidl (4/42) Aug 12 2008 But it's so anti-cleanness hack.

Craig Black (6/14) Aug 12 2008 Are you are refering to my suggestion that we should be able to specify

Don (6/33) Aug 13 2008 I don't think this is true any more.

Craig Black (20/55) Aug 14 2008 To be fair, modern processors are quite sophisticated and do better than...

Jb (17/23) Aug 13 2008 Thats not true these days. I know cause I've profiled it on multiple cpu...

Craig Black (8/23) Aug 14 2008 You may be right in some cases. Profile to be sure. But experience has...

Jb (7/19) Aug 15 2008 I'm not saying there's no penalty, im saying it's a penalty you cant esc...

Craig Black (7/27) Aug 15 2008 I thought you were wrong about this so I did a micro-benchmark and it

Christopher Wright (6/11) Aug 13 2008 Or you could wait a bit[1] and have LLVM do that for you, assuming

Craig Black (4/15) Aug 14 2008 The LLVM based D compiler seems promising. Hopefully it will mature soo...

downs (44/44) Aug 14 2008 To clear this up, I've been running a benchmark.

superdan (3/60) Aug 14 2008 you are looking with a binocular at a coin a mile away and tryin' to fig...

downs (7/70) Aug 14 2008 I know. But if it doesn't matter in that case, it most likely won't matt...

superdan (2/75) Aug 14 2008 the whole premise of speculation is it oils the common path. you have un...

downs (7/82) Aug 14 2008 As you wish.
Jb (24/34) Aug 14 2008 To speculate localy you still have to lookup the class info, compare it ...

superdan (4/28) Aug 14 2008 you are right here tho for the wrong reasons :) yeah speculation is good...

Steven Schveighoffer (3/47) Aug 14 2008 Who the **** are you, and what did you do with superdan? ;)
Jb (22/65) Aug 14 2008 Sorry but you dont know what you are talking about. Indirect jumps get a...

Yigal Chripun (7/86) Aug 14 2008 I think Superdan is just too old ;)
superdan (4/85) Aug 14 2008 that's right eh. i stand corrected. what i said happens on itanium.

Jb (4/6) Aug 14 2008 No!!!!! ;-)

Jb (9/55) Aug 14 2008 The point is that you cant get rid of the branch prediction. You either ...

downs <default_357-line yahoo.de> writes:

The overhead of vtable calls is well known.

For this reason, it may be worthwhile to include an optimization that
essentially checks if a certain class is only inherited from a few times, like,
<= three, in the visible program, and if so, replace vtable calls with a tree
of classinfo pointer comparisons and direct function calls.

When I did this manually in some hobbyist path tracing code, I gained
significant speed-ups, even though I didn't collect any hard numbers.

So, for instance, the equivalent D code would look like so:

class A { ... } class B : A { ... } class C : A { ... }

A a = genSomeObj();
// a.test();
if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as if final
else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
else a.test(); // fallback

Ideas?

Aug 12 2008

davidl <davidl 126.com> writes:

在 Tue, 12 Aug 2008 16:45:08 +0800，downs <default_357-line yahoo.de> 写道:

 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that  
 essentially checks if a certain class is only inherited from a few  
 times, like, <= three, in the visible program, and if so, replace vtable  
 calls with a tree of classinfo pointer comparisons and direct function  
 calls.

 When I did this manually in some hobbyist path tracing code, I gained  
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as  
 if final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
 else a.test(); // fallback

 Ideas?

It's better to show your benchmark examples, there people can observe and  
discuss

-- 
使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/

Aug 12 2008

"Craig Black" <cblack ara.com> writes:

"downs" <default_357-line yahoo.de> wrote in message 
news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that 
 essentially checks if a certain class is only inherited from a few times, 
 like, <= three, in the visible program, and if so, replace vtable calls 
 with a tree of classinfo pointer comparisons and direct function calls.

 When I did this manually in some hobbyist path tracing code, I gained 
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as if 
 final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
 else a.test(); // fallback

 Ideas?

You are right.  Function pointer invokation tends to slow down processing 
because it doesn't cooperate well with pipelining and branch prediction.  I 
don't understand why this is so hard for some programmers to accept.

See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call 
Speculation". Using PGO, the most commonly called virtual functions are 
optimized so that they do not require a function pointer invokation.

Rather than relying on PGO, I would prefer the ability to assign a 
priorities to virtual functions in the code.  This would indicate to the 
compiler what virtual functions will be called the most.

-Craig

Aug 12 2008

davidl <davidl 126.com> writes:

在 Tue, 12 Aug 2008 22:46:21 +0800，Craig Black <cblack ara.com> 写道:

 "downs" <default_357-line yahoo.de> wrote in message
 news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that
 essentially checks if a certain class is only inherited from a few  
 times,
 like, <= three, in the visible program, and if so, replace vtable calls
 with a tree of classinfo pointer comparisons and direct function calls.

 When I did this manually in some hobbyist path tracing code, I gained
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as  
 if
 final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); //  
 dito
 else a.test(); // fallback

 Ideas?

 You are right.  Function pointer invokation tends to slow down processing
 because it doesn't cooperate well with pipelining and branch  
 prediction.  I
 don't understand why this is so hard for some programmers to accept.

 See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call
 Speculation". Using PGO, the most commonly called virtual functions are
 optimized so that they do not require a function pointer invokation.

 Rather than relying on PGO, I would prefer the ability to assign a
 priorities to virtual functions in the code.  This would indicate to the
 compiler what virtual functions will be called the most.

 -Craig

But it's so anti-cleanness hack.

-- 
使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/

Aug 12 2008

"Craig Black" <cblack ara.com> writes:

 Rather than relying on PGO, I would prefer the ability to assign a
 priorities to virtual functions in the code.  This would indicate to the
 compiler what virtual functions will be called the most.

 -Craig

 But it's so anti-cleanness hack.

Are you are refering to my suggestion that we should be able to specify 
virtual function priorities?  This is much more clean than writing your own 
virtual function dispatch routine whenever you want a virtual function to be 
optimized.  Further, the specifying of priority is completely optional.  You 
won't be forced to specify a priority if you don't want to.

-Craig

Aug 12 2008

Don <nospam nospam.com.au> writes:

Craig Black wrote:
 "downs" <default_357-line yahoo.de> wrote in message 
 news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that 
 essentially checks if a certain class is only inherited from a few times, 
 like, <= three, in the visible program, and if so, replace vtable calls 
 with a tree of classinfo pointer comparisons and direct function calls.

 When I did this manually in some hobbyist path tracing code, I gained 
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as if 
 final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // dito
 else a.test(); // fallback

 Ideas?

 
 You are right.  Function pointer invokation tends to slow down processing 
 because it doesn't cooperate well with pipelining and branch prediction.

I don't think this is true any more.
It was true in processors older than the Pentium M and AMD K10. On 
recent CPUs, indirect jumps and calls are predicted as well as 
conditional branches are. If you're seeing this kind of the speed 
difference, it's something else (probably, more code is being executed).

Aug 13 2008

"Craig Black" <craigblack2 cox.net> writes:

"Don" <nospam nospam.com.au> wrote in message 
news:g7u5mr$261s$1 digitalmars.com...
 Craig Black wrote:
 "downs" <default_357-line yahoo.de> wrote in message 
 news:g7ribh$1vii$1 digitalmars.com...
 The overhead of vtable calls is well known.

 For this reason, it may be worthwhile to include an optimization that 
 essentially checks if a certain class is only inherited from a few 
 times, like, <= three, in the visible program, and if so, replace vtable 
 calls with a tree of classinfo pointer comparisons and direct function 
 calls.

 When I did this manually in some hobbyist path tracing code, I gained 
 significant speed-ups, even though I didn't collect any hard numbers.

 So, for instance, the equivalent D code would look like so:

 class A { ... } class B : A { ... } class C : A { ... }

 A a = genSomeObj();
 // a.test();
 if (a.classinfo == typeid(B)) (cast(B)cast(void*) a).test(); // call as 
 if final
 else if (a.classinfo == typeid(C)) (cast(C)cast(void*) a).test(); // 
 dito
 else a.test(); // fallback

 Ideas?

 You are right.  Function pointer invokation tends to slow down processing 
 because it doesn't cooperate well with pipelining and branch prediction.

 I don't think this is true any more.
 It was true in processors older than the Pentium M and AMD K10. On recent 
 CPUs, indirect jumps and calls are predicted as well as conditional 
 branches are. If you're seeing this kind of the speed difference, it's 
 something else (probably, more code is being executed).

To be fair, modern processors are quite sophisticated and do better than 
older ones with function pointers and vtables.  I don't mean to 
overemphasize the penalty of function pointer invokations.  They are usually 
relatively fast on modern processors.

Even so, they can still be the source of bottlenecks.  You should never 
think of a function pointer invokation as free.  Never assume that your 
compiler and processor are doing magical optimizations.  Modern processors 
can still have problems with branch prediction.  I speak from experience, 
and there are many others in the field that agree.

Obviously, you should always profile first since your function pointer 
invokations may not be a bottleneck at all.  It is going to depend on how 
often and how the function pointers are getting called.  From experience, I 
have had numerous performance problems with function pointer invokations. 
For example, I found a very severe bottleneck in my rendering loop that was 
caused due to a function pointer invokation.  It was destroying my 
performance.  In general, when I refactor my code to make less function 
pointer calls, I usually get significant measurable speedups.

-Craig

Aug 14 2008

"Jb" <jb nowhere.com> writes:

"Craig Black" <cblack ara.com> wrote in message 
news:g7s7o1$dpt$1 digitalmars.com...
 You are right.  Function pointer invokation tends to slow down processing 
 because it doesn't cooperate well with pipelining and branch prediction. 
 I don't understand why this is so hard for some programmers to accept.

Thats not true these days. I know cause I've profiled it on multiple cpus, 
correctly predicted Indirect jumps (through a pointer) cost no more than a 
normal call/ret pair. They get a slot in the BTB just like conditional 
jumps, and IIRC they are predicted to go to the same place they did last 
time.

(on x86 anyway).

Virtual methods are in such high use that Intel and Amd have put a lot of 
effort into optimizing them.

Replacing the function pointer with a conditional branch wont make it 
faster.



 See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call 
 Speculation". Using PGO, the most commonly called virtual functions are 
 optimized so that they do not require a function pointer invokation.

The speedup you get from this is the inlining of the function, *not* the 
getting rid of the function pointer lookup. It still requires a lookup in 
the class info and a conditional branch to check that the "inlined function" 
is the correct match for the speculated object.

That will still cost a whole pipline of cycles if wrongly predicted.

Aug 13 2008

"Craig Black" <craigblack2 cox.net> writes:

"Jb" <jb nowhere.com> wrote in message news:g7ubhk$2hq5$1 digitalmars.com...
 "Craig Black" <cblack ara.com> wrote in message 
 news:g7s7o1$dpt$1 digitalmars.com...
 (on x86 anyway).

 Virtual methods are in such high use that Intel and Amd have put a lot of 
 effort into optimizing them.

 Replacing the function pointer with a conditional branch wont make it 
 faster.

It may or may not improve performance.  Depends on the particular case.

 See Microsoft Visual C++ profile guided optimizations (PGO) "Virtual Call 
 Speculation". Using PGO, the most commonly called virtual functions are 
 optimized so that they do not require a function pointer invokation.

 The speedup you get from this is the inlining of the function, *not* the 
 getting rid of the function pointer lookup. It still requires a lookup in 
 the class info and a conditional branch to check that the "inlined 
 function" is the correct match for the speculated object.

 That will still cost a whole pipline of cycles if wrongly predicted.

You may be right in some cases.  Profile to be sure.  But experience has 
taught me to be wary of function pointer invokations, even on modern CPU's. 
If you want to believe that there is never a penalty for function pointer 
invokation go right ahead.  After all, it's only two instructions right? 
Must be real fast. ;)

-Craig

Aug 14 2008

"Jb" <jb nowhere.com> writes:

"Craig Black" <craigblack2 cox.net> wrote in message 
news:g82rj4$1hqn$1 digitalmars.com...
 The speedup you get from this is the inlining of the function, *not* the 
 getting rid of the function pointer lookup. It still requires a lookup in 
 the class info and a conditional branch to check that the "inlined 
 function" is the correct match for the speculated object.

 That will still cost a whole pipline of cycles if wrongly predicted.

 You may be right in some cases.  Profile to be sure.  But experience has 
 taught me to be wary of function pointer invokations, even on modern 
 CPU's. If you want to believe that there is never a penalty for function 
 pointer invokation go right ahead.  After all, it's only two instructions 
 right? Must be real fast. ;)

I'm not saying there's no penalty, im saying it's a penalty you cant escape 
if you want virtual functions.

Function pointer invocations == conditional branches as far as modern branch 
predictors are concerned. You dont gain anything by replacing the former 
with the later.

Aug 15 2008

"Craig Black" <craigblack2 cox.net> writes:

"Jb" <jb nowhere.com> wrote in message news:g843q6$104a$1 digitalmars.com...
 "Craig Black" <craigblack2 cox.net> wrote in message 
 news:g82rj4$1hqn$1 digitalmars.com...
 The speedup you get from this is the inlining of the function, *not* the 
 getting rid of the function pointer lookup. It still requires a lookup 
 in the class info and a conditional branch to check that the "inlined 
 function" is the correct match for the speculated object.

 That will still cost a whole pipline of cycles if wrongly predicted.

 You may be right in some cases.  Profile to be sure.  But experience has 
 taught me to be wary of function pointer invokations, even on modern 
 CPU's. If you want to believe that there is never a penalty for function 
 pointer invokation go right ahead.  After all, it's only two instructions 
 right? Must be real fast. ;)

 I'm not saying there's no penalty, im saying it's a penalty you cant 
 escape if you want virtual functions.

 Function pointer invocations == conditional branches as far as modern 
 branch predictors are concerned. You dont gain anything by replacing the 
 former with the later.

I thought you were wrong about this so I did a micro-benchmark and it 
indicates that you are correct.  With only 1 branch the performance of an if 
statement was exactly the same as a function pointer invokation.  Function 
pointer invokations ended up being roughly 50% faster than a sequence of 
if-elses.  I guess I stand corrected.

-Craig

Aug 15 2008

Christopher Wright <dhasenan gmail.com> writes:

Craig Black wrote:
 Rather than relying on PGO, I would prefer the ability to assign a 
 priorities to virtual functions in the code.  This would indicate to the 
 compiler what virtual functions will be called the most.
 
 -Craig 

Or you could wait a bit[1] and have LLVM do that for you, assuming 
you're willing to run LLVM bytecode. And considering it's LLVM, you 
could probably hack it to periodically dump the optimized code. So you'd 
optimize your application by fooling around with it for a couple hours.

[1] "A bit" may be as long as a decade. I dunno.

Aug 13 2008

"Craig Black" <craigblack2 cox.net> writes:

"Christopher Wright" <dhasenan gmail.com> wrote in message 
news:g7uhpk$308b$2 digitalmars.com...
 Craig Black wrote:
 Rather than relying on PGO, I would prefer the ability to assign a 
 priorities to virtual functions in the code.  This would indicate to the 
 compiler what virtual functions will be called the most.

 -Craig

 Or you could wait a bit[1] and have LLVM do that for you, assuming you're 
 willing to run LLVM bytecode. And considering it's LLVM, you could 
 probably hack it to periodically dump the optimized code. So you'd 
 optimize your application by fooling around with it for a couple hours.

 [1] "A bit" may be as long as a decade. I dunno.

The LLVM based D compiler seems promising.  Hopefully it will mature soon.

-Craig

Aug 14 2008

downs <default_357-line yahoo.de> writes:

To clear this up, I've been running a benchmark.

module test91;

import tools.time, std.stdio, tools.base, tools.mersenne;

class A { void test() { } }
class B : A { final override void test() { } }
class C : A { final override void test() { } }

A a, b, c;
static this() { a = new A; b = new B; c = new C; }

A gen() {
  if (randf() < 1f/3f) return a;
  else if (randf() < 0.5) return b;
  else return c;
}

void main() {
  const count = 1024*1024;
  for (int z = 0; z < 4; ++z) {
    writefln("Naive: ", time({
      for (int i = 0; i < count; ++i) gen().test();
    }()));
    writefln("Speculative for B: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else t.test();
      }
    }()));
    writefln("Speculative for B/C: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
        else t.test();
      }
    }()));
  }
}


And as it turns out, virtual method calls were at least fast enough to not make
any sort of difference in my calls.

Here's the output of my little proggy in the last iteration:

Naive: 560958
Speculative for B: 574602
Speculative for B/C: 572429

If anything, naive is often a little faster.

This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

Sorry for the confusion.

Aug 14 2008

superdan <super dan.org> writes:

downs Wrote:

 To clear this up, I've been running a benchmark.
 
 module test91;
 
 import tools.time, std.stdio, tools.base, tools.mersenne;
 
 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }
 
 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }
 
 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }
 
 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }
 
 
 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.
 
 Here's the output of my little proggy in the last iteration:
 
 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429
 
 If anything, naive is often a little faster.
 
 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.
 
 Sorry for the confusion.

you are looking with a binocular at a coin a mile away and tryin' to figure
quarter or nickel. never gonna work. most likely ur benchmark is buried in
randf timing.

make iteration cost next to nothing. put objects in a medium size vector. then
iterate many times over it.

Aug 14 2008

downs <default_357-line yahoo.de> writes:

superdan wrote:
 downs Wrote:
 
 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }

 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }


 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

 Sorry for the confusion.

 
 you are looking with a binocular at a coin a mile away and tryin' to figure
quarter or nickel. never gonna work. most likely ur benchmark is buried in
randf timing.
 
 make iteration cost next to nothing. put objects in a medium size vector. then
iterate many times over it.

I know. But if it doesn't matter in that case, it most likely won't matter in
practial situations.

Nonetheless, here are some better timings, including a faster (dirtier) randf()
function, and a null pass that only generates the object.

Null: 729908
Naive: 1615314
Speculative for B: 1692860
Speculative for B/C: 1664040

Aug 14 2008

superdan <super dan.org> writes:

downs Wrote:

 superdan wrote:
 downs Wrote:
 
 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }

 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }


 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

 Sorry for the confusion.

 
 you are looking with a binocular at a coin a mile away and tryin' to figure
quarter or nickel. never gonna work. most likely ur benchmark is buried in
randf timing.
 
 make iteration cost next to nothing. put objects in a medium size vector. then
iterate many times over it.

 
 I know. But if it doesn't matter in that case, it most likely won't matter in
practial situations.
 
 Nonetheless, here are some better timings, including a faster (dirtier)
randf() function, and a null pass that only generates the object.
 
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

the whole premise of speculation is it oils the common path. you have uniform
probabilities. what you gain on speculating for B you lose in the extra test
when misspeculating on others. so to really test speculation make B like 90% of
cases and retry.

Aug 14 2008

downs <default_357-line yahoo.de> writes:

superdan wrote:
 downs Wrote:
 
 superdan wrote:
 downs Wrote:

 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
   if (randf() < 1f/3f) return a;
   else if (randf() < 0.5) return b;
   else return c;
 }

 void main() {
   const count = 1024*1024;
   for (int z = 0; z < 4; ++z) {
     writefln("Naive: ", time({
       for (int i = 0; i < count; ++i) gen().test();
     }()));
     writefln("Speculative for B: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else t.test();
       }
     }()));
     writefln("Speculative for B/C: ", time({
       for (int i = 0; i < count; ++i) {
         auto t = gen();
         if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
         else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
         else t.test();
       }
     }()));
   }
 }


 And as it turns out, virtual method calls were at least fast enough to not
make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter. Looks
like recent CPUs' branch predictions really are as good as people claim.

 Sorry for the confusion.

 you are looking with a binocular at a coin a mile away and tryin' to figure
quarter or nickel. never gonna work. most likely ur benchmark is buried in
randf timing.

 make iteration cost next to nothing. put objects in a medium size vector. then
iterate many times over it.

 I know. But if it doesn't matter in that case, it most likely won't matter in
practial situations.

 Nonetheless, here are some better timings, including a faster (dirtier)
randf() function, and a null pass that only generates the object.

 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

 
 the whole premise of speculation is it oils the common path. you have uniform
probabilities. what you gain on speculating for B you lose in the extra test
when misspeculating on others. so to really test speculation make B like 90% of
cases and retry.

As you wish.

Here's the timings for 10% A, then 81% for B, 9% for C.

Null: 629760
Naive: 1383764
Speculative for B: 1397784
Speculative for B/C: 1418163

Aug 14 2008

"Jb" <jb nowhere.com> writes:

"superdan" <super dan.org> wrote in message
news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:

 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

 the whole premise of speculation is it oils the common path. you have
 uniform probabilities.
 what you gain on speculating for B you lose in the extra test when
 misspeculating on others.
 so to really test speculation make B like 90% of cases and retry.

To speculate localy you still have to lookup the class info, compare it to 
the speculated object type, and then conditionaly branch. You cant avoid a 
conditional branch, and as it will share the same pattern as the indirect 
jump, and (in most cases) the same prediction mechanism, neither will be 
better predicted than the other.

So the point is that you cant make the common path any faster unless you 
actualy *inline* the speculated method. In fact if you dont inline you are 
most probably making the situation worse because you are replacing this...

(ignoring parameter setup which will be identical)

MOV   EAX.[objptr.vtable+offset]
CALL  EAX

With this...

MOV   EAX,[objptr.classinfo]
CMP    EAX,expectedclasstype
JNE       wrongtype
CALL    expectedtypemethod
JMP      done
 wrongtype
// normal vtable code here
 done

And there's no way that will be faster imo. Unless perhaps you have a 486 or 
P1. ;-)

Aug 14 2008

superdan <super dan.org> writes:

Jb Wrote:

 
 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:

 
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

 the whole premise of speculation is it oils the common path. you have
 uniform probabilities.
 what you gain on speculating for B you lose in the extra test when
 misspeculating on others.
 so to really test speculation make B like 90% of cases and retry.

 
 To speculate localy you still have to lookup the class info, compare it to 
 the speculated object type, and then conditionaly branch. You cant avoid a 
 conditional branch, and as it will share the same pattern as the indirect 
 jump, and (in most cases) the same prediction mechanism, neither will be 
 better predicted than the other.

no that's wrong. conditional branch is better speculated than indirect jump.
they share no other hardware than the instruction fetching pipe. the mechanism
for conditional branch speculation is that stuff is executed in parallel on
both branches inside the pipeline. the branch that makes it is committed. the
other is retired without getting to write to the register file or memory.
indirect jump is a whole different business altogether.

 So the point is that you cant make the common path any faster unless you 
 actualy *inline* the speculated method. In fact if you dont inline you are 
 most probably making the situation worse because you are replacing this...

you are right here tho for the wrong reasons :) yeah speculation is good if
there are a few good chunky instructions to eat while speculating. if it's a
function call that in turn is just a ret... the results are nothing to write
home about. as it's been shown.

processors today are so complicated there's no chance to have relevant
synthetic tests. when checking for speed you always must use realistic loads.
if method speculation as in here shows nothing on empty functions that's
nothing. i've seen much weirder things. tests must be held on 5-6 real
benchmarks to be any good.

Aug 14 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"superdan" wrote
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:

 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

 the whole premise of speculation is it oils the common path. you have
 uniform probabilities.
 what you gain on speculating for B you lose in the extra test when
 misspeculating on others.
 so to really test speculation make B like 90% of cases and retry.

 To speculate localy you still have to lookup the class info, compare it 
 to
 the speculated object type, and then conditionaly branch. You cant avoid 
 a
 conditional branch, and as it will share the same pattern as the indirect
 jump, and (in most cases) the same prediction mechanism, neither will be
 better predicted than the other.

 no that's wrong. conditional branch is better speculated than indirect 
 jump. they share no other hardware than the instruction fetching pipe. the 
 mechanism for conditional branch speculation is that stuff is executed in 
 parallel on both branches inside the pipeline. the branch that makes it is 
 committed. the other is retired without getting to write to the register 
 file or memory. indirect jump is a whole different business altogether.

 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...

 you are right here tho for the wrong reasons :) yeah speculation is good 
 if there are a few good chunky instructions to eat while speculating. if 
 it's a function call that in turn is just a ret... the results are nothing 
 to write home about. as it's been shown.

 processors today are so complicated there's no chance to have relevant 
 synthetic tests. when checking for speed you always must use realistic 
 loads. if method speculation as in here shows nothing on empty functions 
 that's nothing. i've seen much weirder things. tests must be held on 5-6 
 real benchmarks to be any good.

Who the **** are you, and what did you do with superdan? ;)

-Steve

Aug 14 2008

"Jb" <jb nowhere.com> writes:

"superdan" <super dan.org> wrote in message 
news:g81o61$21bs$1 digitalmars.com...
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:

 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

 the whole premise of speculation is it oils the common path. you have
 uniform probabilities.
 what you gain on speculating for B you lose in the extra test when
 misspeculating on others.
 so to really test speculation make B like 90% of cases and retry.

 To speculate localy you still have to lookup the class info, compare it 
 to
 the speculated object type, and then conditionaly branch. You cant avoid 
 a
 conditional branch, and as it will share the same pattern as the indirect
 jump, and (in most cases) the same prediction mechanism, neither will be
 better predicted than the other.

 no that's wrong. conditional branch is better speculated than indirect 
 jump.
 they share no other hardware than the instruction fetching pipe.

Sorry but you dont know what you are talking about. Indirect jumps get a 
slot in the branch target buffer (BTB) the same as conditional branches. 
They both share the same target prediction hardware. It's not only a fact 
it's also common sense that they would do so.

http://www.agner.org/optimize/microarchitecture.pdf

Read the chapter on branch prediction.

http://www.intel.com/design/processor/manuals/248966.pdf

Read Chapter 3.4.1.6


 the mechanism for conditional branch speculation is that stuff is executed 
 in parallel
 on both branches inside the pipeline. the branch that makes it is 
 committed. the other
  is retired without getting to write to the register file or memory.

Actualy AFAIK no current x86 processor *execute* both branches. What does 
happen is the instruction cache will *prefetch* both branches, so that when 
the BTB decides which way to go it has the instruction data ready.

But it will still cost you a whole pipeline flush if the prediction was 
wrong.


 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...

 you are right here tho for the wrong reasons :) yeah speculation is good 
 if there are
 a few good chunky instructions to eat while speculating. if it's a 
 function call that in
 turn is just a ret... the results are nothing to write home about. as it's 
 been shown.

No.. erm NO... and NOOooOOOO!!!!

;-)

Speculation is good if it allows you to inline the method and chop out a 
whole bunch of register shuffling, stack setup, and entry / exit code.

It's got naff all to do with "having shit to do" while you're speculating. 
Just like it's got naff all to do with conditional branches being faster / 
better predicted than indirect calls.

Aug 14 2008

Yigal Chripun <yigal100 gmail.com> writes:

Jb wrote:
 "superdan" <super dan.org> wrote in message 
 news:g81o61$21bs$1 digitalmars.com...
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:
 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

 the whole premise of speculation is it oils the common path. you have
 uniform probabilities.
 what you gain on speculating for B you lose in the extra test when
 misspeculating on others.
 so to really test speculation make B like 90% of cases and retry.

 To speculate localy you still have to lookup the class info, compare it 
 to
 the speculated object type, and then conditionaly branch. You cant avoid 
 a
 conditional branch, and as it will share the same pattern as the indirect
 jump, and (in most cases) the same prediction mechanism, neither will be
 better predicted than the other.

 no that's wrong. conditional branch is better speculated than indirect 
 jump.
 they share no other hardware than the instruction fetching pipe.

 
 Sorry but you dont know what you are talking about. Indirect jumps get a 
 slot in the branch target buffer (BTB) the same as conditional branches. 
 They both share the same target prediction hardware. It's not only a fact 
 it's also common sense that they would do so.
 
 http://www.agner.org/optimize/microarchitecture.pdf
 
 Read the chapter on branch prediction.
 
 http://www.intel.com/design/processor/manuals/248966.pdf
 
 Read Chapter 3.4.1.6
 
 
 the mechanism for conditional branch speculation is that stuff is executed 
 in parallel
 on both branches inside the pipeline. the branch that makes it is 
 committed. the other
  is retired without getting to write to the register file or memory.

 
 Actualy AFAIK no current x86 processor *execute* both branches. What does 
 happen is the instruction cache will *prefetch* both branches, so that when 
 the BTB decides which way to go it has the instruction data ready.
 
 But it will still cost you a whole pipeline flush if the prediction was 
 wrong.
 
 
 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...

 you are right here tho for the wrong reasons :) yeah speculation is good 
 if there are
 a few good chunky instructions to eat while speculating. if it's a 
 function call that in
 turn is just a ret... the results are nothing to write home about. as it's 
 been shown.

 
 No.. erm NO... and NOOooOOOO!!!!
 
 ;-)
 
 Speculation is good if it allows you to inline the method and chop out a 
 whole bunch of register shuffling, stack setup, and entry / exit code.
 
 It's got naff all to do with "having shit to do" while you're speculating. 
 Just like it's got naff all to do with conditional branches being faster / 
 better predicted than indirect calls.
 
 

I think Superdan is just too old ;)
What he is talking about is probably the way things worked in P4 and
that design was canceled for many good reasons. As Jb says above, No
current x86 processor *execute* both branches.
My Digital Systems professor works at Intel here in Israel where they
design the chips ;) and he taught us the same thing Jb said above.

Aug 14 2008

superdan <super dan.org> writes:

Jb Wrote:

 
 "superdan" <super dan.org> wrote in message 
 news:g81o61$21bs$1 digitalmars.com...
 Jb Wrote:

 "superdan" <super dan.org> wrote in message
 news:g81e1l$18js$1 digitalmars.com...
 downs Wrote:

 Null: 729908
 Naive: 1615314
 Speculative for B: 1692860
 Speculative for B/C: 1664040

 the whole premise of speculation is it oils the common path. you have
 uniform probabilities.
 what you gain on speculating for B you lose in the extra test when
 misspeculating on others.
 so to really test speculation make B like 90% of cases and retry.

 To speculate localy you still have to lookup the class info, compare it 
 to
 the speculated object type, and then conditionaly branch. You cant avoid 
 a
 conditional branch, and as it will share the same pattern as the indirect
 jump, and (in most cases) the same prediction mechanism, neither will be
 better predicted than the other.

 no that's wrong. conditional branch is better speculated than indirect 
 jump.
 they share no other hardware than the instruction fetching pipe.

 
 Sorry but you dont know what you are talking about. Indirect jumps get a 
 slot in the branch target buffer (BTB) the same as conditional branches. 
 They both share the same target prediction hardware. It's not only a fact 
 it's also common sense that they would do so.
 
 http://www.agner.org/optimize/microarchitecture.pdf
 
 Read the chapter on branch prediction.
 
 http://www.intel.com/design/processor/manuals/248966.pdf
 
 Read Chapter 3.4.1.6

relax. i do know what i'm talking about eh. i didn't claim it was timely :)
also it got real jumbled in my head. like i described predicated execution
instead of straight branch prediction. itanium has the latter and x86 has the
former. (bad decision in itanium if you ask me.) so i was talking outta my... i
mean i was talking nonsense.

 the mechanism for conditional branch speculation is that stuff is executed 
 in parallel
 on both branches inside the pipeline. the branch that makes it is 
 committed. the other
  is retired without getting to write to the register file or memory.

 
 Actualy AFAIK no current x86 processor *execute* both branches. What does 
 happen is the instruction cache will *prefetch* both branches, so that when 
 the BTB decides which way to go it has the instruction data ready.

that's right eh. i stand corrected. what i said happens on itanium.

 But it will still cost you a whole pipeline flush if the prediction was 
 wrong.
 
 
 So the point is that you cant make the common path any faster unless you
 actualy *inline* the speculated method. In fact if you dont inline you 
 are
 most probably making the situation worse because you are replacing 
 this...

 you are right here tho for the wrong reasons :) yeah speculation is good 
 if there are
 a few good chunky instructions to eat while speculating. if it's a 
 function call that in
 turn is just a ret... the results are nothing to write home about. as it's 
 been shown.

 
 No.. erm NO... and NOOooOOOO!!!!
 
 ;-)
 
 Speculation is good if it allows you to inline the method and chop out a 
 whole bunch of register shuffling, stack setup, and entry / exit code.
 
 It's got naff all to do with "having shit to do" while you're speculating. 
 Just like it's got naff all to do with conditional branches being faster / 
 better predicted than indirect calls.

hey hey... watch your language :) otherwise i agree.

Aug 14 2008

"Jb" <jb nowhere.com> writes:

"superdan" <super dan.org> wrote in message 
news:g82fpu$t53$1 digitalmars.com...
 Jb Wrote:

 relax.

No!!!!! ;-)

Actualy yeah sorry if i came over a bit cranky.

Aug 14 2008

"Jb" <jb nowhere.com> writes:

"downs" <default_357-line yahoo.de> wrote in message 
news:g8193l$t88$1 digitalmars.com...
 To clear this up, I've been running a benchmark.

 module test91;

 import tools.time, std.stdio, tools.base, tools.mersenne;

 class A { void test() { } }
 class B : A { final override void test() { } }
 class C : A { final override void test() { } }

 A a, b, c;
 static this() { a = new A; b = new B; c = new C; }

 A gen() {
  if (randf() < 1f/3f) return a;
  else if (randf() < 0.5) return b;
  else return c;
 }

 void main() {
  const count = 1024*1024;
  for (int z = 0; z < 4; ++z) {
    writefln("Naive: ", time({
      for (int i = 0; i < count; ++i) gen().test();
    }()));
    writefln("Speculative for B: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else t.test();
      }
    }()));
    writefln("Speculative for B/C: ", time({
      for (int i = 0; i < count; ++i) {
        auto t = gen();
        if (t.classinfo is typeid(B)) (cast(B)cast(void*)t).test();
        else if (t.classinfo is typeid(C)) (cast(C)cast(void*)t).test();
        else t.test();
      }
    }()));
  }
 }


 And as it turns out, virtual method calls were at least fast enough to not 
 make any sort of difference in my calls.

 Here's the output of my little proggy in the last iteration:

 Naive: 560958
 Speculative for B: 574602
 Speculative for B/C: 572429

 If anything, naive is often a little faster.

 This kind of completely confuses my established knowledge on the matter.
 Looks like recent CPUs' branch predictions really are as good as people 
 claim.

The point is that you cant get rid of the branch prediction. You either have 
the indirect jump, or you have a local conditional branch, either way the 
cpu has to speculatively go in one direct or the other. And as both are 
dealt with in (near enough) exactly the same way, they both are dealt with 
by the same prediction mechanism, you dont gain anything from replacing one 
with the other.

It's been that way since the Pentium 2 IIRC.

Aug 14 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Feature Request: nontrivial functions and vtable optimizations