digitalmars.D - I feel outraged -
- Justin Johansson (4/4) Oct 15 2009 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).
- downs (3/9) Oct 15 2009 - with this weird way of writing posts? The subject should tell us about...
- Justin Johansson (6/17) Oct 15 2009 I will be bold and say yes to off-stack allocation,
- bearophile (4/9) Oct 15 2009 This is a nice situation: with just 10-20 minutes of experimental tests ...
- Justin Johansson (6/18) Oct 15 2009 Steve, bearophile, et. al.,
- Justin Johansson (105/117) Oct 15 2009 Fresh light of morning ...
- bearophile (57/60) Oct 15 2009 I have changed your benchmark a little, you may want to look at its timi...
- Justin Johansson (32/108) Oct 16 2009 Thanks muchly (also the _ tip)
- bearophile (19/23) Oct 16 2009 In DMD:
- Justin Johansson (7/17) Oct 16 2009 "because they often have just an illusion of understanding things :-)"
- Steven Schveighoffer (18/43) Oct 15 2009 You got a response because I'm actually awake and at a computer :) I
- Jeremie Pelletier (8/27) Oct 15 2009 I don't see why delegates should be allocated on the heap, if so then
- Steven Schveighoffer (5/9) Oct 15 2009 How do you propose to fix it? I think it is the minimal approach. You ...
- downs (52/52) Oct 15 2009 Two discoveries were made from this benchmark.
- downs (1/1) Oct 15 2009 On consideration, this wasn't a test of the two methods at all, but a te...
- Don (6/12) Oct 15 2009 Not so. On 286 and earlier, stack pushes were more expensive. They're
- that the .sizeof a delegate is 8 bytes (on a 32-bit machine). AFAIK, stack pushes are still more expensive than a pointer dereference in contemporary CPU architectures. Justin
Oct 15 2009
Justin Johansson wrote:- that the .sizeof a delegate is 8 bytes (on a 32-bit machine). AFAIK, stack pushes are still more expensive than a pointer dereference in contemporary CPU architectures. Justin- with this weird way of writing posts? The subject should tell us about the content, not your emotional state! :p Also I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?
Oct 15 2009
downs Wrote:Justin Johansson wrote:Re subject line: fair call, you are right. Emotions aside, at least this time I got a response.- that the .sizeof a delegate is 8 bytes (on a 32-bit machine). AFAIK, stack pushes are still more expensive than a pointer dereference in contemporary CPU architectures. Justin- with this weird way of writing posts? The subject should tell us about the content, not your emotional state! :pAlso I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG"). When the tough gets going, the going have to get tough. (Meaning to start thinking outside of the square.)
Oct 15 2009
Justin Johansson:downs:This is a nice situation: with just 10-20 minutes of experimental tests (that later you have to show us) you can show us you are right, or wrong. Bye, bearophileAlso I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").
Oct 15 2009
bearophile Wrote:Justin Johansson:Steve, bearophile, et. al., Yes, timezone (being Australia) is a severe disadvantage for "reality-time" discussion. Best I retire with this as conjecture for the moment. Buona Notte (23.30) Justindowns:This is a nice situation: with just 10-20 minutes of experimental tests (that later you have to show us) you can show us you are right, or wrong. Bye, bearophileAlso I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").
Oct 15 2009
bearophile Wrote:Justin Johansson:Fresh light of morning ... I'm really glad to have brought this up as I would not have bothered to revisit a performance issue that I had in porting some C++ to D (and this was not looking good for D at first sight). As it turns out though, my initial fear about the .sizeof delegate was unfounded as the performance bottleneck was in an loop inside of a method taking what was effectively a callback parameter. The C++ design was basically implementing a classical visitor pattern over a collection. In porting to D I had a choice of either doing a one-for-one translation of the C++ classes or redesigning using D delegates. This morning I boiled the problem down to the simplest possible with the two forms and benchmarked these. From the code I have reproduced below, clearly the issue is nothing to do with the cost of instantiating a delegate or visitor object. That turns out to be a one time cost and irrelevant given where the iteration actually occurs. So while this isn't the proof or disproof that bearophile asked for, this turns out to be a clear demonstration of the performance-enhancing power of D delegates over an otherwise ingrained C++ thinking approach. I'm impressed. (Hope that statement isn't too emotional :-) -release import std.perf, std.stdio; class SetOfIntegers { private int from, to; this( int from, int to) { this.from = from; this.to = to; } int forEach( int delegate( int x) apply) { for (auto i = from; i <= to; ++i) { apply( i); } return 0; } int forEach( Visitor v) { for (auto i = from; i <= to; ++i) { v.visit( i); } return 0; } } class Visitor { abstract int visit( int x); } class MyVisitor: Visitor { int visit( int x) { return 0; } } void test1( SetOfIntegers s, Visitor v) { auto pc = new PerformanceCounter(); pc.start(); scope(exit) { pc.stop(); writefln( "Using D style delegate callback: %d msec", pc.milliseconds()); } s.forEach( &v.visit); } void test2( SetOfIntegers s, Visitor v) { auto pc = new PerformanceCounter(); pc.start(); scope(exit) { pc.stop(); writefln( "Using C++ style virtual callback: %d msec", pc.milliseconds()); } s.forEach( v); } void main() { writefln( "Delegates vs virtual function callbacks"); writefln(); SetOfIntegers s = new SetOfIntegers( 1, 10000000); Visitor v = new MyVisitor(); for (auto i = 0; i < 10; ++i) { test1( s, v); test2( s, v); writefln(); } writefln(); } $ ./perf.d Delegates vs virtual function callbacks Using D style delegate callback: 121 msec Using C++ style virtual callback: 146 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 120 msec Using C++ style virtual callback: 145 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 145 msec Using D style delegate callback: 120 msec Using C++ style virtual callback: 145 msec Using D style delegate callback: 120 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 146 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Sweet. cheers Justin Johanssondowns:This is a nice situation: with just 10-20 minutes of experimental tests (that later you have to show us) you can show us you are right, or wrong. Bye, bearophileAlso I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").
Oct 15 2009
Justin Johansson:this turns out to be a clear demonstration of the performance-enhancing power of D delegates over an otherwise ingrained C++ thinking approach.I have changed your benchmark a little, you may want to look at its timings too (I have taken timings with it with DMD and LDC, and the results differ): version (Tango) { import tango.stdc.stdio: printf; import tango.stdc.time: CLOCKS_PER_SEC, clock; } else { import std.c.stdio: printf; import std.c.time: CLOCKS_PER_SEC, clock; } double myclock() { return clock() / cast(double)CLOCKS_PER_SEC; } abstract class Visitor { //interface Visitor { // try this too abstract int visit(int x); } final class MyVisitor: Visitor { int visit(int x) { return 0; } } struct IntRange { int stop; int forEachDeleg(int delegate(int x) apply) { for (int i; i < stop; i++) apply(i); return 0; } int forEachObj(Visitor v) { for (int i; i < stop; i++) v.visit(i); return 0; } } void testD(IntRange s, Visitor v) { auto start = myclock(); s.forEachDeleg(&v.visit); auto stop = myclock(); printf("Using D style delegate callback: %d ms\n", cast(int)((stop - start) * 1000)); } void testCpp(IntRange s, Visitor v) { auto start = myclock(); s.forEachObj(v); auto stop = myclock(); printf("Using C++ style virtual callback: %d ms\n", cast(int)((stop - start) * 1000)); } void main() { auto s = IntRange(400_000_000); Visitor v = new MyVisitor(); for (int i; i < 5; i++) { testD(s, v); testCpp(s, v); printf("\n"); } } (I suggest you to use the _ inside big number literals in D, they avoid few bugs). (Few days ago I think to have found that interfaces aren't implemented efficiently in LDC. Lindquist has answered he will improve the situation.) Bye, bearophile
Oct 15 2009
bearophile Wrote:Justin Johansson:Thanks muchly (also the _ tip) Just ran your code with these results (D1/phobos/linux): -release -O Also added -O switch this time though have no idea what level of optimization that does. (btw. In this test code, the -release switch doesn't do anything does it as that's just for conditional compilation?) A. abstract class Visitor version :- Using D style delegate callback: 2720 ms Using C++ style virtual callback: 2249 ms Using D style delegate callback: 2560 ms Using C++ style virtual callback: 2259 ms Using D style delegate callback: 2170 ms Using C++ style virtual callback: 2259 ms Using D style delegate callback: 2099 ms Using C++ style virtual callback: 2259 ms Using D style delegate callback: 2640 ms Using C++ style virtual callback: 2250 ms B. interface Visitor version :- Using D style delegate callback: 2509 ms Using C++ style virtual callback: 2500 ms Using D style delegate callback: 2509 ms Using C++ style virtual callback: 2500 ms Using D style delegate callback: 2519 ms Using C++ style virtual callback: 2510 ms Using D style delegate callback: 2509 ms Using C++ style virtual callback: 2500 ms Using D style delegate callback: 2510 ms Using C++ style virtual callback: 2500 ms The results are not clear cut at all this time. So what's going on? ciao justinthis turns out to be a clear demonstration of the performance-enhancing power of D delegates over an otherwise ingrained C++ thinking approach.I have changed your benchmark a little, you may want to look at its timings too (I have taken timings with it with DMD and LDC, and the results differ): version (Tango) { import tango.stdc.stdio: printf; import tango.stdc.time: CLOCKS_PER_SEC, clock; } else { import std.c.stdio: printf; import std.c.time: CLOCKS_PER_SEC, clock; } double myclock() { return clock() / cast(double)CLOCKS_PER_SEC; } abstract class Visitor { //interface Visitor { // try this too abstract int visit(int x); } final class MyVisitor: Visitor { int visit(int x) { return 0; } } struct IntRange { int stop; int forEachDeleg(int delegate(int x) apply) { for (int i; i < stop; i++) apply(i); return 0; } int forEachObj(Visitor v) { for (int i; i < stop; i++) v.visit(i); return 0; } } void testD(IntRange s, Visitor v) { auto start = myclock(); s.forEachDeleg(&v.visit); auto stop = myclock(); printf("Using D style delegate callback: %d ms\n", cast(int)((stop - start) * 1000)); } void testCpp(IntRange s, Visitor v) { auto start = myclock(); s.forEachObj(v); auto stop = myclock(); printf("Using C++ style virtual callback: %d ms\n", cast(int)((stop - start) * 1000)); } void main() { auto s = IntRange(400_000_000); Visitor v = new MyVisitor(); for (int i; i < 5; i++) { testD(s, v); testCpp(s, v); printf("\n"); } } (I suggest you to use the _ inside big number literals in D, they avoid few bugs). (Few days ago I think to have found that interfaces aren't implemented efficiently in LDC. Lindquist has answered he will improve the situation.) Bye, bearophile
Oct 16 2009
Justin Johansson:Also added -O switch this time though have no idea what level of optimization that does. (btw. In this test code, the -release switch doesn't do anything does it as that's just for conditional compilation?)In DMD: -O means "full optimizations minus the inlining (and keeping asserts, bound tests, contracts and maybe more). -release means no asserts (but it keeps assert(0)), no bound tests and no contracts. -inline means to perform inlining. So generally when you care for performance you compile in DMD with: -O -release -inline (But sometimes inlining makes the performance a little worse, because there's more pressure on the small code half of L1 cache). In this program -release doesn't change the timings probably because there's nothing to remove (bound tests, etc). In LDC: -O equals to -O2, that means an average optimization. -O3 means more optimization and includes two successive inlining passes (so foreach over an opApply are often fully simplified. But only few delegates/function pointers are inlined). -O4 and -O5 currently mean -O3, in future (I hope soon!) -O4 will perform all the optimizations of -O3 plus link-time optimization and _Dmain interning (that's already doable, but only manually). If you add -inline I think (but I am not sure) it performs a third inlining pass. There is the -release too that does as in DMD, plus flags for a finer releasing (for example to disable just asserts but not array bounds) that are not available in DMD.The results are not clear cut at all this time. So what's going on?<I don't know. I have a certain experience of benchmarks now, and I know they are tricky. I usually like to help people understand they don't understand what's going on in their life, because they often have just an illusion of understanding things :-) You may use something like obj2asm (or a disassembler) to see the asm produces in both cases, to understand a little better. If you don't have ways to do it, I can show you the resulting asm myself. Bye, bearophile
Oct 16 2009
bearophile Wrote:Justin Johansson:"because they often have just an illusion of understanding things :-)" So true.The results are not clear cut at all this time. So what's going on?<I don't know. I have a certain experience of benchmarks now, and I know they are tricky. I usually like to help people understand they don't understand what's going on in their life, because they often have just an illusion of understanding things :-)You may use something like obj2asm (or a disassembler) to see the asm produces in both cases, to understand a little better. If you don't have ways to do it, I can show you the resulting asm myself. Bye, bearophileNo worries; I'm fine with groking asm. Thanks very much for your time and encouragement. ciao, justin
Oct 16 2009
On Thu, 15 Oct 2009 07:45:02 -0400, Justin Johansson <procode adam.com.after-dot-com-add-dot-au> wrote:downs Wrote:You got a response because I'm actually awake and at a computer :) I don't think you should expect much earlier than 7am eastern from the US participants (regarding your 3 am post about a manifesto, followed by an assumed lack of interest at 5 am). But I have to agree with downs. Although I look at "non-descriptive" posts, it has nothing to do with my likelihood of reading *or* responding. Attributing response to changing such a non-essential piece of a post is like thinking you made it rain by dancing.Justin Johansson wrote:Re subject line: fair call, you are right. Emotions aside, at least this time I got a response.- that the .sizeof a delegate is 8 bytes (on a 32-bit machine). AFAIK, stack pushes are still more expensive than a pointerdereference in contemporaryCPU architectures. Justin- with this weird way of writing posts? The subject should tell us about the content, not your emotional state! :pWhen the majority of delegates survive exactly one function call, I think you might be very much wrong. You only save on allocation vs. stack when you pass it through many function calls. In fact, using such a delegate will probably be more penalized if the memory location is not local (and stack usually is close to the cache), not to mention putting it off stack means an additional pointer dereference.Also I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").When the tough gets going, the going have to get tough. (Meaning to start thinking outside of the square.)The going isn't tough yet :) delegates work just fine for me. -Steve
Oct 15 2009
Justin Johansson wrote:downs Wrote:I don't see why delegates should be allocated on the heap, if so then dynamic arrays would have to too, because they're the same size. It wouldn't be efficient because even if dereferences 'may' be faster than stack pushes, having arrays or delegates in the heap would double the number of dereferences needed, double the chances of memory not being in the cache and double the code to create and access them.Justin Johansson wrote:Re subject line: fair call, you are right. Emotions aside, at least this time I got a response.- that the .sizeof a delegate is 8 bytes (on a 32-bit machine). AFAIK, stack pushes are still more expensive than a pointer dereference in contemporary CPU architectures. Justin- with this weird way of writing posts? The subject should tell us about the content, not your emotional state! :pAlso I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").When the tough gets going, the going have to get tough. (Meaning to start thinking outside of the square.)And here I was trying to think outside of the tesseract :o)
Oct 15 2009
On Thu, 15 Oct 2009 07:15:45 -0400, Justin Johansson <procode adam.com.after-dot-com-add-dot-au> wrote:- that the .sizeof a delegate is 8 bytes (on a 32-bit machine). AFAIK, stack pushes are still more expensive than a pointer dereference in contemporary CPU architectures.How do you propose to fix it? I think it is the minimal approach. You need 4 bytes for the function pointer, and 4 bytes for the instance data. -Steve
Oct 15 2009
Two discoveries were made from this benchmark. 1) There is no appreciable speed difference between delegates and functors. I re-ran the benchmark several times; sometimes one was faster, sometimes the other - no clear advantage was discernible. The visible differences can be blamed on experimental error. Feel free to rerun it on a pure benchmarking machine.. 2) The GC is slooooow (factor of 40!). No surprise there. The code: gentoo-pc ~ $ cat test.d; gdc-build test.d -o test_c -O3 -frelease -march=nocona && ./test_c module test; import std.stdio; struct Functor { void delegate() dg; void opCall() { dg(); } } void bench(I, C)(string name, I iters, C callable) { auto start = sec(); // sorry for (I l = 0; l < iters; ++l) static if (is(typeof(callable.opCall))) callable.opCall(); else callable(); auto taken = sec() - start; writefln(name, ": ", taken, "s, ", ((taken / iters) * 1000_000), " µs per call" ); } struct _test3 { void test() { } void opCall() { auto dg = new Functor; dg.dg = &test; dg.opCall(); } } import tools.time; void main() { auto dg1 = (){ }, dg2 = new Functor; dg2.dg = dg1; // spin up processor writefln("Warm-up"); for (int k = 0; k < 1024*1024*256; ++k) { dg1(); (*dg2)(); } writefln("Begin benchmark"); const ITERS = cast(long) (1024*1024*1024) * 4; bench("Method 1", ITERS, dg1); bench("Method 2", ITERS, dg2); _test3 test3; // Done this way to allow inlining bench("Method 3", ITERS / 256, test3); } gdc -J. test.d tools/time.d tools/log.d tools/compat.d tools/base.d tools/smart_import.d tools/ctfe.d tools/tests.d tools/functional.d -o test_c -O3 -frelease -march=nocona Warm-up Begin benchmark Method 1: 20.5247s, 0.00477877 µs per call Method 2: 19.6544s, 0.00457615 µs per call Method 3: 2.86392s, 0.170703 µs per call
Oct 15 2009
On consideration, this wasn't a test of the two methods at all, but a test of the compiler's ability to inline. Disregard it.
Oct 15 2009
Justin Johansson wrote:- that the .sizeof a delegate is 8 bytes (on a 32-bit machine). AFAIK, stack pushes are still more expensive than a pointer dereference in contemporary CPU architectures. JustinNot so. On 286 and earlier, stack pushes were more expensive. They're the same on 386 and later (including Core2, K7,K8,K10), but you have a chance of a cache miss with a pointer deref. In my C++ experience I got a 25% speedup of my entire app by replacing heap pointers with stack delegates!
Oct 15 2009