digitalmars.D - Performance
- Thomas (59/59) May 30 2014 I made the following performance test, which adds 10^9 Double’s
- Adam D. Ruppe (5/7) May 30 2014 I haven't actually run this but my guess is that the format
- anonymous (5/19) May 30 2014 [...]
- bearophile (43/53) May 30 2014 Your code written in a more idiomatic way (I have commented out
- bearophile (45/63) May 30 2014 And this is the 32 bit X86 asm generated by ldc2 for the plus
- bearophile (59/59) May 30 2014 This C++ code:
- Orvid King (9/75) May 30 2014 Well, I'd argue that in fact neither the C++ nor D code generated the
- Russel Winder via Digitalmars-d (17/39) May 30 2014 A priori I would believe there a problem with these numbers: my
- bearophile (5/8) May 30 2014 The C++ code I've shown above if compiled with -Ofast seems
- Russel Winder via Digitalmars-d (14/22) May 30 2014 I am assuming you are comparing C++/clang with D/ldc2, it is only
- Walter Bright (2/3) May 30 2014 Usually, the problem will be obvious from looking at the generated assem...
- David Nadlinger (4/5) May 30 2014 This effectively compiles the program without optimizations. Try
- Marco Leise (35/35) May 30 2014 Run this with: -O3 -frelease -fno-assert -fno-bounds-check -march=3Dnati...
- Thomas (8/42) May 31 2014 Thank you for the help. Which OS is running on your notebook ?
- Marco Leise (13/22) May 31 2014 Gentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make
- Thomas (6/26) Jun 02 2014 My PC is 5 years old. Of course I used your flags. Besides I am
- Marco Leise (15/21) Jun 03 2014 You posted a comparing benchmark between 3 languages providing
- John Colvin (3/5) Jun 03 2014 There's always the scheduler, swap etc. Not that they should have
- dennis luehring (12/71) May 30 2014 faulty benchmark
- Russel Winder via Digitalmars-d (14/29) May 30 2014 Indeed.
- dennis luehring (10/13) May 31 2014 average means average of benchmarked times
- dennis luehring (14/27) May 31 2014 so the anti-optimizer-overflowing-second-output aka AOOSO should be
- Andrei Alexandrescu (3/8) May 31 2014 No. Elapsed time in a benchmark does not follow a Student or Gaussian
- Russel Winder via Digitalmars-d (14/23) May 31 2014 We almost certainly need to unpack that more. I agree that behind my
- Andrei Alexandrescu (8/25) May 31 2014 Well there's quantization noise which has uniform distribution. Then all...
- Russel Winder via Digitalmars-d (10/17) May 31 2014 On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d
- Andrei Alexandrescu (3/14) May 31 2014 I don't know the idiom - what does it mean? Something nice I hope :o).
- Andrei Alexandrescu (6/21) May 31 2014 Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to
- Russel Winder via Digitalmars-d (16/20) May 31 2014 On Sat, 2014-05-31 at 14:45 -0700, Andrei Alexandrescu via Digitalmars-d
- John Colvin (5/18) May 31 2014 Well... It depends on what you're looking to do with the result.
- Andrei Alexandrescu (2/4) May 31 2014 Use the minimum unless networking is involved. -- Andrei
- Narendra Modi (3/8) May 31 2014 cache??
I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC. The results for one addition are: D-DMD: 3.1 nanoseconds D-GDC: 3.8 nanoseconds C++: 1.0 nanoseconds Scala: 1.0 nanoseconds D-Source: import std.stdio; import std.datetime; import std.string; import core.time; void main() { run!(plus)( 1000*1000*1000 ); } class C { } string plus( int steps ) { double sum = 1.346346; immutable double p0 = 0.0045; immutable double p1 = 1.00045452-p0; auto b = true; for( int i=0; i<steps; i++){ switch( b ){ case true : sum += p0; break; default: sum += p1; break; } b = !b; } return (format("%s %f","plus\nLast: ", sum) ); // return ("plus\nLast: ", sum ); } void run( alias func )( int steps ) if( is(typeof(func(steps)) == string)) { auto begin = Clock.currStdTime(); string output = func( steps ); auto end = Clock.currStdTime(); double nanotime = toNanos(end-begin)/steps; writeln( output ); writeln( "Time per op: " , nanotime ); writeln( ); } double toNanos( long hns ) { return hns*100.0; } Compiler settings for D: dmd -c -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9 152107A632F/first.o -release -inline -noboundscheck -O -w -version=Have_first -Isource source/perf/testperf.d gdc ./source/perf/testperf.d -frelease -o testperf So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me. Thomas
May 30 2014
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:return (format("%s %f","plus\nLast: ", sum) );I haven't actually run this but my guess is that the format function is the slowish thing here. Did you create a new string in the C version too?gdc ./source/perf/testperf.d -frelease -o testperfThe -O3 switch might help too, which turns on optimizations.
May 30 2014
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC. The results for one addition are: D-DMD: 3.1 nanoseconds D-GDC: 3.8 nanoseconds C++: 1.0 nanoseconds Scala: 1.0 nanoseconds D-Source:[...]Compiler settings for D:[...]So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me.Sources and command lines for the other languages would be nice for comparison.
May 30 2014
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC. The results for one addition are: D-DMD: 3.1 nanoseconds D-GDC: 3.8 nanoseconds C++: 1.0 nanoseconds Scala: 1.0 nanosecondsYour code written in a more idiomatic way (I have commented out new language features): import std.stdio, std.datetime; double plus(in uint nSteps) pure nothrow safe /* nogc*/ { enum double p0 = 0.0045; enum double p1 = 1.00045452-p0; double tot = 1.346346; auto b = true; foreach (immutable i; 0 .. nSteps) { final switch (b) { case true: tot += p0; break; case false: tot += p1; break; } b = !b; } return tot; } void run(alias func, string funcName)(in uint nSteps) { StopWatch sw; sw.start; immutable result = func(nSteps); sw.stop; writeln(funcName); writefln("Last: %f", result); //writeln("Time per op: ", sw.peek.nsecs / real(nSteps)); writeln("Time per op: ", sw.peek.nsecs / cast(real)nSteps); } void main() { run!(plus, "plus")(1_000_000_000U); } (But there is also a benchmark helper around). ldmd2 -O -release -inline -noboundscheck test.d Using LDC2 compiler, on my system the output is: plus Last: 500227252.496398 Time per op: 9.41424 Bye, bearophile
May 30 2014
double plus(in uint nSteps) pure nothrow safe /* nogc*/ { enum double p0 = 0.0045; enum double p1 = 1.00045452-p0; double tot = 1.346346; auto b = true; foreach (immutable i; 0 .. nSteps) { final switch (b) { case true: tot += p0; break; case false: tot += p1; break; } b = !b; } return tot; }And this is the 32 bit X86 asm generated by ldc2 for the plus function: __D4test4plusFNaNbNfxkZd: pushl %ebp movl %esp, %ebp pushl %esi andl $-8, %esp subl $24, %esp movsd LCPI0_0, %xmm0 testl %eax, %eax je LBB0_8 xorl %ecx, %ecx movb $1, %dl movsd LCPI0_1, %xmm1 movsd LCPI0_2, %xmm2 .align 16, 0x90 LBB0_2: testb $1, %dl jne LBB0_3 addsd %xmm1, %xmm0 jmp LBB0_7 .align 16, 0x90 LBB0_3: movzbl %dl, %esi andl $1, %esi je LBB0_5 addsd %xmm2, %xmm0 LBB0_7: xorb $1, %dl incl %ecx cmpl %eax, %ecx jb LBB0_2 LBB0_8: movsd %xmm0, 8(%esp) fldl 8(%esp) leal -4(%ebp), %esp popl %esi popl %ebp ret LBB0_5: movl $11, 4(%esp) movl $__D4test12__ModuleInfoZ, (%esp) calll __d_switch_error Bye, bearophile
May 30 2014
This C++ code: double plus(const unsigned int nSteps) { const double p0 = 0.0045; const double p1 = 1.00045452-p0; double tot = 1.346346; bool b = true; for (unsigned int i = 0; i < nSteps; i++) { switch (b) { case true: tot += p0; break; case false: tot += p1; break; } b = !b; } return tot; } G++ 4.8.0 gives the asm (using -Ofast, that implies unsafe FP optimizations): __Z4plusj: movl 4(%esp), %ecx testl %ecx, %ecx je L7 fldl LC0 xorl %edx, %edx movl $1, %eax fldl LC2 jmp L6 .p2align 4,,7 L11: fxch %st(1) addl $1, %edx xorl $1, %eax cmpl %ecx, %edx faddl LC1 je L12 fxch %st(1) L6: cmpb $1, %al je L11 addl $1, %edx xorl $1, %eax cmpl %ecx, %edx fadd %st, %st(1) jne L6 fstp %st(0) jmp L10 .p2align 4,,7 L12: fstp %st(1) L10: rep ret L7: fldl LC0 ret Bye, bearophile
May 30 2014
On 5/30/2014 9:30 AM, bearophile wrote:Well, I'd argue that in fact neither the C++ nor D code generated the fastest possible code here, as this code will result in at least 3, likely more, potentially even every, branch being mispredicted. I would argue, after checking the throughput numbers for fadd (only checked haswell), that the fastest code here would actually compute both sides of the branch and use a set of 4 cmov's (due to the fact it's x86 and we're working with doubles) to determine which one is the one we need to use going forward.double plus(in uint nSteps) pure nothrow safe /* nogc*/ { enum double p0 = 0.0045; enum double p1 = 1.00045452-p0; double tot = 1.346346; auto b = true; foreach (immutable i; 0 .. nSteps) { final switch (b) { case true: tot += p0; break; case false: tot += p1; break; } b = !b; } return tot; }And this is the 32 bit X86 asm generated by ldc2 for the plus function: __D4test4plusFNaNbNfxkZd: pushl %ebp movl %esp, %ebp pushl %esi andl $-8, %esp subl $24, %esp movsd LCPI0_0, %xmm0 testl %eax, %eax je LBB0_8 xorl %ecx, %ecx movb $1, %dl movsd LCPI0_1, %xmm1 movsd LCPI0_2, %xmm2 .align 16, 0x90 LBB0_2: testb $1, %dl jne LBB0_3 addsd %xmm1, %xmm0 jmp LBB0_7 .align 16, 0x90 LBB0_3: movzbl %dl, %esi andl $1, %esi je LBB0_5 addsd %xmm2, %xmm0 LBB0_7: xorb $1, %dl incl %ecx cmpl %eax, %ecx jb LBB0_2 LBB0_8: movsd %xmm0, 8(%esp) fldl 8(%esp) leal -4(%ebp), %esp popl %esi popl %ebp ret LBB0_5: movl $11, 4(%esp) movl $__D4test12__ModuleInfoZ, (%esp) calll __d_switch_error Bye, bearophile
May 30 2014
On Fri, 2014-05-30 at 13:35 +0000, Thomas via Digitalmars-d wrote:I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC. The results for one addition are: D-DMD: 3.1 nanoseconds D-GDC: 3.8 nanoseconds C++: 1.0 nanoseconds Scala: 1.0 nanosecondsA priori I would believe there a problem with these numbers: my experience of CPU-bound D code is that it is generally as fast as C++. […]Compiler settings for D: dmd -c -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9 152107A632F/first.o -release -inline -noboundscheck -O -w -version=Have_first -Isource source/perf/testperf.d gdc ./source/perf/testperf.d -frelease -o testperf So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me.What is the C++ code you compare against? What is the Scala code you compare against? Did you try Java and static Groovy as well? What command lines did you use for the generation of all the binaries. Without the data to compare it is hard to compare and help. One obvious thing though the gdc command line has no optimization turned on you probably want the -O3 or at least -O2 there. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 30 2014
Russel Winder:A priori I would believe there a problem with these numbers: my experience of CPU-bound D code is that it is generally as fast as C++.The C++ code I've shown above if compiled with -Ofast seems faster than the D code compiled with ldc2. Bye, bearophile
May 30 2014
On Fri, 2014-05-30 at 19:58 +0000, bearophile via Digitalmars-d wrote:Russel Winder:I am assuming you are comparing C++/clang with D/ldc2, it is only reasonable to compare C++/g++ with D/gdc. I am not sure about other compilers. Of course there is then the question of whether C++/clang is better/worse than C++/g++. Lots of fun experimentation and data analysis to be had here, if only there were microbenchmarking frameworks for C++ as well as D ;-) -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winderA priori I would believe there a problem with these numbers: my experience of CPU-bound D code is that it is generally as fast as C++.The C++ code I've shown above if compiled with -Ofast seems faster than the D code compiled with ldc2.
May 30 2014
On 5/30/2014 6:35 AM, Thomas wrote:So what is the problem ?Usually, the problem will be obvious from looking at the generated assembler.
May 30 2014
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:gdc ./source/perf/testperf.d -frelease -o testperfThis effectively compiles the program without optimizations. Try -O3 or -Ofast. David
May 30 2014
Run this with: -O3 -frelease -fno-assert -fno-bounds-check -march=3Dnative This way GCC and LLVM will recognize that you alternately add p0 and p1 to the sum and partially unroll the loop, thereby removing the condition. It takes 1.4xxxx nanoseconds per step on my not so new 2.0 Ghz notebook, so I assume your PC will easily reach parity with your original C++ version. import std.stdio; import core.time; alias =E2=84=95 =3D size_t; void main() { run!plus(1_000_000_000); } double plus(=E2=84=95 steps) { enum p0 =3D 0.0045; enum p1 =3D 1.00045452 - p0; double sum =3D 1.346346; foreach (i; 0 .. steps) sum +=3D i%2 ? p1 : p0; return sum; } void run(alias func)(=E2=84=95 steps) { auto t1 =3D TickDuration.currSystemTick; auto output =3D func(steps); auto t2 =3D TickDuration.currSystemTick; auto nanotime =3D 1_000_000_000.0 / steps * (t2 - t1).length / TickDuratio= n.ticksPerSec; writefln("Last: %s", output); writefln("Time per op: %s", nanotime); writeln(); } --=20 Marco
May 30 2014
On Saturday, 31 May 2014 at 05:12:54 UTC, Marco Leise wrote:Run this with: -O3 -frelease -fno-assert -fno-bounds-check -march=native This way GCC and LLVM will recognize that you alternately add p0 and p1 to the sum and partially unroll the loop, thereby removing the condition. It takes 1.4xxxx nanoseconds per step on my not so new 2.0 Ghz notebook, so I assume your PC will easily reach parity with your original C++ version. import std.stdio; import core.time; alias ℕ = size_t; void main() { run!plus(1_000_000_000); } double plus(ℕ steps) { enum p0 = 0.0045; enum p1 = 1.00045452 - p0; double sum = 1.346346; foreach (i; 0 .. steps) sum += i%2 ? p1 : p0; return sum; } void run(alias func)(ℕ steps) { auto t1 = TickDuration.currSystemTick; auto output = func(steps); auto t2 = TickDuration.currSystemTick; auto nanotime = 1_000_000_000.0 / steps * (t2 - t1).length / TickDuration.ticksPerSec; writefln("Last: %s", output); writefln("Time per op: %s", nanotime); writeln(); }Thank you for the help. Which OS is running on your notebook ? For I compiled your source code with your settings with the GCC compiler. The run took 3.1xxxx nanoseconds per step. For the DMD compiler the run took 5.xxxx nanoseconds. So I think the problem could be specific to the linux versions of the GCC and the DMD compilers. Thomas
May 31 2014
Am Sat, 31 May 2014 17:44:23 +0000 schrieb "Thomas" <t.leichner arcor.de>:Thank you for the help. Which OS is running on your notebook ? For I compiled your source code with your settings with the GCC compiler. The run took 3.1xxxx nanoseconds per step. For the DMD compiler the run took 5.xxxx nanoseconds. So I think the problem could be specific to the linux versions of the GCC and the DMD compilers. ThomasGentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make out a good reason why the runtime should depend on the OS so much. Are you sure you don't run on a PC from 2000 and did you use the compiler flags I gave on top of my post? Did you disable CPU power saving and was no other process running at the same time? By the way I get very similar results when using the LDC compiler. -- Marco
May 31 2014
On Sunday, 1 June 2014 at 03:33:36 UTC, Marco Leise wrote:Am Sat, 31 May 2014 17:44:23 +0000 schrieb "Thomas" <t.leichner arcor.de>:My PC is 5 years old. Of course I used your flags. Besides I am not an idiot, I am programming since 20 years and used 6 different programming languages. I did't post that just for fun, for I am evaluating D as language for numerical programming. ThomasThank you for the help. Which OS is running on your notebook ? For I compiled your source code with your settings with the GCC compiler. The run took 3.1xxxx nanoseconds per step. For the DMD compiler the run took 5.xxxx nanoseconds. So I think the problem could be specific to the linux versions of the GCC and the DMD compilers. ThomasGentoo Linux 64-bit. Aside from the 64-bit maybe, I can't make out a good reason why the runtime should depend on the OS so much. Are you sure you don't run on a PC from 2000 and did you use the compiler flags I gave on top of my post? Did you disable CPU power saving and was no other process running at the same time? By the way I get very similar results when using the LDC compiler.
Jun 02 2014
Am Mon, 02 Jun 2014 10:57:24 +0000 schrieb "Thomas" <t.leichner arcor.de>:My PC is 5 years old. Of course I used your flags. Besides I am not an idiot, I am programming since 20 years and used 6 different programming languages. I did't post that just for fun, for I am evaluating D as language for numerical programming. ThomasYou posted a comparing benchmark between 3 languages providing only the source code for one and didn't even run an optimized compile. That had me thinking. :) Back on topic: Any chance we can see the C++ code so we can compare more directly? It's hard to compare the numbers only for the D version when everyone has different system specs. Also you say your PC is 5 years old. Is your system 32-bit then? That would certainly effect the efficiency of loading and storing 64-bit floating point values and might be a clue in the right direction. I don't want to believe that the OS has an effect on a loop that doesn't make any calls to the OS. -- Marco
Jun 03 2014
On Tuesday, 3 June 2014 at 11:25:31 UTC, Marco Leise wrote:I don't want to believe that the OS has an effect on a loop that doesn't make any calls to the OS.There's always the scheduler, swap etc. Not that they should have any effect on *this* benchmark of course.
Jun 03 2014
faulty benchmark -do not benchmark "format" -use a dummy-var - just add(overflow is not a problem) your plus() results to it and return that in your main - preventing dead code optimization in any way -introduce some sort of random-value into your plus() code, for example use an random-generator or the int-casted pointer to program args as startup value -do not benchmark anything without millions of loops - use the average as the result anything else does not makes sense Am 30.05.2014 15:35, schrieb Thomas:I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC. The results for one addition are: D-DMD: 3.1 nanoseconds D-GDC: 3.8 nanoseconds C++: 1.0 nanoseconds Scala: 1.0 nanoseconds D-Source: import std.stdio; import std.datetime; import std.string; import core.time; void main() { run!(plus)( 1000*1000*1000 ); } class C { } string plus( int steps ) { double sum = 1.346346; immutable double p0 = 0.0045; immutable double p1 = 1.00045452-p0; auto b = true; for( int i=0; i<steps; i++){ switch( b ){ case true : sum += p0; break; default: sum += p1; break; } b = !b; } return (format("%s %f","plus\nLast: ", sum) ); // return ("plus\nLast: ", sum ); } void run( alias func )( int steps ) if( is(typeof(func(steps)) == string)) { auto begin = Clock.currStdTime(); string output = func( steps ); auto end = Clock.currStdTime(); double nanotime = toNanos(end-begin)/steps; writeln( output ); writeln( "Time per op: " , nanotime ); writeln( ); } double toNanos( long hns ) { return hns*100.0; } Compiler settings for D: dmd -c -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9C152107A632F/first.o -release -inline -noboundscheck -O -w -version=Have_first -Isource source/perf/testperf.d gdc ./source/perf/testperf.d -frelease -o testperf So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me. Thomas
May 30 2014
On Sat, 2014-05-31 at 07:32 +0200, dennis luehring via Digitalmars-d wrote:faulty benchmarkIndeed.-do not benchmark "format" -use a dummy-var - just add(overflow is not a problem) your plus() results to it and return that in your main - preventing dead code optimization in any way -introduce some sort of random-value into your plus() code, for example use an random-generator or the int-casted pointer to program args as startup value -do not benchmark anything without millions of loops - use the average as the result anything else does not makes senseAs well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible. Or put it another way: even if you quote a mean with knowing how many in the sample and what the spread is you cannot judge the error and so cannot make deductions or inferences. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 30 2014
Am 31.05.2014 08:36, schrieb Russel Winder via Digitalmars-d:As well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible.average means average of benchmarked times and the dummy values are only for keeping the compiler from removing anything it can reduce at compiletime - that makes benchmarks compareable, these values does not change the algorithm or result quality an any way - its more like an overflowing-second-output bases on the result of the original algorithm (but should be just a simple addition or substraction - ignoring overflow etc.) thats the base of all types of non-stupid benchmarking - next/pro step is to look at the resulting assemblercode
May 31 2014
Am 31.05.2014 13:25, schrieb dennis luehring:Am 31.05.2014 08:36, schrieb Russel Winder via Digitalmars-d:so the anti-optimizer-overflowing-second-output aka AOOSO should be initialized outside of the testfunction with an random-value - i normaly use the pointer to the main args as int the AOOSO should be incremented by the needed result of the benchmarked algorithm - that could be an int casted float/double value, the variant size of an string or whatever is floaty and needed enough to be used and then return the AOOSO as main return so the original algorithm isn't changed but the compiler got absolutely nothing to prevent the usage and the end output of this AOOSO dummy value yes it ignores that the code-size (cache problems) is changed by the AOOSO incrementation - thats the reason for simple casting/overflowing integer stuff here, but if the benchmarking goes that deep you should better take a look at the assembler-levelAs well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible.average means average of benchmarked times and the dummy values are only for keeping the compiler from removing anything it can reduce at compiletime - that makes benchmarks compareable, these values does not change the algorithm or result quality an any way - its more like an overflowing-second-output bases on the result of the original algorithm (but should be just a simple addition or substraction - ignoring overflow etc.) thats the base of all types of non-stupid benchmarking - next/pro step is to look at the resulting assemblercode
May 31 2014
On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:As well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible. Or put it another way: even if you quote a mean with knowing how many in the sample and what the spread is you cannot judge the error and so cannot make deductions or inferences.No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei
May 31 2014
On Sat, 2014-05-31 at 07:02 -0700, Andrei Alexandrescu via Digitalmars-d wrote:On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:We almost certainly need to unpack that more. I agree that behind my comment was an implicit assumption of a normal distribution of results. This is an easy assumption to make even if it is wrong. So is it provably wrong? What is the distribution? If we know that then there is knowledge of the parameters which then allow for statistical inference and deduction. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winderAs well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible. Or put it another way: even if you quote a mean with knowing how many in the sample and what the spread is you cannot judge the error and so cannot make deductions or inferences.No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei
May 31 2014
On 5/31/14, 7:10 AM, Russel Winder via Digitalmars-d wrote:On Sat, 2014-05-31 at 07:02 -0700, Andrei Alexandrescu via Digitalmars-d wrote:Well there's quantization noise which has uniform distribution. Then all other sources of noise are additive (no noise may make code run faster). So I speculate that the pdf is a half Gaussian mixed with a uniform distribution. Taking the mode (which is very close to the minimum in my measurements) would be the most accurate way to go. Taking the average would end up in some weird point on the half-Gaussian slope. AndreiOn 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:We almost certainly need to unpack that more. I agree that behind my comment was an implicit assumption of a normal distribution of results. This is an easy assumption to make even if it is wrong. So is it provably wrong? What is the distribution? If we know that then there is knowledge of the parameters which then allow for statistical inference and deduction.As well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible. Or put it another way: even if you quote a mean with knowing how many in the sample and what the spread is you cannot judge the error and so cannot make deductions or inferences.No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei
May 31 2014
On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d wrote: […]Well there's quantization noise which has uniform distribution. Then all other sources of noise are additive (no noise may make code run faster). So I speculate that the pdf is a half Gaussian mixed with a uniform distribution. Taking the mode (which is very close to the minimum in my measurements) would be the most accurate way to go. Taking the average would end up in some weird point on the half-Gaussian slope.I sense you are taking the piss. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 31 2014
On 5/31/14, 11:49 AM, Russel Winder via Digitalmars-d wrote:On Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d wrote: […]I don't know the idiom - what does it mean? Something nice I hope :o). -- AndreiWell there's quantization noise which has uniform distribution. Then all other sources of noise are additive (no noise may make code run faster). So I speculate that the pdf is a half Gaussian mixed with a uniform distribution. Taking the mode (which is very close to the minimum in my measurements) would be the most accurate way to go. Taking the average would end up in some weird point on the half-Gaussian slope.I sense you are taking the piss.
May 31 2014
On 5/31/14, 2:42 PM, Andrei Alexandrescu wrote:On 5/31/14, 11:49 AM, Russel Winder via Digitalmars-d wrote:Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to take it in context; I am being serious, and basing myself on measurements taken while designing and implementing https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md. AndreiOn Sat, 2014-05-31 at 10:29 -0700, Andrei Alexandrescu via Digitalmars-d wrote: […]I don't know the idiom - what does it mean? Something nice I hope :o). -- AndreiWell there's quantization noise which has uniform distribution. Then all other sources of noise are additive (no noise may make code run faster). So I speculate that the pdf is a half Gaussian mixed with a uniform distribution. Taking the mode (which is very close to the minimum in my measurements) would be the most accurate way to go. Taking the average would end up in some weird point on the half-Gaussian slope.I sense you are taking the piss.
May 31 2014
On Sat, 2014-05-31 at 14:45 -0700, Andrei Alexandrescu via Digitalmars-d wrote: […]Found it: http://en.wikipedia.org/wiki/Taking_the_piss. Not sure how to take it in context; I am being serious, and basing myself on measurements taken while designing and implementing https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md.My apologies for being a abrupt and ill-considered and hence potentially rude. Long story. I'll cogitate on the ideas this morning and see what I can chip in constructively to take things along. I will also ask Aleksey Shipilëv what underpinnings he is using for JMH to see if there is some useful cross-fertilization. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 31 2014
On Saturday, 31 May 2014 at 14:01:52 UTC, Andrei Alexandrescu wrote:On 5/30/14, 11:36 PM, Russel Winder via Digitalmars-d wrote:Well... It depends on what you're looking to do with the result. As you say though, micro-benchmarks of code-quality should always be judged on the minimum of a large sample.As well as the average (mean), you must provide standard deviation and degrees of freedom so that a proper error analysis and t-tests are feasible. Or put it another way: even if you quote a mean with knowing how many in the sample and what the spread is you cannot judge the error and so cannot make deductions or inferences.No. Elapsed time in a benchmark does not follow a Student or Gaussian distribution. Use the mode or (better) the minimum. -- Andrei
May 31 2014
On 5/30/14, 10:32 PM, dennis luehring wrote:-do not benchmark anything without millions of loops - use the average as the resultUse the minimum unless networking is involved. -- Andrei
May 31 2014
On Saturday, 31 May 2014 at 13:59:40 UTC, Andrei Alexandrescu wrote:On 5/30/14, 10:32 PM, dennis luehring wrote:cache??-do not benchmark anything without millions of loops - use the average as the resultUse the minimum unless networking is involved. -- Andrei
May 31 2014