digitalmars.D - DMD floating point performance.
- Dave (31/31) Nov 11 2006 Ok this is trivial, but it probably fairly represents what many use floa...
- Lionello Lunesu (3/3) Nov 12 2006 I'm not sure if it would change anything, but for a full release build,
- Dave (2/8) Nov 12 2006 It wouldn't make a diff. for these tests, so I avoided the clutter.
- Walter Bright (9/12) Nov 12 2006 The issue isn't with D, it's with the back end code generator. You'll
- Dave (87/101) Nov 12 2006 I know this is simplistic, but could something like fxch be used to 'mim...
- Walter Bright (2/5) Nov 12 2006 Yes, I just haven't done the work to figure it out.
- Don Clugston (5/11) Nov 12 2006 The definitive article on this is Agner Fog's optimisation manual at
- Walter Bright (3/7) Nov 13 2006 Thanks, it's a great reference I didn't know existed. But I couldn't
- Dave (2/10) Nov 15 2006 Ran across this: http://www.smlnj.org/compiler-notes/x86-fp.ps
- =?iso-8859-1?q?Knud_S=F8rensen?= (22/22) Nov 13 2006 I think it is possible to systematic search
- =?iso-8859-1?q?Knud_S=F8rensen?= (9/37) Nov 15 2006 Does anyone know of tools which might be use to hack up a deoptimizer li...
Ok this is trivial, but it probably fairly represents what many use floating point for day-to-day (simple math on simple data): import std.stdio, std.date; void main() { d_time s = getUTCtime; double sum = 1.0, val = 1.000001; for(size_t i = 0; i < 10_000_000; i++) { sum += val; sum -= val; sum *= val; sum /= val; } d_time e = getUTCtime; writefln(sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); } 1: 0.282 secs 1: 0.683 secs A difference of 2.5x?! If you look at the DMD asm, the problem is that each operation is wrapped by a load/store.. Why wouldn't val and sum be kept in fp registers inside the loop? Another couple of examples: http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=all http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=all I mean I'd like to standardize on DMD; a) it doesn't come with any baggage on either Windows or Linux and b) it is more consistently maintained as of late, but things like this make the decision a lot tougher. Thanks, - Dave
Nov 11 2006
I'm not sure if it would change anything, but for a full release build, you'd also want -release and -inline. L.
Nov 12 2006
Lionello Lunesu wrote:I'm not sure if it would change anything, but for a full release build, you'd also want -release and -inline. L.It wouldn't make a diff. for these tests, so I avoided the clutter.
Nov 12 2006
Dave wrote:If you look at the DMD asm, the problem is that each operation is wrapped by a load/store.. Why wouldn't val and sum be kept in fp registers inside the loop?The issue isn't with D, it's with the back end code generator. You'll see the same thing with the C and C++ compiler. Because the FPU is a 'stack' machine rather than a register machine, it's hard to write a code generator that enregisters floating point variables. It's also problematical because every function call must empty the FPU stack anyway, and so a lot of register spill logic must be in place for it to be very useful. Not impossible, just a lot of tricky work. How does GDC do with this?
Nov 12 2006
Walter Bright wrote:Dave wrote:I know this is simplistic, but could something like fxch be used to 'mimick' a register machine as vars are stored into registers? fxch is supposed to be very efficient.If you look at the DMD asm, the problem is that each operation is wrapped by a load/store.. Why wouldn't val and sum be kept in fp registers inside the loop?The issue isn't with D, it's with the back end code generator. You'll see the same thing with the C and C++ compiler. Because the FPU is a 'stack' machine rather than a register machine, it's hard to write a code generator that enregisters floating pointvariables. It's also problematical because every function call must empty the FPU stack anyway, and so a lot of register spill logic must be in place for it to be very useful.Could ffree greatly simply things here?Not impossible, just a lot of tricky work. How does GDC do with this?The original post was comparing DMD vs. GDC. Here's some code showing relative asm. output: dmd = 1: 0.673 secs dmd asm = 1: 0.682 secs opt asm = 1: 0.272 secs ;--- import std.stdio, std.date; void main() { d_time s = getUTCtime; double sum = fp; d_time e = getUTCtime; writefln("dmd = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); s = getUTCtime; sum = fp_dmd_asm; e = getUTCtime; writefln("dmd asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); s = getUTCtime; sum = fp_dmd_opt; e = getUTCtime; writefln("opt asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); } double fp() { double sum = 1.0, val = 0.000001; for(size_t i = 0; i < 10_000_000; i++) { sum += val; sum -= val; sum *= val; sum /= val; } return sum; } double _sum = 1.0, _val = 0.000001; double fp_dmd_asm() // more or less { asm { fld qword ptr _sum[0]; fstp qword ptr -8[EBP]; xor EAX,EAX; L11: fld qword ptr _val[0]; fadd qword ptr -8[EBP]; fstp qword ptr -8[EBP]; fld qword ptr _val[0]; fsubr qword ptr -8[EBP]; fstp qword ptr -8[EBP]; fld qword ptr _val[0]; fmul qword ptr -8[EBP]; fstp qword ptr -8[EBP]; fld qword ptr _val[0]; fdivr qword ptr -8[EBP]; fstp qword ptr -8[EBP]; inc EAX; cmp EAX,10_000_000; jb L11; fld qword ptr -8[EBP]; fstp _sum[0]; } return _sum; } double fp_dmd_opt() { double sum = 1.0, val = 0.000001; asm { fld qword ptr sum[0]; fld qword ptr val[0]; fxch ST(1); xor EAX,EAX; L1: fadd ST, ST(1); fsubr ST, ST(1); fmul ST, ST(1); fdivr ST, ST(1); inc EAX; cmp EAX,10_000_000; jb L1; fstp sum[0]; ffree ST(1); } return sum; }
Nov 12 2006
Dave wrote:I know this is simplistic, but could something like fxch be used to 'mimick' a register machine as vars are stored into registers? fxch is supposed to be very efficient.Yes, I just haven't done the work to figure it out.
Nov 12 2006
Walter Bright wrote:Dave wrote:The definitive article on this is Agner Fog's optimisation manual at www.agner.org. It's fantastically difficult to do x87 perfectly (and this is probably the main appeal of SSE), but I'm sure there are massive gains to be had relatively cheaply.I know this is simplistic, but could something like fxch be used to 'mimick' a register machine as vars are stored into registers? fxch is supposed to be very efficient.Yes, I just haven't done the work to figure it out.
Nov 12 2006
Don Clugston wrote:The definitive article on this is Agner Fog's optimisation manual at www.agner.org. It's fantastically difficult to do x87 perfectly (and this is probably the main appeal of SSE), but I'm sure there are massive gains to be had relatively cheaply.Thanks, it's a great reference I didn't know existed. But I couldn't find anything specific about enregistering ST register algorithms.
Nov 13 2006
Walter Bright wrote:Don Clugston wrote:Ran across this: http://www.smlnj.org/compiler-notes/x86-fp.psThe definitive article on this is Agner Fog's optimisation manual at www.agner.org. It's fantastically difficult to do x87 perfectly (and this is probably the main appeal of SSE), but I'm sure there are massive gains to be had relatively cheaply.Thanks, it's a great reference I didn't know existed. But I couldn't find anything specific about enregistering ST register algorithms.
Nov 15 2006
I think it is possible to systematic search for cases where dmd don't make optimal code. How this could be done is described at http://www.oonumerics.org/lunar/ The idea is that one start with a simples program main() { return(0); } Then one deoptimize it through one or more deoptimizing steps, example: main() { int i=0; if (false) { int i=10; } return(i) } At last one run the deoptimized program through the compiler and compere the result to the result of the start program. Now we just need someone to write a deoptimizer for D.
Nov 13 2006
Does anyone know of tools which might be use to hack up a deoptimizer like that ??? It does not have to be a tool written in D, but is should be possible to generate D code. Maybe a good genetic programing tool can be used, I don't know. Any suggestions ?? On Mon, 13 Nov 2006 10:04:00 +0100, Knud Sørensen wrote:I think it is possible to systematic search for cases where dmd don't make optimal code. How this could be done is described at http://www.oonumerics.org/lunar/ The idea is that one start with a simples program main() { return(0); } Then one deoptimize it through one or more deoptimizing steps, example: main() { int i=0; if (false) { int i=10; } return(i) } At last one run the deoptimized program through the compiler and compere the result to the result of the start program. Now we just need someone to write a deoptimizer for D.
Nov 15 2006