digitalmars.D.learn - Raytracing speed again, this time with gdc.
- downs (17/17) Nov 07 2007 I've tried to translate the attached C++ program to D (D program also
- dominik (3/15) Nov 07 2007 what kind of output is this? how do you view result?
- downs (5/23) Nov 07 2007 PGM. Gwenview and Konqueror can display it, as well as any decent image
- Jascha Wetzel (3/27) Nov 08 2007 did you try to profile both versions?
- downs (17/31) Nov 08 2007 Thanks for the input, but that's not it.
- downs (3/9) Nov 08 2007 Okay, now that I look at it again, some of those values smack me as dist...
- downs (30/30) Nov 08 2007 Here we go again.
- Jascha Wetzel (11/45) Nov 08 2007 in Sphere.intersect, instead of
- Bill Baxter (17/41) Nov 08 2007 Could the difference be in part due to default initialization in D?
- downs (7/30) Nov 08 2007 If you check the source, you'll see that practically all my structs are ...
- Dave (10/44) Nov 10 2007 Sorry if this has already been discussed, but if not: Have you checked t...
- Saaa (4/4) Nov 09 2007 Any success? If you find out what it was could you post it?
I've tried to translate the attached C++ program to D (D program also attached). Even after explicitly using the __builtin_sqrt (which correctly generates a fsqrt instruction (shame on you, non-inlined std.math)), the D code is significantly slower (12.9s for D vs 9s for C++). Does anybody know why this is so? --downs, confused and saddened PS: benchmark gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020s
Nov 07 2007
"downs" <default_357-line yahoo.de> wrote in message news:fgskq5$1p5n$1 digitalmars.com...writefln("P5\n", n, " ", n, "\n255"); for (int y=n-1; y>=0; --y) for (int x=0; x<n; ++x) { double g=0; for (int d=0; d<ss*ss; ++d) { auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, n).unitise(); g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s); } printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss))); } }what kind of output is this? how do you view result?
Nov 07 2007
dominik wrote:"downs" <default_357-line yahoo.de> wrote in message news:fgskq5$1p5n$1 digitalmars.com...PGM. Gwenview and Konqueror can display it, as well as any decent image viewer. http://en.wikipedia.org/wiki/Portable_pixmap --downswritefln("P5\n", n, " ", n, "\n255"); for (int y=n-1; y>=0; --y) for (int x=0; x<n; ++x) { double g=0; for (int d=0; d<ss*ss; ++d) { auto dir=Vec(x+(d%ss)*1.0/ss-n/2.0, y+(d/ss)*1.0/ss-n/2.0, n).unitise(); g += ray_trace(light, Ray(Vec(0, 0, -4), dir), s); } printf("%c", cast(ubyte)(0.5 + 255.0 * g / (ss*ss))); } }what kind of output is this? how do you view result?
Nov 07 2007
downs wrote:I've tried to translate the attached C++ program to D (D program also attached). Even after explicitly using the __builtin_sqrt (which correctly generates a fsqrt instruction (shame on you, non-inlined std.math)), the D code is significantly slower (12.9s for D vs 9s for C++). Does anybody know why this is so? --downs, confused and saddened PS: benchmark gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020sdid you try to profile both versions? also try benchmarking without the IO.
Nov 08 2007
Jascha Wetzel wrote:Thanks for the input, but that's not it. The IO of 262K cannot possibly account for three seconds of difference. Just for the books, same result without the printf call. Here's the first three lines of each's profile run. D version: time seconds seconds calls s/call s/call name 77.27 5.10 5.10 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 15.15 6.10 1.00 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 4.55 6.40 0.30 1 0.30 6.56 _Dmain All: 5.25s C++ version: time seconds seconds calls ms/call ms/call name 75.18 4.12 4.12 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 19.34 5.18 1.06 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 4.74 5.44 0.26 main All: 6.57s --downsgentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020sdid you try to profile both versions? also try benchmarking without the IO.
Nov 08 2007
downs wrote:D version: time seconds seconds calls s/call s/call name 77.27 5.10 5.10 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 15.15 6.10 1.00 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 4.55 6.40 0.30 1 0.30 6.56 _Dmain All: 5.25sOkay, now that I look at it again, some of those values smack me as distinctly .. impossible. I'll redo the profiling. --downs
Nov 08 2007
Here we go again. D version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 73.09 5.08 5.08 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 19.57 6.44 1.36 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 3.74 6.70 0.26 1 0.26 6.88 _Dmain 2.59 6.88 0.18 4194304 0.00 0.00 double ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene) 0.43 6.91 0.03 struct gcx.Pool* gcx.Gcx.findPool(void*, void*) 0.14 6.92 0.01 void* gcx.GC.mallocNoSync(uint, void*) 0.14 6.93 0.01 uint gcbits.GCBits.test(uint, void*) 0.14 6.94 0.01 _d_callfinalizer 0.14 6.95 0.01 _d_newclass 0.00 6.95 0.00 1 0.00 0.00 void ray1._staticCtor1() 0.00 6.95 0.00 1 0.00 0.00 class ray1.Scene ray1.create(int, inout struct ray1.Vec, double) 0.00 6.95 0.00 1 0.00 0.00 void ray1.__modinit() C++ version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 76.88 4.19 4.19 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 18.35 5.19 1.00 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 3.30 5.37 0.18 main 1.28 5.44 0.07 4194304 0.00 0.00 ray_trace(Vec const&, Ray const&, Scene const&) 0.18 5.45 0.01 1 10.00 10.00 create(int, Vec const&, double) 0.00 5.45 0.00 87381 0.00 0.00 Sphere::~Sphere() 0.00 5.45 0.00 1 0.00 0.00 global constructors keyed to real 0.00 5.45 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.45 0.00 1 0.00 0.00 Group::~Group()
Nov 08 2007
downs wrote:Here we go again. D version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 73.09 5.08 5.08 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 19.57 6.44 1.36 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 3.74 6.70 0.26 1 0.26 6.88 _Dmain 2.59 6.88 0.18 4194304 0.00 0.00 double ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene) 0.43 6.91 0.03 struct gcx.Pool* gcx.Gcx.findPool(void*, void*) 0.14 6.92 0.01 void* gcx.GC.mallocNoSync(uint, void*) 0.14 6.93 0.01 uint gcbits.GCBits.test(uint, void*) 0.14 6.94 0.01 _d_callfinalizer 0.14 6.95 0.01 _d_newclass 0.00 6.95 0.00 1 0.00 0.00 void ray1._staticCtor1() 0.00 6.95 0.00 1 0.00 0.00 class ray1.Scene ray1.create(int, inout struct ray1.Vec, double) 0.00 6.95 0.00 1 0.00 0.00 void ray1.__modinit() C++ version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 76.88 4.19 4.19 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 18.35 5.19 1.00 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 3.30 5.37 0.18 main 1.28 5.44 0.07 4194304 0.00 0.00 ray_trace(Vec const&, Ray const&, Scene const&) 0.18 5.45 0.01 1 10.00 10.00 create(int, Vec const&, double) 0.00 5.45 0.00 87381 0.00 0.00 Sphere::~Sphere() 0.00 5.45 0.00 1 0.00 0.00 global constructors keyed to real 0.00 5.45 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.45 0.00 1 0.00 0.00 Group::~Group()in Sphere.intersect, instead of hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise); try hit.first = lamdba; hit.second = ... it might not optimize away the creation on stack + copy. now i'd look at the asm code for ray_sphere in both versions, since that seems to where the time is burnt. also curious: why does the C++ version call Sphere::intersect 12 times more often than the D version?
Nov 08 2007
Jascha Wetzel wrote:downs wrote:I just tried that. Negligible change.Here we go again. D version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 73.09 5.08 5.08 6164146 0.00 0.00 void ray1.Group.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 19.57 6.44 1.36 38551270 0.00 0.00 void ray1.Sphere.intersect(inout typedef ray1.Hit, inout struct ray1.Ray, void*) 3.74 6.70 0.26 1 0.26 6.88 _Dmain 2.59 6.88 0.18 4194304 0.00 0.00 double ray1.ray_trace(inout struct ray1.Vec, inout struct ray1.Ray, class ray1.Scene) 0.43 6.91 0.03 struct gcx.Pool* gcx.Gcx.findPool(void*, void*) 0.14 6.92 0.01 void* gcx.GC.mallocNoSync(uint, void*) 0.14 6.93 0.01 uint gcbits.GCBits.test(uint, void*) 0.14 6.94 0.01 _d_callfinalizer 0.14 6.95 0.01 _d_newclass 0.00 6.95 0.00 1 0.00 0.00 void ray1._staticCtor1() 0.00 6.95 0.00 1 0.00 0.00 class ray1.Scene ray1.create(int, inout struct ray1.Vec, double) 0.00 6.95 0.00 1 0.00 0.00 void ray1.__modinit() C++ version. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 76.88 4.19 4.19 6164146 0.00 0.00 Group::intersect(std::pair<double, Vec>&, Ray const&) const 18.35 5.19 1.00 38551282 0.00 0.00 Sphere::intersect(std::pair<double, Vec>&, Ray const&) const 3.30 5.37 0.18 main 1.28 5.44 0.07 4194304 0.00 0.00 ray_trace(Vec const&, Ray const&, Scene const&) 0.18 5.45 0.01 1 10.00 10.00 create(int, Vec const&, double) 0.00 5.45 0.00 87381 0.00 0.00 Sphere::~Sphere() 0.00 5.45 0.00 1 0.00 0.00 global constructors keyed to real 0.00 5.45 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.45 0.00 1 0.00 0.00 Group::~Group()in Sphere.intersect, instead of hit = Hit(lambda, (ray.orig + lambda*ray.dir - center).unitise); try hit.first = lamdba; hit.second = ... it might not optimize away the creation on stack + copy.now i'd look at the asm code for ray_sphere in both versions, since that seems to where the time is burnt. also curious: why does the C++ version call Sphere::intersect 12 times more often than the D version?I'd like to know that as well. Far as I know, the algorithm's the same. I'll add debugging code to find out. --downs
Nov 08 2007
downs wrote:Jascha Wetzel wrote:Okay, it's some weird floating point effect around this line: "disc=b*b - v.dot(v) + radius*radius;". It only happens in very specific conditions; x=430, y=379, d=11. So whatever it is, responsible for three seconds it is not. --downsalso curious: why does the C++ version call Sphere::intersect 12 times more often than the D version?I'd like to know that as well. Far as I know, the algorithm's the same. I'll add debugging code to find out. --downs
Nov 08 2007
downs wrote:I've tried to translate the attached C++ program to D (D program also attached). Even after explicitly using the __builtin_sqrt (which correctly generates a fsqrt instruction (shame on you, non-inlined std.math)), the D code is significantly slower (12.9s for D vs 9s for C++). Does anybody know why this is so? --downs, confused and saddened PS: benchmark gentoo-pc ~/d/RayBen $ gdc ray1.d -O3 -frelease -ffast-math -o ray1_d tools/base.d && time ./ray1_d >result_d.pnm; g++ ray1.cxx -O3 -ffast-math -o ray1_cpp && time ./ray1_cpp > result_cxx.pnm real 0m13.448s user 0m12.730s sys 0m0.090s real 0m10.128s user 0m9.810s sys 0m0.020sCould the difference be in part due to default initialization in D? Maybe all your rays and vecs are getting initialized first to NaN and then overwritten with the value you want, and that is slowing it down. You could try sticking some =void's in your structs, like so: struct Vec { double x=void, y=void, z=void; Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); } Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); } Vec opMul(double a) { return Vec(x*a, y*a, z*a); } double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; } Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); } } struct Pair(T, U) { T first=void; U second=void; } typedef Pair!(double, Vec) Hit; struct Ray { Vec orig=void, dir=void; } --bb
Nov 08 2007
Bill Baxter wrote:Could the difference be in part due to default initialization in D? Maybe all your rays and vecs are getting initialized first to NaN and then overwritten with the value you want, and that is slowing it down. You could try sticking some =void's in your structs, like so: struct Vec { double x=void, y=void, z=void; Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); } Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); } Vec opMul(double a) { return Vec(x*a, y*a, z*a); } double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; } Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); } } struct Pair(T, U) { T first=void; U second=void; } typedef Pair!(double, Vec) Hit; struct Ray { Vec orig=void, dir=void; } --bbIf you check the source, you'll see that practically all my structs are either manually initialized with proper values or the result of a calculation. Also, I tried that and it's not it. Sorry. Still, thanks for the idea! --downs
Nov 08 2007
"downs" <default_357-line yahoo.de> wrote in message news:fh0iem$2dhc$1 digitalmars.com...Bill Baxter wrote:Sorry if this has already been discussed, but if not: Have you checked to see if function inlining differences are the culprit? The D front-end doesn't inline functions with byref params, and for DMD at least the FE is responsible for all inlining. For GDC the front-end inlining might be turned off anyhow (I can't remember), and maybe GCC does all that in the intermediate or backend stages.. If so it would presumably be pretty close to the same for D and C++ but it might be worth a look anyhow.Could the difference be in part due to default initialization in D? Maybe all your rays and vecs are getting initialized first to NaN and then overwritten with the value you want, and that is slowing it down. You could try sticking some =void's in your structs, like so: struct Vec { double x=void, y=void, z=void; Vec opAdd(ref Vec other) { return Vec(x+other.x, y+other.y, z+other.z); } Vec opSub(ref Vec other) { return Vec(x-other.x, y-other.y, z-other.z); } Vec opMul(double a) { return Vec(x*a, y*a, z*a); } double dot(ref Vec other) { return x*other.x+y*other.y+z*other.z; } Vec unitise() { return opMul(1.0/dsqrt(dot(*this))); } } struct Pair(T, U) { T first=void; U second=void; } typedef Pair!(double, Vec) Hit; struct Ray { Vec orig=void, dir=void; } --bbIf you check the source, you'll see that practically all my structs are either manually initialized with proper values or the result of a calculation. Also, I tried that and it's not it. Sorry. Still, thanks for the idea! --downs
Nov 10 2007
Any success? If you find out what it was could you post it? The only things I can think of are a stack heap allocation, variable initialisation and better optimization differences... Maybe making some smaller test cases.
Nov 09 2007