www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Optimizing a raytracer

reply =?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:
Hello!

I am writing an unbiased raytrace renderer in D. I have good 
progress, but I want to make it as fast as possible where I can 
do it without compromises.

I use a struct with three doubles for vector and color 
calculations and I have operator overloading for them. Many 
vectors and colors are created during the tracing calculations.

I thought, using classes may require too much memory, because 
they are not destructed on scope end, and maybe speed reduction 
when GC kicks in.

Is my assumptions that in this case struct are more wise?

To avoid the constructing many vectors and colors, I thought to 
use ref arguments, but I also heard that ref functions are not 
inlined. What would generate the fastest code for a cross-product 
for example?

What compiler and compilations flags should I use to generate the 
fastest code? My main target is sixty-four bit machines, 
cross-platform. What optimizations can I assume for various 
compilers? Are only once used local variables inlined? So it 
secure to extract local variables only to make the code more easy 
to understand?

Thanks is Advance!
Róbert László Páli
Oct 16 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-10-16 14:02, "Róbert László Páli" wrote:
 Hello!

 I am writing an unbiased raytrace renderer in D. I have good progress,
 but I want to make it as fast as possible where I can do it without
 compromises.

 I use a struct with three doubles for vector and color calculations and
 I have operator overloading for them. Many vectors and colors are
 created during the tracing calculations.

 I thought, using classes may require too much memory, because they are
 not destructed on scope end, and maybe speed reduction when GC kicks in.

 Is my assumptions that in this case struct are more wise?

 To avoid the constructing many vectors and colors, I thought to use ref
 arguments, but I also heard that ref functions are not inlined. What
 would generate the fastest code for a cross-product for example?

 What compiler and compilations flags should I use to generate the
 fastest code? My main target is sixty-four bit machines, cross-platform.
 What optimizations can I assume for various compilers? Are only once
 used local variables inlined? So it secure to extract local variables
 only to make the code more easy to understand?
I would say use structs. For compiler I would go with LDC or GDC. Both of these are faster for floating point calculations than DMD. You can always benchmark. -- /Jacob Carlborg
Oct 16 2013
prev sibling next sibling parent "finalpatch" <fengli gmail.com> writes:
I find it critical to ensure all loops are unrolled in basic 
vector ops (copy/arithmathc/dot etc.)

On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli 
wrote:
 Hello!

 I am writing an unbiased raytrace renderer in D. I have good 
 progress, but I want to make it as fast as possible where I can 
 do it without compromises.
Oct 16 2013
prev sibling next sibling parent "ponce" <contact gam3sfrommars.fr> writes:
On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli 
wrote:
 I thought, using classes may require too much memory, because 
 they are not destructed on scope end, and maybe speed reduction 
 when GC kicks in.

 Is my assumptions that in this case struct are more wise?
Yes, by all means use struct.
 What would generate the fastest code for a cross-product for 
 example?
If you are on x86, SSE 4.1 introduced an instruction called DPPS which performs a dot product. Maybe you can force it into doing a cross-product with clever swizzles and masks.
Oct 16 2013
prev sibling next sibling parent reply =?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:
 Jacob Carlborg
 I would say use structs. For compiler I would go with LDC or 
 GDC. Both of these are faster for floating point calculations 
 than DMD. You can always benchmark.
Thank you for the advice! I installed ldc and used ldmd2. Te benchmarks are amazing! :O DMD > compile = 2503 > run = 26210 LDMD > compile = 3953 > run = 8935 These are in milliseconds, benchmarked with time command. Both were compiled with smae Flags: -O -inline -release -noboundscheck finalpatch
 I find it critical to ensure all loops are unrolled in basic 
 vector ops (copy/arithmathc/dot etc.)
In these crucial parts I don't use loops, made these operations by hand. There are simple 3 named doubles. But thanks for the advice. ponce
 If you are on x86, SSE 4.1 introduced an instruction called 
 DPPS which performs a dot product. Maybe you can force it into 
 doing a cross-product with clever swizzles and masks.
Could you give me a hint, how it could be implemented in D to use that dot product? I am not expirienced with such low-level programming. And would you suggest to try to use SIMD double4 for 3D vectors? It would take some time to change code.
Oct 17 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Róbert László Páli:

 And would you suggest to try to use
 SIMD double4 for 3D vectors? It would
 take some time to change code.
Using a double4 could improve the performance of your code, but it must be used wisely. (One general tip is to avoid mixing SIMD and serial code. if you want to use SIMD code, then it's often better to keep using SIMD registers even if you have one value). Bye, bearophile
Oct 17 2013
next sibling parent =?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:
 Using a double4 could improve the performance of your code, but 
 it must be used wisely. (One general tip is to avoid mixing  
 SIMD
 and serial code. if you want to use SIMD code, then it's  often
 better to keep using SIMD registers even if you have one  
 value).
I sadly could not get it to work properly, but the performance seems good so far. Teoretichally I only would need to adjust the Vector struct and operations (a small layer of the code, the rest uses only the Vector type and the operations, not the inside of it). In case you are interested: http://palaes.rudanium.org/SubSpace/render.php
Mar 26 2014
prev sibling parent =?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:
Oh, thanks for all of your help. Nice
to see, that D guys do really help. :)
Mar 26 2014
prev sibling parent reply "Bienlein" <jeti789 web.de> writes:
You can also achieve significant speed-ups by doing things in 
parallel, f.ex. see 
https://groups.google.com/forum/?hl=de#!searchin/golang-nuts/ray$20tracer/golang-nuts/mxYzHQSV3rw/dOA78aeVLgEJ
Mar 26 2014
parent =?UTF-8?B?IlLDs2JlcnQgTMOhc3psw7MgUMOh?= =?UTF-8?B?bGki?= writes:
Thanks! I already do tracing the samples parallel.
Strangly I have a core 2 duo and it seems that using
3 threads is the best (slightly better than 2). Aldough
this might be accidetal. Maybe the more-complex
samples are more equally in separate threds.
Mar 26 2014