www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - OOP, faster data layouts, compilers

reply bearophile <bearophileHUGS lycos.com> writes:
Through Reddit I've found a set of wordy slides, "Design for Performance", on
designing efficient games code:
http://www.scribd.com/doc/53483851/Design-for-Performance
http://www.reddit.com/r/programming/comments/guyb2/designing_code_for_performance/

The slide touch many small topics, like the need for prefetching, desing for
cache-aware code, etc. One of the main topics is how to better lay data
structures in memory for modern CPUs. It shows how object oriented style leads
often to collections of little trees, for example  arrays of object references
(or struct pointers) that refer to objects that contain other references to sub
parts. Iterating on such data structures is not so efficient.

The slides also discuss a little the difference between creating an array of
2-item structs, or a struct that contains two arrays of single native values.
If the code needs to scan just one of those two fields, then the struct that
contains the two arrays is faster.

Similar topics were discussed better in "Pitfalls of Object Oriented
Programming" (2009):
http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

In my opinion if D2 has some success then one of its significant usages will be
to write fast games, so the design/performance concerns expressed in those two
sets of slides need to be important for D design.

D probably allows to lay data in memory as shown in those slides, but I'd like
some help from the compiler too.  I don't think the compilers will be soon able
to turn an immutable binary tree into an array, to speedup its repeated
scanning, but maybe there are ways to express semantics in the code that will
allow them future smarter compilers to perform some of those memory layout
optimization, like transposing arrays. A possible idea is a
 no_inbound_pointers that forbids taking the addess of the items, and allows
the compiler to modify the data layout a little.

Bye,
bearophile
Apr 21 2011
next sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
Many thanks for the links, they provide very nice discussions.

Specially the link below, that you can follow from your first link,
http://c0de517e.blogspot.com/2011/04/2011-current-and-future-programming.html

But in what concerns game development, D2 might already be too late.

I know a bit of it, since a live a bit on that part of the universe.

Due to XNA(Windows and XBox 360), Mono/Unity, and now WP7, many game studios

even using
it for the server side code.

Java used to have a foot there, specially due to the J2ME game development, 
with a small
push thanks to Android. Which decreased since Google made the NDK available.



might actually be the next C++, at least in what game development is 
concerned.

And the dependency on a JIT environment is an implementation issue. The 
Bartok compiler in Singularity
compiles to native code, and Mono also provides a similar option.

So who knows?

--
Paulo



"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:ioqdhe$2030$1 digitalmars.com...
 Through Reddit I've found a set of wordy slides, "Design for Performance", 
 on designing efficient games code:
 http://www.scribd.com/doc/53483851/Design-for-Performance
 http://www.reddit.com/r/programming/comments/guyb2/designing_code_for_performance/

 The slide touch many small topics, like the need for prefetching, desing 
 for cache-aware code, etc. One of the main topics is how to better lay 
 data structures in memory for modern CPUs. It shows how object oriented 
 style leads often to collections of little trees, for example  arrays of 
 object references (or struct pointers) that refer to objects that contain 
 other references to sub parts. Iterating on such data structures is not so 
 efficient.

 The slides also discuss a little the difference between creating an array 
 of 2-item structs, or a struct that contains two arrays of single native 
 values. If the code needs to scan just one of those two fields, then the 
 struct that contains the two arrays is faster.

 Similar topics were discussed better in "Pitfalls of Object Oriented 
 Programming" (2009):
 http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

 In my opinion if D2 has some success then one of its significant usages 
 will be to write fast games, so the design/performance concerns expressed 
 in those two sets of slides need to be important for D design.

 D probably allows to lay data in memory as shown in those slides, but I'd 
 like some help from the compiler too.  I don't think the compilers will be 
 soon able to turn an immutable binary tree into an array, to speedup its 
 repeated scanning, but maybe there are ways to express semantics in the 
 code that will allow them future smarter compilers to perform some of 
 those memory layout optimization, like transposing arrays. A possible idea 
 is a  no_inbound_pointers that forbids taking the addess of the items, and 
 allows the compiler to modify the data layout a little.

 Bye,
 bearophile 
Apr 22 2011
parent reply Kai Meyer <kai unixlords.com> writes:
On 04/22/2011 02:55 AM, Paulo Pinto wrote:
 Many thanks for the links, they provide very nice discussions.

 Specially the link below, that you can follow from your first link,
 http://c0de517e.blogspot.com/2011/04/2011-current-and-future-programming.html

 But in what concerns game development, D2 might already be too late.

 I know a bit of it, since a live a bit on that part of the universe.

 Due to XNA(Windows and XBox 360), Mono/Unity, and now WP7, many game studios

 even using
 it for the server side code.

 Java used to have a foot there, specially due to the J2ME game development,
 with a small
 push thanks to Android. Which decreased since Google made the NDK available.



 might actually be the next C++, at least in what game development is
 concerned.

 And the dependency on a JIT environment is an implementation issue. The
 Bartok compiler in Singularity
 compiles to native code, and Mono also provides a similar option.

 So who knows?

 --
 Paulo
C/C++ is. There is a purpose and a place for Interpreted languages like engine) is written in an interpreted language either, which basically means the guts are likely written in either C or C++. The point being made is that Systems Programming Languages like C/C++ and D are picked for their execution speed, and Interpreted Languages are picked for their ease of programming (or development speed). Since D is picked for execution speed, we should seriously consider every opportunity to improve in that arena. The OP wasn't just for the game developers, but for game framework developers as well.
Apr 22 2011
parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 22.04.2011 18:48, schrieb Kai Meyer:
 

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.
IMHO D won't be successful for games as long as it only supports Windows, Linux and OSX on PC (-like) hardware. We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and for mobile devices (Android, iOS, maybe Win7 phones and other stuff). This means good PPC (maybe the PS3's Cell CPU would need special support even though it's understands PPC code? I don't know.) and ARM support and support for the operating systems and SDKs used on those platforms. Of course execution speed is very important as well, but D in it's current state is not *that* bad in this regard. Sure, the GC is a bit slow, but in high performance games you shouldn't use it (or even malloc/free) all the time, anyway, see http://www.digitalmars.com/d/2.0/memory.html#realtime Another point: I find Minecraft pretty impressive. It really changed my view upon Games developed in Java. Cheers, - Daniel
Apr 22 2011
parent reply Kai Meyer <kai unixlords.com> writes:
On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.
IMHO D won't be successful for games as long as it only supports Windows, Linux and OSX on PC (-like) hardware. We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and for mobile devices (Android, iOS, maybe Win7 phones and other stuff). This means good PPC (maybe the PS3's Cell CPU would need special support even though it's understands PPC code? I don't know.) and ARM support and support for the operating systems and SDKs used on those platforms. Of course execution speed is very important as well, but D in it's current state is not *that* bad in this regard. Sure, the GC is a bit slow, but in high performance games you shouldn't use it (or even malloc/free) all the time, anyway, see http://www.digitalmars.com/d/2.0/memory.html#realtime Another point: I find Minecraft pretty impressive. It really changed my view upon Games developed in Java. Cheers, - Daniel
Hah, Minecraft. Have you tried loading up a high resolution texture pack yet? There's a reason why it looks like 8-bit graphics. It's not Java that makes Minecraft awesome, imo :)
Apr 22 2011
parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 22.04.2011 19:11, schrieb Kai Meyer:
 On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.
IMHO D won't be successful for games as long as it only supports Windows, Linux and OSX on PC (-like) hardware. We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and for mobile devices (Android, iOS, maybe Win7 phones and other stuff). This means good PPC (maybe the PS3's Cell CPU would need special support even though it's understands PPC code? I don't know.) and ARM support and support for the operating systems and SDKs used on those platforms. Of course execution speed is very important as well, but D in it's current state is not *that* bad in this regard. Sure, the GC is a bit slow, but in high performance games you shouldn't use it (or even malloc/free) all the time, anyway, see http://www.digitalmars.com/d/2.0/memory.html#realtime Another point: I find Minecraft pretty impressive. It really changed my view upon Games developed in Java. Cheers, - Daniel
Hah, Minecraft. Have you tried loading up a high resolution texture pack yet? There's a reason why it looks like 8-bit graphics. It's not Java that makes Minecraft awesome, imo :)
No I haven't. What I find impressive is this (almost infinitely) big world that is completely changeable, i.e. you can build new stuff everywhere, you can dig tunnels everywhere (ok, somewhere really deep there's a limit) and the game still runs smoothly. Haven't seen something like that in any game before.
Apr 22 2011
next sibling parent reply Kai Meyer <kai unixlords.com> writes:
On 04/22/2011 11:20 AM, Daniel Gibson wrote:
 Am 22.04.2011 19:11, schrieb Kai Meyer:
 On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.
IMHO D won't be successful for games as long as it only supports Windows, Linux and OSX on PC (-like) hardware. We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and for mobile devices (Android, iOS, maybe Win7 phones and other stuff). This means good PPC (maybe the PS3's Cell CPU would need special support even though it's understands PPC code? I don't know.) and ARM support and support for the operating systems and SDKs used on those platforms. Of course execution speed is very important as well, but D in it's current state is not *that* bad in this regard. Sure, the GC is a bit slow, but in high performance games you shouldn't use it (or even malloc/free) all the time, anyway, see http://www.digitalmars.com/d/2.0/memory.html#realtime Another point: I find Minecraft pretty impressive. It really changed my view upon Games developed in Java. Cheers, - Daniel
Hah, Minecraft. Have you tried loading up a high resolution texture pack yet? There's a reason why it looks like 8-bit graphics. It's not Java that makes Minecraft awesome, imo :)
No I haven't. What I find impressive is this (almost infinitely) big world that is completely changeable, i.e. you can build new stuff everywhere, you can dig tunnels everywhere (ok, somewhere really deep there's a limit) and the game still runs smoothly. Haven't seen something like that in any game before.
The random world generator is amazing, but it's not speed. The polygon count of the game is excruciatingly low because the client is smart enough to only draw the faces of blocks that are visible. The very bottom (bedrock) and they very top of the sky (as high as you can build blocks) is 256 blocks tall. The game is full of low-level bit-stuffing (like stacks of 64). The genius of the game is not in any special features of Java, it's in the data structure and data generator, which can be done much faster in other languages. But it begs the question, "why does it need to be faster?" It is "fast enough" in the JVM (unless you load up the high resolution textures, in which case the game becomes unbearably slow when viewing long distances.) The purpose of the original post was to indicate that some low level research shows that underlying data structures (as applied to video game development) can have an impact on the performance of the application, which D (I think) cares very much about.
Apr 22 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Kai Meyer:

 The purpose of the original post was to indicate that some low level 
 research shows that underlying data structures (as applied to video game 
 development) can have an impact on the performance of the application, 
 which D (I think) cares very much about.
The idea of the original post was a bit more complex: how can we invent new/better ways to express semantics in D code that will not forbid future D compilers to perform a bit of changes in the layout of data structures to increase code performance? Complex transforms of the data layout seem too much complex for even a good compiler, but maybe simpler ones will be possible. And I think to do this the D code needs some more semantics. I was suggesting an annotation that forbids inbound pointers, that allows the compiler to move data around a little, but this is just a start. Bye, bearophile
Apr 22 2011
parent reply Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 4/22/2011 2:20 PM, bearophile wrote:
 Kai Meyer:

 The purpose of the original post was to indicate that some low level
 research shows that underlying data structures (as applied to video game
 development) can have an impact on the performance of the application,
 which D (I think) cares very much about.
The idea of the original post was a bit more complex: how can we invent new/better ways to express semantics in D code that will not forbid future D compilers to perform a bit of changes in the layout of data structures to increase code performance? Complex transforms of the data layout seem too much complex for even a good compiler, but maybe simpler ones will be possible. And I think to do this the D code needs some more semantics. I was suggesting an annotation that forbids inbound pointers, that allows the compiler to move data around a little, but this is just a start. Bye, bearophile
In many ways the biggest thing I use regularly in game development that I would lose by moving to D would be good built-in SIMD support. The PC compilers from MS and Intel both have intrinsic data types and instructions that cover all the operations from SSE1 up to AVX. The intrinsics are nice in that the job of register allocation and scheduling is given to the compiler and generally the code it outputs is good enough (though it needs to be watched at times). Unlike ASM, intrinsics can be inlined so your math library can provide a platform abstraction at that layer before building up to larger operations (like vectorized forms of sin, cos, etc) and algorithms (like frustum cull checks, k-dop polygon collision etc), which makes porting and reusing the algorithms to other platforms much much easier, as only the low level layer needs to be ported, and only outliers at the algorithm level need to be tweaked after you get it up and running. On the consoles there is AltiVec (VMX) which is very similar to SSE in many ways. The common ground is basically SSE1 tier operations : 128 bit values operating on 4x32 bit integer and 4x32 bit float support. 64 bit AMD/Intel makes SSE2 the minimum standard, and a systems language on those platforms should reflect that. Loading and storing is comparable across platforms with similar alignment restrictions or penalties for working with unaligned data. Packing/swizzle/shuffle/permuting are different but this is not a huge problem for most algorithms. The lack of fused multiply and add on the Intel side can be worked around or abstracted (i.e. always write code as if it existed, have the Intel version expand to multiple ops). And now my wish list: If you have worked with shader programming through HLSL or CG the expressiveness of doing the work in SIMD is very high. If I could write something that looked exactly like HLSL but it was integrated perfectly in a language like D or C++, it would be pretty huge to me. The amount of math you can have in a line or two in HLSL is mind boggling at times, yet extremely intuitive and rather easy to debug.
Apr 22 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.  The PC
 compilers from MS and Intel both have intrinsic data types and
 instructions that cover all the operations from SSE1 up to AVX.  The
 intrinsics are nice in that the job of register allocation and
 scheduling is given to the compiler and generally the code it outputs is
 good enough (though it needs to be watched at times).
This is a topic quite different from the one I was talking about, but it's an interesting topic :-) SIMD intrinsics look ugly, they add lot of noise to the code, and are very specific to one CPU, or instruction set. You can't design a clean language with hundreds of those. Once 256 or 512 bit registers come, you need to add new intrinsics and change your code to use them. This is not so good. D array operations are probably meant to become smarter, when you perform a: int[8] a, b, c; a = b + c; A future good D compiler may use just two inlined istructions, or little more. This will probably include shuffling and broadcasting properties too. Maybe this kind of code is not as efficient as handwritten assembly code (or C code that uses SIMD intrinsics) but it's adaptable to different CPUs, future ones too, it's much less noisy, and it seems safer. I think such optimizations are better left to the back-end, so lot of time ago I've asked it to LLVM devs, for future LDC: http://llvm.org/bugs/show_bug.cgi?id=6956 The presence of such well implemented vector ops will not forbid another D compiler to add true SIMD intrinsics too.
 Unlike ASM, intrinsics can be inlined so your math library can provide a
DMD may eventually need this feature of the LDC compiler: http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions Bye, bearophile
Apr 22 2011
parent reply Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 4/22/2011 4:41 PM, bearophile wrote:
 Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.  The PC
 compilers from MS and Intel both have intrinsic data types and
 instructions that cover all the operations from SSE1 up to AVX.  The
 intrinsics are nice in that the job of register allocation and
 scheduling is given to the compiler and generally the code it outputs is
 good enough (though it needs to be watched at times).
This is a topic quite different from the one I was talking about, but it's an interesting topic :-) SIMD intrinsics look ugly, they add lot of noise to the code, and are very specific to one CPU, or instruction set. You can't design a clean language with hundreds of those. Once 256 or 512 bit registers come, you need to add new intrinsics and change your code to use them. This is not so good.
In C++ the intrinsics are easily wrapped by __forceinline global functions, to provide a platform abstraction against the intrinsics. Then, you can write class wrappers to provide the most common level of functionality, which boils down to a class to do vectorized math operators for + - * / and vectorized comparison functions == != >= <= < and >. From HLSL you have to borrow the 'any' and 'all' statements (along with variations for every permutation of the bitmask of the test result) to do conditional branching for the tests. This pretty much leaves swizzle/shuffle/permuting and outlying features (8,16,64 bit integers) in the realm of 'ugly'. From here you could build up portable SIMD transcendental functions (sin, cos, pow, log, etc), and other libraries (matrix multiplication, inversion, quaternions etc). I would say in D this could be faked provided the language at a minimum understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and how to efficiently move it via registers for function calls. Kind of 'make it at least work in the ABI, come back to a good implementation later' solution. There is some room to beat Microsoft here, as the the code visual studio 2010 outputs currently for 64 bit environments cannot pass 128 bit SIMD values by register (forceinline functions are the only workaround), even though scalar 32 and 64 bit float values are passed by XMM register just fine. The current hardware landscape dictates organizing your data in SIMD friendly manners. Naive OOP based code is going to de-reference too many pointers to get to scattered data. This makes the hardware prefetcher work too hard, and it wastes cache memory by only using a fraction of the RAM from the cache line, plus wasting 75-90% of the bandwidth and memory on the machine.
 D array operations are probably meant to become smarter, when you perform a:

 int[8] a, b, c;
 a = b + c;
Now the original topic pertains to data layouts, of which SIMD, the CPU cache, and efficient code all inter-relate. I would argue the above code is an idealistic example, as when writing SIMD code you almost always have to transpose or rotate one of the sets of data to work in parallel across the other one. What happens when this code has to branch? In SIMD land you have to test if any or all 4 lanes of SIMD data need to take it. And a lot of time the best course of action is to compute the other code path in addition to the first one, AND the fist result and NAND the second one and OR the results together to make valid output. I could maybe see a functional language doing ok at this. The only reasonable construct to be able to explain how common this is in optimized SIMD code, is to compare it to is HLSL's vectorized ternary operator (and understanding that 'a' and 'b' can be fairly intricate chunks of code if you are clever): float4 a = {1,2,3,4}; float4 b = {5,6,7,8}; float4 c = {-1,0,1,2}; float4 d = {0,0,0,0}; float4 foo = (c > d) ? a : b; results with foo = {5,6,3,4} For a lot of algorithms the 'a' and 'b' path have similar cost, so for SIMD it executes about 2x faster than the scalar case, although better than 2x gains are possible since using SIMD also naturally reduces or eliminates a ton of branching which CPUs don't really like to do due to their long pipelines. And as much as Intel likes to argue that a structure containing positions for a particle system should look like this because it makes their hardware benchmarks awesome, the following vertex layout is a failure: struct ParticleVertex { float[1000] XPos; float[1000] YPos; float[1000] ZPos; } The GPU (or Audio devices) does not consume it this way. The data is also not cache coherent if you are trying to read or write a single vertex out of the structure. A hybrid structure which is aware of the size of a SIMD register is the next logical choice: align(16) struct ParticleVertex { float[4] XPos; float[4] YPos; float[4] ZPos; } ParticleVertex[250] ParticleVertices; // struct is also now 75% of a 64 byte cache line // Also, 2 of any 4 random accesses for a vertex are in the same // cache line, and only 2 are touched in the worst case But this hybrid structure still has to be shuffled before being given to a GPU (albeit in much more bite size increments that could easily read-shuffle-write at the same speed of a platform optimized memcpy) Things get real messy when you have multiple vertex attributes as decisions to keep them together or separate are conflicting and both choices make sense to different systems :)
Apr 22 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Sean Cavanaugh:

 In C++ the intrinsics are easily wrapped by __forceinline global
 functions, to provide a platform abstraction against the intrinsics.
When AVX will become 512 bits wide, or you need to use a very different set of vector register, your global functions need to change, so the code that calls them too has to change. This is acceptable for library code, but it's not good for D built-ins operations. D built-in vector ops need to be more clean, general and long-lasting, even if they may not fully replace SSE intrinsics.
 I would say in D this could be faked provided the language at a minimum
 understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and
 how to efficiently move it via registers for function calls.
Also think about what the D ABI will be 15-25 years from now. D design must look a bit more forward too.
 Now the original topic pertains to data layouts,
It was about how to not preclude future D compilers from shuffling data around a bit by themselves :-)
 I would argue the above
 code is an idealistic example, as when writing SIMD code you almost
 always have to transpose or rotate one of the sets of data to work in
 parallel across the other one.
Right.
 float4 a = {1,2,3,4};
 float4 b = {5,6,7,8};
 float4 c = {-1,0,1,2};
 float4 d = {0,0,0,0};
 float4 foo = (c > d) ? a : b;
Recently I have asked for a D vector comparison operation too, (the compiler is supposed able to splits them into register-sized chunks for the comparisons), this is good for AVX instructions (a little problem here is that I think currently DMD allocates memory on heap to instantiate those four little arrays): int[4] a = [1,2,3,4]; int[4] b = [5,6,7,8] int[4] c = [-1,0,1,2]; int[4] d = [0,0,0,0]; int[4] foo = (c[] > d[]) ? a[] : b[];
 Things get real messy when you have multiple vertex attributes as
 decisions to keep them together or separate are conflicting and both
 choices make sense to different systems :)
It's not easy for future compilers to perform similar auto-vectorizations :-) Bye and thank you for your answer, bearophile
Apr 22 2011
prev sibling parent reply Don <nospam nospam.com> writes:
Sean Cavanaugh wrote:
 On 4/22/2011 2:20 PM, bearophile wrote:
 Kai Meyer:

 The purpose of the original post was to indicate that some low level
 research shows that underlying data structures (as applied to video game
 development) can have an impact on the performance of the application,
 which D (I think) cares very much about.
The idea of the original post was a bit more complex: how can we invent new/better ways to express semantics in D code that will not forbid future D compilers to perform a bit of changes in the layout of data structures to increase code performance? Complex transforms of the data layout seem too much complex for even a good compiler, but maybe simpler ones will be possible. And I think to do this the D code needs some more semantics. I was suggesting an annotation that forbids inbound pointers, that allows the compiler to move data around a little, but this is just a start. Bye, bearophile
In many ways the biggest thing I use regularly in game development that I would lose by moving to D would be good built-in SIMD support. The PC compilers from MS and Intel both have intrinsic data types and instructions that cover all the operations from SSE1 up to AVX. The intrinsics are nice in that the job of register allocation and scheduling is given to the compiler and generally the code it outputs is good enough (though it needs to be watched at times). Unlike ASM, intrinsics can be inlined so your math library can provide a platform abstraction at that layer before building up to larger operations (like vectorized forms of sin, cos, etc) and algorithms (like frustum cull checks, k-dop polygon collision etc), which makes porting and reusing the algorithms to other platforms much much easier, as only the low level layer needs to be ported, and only outliers at the algorithm level need to be tweaked after you get it up and running. On the consoles there is AltiVec (VMX) which is very similar to SSE in many ways. The common ground is basically SSE1 tier operations : 128 bit values operating on 4x32 bit integer and 4x32 bit float support. 64 bit AMD/Intel makes SSE2 the minimum standard, and a systems language on those platforms should reflect that.
Yes. It is for primarily for this reason that we made static arrays return-by-value. It is intended that on x86, float[4] will be an SSE1 register. So it should be possible to write SIMD code with standard array operations. (Note that this is *much* easier for the compiler, than trying to vectorize scalar code). This gives syntax like: float[4] a, b, c; a[] += b[] * c[]; (currently works, but doesn't use SSE, so has dismal performance).
 
 Loading and storing is comparable across platforms with similar 
 alignment restrictions or penalties for working with unaligned data. 
 Packing/swizzle/shuffle/permuting are different but this is not a huge 
 problem for most algorithms.  The lack of fused multiply and add on the 
 Intel side can be worked around or abstracted (i.e. always write code as 
 if it existed, have the Intel version expand to multiple ops).
 
 And now my wish list:
 
 If you have worked with shader programming through HLSL or CG the 
 expressiveness of doing the work in SIMD is very high.  If I could write 
 something that looked exactly like HLSL but it was integrated perfectly 
 in a language like D or C++, it would be pretty huge to me.  The amount 
 of math you can have in a line or two in HLSL is mind boggling at times, 
 yet extremely intuitive and rather easy to debug.
Apr 26 2011
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 26/04/11 9:01 AM, Don wrote:
 Sean Cavanaugh wrote:
 In many ways the biggest thing I use regularly in game development
 that I would lose by moving to D would be good built-in SIMD support.
 <snip>
Yes. It is for primarily for this reason that we made static arrays return-by-value. It is intended that on x86, float[4] will be an SSE1 register. So it should be possible to write SIMD code with standard array operations. (Note that this is *much* easier for the compiler, than trying to vectorize scalar code). This gives syntax like: float[4] a, b, c; a[] += b[] * c[]; (currently works, but doesn't use SSE, so has dismal performance).
What about float[4]s that are part of an object? Will they be automatically align(16) so that they can be quickly moved into the SSE registers, or will the user have to specify that manually? Also, what if I don't want my float[4] to be stored in a SSE register e.g. because I will be treating those four floats as individual floats, and never as a vector? IMO, float[4] should be left as it is and you should introduce a new vector data type that has all these optimisations. Just because a vector is four floats doesn't mean that all groups of four floats are vectors.
Apr 26 2011
parent Don <nospam nospam.com> writes:
Peter Alexander wrote:
 On 26/04/11 9:01 AM, Don wrote:
 Sean Cavanaugh wrote:
 In many ways the biggest thing I use regularly in game development
 that I would lose by moving to D would be good built-in SIMD support.
 <snip>
Yes. It is for primarily for this reason that we made static arrays return-by-value. It is intended that on x86, float[4] will be an SSE1 register. So it should be possible to write SIMD code with standard array operations. (Note that this is *much* easier for the compiler, than trying to vectorize scalar code). This gives syntax like: float[4] a, b, c; a[] += b[] * c[]; (currently works, but doesn't use SSE, so has dismal performance).
What about float[4]s that are part of an object? Will they be automatically align(16) so that they can be quickly moved into the SSE registers, or will the user have to specify that manually?
No special treatment, they just use the alignment for arrays of the type. Which I believe is indeed align(16) in that case.
 Also, what if I don't want my float[4] to be stored in a SSE register 
 e.g. because I will be treating those four floats as individual floats, 
 and never as a vector?
That's a decision for the compiler to make. It'll generate whatever code it thinks is appropriate. (My mention of float[4] being in an SSE register applies ONLY to parameter passing; but it isn't decided yet anyway).
 IMO, float[4] should be left as it is and you should introduce a new 
 vector data type that has all these optimisations. Just because a vector 
 is four floats doesn't mean that all groups of four floats are vectors.
It has absolutely nothing to do with vectors. All groups of floats (of ANY length) benefit from SIMD. D's semantics make it easy to take advantage of SIMD, regardless of what size it is. C's ancient machine model doesn't envisage SIMD, so C compilers are left with a massive abstraction inversion. It's really quite ridiculous that in this area, most mainstream programming languages are still operating at a lower level of abstraction than asm.
Apr 28 2011
prev sibling parent reply Andrew Wiley <wiley.andrew.j gmail.com> writes:
On Fri, Apr 22, 2011 at 12:31 PM, Kai Meyer <kai unixlords.com> wrote:

 On 04/22/2011 11:20 AM, Daniel Gibson wrote:

 Am 22.04.2011 19:11, schrieb Kai Meyer:

 On 04/22/2011 11:05 AM, Daniel Gibson wrote:

 Am 22.04.2011 18:48, schrieb Kai Meyer:


 C/C++ is. There is a purpose and a place for Interpreted languages like


 or

 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.
IMHO D won't be successful for games as long as it only supports Windows, Linux and OSX on PC (-like) hardware. We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and for mobile devices (Android, iOS, maybe Win7 phones and other stuff). This means good PPC (maybe the PS3's Cell CPU would need special support even though it's understands PPC code? I don't know.) and ARM support and support for the operating systems and SDKs used on those platforms. Of course execution speed is very important as well, but D in it's current state is not *that* bad in this regard. Sure, the GC is a bit slow, but in high performance games you shouldn't use it (or even malloc/free) all the time, anyway, see http://www.digitalmars.com/d/2.0/memory.html#realtime Another point: I find Minecraft pretty impressive. It really changed my view upon Games developed in Java. Cheers, - Daniel
Hah, Minecraft. Have you tried loading up a high resolution texture pack yet? There's a reason why it looks like 8-bit graphics. It's not Java that makes Minecraft awesome, imo :)
No I haven't. What I find impressive is this (almost infinitely) big world that is completely changeable, i.e. you can build new stuff everywhere, you can dig tunnels everywhere (ok, somewhere really deep there's a limit) and the game still runs smoothly. Haven't seen something like that in any game before.
The random world generator is amazing, but it's not speed. The polygon count of the game is excruciatingly low because the client is smart enough to only draw the faces of blocks that are visible. The very bottom (bedrock) and they very top of the sky (as high as you can build blocks) is 256 blocks tall. The game is full of low-level bit-stuffing (like stacks of 64). The genius of the game is not in any special features of Java, it's in the data structure and data generator, which can be done much faster in other languages. But it begs the question, "why does it need to be faster?" It is "fast enough" in the JVM (unless you load up the high resolution textures, in which case the game becomes unbearably slow when viewing long distances.)
Actually, the world is 128 blocks tall, and divided into 16x128x16 block "chunks." To elaborate on the bit stuffing, at the end of the day, each block is 2.5 bytes (type, metadata, and some lighting info) with exceptions for things like chests. The reason Minecraft runs so well in Java, from my point of view, is that the authors resisted the Java urge to throw objects at the problem and instead put everything into large byte arrays and wrote methods to manipulate them. From that perspective, using Java would be about the same as using any language, which let them stick to what they knew without incurring a large performance penalty. However, it's also true that as soon as you try to use a 128x128 texture pack, you very quickly become disillusioned with Minecraft's performance.
Apr 22 2011
parent Mike Parker <aldacron gmail.com> writes:
On 4/23/2011 4:22 AM, Andrew Wiley wrote:

 The reason Minecraft runs so well in Java, from my point of view, is
 that the authors resisted the Java urge to throw objects at the problem
 and instead put everything into large byte arrays and wrote methods to
 manipulate them. From that perspective, using Java would be about the
 same as using any language, which let them stick to what they knew
 without incurring a large performance penalty.
FYI, Markus, the author, has been a figure in the Java game development community for years. He was the original client programmer for Wurm Online[1] (where the landscape is 'infinite' and tiled) and a frequent participant in the Java4k competition[2] (with Left4kDead[3] perhaps being his most popular). I think it's a safe assumption that the techniques he put to use in Minecraft were learned from his experiments with the Wurm landscape and with cramming Java games into 4kb. [1] http://www.wurmonline.com/ [2] http://www.java4k.com/index.php?action=home [3] http://www.mojang.com/notch/j4k/l4kd/
Apr 22 2011
prev sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 22/04/2011 18:20, Daniel Gibson wrote:
 Am 22.04.2011 19:11, schrieb Kai Meyer:
 On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.
IMHO D won't be successful for games as long as it only supports Windows, Linux and OSX on PC (-like) hardware. We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and for mobile devices (Android, iOS, maybe Win7 phones and other stuff). This means good PPC (maybe the PS3's Cell CPU would need special support even though it's understands PPC code? I don't know.) and ARM support and support for the operating systems and SDKs used on those platforms. Of course execution speed is very important as well, but D in it's current state is not *that* bad in this regard. Sure, the GC is a bit slow, but in high performance games you shouldn't use it (or even malloc/free) all the time, anyway, see http://www.digitalmars.com/d/2.0/memory.html#realtime Another point: I find Minecraft pretty impressive. It really changed my view upon Games developed in Java. Cheers, - Daniel
Hah, Minecraft. Have you tried loading up a high resolution texture pack yet? There's a reason why it looks like 8-bit graphics. It's not Java that makes Minecraft awesome, imo :)
No I haven't. What I find impressive is this (almost infinitely) big world that is completely changeable, i.e. you can build new stuff everywhere, you can dig tunnels everywhere (ok, somewhere really deep there's a limit) and the game still runs smoothly. Haven't seen something like that in any game before.
Yes, that is why Minecraft is so appealing, but AFAIK that is more of a game design issue than a technical one. It may not be easy to implement such an engine, but I'm sure many game coders out there could have done it, it's not "rocket" science. Rather, it was the gameplay design idea (and fleshing it out) that made Minecraft unique and popular, AFAIK. -- Bruno Medeiros - Software Engineer
Apr 29 2011
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.
Don has given a nice answer about how D2 plans to face this. To focus more what Don was saying I think a small exaple will help. This is a C implementation of one Computer Shootout benchmarks, that generates a binary PPM image of the Mandelbrot set: http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4 This is an important part of that C version: typedef double v2df __attribute__ ((vector_size(16))); /* vector of two doubles */ const v2df zero = { 0.0, 0.0 }; const v2df four = { 4.0, 4.0 }; // Constant throughout the program, value depends on N int bytes_per_row; double inverse_w; double inverse_h; // Program argument: height and width of the image int N; // Lookup table for initial real-axis value v2df *Crvs; // Mandelbrot bitmap uint8_t *bitmap; static void calc_row(int y) { uint8_t *row_bitmap = bitmap + (bytes_per_row * y); int x; const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 }; for (x = 0; x < N; x += 2) { v2df Crv = Crvs[x >> 1]; v2df Civ = Civ_init; v2df Zrv = zero; v2df Ziv = zero; v2df Trv = zero; v2df Tiv = zero; int i = 50; int two_pixels; v2df is_still_bounded; do { Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ; Zrv = Trv - Tiv + Crv; Trv = Zrv * Zrv; Tiv = Ziv * Ziv; // All bits will be set to 1 if 'Trv + Tiv' is less than 4 // and all bits will be set to 0 otherwise. Two elements // are calculated in parallel here. is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four); // Move the sign-bit of the low element to bit 0, move the // sign-bit of the high element to bit 1. The result is // that the pixel will be set if the calculation was // bounded. two_pixels = __builtin_ia32_movmskpd(is_still_bounded); } while (--i > 0 && two_pixels); // The pixel bits must be in the most and second most // significant position two_pixels <<= 6; // Add the two pixels to the bitmap, all bits are // initially zero since the area was allocated with calloc() row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7)); } } GCC 4.6 compiles the inner do-while loop of calc_row() to just this very clean assembly, that in my opinion is quite _beautiful_, it shows one of the most important final purposes of a good compiler: L9: subl $1, %ecx addpd %xmm0, %xmm0 mulpd %xmm0, %xmm1 movapd %xmm4, %xmm0 addpd %xmm6, %xmm1 addpd %xmm5, %xmm0 subpd %xmm3, %xmm0 movapd %xmm1, %xmm3 movapd %xmm0, %xmm4 mulpd %xmm1, %xmm3 mulpd %xmm0, %xmm4 movapd %xmm3, %xmm2 addpd %xmm4, %xmm2 cmplepd %xmm7, %xmm2 movmskpd %xmm2, %ebx je L18 testl %ebx, %ebx jne L9 Those addpd, subpd, mulpd, movapd, etc, instructions work on pairs of doubles (those v2df). And the code uses the cmplepd and movmskpd instructions too, in a very clean way, that I think not even GCC 4.6 is normally able to use by itself. A good language + compiler have many purposes, but producing ASM code like that is one of the most important purposes, expecially if you write numerical code. A numerical programmer really wants to write code that somehow produces equally clean and powerful code (or better, using AVX 256-bit registers and 3-way instructions) in numerical processing kernels (often such kernels are small, often just bodies of inner loops). D2 allows to write code almost as clean as this C one (but I think currently no D compiler is able to turn this into clean inlined addpd, subpd, mulpd, movapd instructions. This is a compiler issue, not a language one): v2df Zrv = zero; ... Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ; Zrv = Trv - Tiv + Crv; Trv = Zrv * Zrv; Tiv = Ziv * Ziv; In D it becomes: double[2] Zrv = zero; ... Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[]; Zrv[] = Trv[] - Tiv[] + Crv[]; Trv[] = Zrv[] * Zrv[]; Tiv[] = Ziv[] * Ziv[]; But then how do you write this in a clean way in D2/D3? do { ... is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four); two_pixels = __builtin_ia32_movmskpd(is_still_bounded); } while (--i > 0 && two_pixels); Using those __builtin_ia32_cmplepd() and __builtin_ia32_movmskpd() is not easy, so there is a tradeoff between allowing easy to write code, and giving power. So it's acceptable for a language to give a bit less power if the code is simpler to write. Yet, in a system language if you don't give people a way to produce ASM code as clean as the one I've shown in the inner loops of numerical processing code, some D2 programmers will be forced to write down inline asm, and that's sometimes worse than using intrinsics like __builtin_ia32_cmplepd(). Writing efficient inner loops is very important for numerical processing code, and I think numerical processing code is important for D2. Time ago I have suggested to extend the D2 vector operations to code like this, but I think this is not enough still: float[4] a, b, c, d; c = a[] == b[]; d = a[] >= b[]; Bye, bearophile
May 03 2011
parent reply "qznc" <qznc web.de> writes:
On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:
 Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game 
 development that I would lose by moving to D would be good 
 built-in SIMD support.
Don has given a nice answer about how D2 plans to face this. To focus more what Don was saying I think a small exaple will help. This is a C implementation of one Computer Shootout benchmarks, that generates a binary PPM image of the Mandelbrot set: http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4 This is an important part of that C version: typedef double v2df __attribute__ ((vector_size(16))); /* vector of two doubles */ const v2df zero = { 0.0, 0.0 }; const v2df four = { 4.0, 4.0 }; // Constant throughout the program, value depends on N int bytes_per_row; double inverse_w; double inverse_h; // Program argument: height and width of the image int N; // Lookup table for initial real-axis value v2df *Crvs; // Mandelbrot bitmap uint8_t *bitmap; static void calc_row(int y) { uint8_t *row_bitmap = bitmap + (bytes_per_row * y); int x; const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 }; for (x = 0; x < N; x += 2) { v2df Crv = Crvs[x >> 1]; v2df Civ = Civ_init; v2df Zrv = zero; v2df Ziv = zero; v2df Trv = zero; v2df Tiv = zero; int i = 50; int two_pixels; v2df is_still_bounded; do { Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ; Zrv = Trv - Tiv + Crv; Trv = Zrv * Zrv; Tiv = Ziv * Ziv; // All bits will be set to 1 if 'Trv + Tiv' is less than 4 // and all bits will be set to 0 otherwise. Two elements // are calculated in parallel here. is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four); // Move the sign-bit of the low element to bit 0, move the // sign-bit of the high element to bit 1. The result is // that the pixel will be set if the calculation was // bounded. two_pixels = __builtin_ia32_movmskpd(is_still_bounded); } while (--i > 0 && two_pixels); // The pixel bits must be in the most and second most // significant position two_pixels <<= 6; // Add the two pixels to the bitmap, all bits are // initially zero since the area was allocated with calloc() row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7)); } } GCC 4.6 compiles the inner do-while loop of calc_row() to just this very clean assembly, that in my opinion is quite _beautiful_, it shows one of the most important final purposes of a good compiler: L9: subl $1, %ecx addpd %xmm0, %xmm0 mulpd %xmm0, %xmm1 movapd %xmm4, %xmm0 addpd %xmm6, %xmm1 addpd %xmm5, %xmm0 subpd %xmm3, %xmm0 movapd %xmm1, %xmm3 movapd %xmm0, %xmm4 mulpd %xmm1, %xmm3 mulpd %xmm0, %xmm4 movapd %xmm3, %xmm2 addpd %xmm4, %xmm2 cmplepd %xmm7, %xmm2 movmskpd %xmm2, %ebx je L18 testl %ebx, %ebx jne L9 Those addpd, subpd, mulpd, movapd, etc, instructions work on pairs of doubles (those v2df). And the code uses the cmplepd and movmskpd instructions too, in a very clean way, that I think not even GCC 4.6 is normally able to use by itself. A good language + compiler have many purposes, but producing ASM code like that is one of the most important purposes, expecially if you write numerical code. A numerical programmer really wants to write code that somehow produces equally clean and powerful code (or better, using AVX 256-bit registers and 3-way instructions) in numerical processing kernels (often such kernels are small, often just bodies of inner loops). D2 allows to write code almost as clean as this C one (but I think currently no D compiler is able to turn this into clean inlined addpd, subpd, mulpd, movapd instructions. This is a compiler issue, not a language one): v2df Zrv = zero; ... Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ; Zrv = Trv - Tiv + Crv; Trv = Zrv * Zrv; Tiv = Ziv * Ziv; In D it becomes: double[2] Zrv = zero; ... Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[]; Zrv[] = Trv[] - Tiv[] + Crv[]; Trv[] = Zrv[] * Zrv[]; Tiv[] = Ziv[] * Ziv[]; But then how do you write this in a clean way in D2/D3? do { ... is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four); two_pixels = __builtin_ia32_movmskpd(is_still_bounded); } while (--i > 0 && two_pixels); Using those __builtin_ia32_cmplepd() and __builtin_ia32_movmskpd() is not easy, so there is a tradeoff between allowing easy to write code, and giving power. So it's acceptable for a language to give a bit less power if the code is simpler to write. Yet, in a system language if you don't give people a way to produce ASM code as clean as the one I've shown in the inner loops of numerical processing code, some D2 programmers will be forced to write down inline asm, and that's sometimes worse than using intrinsics like __builtin_ia32_cmplepd(). Writing efficient inner loops is very important for numerical processing code, and I think numerical processing code is important for D2. Time ago I have suggested to extend the D2 vector operations to code like this, but I think this is not enough still: float[4] a, b, c, d; c = a[] == b[]; d = a[] >= b[]; Bye, bearophile
Just found this old post, since I'm tuning mandelbrot.d right now [0]. The good news: LDC produces code, which is quite close to the C version. mulsd %xmm6,%xmm4 subsd %xmm1,%xmm7 addsd %xmm4,%xmm4 addsd %xmm5,%xmm7 addsd %xmm0,%xmm4 movaps %xmm7,%xmm6 mulsd %xmm6,%xmm6 movaps %xmm4,%xmm2 mulsd %xmm2,%xmm2 movaps %xmm2,%xmm1 addsd %xmm6,%xmm1 ucomisd %xmm1,%xmm3 jb 4026f0 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0x130> jl 402680 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0xc0> Even better, the code is produce from the following (inlined!) source, which is pretty much the mathematical definition. for(auto i = 0; i < iter && norm(Z) <= lim; i++) Z = Z*Z + C; The bad news: cmplepd and movmskpd are not used. Is that possible somehow four years later? The gcc code is roughly twice as fast at the moment, but I don't know if cmplepd and movmskpd is the only thing missing. [0] https://github.com/qznc/d-shootout
Sep 02 2015
parent "David Nadlinger" <code klickverbot.at> writes:
On Wednesday, 2 September 2015 at 19:04:10 UTC, qznc wrote:
 The bad news: cmplepd and movmskpd are not used. Is that 
 possible somehow four years later?
I just checked, and LLVM does not know how to automatically vectorize that loop. You would need to write it manually using vector types (like in the C version).
 [0] https://github.com/qznc/d-shootout
As a general note, you might want to add "-boundscheck=off -mcpu=native" to the flags for LDC too for a fair comparison to the other compilers. Also, if you use the DMD-style flags (e.g. -O -inline), you should use the ldmd2 wrapper instead of ldc2. You might also want to use 2.067 branch of ldc2 (just released as an alpha version) for better comparability to DMD. — David
Sep 02 2015