digitalmars.D - OOP, faster data layouts, compilers

bearophile (11/11) Apr 21 2011 Through Reddit I've found a set of wordy slides, "Design for Performance...

Paulo Pinto (24/55) Apr 22 2011 Many thanks for the links, they provide very nice discussions.

Kai Meyer (14/36) Apr 22 2011 I don't think C# is the next C++; it's impossible for C# to be what

Daniel Gibson (17/31) Apr 22 2011 IMHO D won't be successful for games as long as it only supports

Kai Meyer (4/35) Apr 22 2011 Hah, Minecraft. Have you tried loading up a high resolution texture pack...

Daniel Gibson (7/47) Apr 22 2011 No I haven't.

Kai Meyer (16/63) Apr 22 2011 The random world generator is amazing, but it's not speed. The polygon

bearophile (4/8) Apr 22 2011 The idea of the original post was a bit more complex: how can we invent ...

Sean Cavanaugh (33/41) Apr 22 2011 In many ways the biggest thing I use regularly in game development that

bearophile (15/23) Apr 22 2011 This is a topic quite different from the one I was talking about, but it...

Sean Cavanaugh (85/98) Apr 22 2011 In C++ the intrinsics are easily wrapped by __forceinline global

bearophile (14/32) Apr 22 2011 Also think about what the D ABI will be 15-25 years from now. D design m...

Don (11/70) Apr 26 2011 Yes. It is for primarily for this reason that we made static arrays

Peter Alexander (10/24) Apr 26 2011 What about float[4]s that are part of an object? Will they be

Don (14/41) Apr 28 2011 No special treatment, they just use the alignment for arrays of the

Andrew Wiley (14/81) Apr 22 2011 Actually, the world is 128 blocks tall, and divided into 16x128x16 block

Mike Parker (11/17) Apr 22 2011 FYI, Markus, the author, has been a figure in the Java game development

Bruno Medeiros (8/55) Apr 29 2011 Yes, that is why Minecraft is so appealing, but AFAIK that is more of a

bearophile (105/107) May 03 2011 Don has given a nice answer about how D2 plans to face this.

qznc (29/170) Sep 02 2015 Just found this old post, since I'm tuning mandelbrot.d right now

David Nadlinger (11/14) Sep 02 2015 I just checked, and LLVM does not know how to automatically

bearophile <bearophileHUGS lycos.com> writes:

Through Reddit I've found a set of wordy slides, "Design for Performance", on
designing efficient games code:
http://www.scribd.com/doc/53483851/Design-for-Performance
http://www.reddit.com/r/programming/comments/guyb2/designing_code_for_performance/

The slide touch many small topics, like the need for prefetching, desing for
cache-aware code, etc. One of the main topics is how to better lay data
structures in memory for modern CPUs. It shows how object oriented style leads
often to collections of little trees, for example  arrays of object references
(or struct pointers) that refer to objects that contain other references to sub
parts. Iterating on such data structures is not so efficient.

The slides also discuss a little the difference between creating an array of
2-item structs, or a struct that contains two arrays of single native values.
If the code needs to scan just one of those two fields, then the struct that
contains the two arrays is faster.

Similar topics were discussed better in "Pitfalls of Object Oriented
Programming" (2009):
http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

In my opinion if D2 has some success then one of its significant usages will be
to write fast games, so the design/performance concerns expressed in those two
sets of slides need to be important for D design.

D probably allows to lay data in memory as shown in those slides, but I'd like
some help from the compiler too.  I don't think the compilers will be soon able
to turn an immutable binary tree into an array, to speedup its repeated
scanning, but maybe there are ways to express semantics in the code that will
allow them future smarter compilers to perform some of those memory layout
optimization, like transposing arrays. A possible idea is a
 no_inbound_pointers that forbids taking the addess of the items, and allows
the compiler to modify the data layout a little.

Bye,
bearophile

Apr 21 2011

"Paulo Pinto" <pjmlp progtools.org> writes:

Many thanks for the links, they provide very nice discussions.

Specially the link below, that you can follow from your first link,
http://c0de517e.blogspot.com/2011/04/2011-current-and-future-programming.html

But in what concerns game development, D2 might already be too late.

I know a bit of it, since a live a bit on that part of the universe.

Due to XNA(Windows and XBox 360), Mono/Unity, and now WP7, many game studios

even using
it for the server side code.

Java used to have a foot there, specially due to the J2ME game development,
with a small
push thanks to Android. Which decreased since Google made the NDK available.

might actually be the next C++, at least in what game development is
concerned.

And the dependency on a JIT environment is an implementation issue. The
Bartok compiler in Singularity
compiles to native code, and Mono also provides a similar option.

So who knows?

--
Paulo

"bearophile" <bearophileHUGS lycos.com> wrote in message
news:ioqdhe$2030$1 digitalmars.com...
Through Reddit I've found a set of wordy slides, "Design for Performance",
on designing efficient games code:
http://www.scribd.com/doc/53483851/Design-for-Performance
http://www.reddit.com/r/programming/comments/guyb2/designing_code_for_performance/

The slide touch many small topics, like the need for prefetching, desing
for cache-aware code, etc. One of the main topics is how to better lay
data structures in memory for modern CPUs. It shows how object oriented
style leads often to collections of little trees, for example arrays of
object references (or struct pointers) that refer to objects that contain
other references to sub parts. Iterating on such data structures is not so
efficient.

The slides also discuss a little the difference between creating an array
of 2-item structs, or a struct that contains two arrays of single native
values. If the code needs to scan just one of those two fields, then the
struct that contains the two arrays is faster.

Similar topics were discussed better in "Pitfalls of Object Oriented
Programming" (2009):
http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

In my opinion if D2 has some success then one of its significant usages
will be to write fast games, so the design/performance concerns expressed
in those two sets of slides need to be important for D design.

D probably allows to lay data in memory as shown in those slides, but I'd
like some help from the compiler too. I don't think the compilers will be
soon able to turn an immutable binary tree into an array, to speedup its
repeated scanning, but maybe there are ways to express semantics in the
code that will allow them future smarter compilers to perform some of
those memory layout optimization, like transposing arrays. A possible idea
is a no_inbound_pointers that forbids taking the addess of the items, and
allows the compiler to modify the data layout a little.

Bye,
bearophile

Apr 22 2011

Kai Meyer <kai unixlords.com> writes:

On 04/22/2011 02:55 AM, Paulo Pinto wrote:
 Many thanks for the links, they provide very nice discussions.

 Specially the link below, that you can follow from your first link,
 http://c0de517e.blogspot.com/2011/04/2011-current-and-future-programming.html

 But in what concerns game development, D2 might already be too late.

 I know a bit of it, since a live a bit on that part of the universe.

 Due to XNA(Windows and XBox 360), Mono/Unity, and now WP7, many game studios

 even using
 it for the server side code.

 Java used to have a foot there, specially due to the J2ME game development,
 with a small
 push thanks to Android. Which decreased since Google made the NDK available.



 might actually be the next C++, at least in what game development is
 concerned.

 And the dependency on a JIT environment is an implementation issue. The
 Bartok compiler in Singularity
 compiles to native code, and Mono also provides a similar option.

 So who knows?

 --
 Paulo


C/C++ is. There is a purpose and a place for Interpreted languages like 



engine) is written in an interpreted language either, which basically 
means the guts are likely written in either C or C++. The point being 
made is that Systems Programming Languages like C/C++ and D are picked 
for their execution speed, and Interpreted Languages are picked for 
their ease of programming (or development speed). Since D is picked for 
execution speed, we should seriously consider every opportunity to 
improve in that arena. The OP wasn't just for the game developers, but 
for game framework developers as well.

Apr 22 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 22.04.2011 18:48, schrieb Kai Meyer:
 

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

IMHO D won't be successful for games as long as it only supports
Windows, Linux and OSX on PC (-like) hardware.
We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
This means good PPC (maybe the PS3's Cell CPU would need special support
even though it's understands PPC code? I don't know.) and ARM support
and support for the operating systems and SDKs used on those platforms.

Of course execution speed is very important as well, but D in it's
current state is not *that* bad in this regard. Sure, the GC is a bit
slow, but in high performance games you shouldn't use it (or even
malloc/free) all the time, anyway, see
http://www.digitalmars.com/d/2.0/memory.html#realtime

Another point: I find Minecraft pretty impressive. It really changed my
view upon Games developed in Java.

Cheers,
- Daniel

Apr 22 2011

Kai Meyer <kai unixlords.com> writes:

On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

 IMHO D won't be successful for games as long as it only supports
 Windows, Linux and OSX on PC (-like) hardware.
 We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
 for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
 This means good PPC (maybe the PS3's Cell CPU would need special support
 even though it's understands PPC code? I don't know.) and ARM support
 and support for the operating systems and SDKs used on those platforms.

 Of course execution speed is very important as well, but D in it's
 current state is not *that* bad in this regard. Sure, the GC is a bit
 slow, but in high performance games you shouldn't use it (or even
 malloc/free) all the time, anyway, see
 http://www.digitalmars.com/d/2.0/memory.html#realtime

 Another point: I find Minecraft pretty impressive. It really changed my
 view upon Games developed in Java.

 Cheers,
 - Daniel

Hah, Minecraft. Have you tried loading up a high resolution texture pack 
yet? There's a reason why it looks like 8-bit graphics. It's not Java 
that makes Minecraft awesome, imo :)

Apr 22 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 22.04.2011 19:11, schrieb Kai Meyer:
 On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

 IMHO D won't be successful for games as long as it only supports
 Windows, Linux and OSX on PC (-like) hardware.
 We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
 for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
 This means good PPC (maybe the PS3's Cell CPU would need special support
 even though it's understands PPC code? I don't know.) and ARM support
 and support for the operating systems and SDKs used on those platforms.

 Of course execution speed is very important as well, but D in it's
 current state is not *that* bad in this regard. Sure, the GC is a bit
 slow, but in high performance games you shouldn't use it (or even
 malloc/free) all the time, anyway, see
 http://www.digitalmars.com/d/2.0/memory.html#realtime

 Another point: I find Minecraft pretty impressive. It really changed my
 view upon Games developed in Java.

 Cheers,
 - Daniel

 
 Hah, Minecraft. Have you tried loading up a high resolution texture pack
 yet? There's a reason why it looks like 8-bit graphics. It's not Java
 that makes Minecraft awesome, imo :)

No I haven't.
What I find impressive is this (almost infinitely) big world that is
completely changeable, i.e. you can build new stuff everywhere, you can
dig tunnels everywhere (ok, somewhere really deep there's a limit) and
the game still runs smoothly. Haven't seen something like that in any
game before.

Apr 22 2011

Kai Meyer <kai unixlords.com> writes:

On 04/22/2011 11:20 AM, Daniel Gibson wrote:
 Am 22.04.2011 19:11, schrieb Kai Meyer:
 On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

 IMHO D won't be successful for games as long as it only supports
 Windows, Linux and OSX on PC (-like) hardware.
 We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
 for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
 This means good PPC (maybe the PS3's Cell CPU would need special support
 even though it's understands PPC code? I don't know.) and ARM support
 and support for the operating systems and SDKs used on those platforms.

 Of course execution speed is very important as well, but D in it's
 current state is not *that* bad in this regard. Sure, the GC is a bit
 slow, but in high performance games you shouldn't use it (or even
 malloc/free) all the time, anyway, see
 http://www.digitalmars.com/d/2.0/memory.html#realtime

 Another point: I find Minecraft pretty impressive. It really changed my
 view upon Games developed in Java.

 Cheers,
 - Daniel

 Hah, Minecraft. Have you tried loading up a high resolution texture pack
 yet? There's a reason why it looks like 8-bit graphics. It's not Java
 that makes Minecraft awesome, imo :)

 No I haven't.
 What I find impressive is this (almost infinitely) big world that is
 completely changeable, i.e. you can build new stuff everywhere, you can
 dig tunnels everywhere (ok, somewhere really deep there's a limit) and
 the game still runs smoothly. Haven't seen something like that in any
 game before.

The random world generator is amazing, but it's not speed. The polygon 
count of the game is excruciatingly low because the client is smart 
enough to only draw the faces of blocks that are visible. The very 
bottom (bedrock) and they very top of the sky (as high as you can build 
blocks) is 256 blocks tall. The game is full of low-level bit-stuffing 
(like stacks of 64). The genius of the game is not in any special 
features of Java, it's in the data structure and data generator, which 
can be done much faster in other languages. But it begs the question, 
"why does it need to be faster?" It is "fast enough" in the JVM (unless 
you load up the high resolution textures, in which case the game becomes 
unbearably slow when viewing long distances.)

The purpose of the original post was to indicate that some low level 
research shows that underlying data structures (as applied to video game 
development) can have an impact on the performance of the application, 
which D (I think) cares very much about.

Apr 22 2011

bearophile <bearophileHUGS lycos.com> writes:

Kai Meyer:

 The purpose of the original post was to indicate that some low level 
 research shows that underlying data structures (as applied to video game 
 development) can have an impact on the performance of the application, 
 which D (I think) cares very much about.

The idea of the original post was a bit more complex: how can we invent
new/better ways to express semantics in D code that will not forbid future D
compilers to perform a bit of changes in the layout of data structures to
increase code performance? Complex transforms of the data layout seem too much
complex for even a good compiler, but maybe simpler ones will be possible. And
I think to do this the D code needs some more semantics. I was suggesting an
annotation that forbids inbound pointers, that allows the compiler to move data
around a little, but this is just a start.

Bye,
bearophile

Apr 22 2011

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

On 4/22/2011 2:20 PM, bearophile wrote:
 Kai Meyer:

 The purpose of the original post was to indicate that some low level
 research shows that underlying data structures (as applied to video game
 development) can have an impact on the performance of the application,
 which D (I think) cares very much about.

 The idea of the original post was a bit more complex: how can we invent
new/better ways to express semantics in D code that will not forbid future D
compilers to perform a bit of changes in the layout of data structures to
increase code performance? Complex transforms of the data layout seem too much
complex for even a good compiler, but maybe simpler ones will be possible. And
I think to do this the D code needs some more semantics. I was suggesting an
annotation that forbids inbound pointers, that allows the compiler to move data
around a little, but this is just a start.

 Bye,
 bearophile


In many ways the biggest thing I use regularly in game development that 
I would lose by moving to D would be good built-in SIMD support.  The PC 
compilers from MS and Intel both have intrinsic data types and 
instructions that cover all the operations from SSE1 up to AVX.  The 
intrinsics are nice in that the job of register allocation and 
scheduling is given to the compiler and generally the code it outputs is 
good enough (though it needs to be watched at times).

Unlike ASM, intrinsics can be inlined so your math library can provide a 
platform abstraction at that layer before building up to larger 
operations (like vectorized forms of sin, cos, etc) and algorithms (like 
frustum cull checks, k-dop polygon collision etc), which makes porting 
and reusing the algorithms to other platforms much much easier, as only 
the low level layer needs to be ported, and only outliers at the 
algorithm level need to be tweaked after you get it up and running.

On the consoles there is AltiVec (VMX) which is very similar to SSE in 
many ways.  The common ground is basically SSE1 tier operations : 128 
bit values operating on 4x32 bit integer and 4x32 bit float support.  64 
bit AMD/Intel makes SSE2 the minimum standard, and a systems language on 
those platforms should reflect that.

Loading and storing is comparable across platforms with similar 
alignment restrictions or penalties for working with unaligned data. 
Packing/swizzle/shuffle/permuting are different but this is not a huge 
problem for most algorithms.  The lack of fused multiply and add on the 
Intel side can be worked around or abstracted (i.e. always write code as 
if it existed, have the Intel version expand to multiple ops).

And now my wish list:

If you have worked with shader programming through HLSL or CG the 
expressiveness of doing the work in SIMD is very high.  If I could write 
something that looked exactly like HLSL but it was integrated perfectly 
in a language like D or C++, it would be pretty huge to me.  The amount 
of math you can have in a line or two in HLSL is mind boggling at times, 
yet extremely intuitive and rather easy to debug.

Apr 22 2011

bearophile <bearophileHUGS lycos.com> writes:

Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.  The PC
 compilers from MS and Intel both have intrinsic data types and
 instructions that cover all the operations from SSE1 up to AVX.  The
 intrinsics are nice in that the job of register allocation and
 scheduling is given to the compiler and generally the code it outputs is
 good enough (though it needs to be watched at times).

This is a topic quite different from the one I was talking about, but it's an
interesting topic :-)

SIMD intrinsics look ugly, they add lot of noise to the code, and are very
specific to one CPU, or instruction set. You can't design a clean language with
hundreds of those. Once 256 or 512 bit registers come, you need to add new
intrinsics and change your code to use them. This is not so good.

D array operations are probably meant to become smarter, when you perform a:

int[8] a, b, c;
a = b + c;

A future good D compiler may use just two inlined istructions, or little more.
This will probably include shuffling and broadcasting properties too.

Maybe this kind of code is not as efficient as handwritten assembly code (or C
code that uses SIMD intrinsics) but it's adaptable to different CPUs, future
ones too, it's much less noisy, and it seems safer.

I think such optimizations are better left to the back-end, so lot of time ago
I've asked it to LLVM devs, for future LDC:
http://llvm.org/bugs/show_bug.cgi?id=6956

The presence of such well implemented vector ops will not forbid another D
compiler to add true SIMD intrinsics too.


 Unlike ASM, intrinsics can be inlined so your math library can provide a

DMD may eventually need this feature of the LDC compiler:
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

Bye,
bearophile

Apr 22 2011

Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:

On 4/22/2011 4:41 PM, bearophile wrote:
 Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.  The PC
 compilers from MS and Intel both have intrinsic data types and
 instructions that cover all the operations from SSE1 up to AVX.  The
 intrinsics are nice in that the job of register allocation and
 scheduling is given to the compiler and generally the code it outputs is
 good enough (though it needs to be watched at times).

 This is a topic quite different from the one I was talking about, but it's an
interesting topic :-)

 SIMD intrinsics look ugly, they add lot of noise to the code, and are very
specific to one CPU, or instruction set. You can't design a clean language with
hundreds of those. Once 256 or 512 bit registers come, you need to add new
intrinsics and change your code to use them. This is not so good.

In C++ the intrinsics are easily wrapped by __forceinline global 
functions, to provide a platform abstraction against the intrinsics.

Then, you can write class wrappers to provide the most common level of 
functionality, which boils down to a class to do vectorized math 
operators for + - * / and vectorized comparison functions == != >= <= < 
and >.  From HLSL you have to borrow the 'any' and 'all' statements 
(along with variations for every permutation of the bitmask of the test 
result) to do conditional branching for the tests.  This pretty much 
leaves swizzle/shuffle/permuting and outlying features (8,16,64 bit 
integers) in the realm of 'ugly'.

 From here you could build up portable SIMD transcendental functions 
(sin, cos, pow, log, etc), and other libraries (matrix multiplication, 
inversion, quaternions etc).

I would say in D this could be faked provided the language at a minimum 
understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and 
how to efficiently move it via registers for function calls.  Kind of 
'make it at least work in the ABI, come back to a good implementation 
later' solution.  There is some room to beat Microsoft here, as the the 
code visual studio 2010 outputs currently for 64 bit environments cannot 
pass 128 bit SIMD values by register (forceinline functions are the only 
workaround), even though scalar 32 and 64 bit float values are passed by 
XMM register just fine.

The current hardware landscape dictates organizing your data in SIMD 
friendly manners.  Naive OOP based code is going to de-reference too 
many pointers to get to scattered data.  This makes the hardware 
prefetcher work too hard, and it wastes cache memory by only using a 
fraction of the RAM from the cache line, plus wasting 75-90% of the 
bandwidth and memory on the machine.

 D array operations are probably meant to become smarter, when you perform a:

 int[8] a, b, c;
 a = b + c;

Now the original topic pertains to data layouts, of which SIMD, the CPU 
cache, and efficient code all inter-relate.  I would argue the above 
code is an idealistic example, as when writing SIMD code you almost 
always have to transpose or rotate one of the sets of data to work in 
parallel across the other one.  What happens when this code has to 
branch?  In SIMD land you have to test if any or all 4 lanes of SIMD 
data need to take it.  And a lot of time the best course of action is to 
compute the other code path in addition to the first one, AND the fist 
result and NAND the second one and OR the results together to make valid 
output.  I could maybe see a functional language doing ok at this.   The 
only reasonable construct to be able to explain how common this is in 
optimized SIMD code, is to compare it to is HLSL's vectorized ternary 
operator (and understanding that 'a' and 'b' can be fairly intricate 
chunks of code if you are clever):

float4 a = {1,2,3,4};
float4 b = {5,6,7,8};
float4 c = {-1,0,1,2};
float4 d = {0,0,0,0};
float4 foo = (c > d) ? a : b;

results with foo = {5,6,3,4}

For a lot of algorithms the 'a' and 'b' path have similar cost, so for 
SIMD it executes about 2x faster than the scalar case, although better 
than 2x gains are possible since using SIMD also naturally reduces or 
eliminates a ton of branching which CPUs don't really like to do due to 
their long pipelines.



And as much as Intel likes to argue that a structure containing 
positions for a particle system should look like this because it makes 
their hardware benchmarks awesome, the following vertex layout is a failure:

struct ParticleVertex
{
float[1000] XPos;
float[1000] YPos;
float[1000] ZPos;
}

The GPU (or Audio devices) does not consume it this way. The data is 
also not cache coherent if you are trying to read or write a single 
vertex out of the structure.

A hybrid structure which is aware of the size of a SIMD register is the 
next logical choice:

align(16)
struct ParticleVertex
{
float[4] XPos;
float[4] YPos;
float[4] ZPos;
}
ParticleVertex[250] ParticleVertices;

// struct is also now 75% of a 64 byte cache line
// Also, 2 of any 4 random accesses for a vertex are in the same
// cache line, and only 2 are touched in the worst case

But this hybrid structure still has to be shuffled before being given to 
a GPU (albeit in much more bite size increments that could easily 
read-shuffle-write at the same speed of a platform optimized memcpy)

Things get real messy when you have multiple vertex attributes as 
decisions to keep them together or separate are conflicting and both 
choices make sense to different systems :)

Apr 22 2011

bearophile <bearophileHUGS lycos.com> writes:

Sean Cavanaugh:

 In C++ the intrinsics are easily wrapped by __forceinline global
 functions, to provide a platform abstraction against the intrinsics.

When AVX will become 512 bits wide, or you need to use a very different set of
vector register, your global functions need to change, so the code that calls
them too has to change. This is acceptable for library code, but it's not good
for D built-ins operations. D built-in vector ops need to be more clean,
general and long-lasting, even if they may not fully replace SSE intrinsics.


 I would say in D this could be faked provided the language at a minimum
 understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and
 how to efficiently move it via registers for function calls.

Also think about what the D ABI will be 15-25 years from now. D design must
look a bit more forward too.


 Now the original topic pertains to data layouts,

It was about how to not preclude future D compilers from shuffling data around
a bit by themselves :-)


 I would argue the above
 code is an idealistic example, as when writing SIMD code you almost
 always have to transpose or rotate one of the sets of data to work in
 parallel across the other one.

Right.


 float4 a = {1,2,3,4};
 float4 b = {5,6,7,8};
 float4 c = {-1,0,1,2};
 float4 d = {0,0,0,0};
 float4 foo = (c > d) ? a : b;

Recently I have asked for a D vector comparison operation too, (the compiler is
supposed able to splits them into register-sized chunks for the comparisons),
this is good for AVX instructions (a little problem here is that I think
currently DMD allocates memory on heap to instantiate those four little arrays):

int[4] a = [1,2,3,4];
int[4] b = [5,6,7,8]
int[4] c = [-1,0,1,2];
int[4] d = [0,0,0,0];
int[4] foo = (c[] > d[]) ? a[] : b[];


 Things get real messy when you have multiple vertex attributes as
 decisions to keep them together or separate are conflicting and both
 choices make sense to different systems :)

It's not easy for future compilers to perform similar auto-vectorizations :-)

Bye and thank you for your answer,
bearophile

Apr 22 2011

Don <nospam nospam.com> writes:

Sean Cavanaugh wrote:
 On 4/22/2011 2:20 PM, bearophile wrote:
 Kai Meyer:

 The purpose of the original post was to indicate that some low level
 research shows that underlying data structures (as applied to video game
 development) can have an impact on the performance of the application,
 which D (I think) cares very much about.

 The idea of the original post was a bit more complex: how can we 
 invent new/better ways to express semantics in D code that will not 
 forbid future D compilers to perform a bit of changes in the layout of 
 data structures to increase code performance? Complex transforms of 
 the data layout seem too much complex for even a good compiler, but 
 maybe simpler ones will be possible. And I think to do this the D code 
 needs some more semantics. I was suggesting an annotation that forbids 
 inbound pointers, that allows the compiler to move data around a 
 little, but this is just a start.

 Bye,
 bearophile

 
 
 In many ways the biggest thing I use regularly in game development that 
 I would lose by moving to D would be good built-in SIMD support.  The PC 
 compilers from MS and Intel both have intrinsic data types and 
 instructions that cover all the operations from SSE1 up to AVX.  The 
 intrinsics are nice in that the job of register allocation and 
 scheduling is given to the compiler and generally the code it outputs is 
 good enough (though it needs to be watched at times).
 
 Unlike ASM, intrinsics can be inlined so your math library can provide a 
 platform abstraction at that layer before building up to larger 
 operations (like vectorized forms of sin, cos, etc) and algorithms (like 
 frustum cull checks, k-dop polygon collision etc), which makes porting 
 and reusing the algorithms to other platforms much much easier, as only 
 the low level layer needs to be ported, and only outliers at the 
 algorithm level need to be tweaked after you get it up and running.
 
 On the consoles there is AltiVec (VMX) which is very similar to SSE in 
 many ways.  The common ground is basically SSE1 tier operations : 128 
 bit values operating on 4x32 bit integer and 4x32 bit float support.  64 
 bit AMD/Intel makes SSE2 the minimum standard, and a systems language on 
 those platforms should reflect that.

Yes. It is for primarily for this reason that we made static arrays 
return-by-value. It is intended that on x86, float[4] will be an SSE1 
register.
So it should be possible to write SIMD code with standard array 
operations. (Note that this is *much* easier for the compiler, than 
trying to vectorize scalar code).

This gives syntax like:
float[4] a, b, c;
a[] += b[] * c[];
(currently works, but doesn't use SSE, so has dismal performance).

 
 Loading and storing is comparable across platforms with similar 
 alignment restrictions or penalties for working with unaligned data. 
 Packing/swizzle/shuffle/permuting are different but this is not a huge 
 problem for most algorithms.  The lack of fused multiply and add on the 
 Intel side can be worked around or abstracted (i.e. always write code as 
 if it existed, have the Intel version expand to multiple ops).
 
 And now my wish list:
 
 If you have worked with shader programming through HLSL or CG the 
 expressiveness of doing the work in SIMD is very high.  If I could write 
 something that looked exactly like HLSL but it was integrated perfectly 
 in a language like D or C++, it would be pretty huge to me.  The amount 
 of math you can have in a line or two in HLSL is mind boggling at times, 
 yet extremely intuitive and rather easy to debug.

Apr 26 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 26/04/11 9:01 AM, Don wrote:
 Sean Cavanaugh wrote:
 In many ways the biggest thing I use regularly in game development
 that I would lose by moving to D would be good built-in SIMD support.
 <snip>

 Yes. It is for primarily for this reason that we made static arrays
 return-by-value. It is intended that on x86, float[4] will be an SSE1
 register.
 So it should be possible to write SIMD code with standard array
 operations. (Note that this is *much* easier for the compiler, than
 trying to vectorize scalar code).

 This gives syntax like:
 float[4] a, b, c;
 a[] += b[] * c[];
 (currently works, but doesn't use SSE, so has dismal performance).

What about float[4]s that are part of an object? Will they be 
automatically align(16) so that they can be quickly moved into the SSE 
registers, or will the user have to specify that manually?

Also, what if I don't want my float[4] to be stored in a SSE register 
e.g. because I will be treating those four floats as individual floats, 
and never as a vector?

IMO, float[4] should be left as it is and you should introduce a new 
vector data type that has all these optimisations. Just because a vector 
is four floats doesn't mean that all groups of four floats are vectors.

Apr 26 2011

Don <nospam nospam.com> writes:

Peter Alexander wrote:
 On 26/04/11 9:01 AM, Don wrote:
 Sean Cavanaugh wrote:
 In many ways the biggest thing I use regularly in game development
 that I would lose by moving to D would be good built-in SIMD support.
 <snip>

 Yes. It is for primarily for this reason that we made static arrays
 return-by-value. It is intended that on x86, float[4] will be an SSE1
 register.
 So it should be possible to write SIMD code with standard array
 operations. (Note that this is *much* easier for the compiler, than
 trying to vectorize scalar code).

 This gives syntax like:
 float[4] a, b, c;
 a[] += b[] * c[];
 (currently works, but doesn't use SSE, so has dismal performance).

 
 What about float[4]s that are part of an object? Will they be 
 automatically align(16) so that they can be quickly moved into the SSE 
 registers, or will the user have to specify that manually?

No special treatment, they just use the alignment for arrays of the 
type. Which I believe is indeed align(16) in that case.

 Also, what if I don't want my float[4] to be stored in a SSE register 
 e.g. because I will be treating those four floats as individual floats, 
 and never as a vector?

That's a decision for the compiler to make. It'll generate whatever code 
it thinks is appropriate. (My mention of float[4] being in an SSE 
register applies ONLY to parameter passing; but it isn't decided yet 
anyway).

 IMO, float[4] should be left as it is and you should introduce a new 
 vector data type that has all these optimisations. Just because a vector 
 is four floats doesn't mean that all groups of four floats are vectors.

It has absolutely nothing to do with vectors. All groups of floats (of 
ANY length) benefit from SIMD. D's semantics make it easy to take 
advantage of SIMD, regardless of what size it is.

C's ancient machine model doesn't envisage SIMD, so C compilers are left 
with a massive abstraction inversion. It's really quite ridiculous that 
in this area, most mainstream programming languages are still operating 
at a lower level of abstraction than asm.

Apr 28 2011

Andrew Wiley <wiley.andrew.j gmail.com> writes:

On Fri, Apr 22, 2011 at 12:31 PM, Kai Meyer <kai unixlords.com> wrote:

 On 04/22/2011 11:20 AM, Daniel Gibson wrote:

 Am 22.04.2011 19:11, schrieb Kai Meyer:

 On 04/22/2011 11:05 AM, Daniel Gibson wrote:

 Am 22.04.2011 18:48, schrieb Kai Meyer:


 C/C++ is. There is a purpose and a place for Interpreted languages like


 or

 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

 IMHO D won't be successful for games as long as it only supports
 Windows, Linux and OSX on PC (-like) hardware.
 We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
 for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
 This means good PPC (maybe the PS3's Cell CPU would need special support
 even though it's understands PPC code? I don't know.) and ARM support
 and support for the operating systems and SDKs used on those platforms.

 Of course execution speed is very important as well, but D in it's
 current state is not *that* bad in this regard. Sure, the GC is a bit
 slow, but in high performance games you shouldn't use it (or even
 malloc/free) all the time, anyway, see
 http://www.digitalmars.com/d/2.0/memory.html#realtime

 Another point: I find Minecraft pretty impressive. It really changed my
 view upon Games developed in Java.

 Cheers,
 - Daniel

 Hah, Minecraft. Have you tried loading up a high resolution texture pack
 yet? There's a reason why it looks like 8-bit graphics. It's not Java
 that makes Minecraft awesome, imo :)

 No I haven't.
 What I find impressive is this (almost infinitely) big world that is
 completely changeable, i.e. you can build new stuff everywhere, you can
 dig tunnels everywhere (ok, somewhere really deep there's a limit) and
 the game still runs smoothly. Haven't seen something like that in any
 game before.

 The random world generator is amazing, but it's not speed. The polygon
 count of the game is excruciatingly low because the client is smart enough
 to only draw the faces of blocks that are visible. The very bottom (bedrock)
 and they very top of the sky (as high as you can build blocks) is 256 blocks
 tall. The game is full of low-level bit-stuffing (like stacks of 64). The
 genius of the game is not in any special features of Java, it's in the data
 structure and data generator, which can be done much faster in other
 languages. But it begs the question, "why does it need to be faster?" It is
 "fast enough" in the JVM (unless you load up the high resolution textures,
 in which case the game becomes unbearably slow when viewing long distances.)

Actually, the world is 128 blocks tall, and divided into 16x128x16 block
"chunks."
To elaborate on the bit stuffing, at the end of the day, each block is 2.5
bytes (type, metadata, and some lighting info) with exceptions for things
like chests.

The reason Minecraft runs so well in Java, from my point of view, is that
the authors resisted the Java urge to throw objects at the problem and
instead put everything into large byte arrays and wrote methods to
manipulate them. From that perspective, using Java would be about the same
as using any language, which let them stick to what they knew without
incurring a large performance penalty.

However, it's also true that as soon as you try to use a 128x128 texture
pack, you very quickly become disillusioned with Minecraft's performance.

Apr 22 2011

Mike Parker <aldacron gmail.com> writes:

On 4/23/2011 4:22 AM, Andrew Wiley wrote:

 The reason Minecraft runs so well in Java, from my point of view, is
 that the authors resisted the Java urge to throw objects at the problem
 and instead put everything into large byte arrays and wrote methods to
 manipulate them. From that perspective, using Java would be about the
 same as using any language, which let them stick to what they knew
 without incurring a large performance penalty.

FYI, Markus, the author, has been a figure in the Java game development 
community for years. He was the original client programmer for Wurm 
Online[1] (where the landscape is 'infinite' and tiled) and a frequent 
participant in the Java4k competition[2] (with Left4kDead[3] perhaps 
being his most popular). I think it's a safe assumption that the 
techniques he put to use in Minecraft were learned from his experiments 
with the Wurm landscape and with cramming Java games into 4kb.

[1] http://www.wurmonline.com/
[2] http://www.java4k.com/index.php?action=home
[3] http://www.mojang.com/notch/j4k/l4kd/

Apr 22 2011

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

On 22/04/2011 18:20, Daniel Gibson wrote:
 Am 22.04.2011 19:11, schrieb Kai Meyer:
 On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 C/C++ is. There is a purpose and a place for Interpreted languages like



 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

 IMHO D won't be successful for games as long as it only supports
 Windows, Linux and OSX on PC (-like) hardware.
 We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
 for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
 This means good PPC (maybe the PS3's Cell CPU would need special support
 even though it's understands PPC code? I don't know.) and ARM support
 and support for the operating systems and SDKs used on those platforms.

 Of course execution speed is very important as well, but D in it's
 current state is not *that* bad in this regard. Sure, the GC is a bit
 slow, but in high performance games you shouldn't use it (or even
 malloc/free) all the time, anyway, see
 http://www.digitalmars.com/d/2.0/memory.html#realtime

 Another point: I find Minecraft pretty impressive. It really changed my
 view upon Games developed in Java.

 Cheers,
 - Daniel

 Hah, Minecraft. Have you tried loading up a high resolution texture pack
 yet? There's a reason why it looks like 8-bit graphics. It's not Java
 that makes Minecraft awesome, imo :)

 No I haven't.
 What I find impressive is this (almost infinitely) big world that is
 completely changeable, i.e. you can build new stuff everywhere, you can
 dig tunnels everywhere (ok, somewhere really deep there's a limit) and
 the game still runs smoothly. Haven't seen something like that in any
 game before.

Yes, that is why Minecraft is so appealing, but AFAIK that is more of a 
game design issue than a technical one. It may not be easy to implement 
such an engine, but I'm sure many game coders out there could have done 
it, it's not "rocket" science. Rather, it was the gameplay design idea 
(and fleshing it out) that made Minecraft unique and popular, AFAIK.

-- 
Bruno Medeiros - Software Engineer

Apr 29 2011

bearophile <bearophileHUGS lycos.com> writes:

Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.

Don has given a nice answer about how D2 plans to face this.

To focus more what Don was saying I think a small exaple will help. This is a C
implementation of one Computer Shootout benchmarks, that generates a binary PPM
image of the Mandelbrot set:

http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4

This is an important part of that C version:


typedef double v2df __attribute__ ((vector_size(16))); /* vector of two doubles
*/
const v2df zero = { 0.0, 0.0 };
const v2df four = { 4.0, 4.0 };

// Constant throughout the program, value depends on N
int bytes_per_row;
double inverse_w;
double inverse_h;

// Program argument: height and width of the image
int N;

// Lookup table for initial real-axis value
v2df *Crvs;

// Mandelbrot bitmap
uint8_t *bitmap;

static void calc_row(int y) {
  uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
  int x;
  const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };

  for (x = 0; x < N; x += 2) {
    v2df Crv = Crvs[x >> 1];
    v2df Civ = Civ_init;
    v2df Zrv = zero;
    v2df Ziv = zero;
    v2df Trv = zero;
    v2df Tiv = zero;
    int i = 50;
    int two_pixels;
    v2df is_still_bounded;

    do {
      Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
      Zrv = Trv - Tiv + Crv;
      Trv = Zrv * Zrv;
      Tiv = Ziv * Ziv;

      // All bits will be set to 1 if 'Trv + Tiv' is less than 4
      // and all bits will be set to 0 otherwise. Two elements
      // are calculated in parallel here.
      is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);

      // Move the sign-bit of the low element to bit 0, move the
      // sign-bit of the high element to bit 1. The result is
      // that the pixel will be set if the calculation was
      // bounded.
      two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
    } while (--i > 0 && two_pixels);

    // The pixel bits must be in the most and second most
    // significant position
    two_pixels <<= 6;

    // Add the two pixels to the bitmap, all bits are
    // initially zero since the area was allocated with calloc()
    row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
  }
}


GCC 4.6 compiles the inner do-while loop of calc_row() to just this very clean
assembly, that in my opinion is quite _beautiful_, it shows one of the most
important final purposes of a good compiler:

L9:
    subl    $1, %ecx
    addpd   %xmm0, %xmm0
    mulpd   %xmm0, %xmm1
    movapd  %xmm4, %xmm0
    addpd   %xmm6, %xmm1
    addpd   %xmm5, %xmm0
    subpd   %xmm3, %xmm0
    movapd  %xmm1, %xmm3
    movapd  %xmm0, %xmm4
    mulpd   %xmm1, %xmm3
    mulpd   %xmm0, %xmm4
    movapd  %xmm3, %xmm2
    addpd   %xmm4, %xmm2
    cmplepd %xmm7, %xmm2
    movmskpd    %xmm2, %ebx
    je  L18
    testl   %ebx, %ebx
    jne L9


Those addpd, subpd, mulpd, movapd, etc, instructions work on pairs of doubles
(those v2df). And the code uses the cmplepd and movmskpd instructions too, in a
very clean way, that I think not even GCC 4.6 is normally able to use by
itself. A good language + compiler have many purposes, but producing ASM code
like that is one of the most important purposes, expecially if you write
numerical code.

A numerical programmer really wants to write code that somehow produces equally
clean and powerful code (or better, using AVX 256-bit registers and 3-way
instructions) in numerical processing kernels (often such kernels are small,
often just bodies of inner loops).

D2 allows to write code almost as clean as this C one (but I think currently no
D compiler is able to turn this into clean inlined addpd, subpd, mulpd, movapd
instructions. This is a compiler issue, not a language one):

v2df Zrv = zero;
...
Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;


In D it becomes:

double[2] Zrv = zero;
...
Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
Zrv[] = Trv[] - Tiv[] + Crv[];
Trv[] = Zrv[] * Zrv[];
Tiv[] = Ziv[] * Ziv[];


But then how do you write this in a clean way in D2/D3?

do {
    ...
    is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
    two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i > 0 && two_pixels);



Using those __builtin_ia32_cmplepd() and __builtin_ia32_movmskpd() is not easy,
so there is a tradeoff between allowing easy to write code, and giving power.
So it's acceptable for a language to give a bit less power if the code is
simpler to write. Yet, in a system language if you don't give people a way to
produce ASM code as clean as the one I've shown in the inner loops of numerical
processing code, some D2 programmers will be forced to write down inline asm,
and that's sometimes worse than using intrinsics like __builtin_ia32_cmplepd().

Writing efficient inner loops is very important for numerical processing code,
and I think numerical processing code is important for D2.

Time ago I have suggested to extend the D2 vector operations to code like this,
but I think this is not enough still:

float[4] a, b, c, d;
c = a[] == b[];
d = a[] >= b[];

Bye,
bearophile

May 03 2011

"qznc" <qznc web.de> writes:

On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:
 Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game 
 development that I would lose by moving to D would be good 
 built-in SIMD support.

 Don has given a nice answer about how D2 plans to face this.

 To focus more what Don was saying I think a small exaple will 
 help. This is a C implementation of one Computer Shootout 
 benchmarks, that generates a binary PPM image of the Mandelbrot 
 set:

 http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4

 This is an important part of that C version:


 typedef double v2df __attribute__ ((vector_size(16))); /* 
 vector of two doubles */
 const v2df zero = { 0.0, 0.0 };
 const v2df four = { 4.0, 4.0 };

 // Constant throughout the program, value depends on N
 int bytes_per_row;
 double inverse_w;
 double inverse_h;

 // Program argument: height and width of the image
 int N;

 // Lookup table for initial real-axis value
 v2df *Crvs;

 // Mandelbrot bitmap
 uint8_t *bitmap;

 static void calc_row(int y) {
   uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
   int x;
   const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };

   for (x = 0; x < N; x += 2) {
     v2df Crv = Crvs[x >> 1];
     v2df Civ = Civ_init;
     v2df Zrv = zero;
     v2df Ziv = zero;
     v2df Trv = zero;
     v2df Tiv = zero;
     int i = 50;
     int two_pixels;
     v2df is_still_bounded;

     do {
       Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
       Zrv = Trv - Tiv + Crv;
       Trv = Zrv * Zrv;
       Tiv = Ziv * Ziv;

       // All bits will be set to 1 if 'Trv + Tiv' is less than 4
       // and all bits will be set to 0 otherwise. Two elements
       // are calculated in parallel here.
       is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, 
 four);

       // Move the sign-bit of the low element to bit 0, move the
       // sign-bit of the high element to bit 1. The result is
       // that the pixel will be set if the calculation was
       // bounded.
       two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
     } while (--i > 0 && two_pixels);

     // The pixel bits must be in the most and second most
     // significant position
     two_pixels <<= 6;

     // Add the two pixels to the bitmap, all bits are
     // initially zero since the area was allocated with calloc()
     row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
   }
 }


 GCC 4.6 compiles the inner do-while loop of calc_row() to just 
 this very clean assembly, that in my opinion is quite 
 _beautiful_, it shows one of the most important final purposes 
 of a good compiler:

 L9:
     subl    $1, %ecx
     addpd   %xmm0, %xmm0
     mulpd   %xmm0, %xmm1
     movapd  %xmm4, %xmm0
     addpd   %xmm6, %xmm1
     addpd   %xmm5, %xmm0
     subpd   %xmm3, %xmm0
     movapd  %xmm1, %xmm3
     movapd  %xmm0, %xmm4
     mulpd   %xmm1, %xmm3
     mulpd   %xmm0, %xmm4
     movapd  %xmm3, %xmm2
     addpd   %xmm4, %xmm2
     cmplepd %xmm7, %xmm2
     movmskpd    %xmm2, %ebx
     je  L18
     testl   %ebx, %ebx
     jne L9


 Those addpd, subpd, mulpd, movapd, etc, instructions work on 
 pairs of doubles (those v2df). And the code uses the cmplepd 
 and movmskpd instructions too, in a very clean way, that I 
 think not even GCC 4.6 is normally able to use by itself. A 
 good language + compiler have many purposes, but producing ASM 
 code like that is one of the most important purposes, 
 expecially if you write numerical code.

 A numerical programmer really wants to write code that somehow 
 produces equally clean and powerful code (or better, using AVX 
 256-bit registers and 3-way instructions) in numerical 
 processing kernels (often such kernels are small, often just 
 bodies of inner loops).

 D2 allows to write code almost as clean as this C one (but I 
 think currently no D compiler is able to turn this into clean 
 inlined addpd, subpd, mulpd, movapd instructions. This is a 
 compiler issue, not a language one):

 v2df Zrv = zero;
 ...
 Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
 Zrv = Trv - Tiv + Crv;
 Trv = Zrv * Zrv;
 Tiv = Ziv * Ziv;


 In D it becomes:

 double[2] Zrv = zero;
 ...
 Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
 Zrv[] = Trv[] - Tiv[] + Crv[];
 Trv[] = Zrv[] * Zrv[];
 Tiv[] = Ziv[] * Ziv[];


 But then how do you write this in a clean way in D2/D3?

 do {
     ...
     is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
     two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
 } while (--i > 0 && two_pixels);



 Using those __builtin_ia32_cmplepd() and 
 __builtin_ia32_movmskpd() is not easy, so there is a tradeoff 
 between allowing easy to write code, and giving power. So it's 
 acceptable for a language to give a bit less power if the code 
 is simpler to write. Yet, in a system language if you don't 
 give people a way to produce ASM code as clean as the one I've 
 shown in the inner loops of numerical processing code, some D2 
 programmers will be forced to write down inline asm, and that's 
 sometimes worse than using intrinsics like 
 __builtin_ia32_cmplepd().

 Writing efficient inner loops is very important for numerical 
 processing code, and I think numerical processing code is 
 important for D2.

 Time ago I have suggested to extend the D2 vector operations to 
 code like this, but I think this is not enough still:

 float[4] a, b, c, d;
 c = a[] == b[];
 d = a[] >= b[];

 Bye,
 bearophile

Just found this old post, since I'm tuning mandelbrot.d right now 
[0].

The good news: LDC produces code, which is quite close to the C 
version.

mulsd  %xmm6,%xmm4
subsd  %xmm1,%xmm7
addsd  %xmm4,%xmm4
addsd  %xmm5,%xmm7
addsd  %xmm0,%xmm4
movaps %xmm7,%xmm6
mulsd  %xmm6,%xmm6
movaps %xmm4,%xmm2
mulsd  %xmm2,%xmm2
movaps %xmm2,%xmm1
addsd  %xmm6,%xmm1
ucomisd %xmm1,%xmm3
jb     4026f0 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0x130>
jl     402680 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0xc0>

Even better, the code is produce from the following (inlined!) 
source,
which is pretty much the mathematical definition.

for(auto i = 0; i < iter && norm(Z) <= lim; i++)
         Z = Z*Z + C;

The bad news: cmplepd and movmskpd are not used. Is that possible 
somehow four years later?

The gcc code is roughly twice as fast at the moment, but I don't 
know if cmplepd and movmskpd is the only thing missing.

[0] https://github.com/qznc/d-shootout

Sep 02 2015

"David Nadlinger" <code klickverbot.at> writes:

On Wednesday, 2 September 2015 at 19:04:10 UTC, qznc wrote:
 The bad news: cmplepd and movmskpd are not used. Is that 
 possible somehow four years later?

I just checked, and LLVM does not know how to automatically 
vectorize that loop. You would need to write it manually using 
vector types (like in the C version).

 [0] https://github.com/qznc/d-shootout

As a general note, you might want to add "-boundscheck=off 
-mcpu=native" to the flags for LDC too for a fair comparison to 
the other compilers. Also, if you use the DMD-style flags (e.g. 
-O -inline), you should use the ldmd2 wrapper instead of ldc2.

You might also want to use 2.067 branch of ldc2 (just released as 
an alpha version) for better comparability to DMD.

  — David

Sep 02 2015

D Programming

C/C++ Programming

Other

digitalmars.D - OOP, faster data layouts, compilers