www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - SSE in D

reply Emil Madsen <sovende gmail.com> writes:
Is there a D equivalent of the "xmmintrin.h", or any other convenient way of
doing SSE in D?
- I've been looking into the Array Operators, but will those work, for
instance if I'm doing something alike:
a[3], b[4]
c[4] = a+b;
and when will the compiler write SSE asm for the array operators? - is there
a target=architecture for the compiler? or will it simply write SSE if one
defines something alike -msse4? - I'm having a bit of trouble finding stuff
about SSE for D, sources on the subject anyone?

-- 
// Yours sincerely
// Emil 'Skeen' Madsen
Oct 02 2010
next sibling parent Trass3r <un known.com> writes:
Am 02.10.2010, 15:23 Uhr, schrieb Emil Madsen <sovende gmail.com>:

 Is there a D equivalent of the "xmmintrin.h", or any other convenient  
 way of
 doing SSE in D?
 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;
 and when will the compiler write SSE asm for the array operators? - is  
 there
 a target=architecture for the compiler? or will it simply write SSE if  
 one
 defines something alike -msse4? - I'm having a bit of trouble finding  
 stuff
 about SSE for D, sources on the subject anyone?
SSE is supported in inline assembly. dmd's backend doesn't automatically vectorize code. gdc and ldc are theoretically able to it (cause of the backends they use) but I don't know to what extent they really do in practice. Array operations leverage prewritten optimized SSE code if possible. See Array operations section of http://www.digitalmars.com/d/2.0/arrays.html
Oct 02 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Emil Madsen:

You are asking many different things, let's disentangle your questions a little.

Is there a D equivalent of the "xmmintrin.h", or any other convenient way of
doing SSE in D?<
D2 language is not designed to be an academic language, it's designed to be a reasonably practical language (despite some of its feature are not just buggy or unfinished, but also contain new design ideas, that far from being "battle tested", so no one knows if they will actually turn out to be good in large or very large D2 programs). But its implementation is not fully practical yet. In a compiler like GCC you may see a ton of dirty or smelly little features that turn out being practically useful or even almost necessary for real-world code, that are absent from the C standard. The D2 compiler lacks a big amount of such dirty utility corner cases. Even the (D1) compiler LDC shows some of such necessary dirty little features, like the allow_inline pragma to allow inlining of functions that contain asm, and so on. I guess that when D2 will be more finished, and some people will write a more efficient implementation of D2, those little smelly things will be added in abundance. The xmmintrin little dirty intrinsics are absent from DMD and D, both in practice and by design. GCC C is not designed much, they just add those SIMD operations to the ball of mud named GNU C (plus handy operator overloading if you want to sum or mult two registers represented as special arrays of doubles or floats or ints). D here is designed in a bit more idealistic way, and it tries to be semantically cleaner, so instead of those intrinsics, you are supposed to use vectorial operations done on arrays (both static and dynamic). Many of such operations are already implemented and more or less they work, but unless your arrays are large, they actually usually slow down your code, because they are chunks of pre-written asm (that use SSE+ registers too) designed for large arrays, are they are not inlined. In theory in future the D front-end will be able to replace a sum of two 4-float static arrays with a single SSE instruction (or little more) (if you have compiled the code for SSE-enabled CPUs). In practice DMD is far from this point, and the development efforts are (rightly!) focused on finishing core features and removing the worst implementation (or even design) bugs. Optimization of code generation matters are for later.
 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;
The right D syntax is: float[4] a, b, c; c[] = a[] + b[]; You must always use [] after the array name. Arrays must have the same length. And currently you can't use this syntax: void main() { float[4] a, b; float[4] c[] = a[] + b[]; } That gives the error: test.d(3): Error: cannot implicitly convert expression (a[] + b[]) of type float[] to float[4u][] Probably because of a unforeseen design bug that causes such collision between D and C syntax that is accepted still in D. See this bug report for more info about this design problem, that so far most people (including the main designers) seem to happily ignore: http://d.puremagic.com/issues/show_bug.cgi?id=3971 Here I have suggested a possible solution, the introduction of a -cstyle compiler flag, that was ignored even more: http://d.puremagic.com/issues/show_bug.cgi?id=4580 So this code works: void main() { float[4] a, b, c; c[] = a[] + b[]; } But it performs a call to the asm routine that performs the vector c=a+b in assembly, that uses SSE registers too if your CPU (detected at runtime) supports them.
 and when will the compiler write SSE asm for the array operators?
DMD currently never writes SSE asm, unless you use those asm instructions in inlined asm code. The 64 bit DMD will probably be able to use those registers too, but I have no idea if then 32 bit DMD too will use them, I hope so, but I have little hope. I'd like to know this. D1 LDC now uses SSE registers for most of its floating point operations because LLVM is very bad in using the X86 floating point stack. Low-level D code written for D1 ldc is usually about as efficient as C code written for GCC. This is a very good thing. But recently the development of LDC has slowed down a lot, and there is no D2 version of it, it's not updated to the latest versions of LLVM and there's no Windows support because LLVM devs are paid by Apple and they don't care to make LLVM work fully (== with exceptions too) for Windows too, they just need to give to people the illusion that LLVM is multi-platform. I used to help LLVM development, but I have stopped until they will add a good support of exceptions on Windows. There is a GCC-based D compiler too, named GDC, and I think it works, but I have never appreciated it much on Windows. Other people may give you more/better info on it.
 - is there a target=architecture for the compiler? or will it simply write
 SSE if one defines something alike -msse4? -
LDC D1 allows you to specify the target a little, while I think DMD always targets a Pentium1.
 I'm having a bit of trouble finding stuff
 about SSE for D, sources on the subject anyone?
There is not much to search :-) Bye, bearophile
Oct 02 2010
next sibling parent yoda <yoda talk.info> writes:
bearophile Wrote:

 Emil Madsen:
 
 You are asking many different things, let's disentangle your questions a
little.
 
Is there a D equivalent of the "xmmintrin.h", or any other convenient way of
doing SSE in D?<
D2 language is not designed to be an academic language, it's designed to be a reasonably practical language (despite some of its feature are not just buggy or unfinished, but also contain new design ideas, that far from being "battle tested", so no one knows if they will actually turn out to be good in large or very large D2 programs). But its implementation is not fully practical yet. In a compiler like GCC you may see a ton of dirty or smelly little features that turn out being practically useful or even almost necessary for real-world code, that are absent from the C standard. The D2 compiler lacks a big amount of such dirty utility corner cases. Even the (D1) compiler LDC shows some of such necessary dirty little features, like the allow_inline pragma to allow inlining of functions that contain asm, and so on. I guess that when D2 will be more finished, and some people will write a more efficient implementation of D2, those little smelly things will be added in abundance. The xmmintrin little dirty intrinsics are absent from DMD and D, both in practice and by design. GCC C is not designed much, they just add those SIMD operations to the ball of mud named GNU C (plus handy operator overloading if you want to sum or mult two registers represented as special arrays of doubles or floats or ints). D here is designed in a bit more idealistic way, and it tries to be semantically cleaner, so instead of those intrinsics, you are supposed to use vectorial operations done on arrays (both static and dynamic). Many of such operations are already implemented and more or less they work, but unless your arrays are large, they actually usually slow down your code, because they are chunks of pre-written asm (that use SSE+ registers too) designed for large arrays, are they are not inlined. In theory in future the D front-end will be able to replace a sum of two 4-float static arrays with a single SSE instruction (or little more) (if you have compiled the code for SSE-enabled CPUs). In practice DMD is far from this point, and the development efforts are (rightly!) focused on finishing core features and removing the worst implementation (or even design) bugs. Optimization of code generation matters are for later.
 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;
The right D syntax is: float[4] a, b, c; c[] = a[] + b[]; You must always use [] after the array name. Arrays must have the same length. And currently you can't use this syntax: void main() { float[4] a, b; float[4] c[] = a[] + b[]; } That gives the error: test.d(3): Error: cannot implicitly convert expression (a[] + b[]) of type float[] to float[4u][] Probably because of a unforeseen design bug that causes such collision between D and C syntax that is accepted still in D. See this bug report for more info about this design problem, that so far most people (including the main designers) seem to happily ignore: http://d.puremagic.com/issues/show_bug.cgi?id=3971 Here I have suggested a possible solution, the introduction of a -cstyle compiler flag, that was ignored even more: http://d.puremagic.com/issues/show_bug.cgi?id=4580 So this code works: void main() { float[4] a, b, c; c[] = a[] + b[]; } But it performs a call to the asm routine that performs the vector c=a+b in assembly, that uses SSE registers too if your CPU (detected at runtime) supports them.
 and when will the compiler write SSE asm for the array operators?
DMD currently never writes SSE asm, unless you use those asm instructions in inlined asm code. The 64 bit DMD will probably be able to use those registers too, but I have no idea if then 32 bit DMD too will use them, I hope so, but I have little hope. I'd like to know this. D1 LDC now uses SSE registers for most of its floating point operations because LLVM is very bad in using the X86 floating point stack. Low-level D code written for D1 ldc is usually about as efficient as C code written for GCC. This is a very good thing. But recently the development of LDC has slowed down a lot, and there is no D2 version of it, it's not updated to the latest versions of LLVM and there's no Windows support because LLVM devs are paid by Apple and they don't care to make LLVM work fully (== with exceptions too) for Windows too, they just need to give to people the illusion that LLVM is multi-platform. I used to help LLVM development, but I have stopped until they will add a good support of exceptions on Windows. There is a GCC-based D compiler too, named GDC, and I think it works, but I have never appreciated it much on Windows. Other people may give you more/better info on it.
 - is there a target=architecture for the compiler? or will it simply write
 SSE if one defines something alike -msse4? -
LDC D1 allows you to specify the target a little, while I think DMD always targets a Pentium1.
 I'm having a bit of trouble finding stuff
 about SSE for D, sources on the subject anyone?
There is not much to search :-)
Is it just me or does anyone else have problems reading / understanding what he tries to say? The words are more or less correct, but the grammar is something really incomprehensible. What is bearophile? Some Asperger child prodigy? The words sound like they come from some ivory tower 500 feet above us and show no signs of emotions or social group thinking. Why is he affecting D's development so much?
Oct 03 2010
prev sibling parent reply Emil Madsen <sovende gmail.com> writes:
uses SSE registers too if your CPU (detected at runtime) supports them.
How is this done? - using codepaths after a call to cpuid? and I can see the idea in cleaning up syntax, by replacing intrinsics with array operators, however, what if I want to for instance shuffle? - would it be possible to overload >> for that, or something? and how would it shuffle? 4 elements or the entire thing? - Say I want to shuffle elements once to the right like this: a b c d --> d a b c (_mm_shuffle_ps(array, array, _MM_SHUFFLE(2, 1, 0, 3));) Its just because I'm in need of such functionality to implement matrixes, and such using SSE. - what would my alternative be? implementing "xmmintrin.h" using bits of small inline asm? - that however wouldn't yield any speed, if its not getting inlined? On 3 October 2010 03:34, bearophile <bearophileHUGS lycos.com> wrote:
 Emil Madsen:

 You are asking many different things, let's disentangle your questions a
 little.

Is there a D equivalent of the "xmmintrin.h", or any other convenient way
of doing SSE in D?< D2 language is not designed to be an academic language, it's designed to be a reasonably practical language (despite some of its feature are not just buggy or unfinished, but also contain new design ideas, that far from being "battle tested", so no one knows if they will actually turn out to be good in large or very large D2 programs). But its implementation is not fully practical yet. In a compiler like GCC you may see a ton of dirty or smelly little features that turn out being practically useful or even almost necessary for real-world code, that are absent from the C standard. The D2 compiler lacks a big amount of such dirty utility corner cases. Even the (D1) compiler LDC shows some of such necessary dirty little features, like the allow_inline pragma to allow inlining of functions that contain asm, and so on. I guess that when D2 will be more finished, and some people will write a more efficient implementation of D2, those little smelly things will be added in abundance. The xmmintrin little dirty intrinsics are absent from DMD and D, both in practice and by design. GCC C is not designed much, they just add those SIMD operations to the ball of mud named GNU C (plus handy operator overloading if you want to sum or mult two registers represented as special arrays of doubles or floats or ints). D here is designed in a bit more idealistic way, and it tries to be semantically cleaner, so instead of those intrinsics, you are supposed to use vectorial operations done on arrays (both static and dynamic). Many of such operations are already implemented and more or less they work, but unless your arrays are large, they actually usually slow down your code, because they are chunks of pre-written asm (that use SSE+ registers too) designed for large arrays, are they are not inlined. In theory in future the D front-end will be able to replace a sum of two 4-float static arrays with a single SSE instruction (or little more) (if you have compiled the code for SSE-enabled CPUs). In practice DMD is far from this point, and the development efforts are (rightly!) focused on finishing core features and removing the worst implementation (or even design) bugs. Optimization of code generation matters are for later.
 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;
The right D syntax is: float[4] a, b, c; c[] = a[] + b[]; You must always use [] after the array name. Arrays must have the same length. And currently you can't use this syntax: void main() { float[4] a, b; float[4] c[] = a[] + b[]; } That gives the error: test.d(3): Error: cannot implicitly convert expression (a[] + b[]) of type float[] to float[4u][] Probably because of a unforeseen design bug that causes such collision between D and C syntax that is accepted still in D. See this bug report for more info about this design problem, that so far most people (including the main designers) seem to happily ignore: http://d.puremagic.com/issues/show_bug.cgi?id=3971 Here I have suggested a possible solution, the introduction of a -cstyle compiler flag, that was ignored even more: http://d.puremagic.com/issues/show_bug.cgi?id=4580 So this code works: void main() { float[4] a, b, c; c[] = a[] + b[]; } But it performs a call to the asm routine that performs the vector c=a+b in assembly, that uses SSE registers too if your CPU (detected at runtime) supports them.
 and when will the compiler write SSE asm for the array operators?
DMD currently never writes SSE asm, unless you use those asm instructions in inlined asm code. The 64 bit DMD will probably be able to use those registers too, but I have no idea if then 32 bit DMD too will use them, I hope so, but I have little hope. I'd like to know this. D1 LDC now uses SSE registers for most of its floating point operations because LLVM is very bad in using the X86 floating point stack. Low-level D code written for D1 ldc is usually about as efficient as C code written for GCC. This is a very good thing. But recently the development of LDC has slowed down a lot, and there is no D2 version of it, it's not updated to the latest versions of LLVM and there's no Windows support because LLVM devs are paid by Apple and they don't care to make LLVM work fully (== with exceptions too) for Windows too, they just need to give to people the illusion that LLVM is multi-platform. I used to help LLVM development, but I have stopped until they will add a good support of exceptions on Windows. There is a GCC-based D compiler too, named GDC, and I think it works, but I have never appreciated it much on Windows. Other people may give you more/better info on it.
 - is there a target=architecture for the compiler? or will it simply
write
 SSE if one defines something alike -msse4? -
LDC D1 allows you to specify the target a little, while I think DMD always targets a Pentium1.
 I'm having a bit of trouble finding stuff
 about SSE for D, sources on the subject anyone?
There is not much to search :-) Bye, bearophile
-- // Yours sincerely // Emil 'Skeen' Madsen
Oct 03 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Emil Madsen:

uses SSE registers too if your CPU (detected at runtime) supports them.
How is this done? - using codepaths after a call to cpuid?
In your dmd distribution there is compiler/druntime/phobos source code too, take a peek there. This souce code shows you how it's done: http://www.dsource.org/projects/druntime/browser/trunk/src/rt/arraydouble.d
 what if I want to for instance shuffle? - would it
 be possible to overload >> for that, or something? and how would it shuffle?
 4 elements or the entire thing? - Say I want to shuffle elements once to the
 right like this:
 a b c d
At the moment I think you have to write a little function that performs the shuffle (and if it contains asm it will not be inlined). A similar solution is to use a little shuffling struct that uses opDispatch to give a nice shuffling syntax. You may also use a string mixin, if your asm code must be inlined, but this is not nice. I think currently there is no very good way to do what you need to do. I think Don or someone else will need to invent something good enough for the efficient shuffling :-)
 implementing
 "xmmintrin.h" using bits of small inline asm? - that however wouldn't yield
 any speed, if its not getting inlined?
In DMD functions that contain asm don't get inlined, so those small snippets become kind of useless if your purpose is max performance. LDC (D1) compiler being more practical has two different ways to do what you need to do, the pragma(allow_inline): http://www.dsource.org/projects/ldc/wiki/Docs#allow_inline And Inline Assembly Expressions: http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions In DMD you probably have to build your code as string at compile-time and then mix-in in the normal code. This is not handy nor clean, but it may work. Bye, bearophile
Oct 03 2010