digitalmars.D - SIMD/intrinsincs questions
- Mike Farnsworth (12/12) Nov 06 2009 Hey all,
- Don (14/35) Nov 06 2009 Hi Mike, Welcome to D!
- Mike Farnsworth (8/39) Nov 06 2009 Awesome, does this also apply to dynamic arrays? And how far does that ...
- Don (7/52) Nov 06 2009 Yes, that works, and it applies to dynamic arrays too. A key idea behind...
- Andrei Alexandrescu (5/68) Nov 06 2009 Mike, for more info on the supported operations you may want to refer to...
- Bill Baxter (13/26) Nov 06 2009 e
- Walter Bright (21/22) Nov 06 2009 The following are directly supported:
- Walter Bright (2/8) Nov 06 2009 Many of the array operations do use the CPU vector operations.
- Lutger (9/21) Nov 08 2009 Have you seen this page?
- Robert Jacques (5/31) Nov 08 2009 SSE intrinsics allow you to specify the operation, but allow the compile...
- Michael Farnsworth (75/108) Nov 08 2009 I finally went and did a little homework, so sorry for the long reply
- Robert Jacques (10/124) Nov 08 2009 By design, D asm blocks are separated from the optimizer: no code motion...
- Michael Farnsworth (39/47) Nov 09 2009 Yeah, I've discovered that having either the constraints-based __asm()
- Walter Bright (5/11) Nov 09 2009 I think there's a lot of potential in this. Most languages lack array
- Mike Farnsworth (5/17) Nov 09 2009 Can you elaborate a bit on what you mean? If I understand what you're g...
- Walter Bright (15/31) Nov 09 2009 Sure. Consider the code:
- Bill Baxter (32/48) Nov 09 2009 at
- Don (13/55) Nov 10 2009 The bad news: The DMD back-end is a state-of-the-art backend from the
- Walter Bright (8/22) Nov 10 2009 Modern compilers don't do much better. The point of diminishing returns
- Don (14/38) Nov 10 2009 Yup. The only integer operation modern compilers still don't do well is
- Walter Bright (6/15) Nov 10 2009 I do have a working Pentium around here somewhere. I even have a 486,
- Adam D. Ruppe (17/18) Nov 10 2009 I actually still use my Pentium 1 computers. I have three of them, one
- Walter Bright (6/9) Nov 10 2009 Interestingly, dmd does a very good job of Pentium instruction
- Lutger (5/22) Nov 10 2009 Until recently my stepdad still had his 8086 setup to interface with an ...
- bearophile (6/8) Nov 10 2009 I routinely see D benchmarks that are 2+ times faster with LDC compared ...
- Walter Bright (6/13) Nov 10 2009 Have to be careful about benchmarks without looking at why. A few months...
- Mike Farnsworth (5/30) Nov 10 2009 For my purposes, runtime detection is probably out the window, unless th...
- Walter Bright (5/16) Nov 10 2009 The way to do it is to not distribute multiple executables, but have the...
- Mike Farnsworth (4/22) Nov 10 2009 Was it actually rewriting the executable code to call the alternate func...
- Walter Bright (37/43) Nov 10 2009 It's much simpler than that. Some C:
- Chad J (15/19) Nov 10 2009 If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to
- Mike Farnsworth (3/24) Nov 10 2009 Incidentally, if you use LLVM to compile to their bitcode, you can at ru...
Hey all, The other day someone pointed me to Andrei's article in DDJ, and I dove headlong into researching D and what it is capable of. I had only seen it referred to a few times with respect to template metaprogramming and that crazy compile-time ray tracer, but I have to say I've been very impressed with what I've seen, especially with D2. A bit of background: I work in the movie VFX industry, and worked in games development previously, and I have my own ray tracer that I experiment with (see http://renderspud.blogspot.com/ for info). Back in college the better version, and now I've slowly been converting it to C++ again with SSE support (getting to the SOA ray packet form soon, I hope) so that it doesn't suck speed-wise. Anyway, long story short, SIMD is really important to me. In dmd and ldc, is there any support for SSE or other SIMD intrinsics? I realize that I could write some asm blocks, but that means each operation (vector add, sub, mul, dot product, etc.) would need to probably include a prelude and postlude with loads and stores. I worry that this will not get optimized away (unless I don't use 'naked'?). In the alternative, is it possible to support something along the lines of gcc's vector extensions: typedef int v4si __attribute__ ((vector_size (16))); typedef float v4sf __attribute__ ((vector_size (16))); where the compiler will automatically generate opAdd, etc. for those types? I'm not suggesting using gcc's syntax, of course, but you get the idea. It would provide a very easy way for the compiler to prefer to keep 4-float vectors in SSE registers, pass them in registers where appropriate in function calls, nuke lots of loads and stores when inlining, etc. Having good, native SIMD support in D seems like a natural fit (heck, it's got complex numbers built-in). Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike Farnsworth
Nov 06 2009
Mike Farnsworth wrote:Hey all, The other day someone pointed me to Andrei's article in DDJ, and I dove headlong into researching D and what it is capable of. I had only seen it referred to a few times with respect to template metaprogramming and that crazy compile-time ray tracer, but I have to say I've been very impressed with what I've seen, especially with D2. A bit of background: I work in the movie VFX industry, and worked in games development previously, and I have my own ray tracer that I experiment with (see http://renderspud.blogspot.com/ for info). Back in college the better version, and now I've slowly been converting it to C++ again with SSE support (getting to the SOA ray packet form soon, I hope) so that it doesn't suck speed-wise. Anyway, long story short, SIMD is really important to me. In dmd and ldc, is there any support for SSE or other SIMD intrinsics? I realize that I could write some asm blocks, but that means each operation (vector add, sub, mul, dot product, etc.) would need to probably include a prelude and postlude with loads and stores. I worry that this will not get optimized away (unless I don't use 'naked'?). In the alternative, is it possible to support something along the lines of gcc's vector extensions: typedef int v4si __attribute__ ((vector_size (16))); typedef float v4sf __attribute__ ((vector_size (16))); where the compiler will automatically generate opAdd, etc. for those types? I'm not suggesting using gcc's syntax, of course, but you get the idea.. It would provide a very easy way for the compiler to prefer to keep 4-float vectors in SSE registers, pass them in registers where appropriate in function calls, nuke lots of loads and stores when inlining, etc. Having good, native SIMD support in D seems like a natural fit (heck, it's got complex numbers built-in). Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.
Nov 06 2009
Don Wrote:Mike Farnsworth wrote:Awesome, does this also apply to dynamic arrays? And how far does that go? E.g. if I were to do something odd like: x[] = ((y[] % 5) ^ 2) + z[]; Would that also work? (Sorry, I should test it myself, but I'm at work and haven't had time to get D tools installed yet and so am flying blind.) On another note, I'm aware that the latest gcc versions have pretty good SIMD auto-vectorization, so I assume that will eventually be in the cards for dmd. As for lcd, that is pretty much dependent on llvm itself, and that doesn't have auto-vectorization of code yet AFAIK. Anyone familiar with ldc have any idea about getting optimized asm and/or SSE intrinsics to do the right thing? As soon as I have some time, I'll stop being lazy and actually go try some of this stuff out myself and see what the compiled asm looks like, but if anyone has already figured out the answers I can stay lazy. If it comes down to me needing to create some x86 asm in structs to get some initial SSE-based vector types working, I'll do that and share with the class. I'm not amazing with that stuff, but it could serve as a poor-man's stopgap until the compilers mature a bit in this regard. -MikeIn dmd and ldc, is there any support for SSE or other SIMD intrinsics? I realize that I could write some asm blocks, but that means each operation (vector add, sub, mul, dot product, etc.) would need to probably include a prelude and postlude with loads and stores. I worry that this will not get optimized away (unless I don't use 'naked'?). In the alternative, is it possible to support something along the lines of gcc's vector extensions: typedef int v4si __attribute__ ((vector_size (16))); typedef float v4sf __attribute__ ((vector_size (16))); where the compiler will automatically generate opAdd, etc. for those types? I'm not suggesting using gcc's syntax, of course, but you get the idea.. It would provide a very easy way for the compiler to prefer to keep 4-float vectors in SSE registers, pass them in registers where appropriate in function calls, nuke lots of loads and stores when inlining, etc. Having good, native SIMD support in D seems like a natural fit (heck, it's got complex numbers built-in). Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.
Nov 06 2009
Mike Farnsworth wrote:Don Wrote:Yes, that works, and it applies to dynamic arrays too. A key idea behind this is that since modern machines support SIMD, it's quite ridiculous for a high level languages to not be able to express it.Mike Farnsworth wrote:Awesome, does this also apply to dynamic arrays? And how far does that go? E.g. if I were to do something odd like: x[] = ((y[] % 5) ^ 2) + z[];In dmd and ldc, is there any support for SSE or other SIMD intrinsics? I realize that I could write some asm blocks, but that means each operation (vector add, sub, mul, dot product, etc.) would need to probably include a prelude and postlude with loads and stores. I worry that this will not get optimized away (unless I don't use 'naked'?). In the alternative, is it possible to support something along the lines of gcc's vector extensions: typedef int v4si __attribute__ ((vector_size (16))); typedef float v4sf __attribute__ ((vector_size (16))); where the compiler will automatically generate opAdd, etc. for those types? I'm not suggesting using gcc's syntax, of course, but you get the idea.. It would provide a very easy way for the compiler to prefer to keep 4-float vectors in SSE registers, pass them in registers where appropriate in function calls, nuke lots of loads and stores when inlining, etc. Having good, native SIMD support in D seems like a natural fit (heck, it's got complex numbers built-in). Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.Would that also work? (Sorry, I should test it myself, but I'm at work and haven't had time to get D tools installed yet and so am flying blind.) On another note, I'm aware that the latest gcc versions have pretty good SIMD auto-vectorization, so I assume that will eventually be in the cards for dmd. As for lcd, that is pretty much dependent on llvm itself, and that doesn't have auto-vectorization of code yet AFAIK. Anyone familiar with ldc have any idea about getting optimized asm and/or SSE intrinsics to do the right thing? As soon as I have some time, I'll stop being lazy and actually go try some of this stuff out myself and see what the compiled asm looks like, but if anyone has already figured out the answers I can stay lazy. If it comes down to me needing to create some x86 asm in structs to get some initial SSE-based vector types working, I'll do that and share with the class. I'm not amazing with that stuff, but it could serve as a poor-man's stopgap until the compilers mature a bit in this regard.Yes, lots of stuff that should work doesn't yet. The emphasis has been on getting the fundamentals solid. There's a lot of activity planned -- in fact I'm improving the compiler support for operator loading right now.
Nov 06 2009
Don wrote:Mike Farnsworth wrote:Mike, for more info on the supported operations you may want to refer to the Thermopylae excerpt: http://erdani.com/d/thermopylae.pdf AndreiDon Wrote:Yes, that works, and it applies to dynamic arrays too. A key idea behind this is that since modern machines support SIMD, it's quite ridiculous for a high level languages to not be able to express it.Mike Farnsworth wrote:Awesome, does this also apply to dynamic arrays? And how far does that go? E.g. if I were to do something odd like: x[] = ((y[] % 5) ^ 2) + z[];In dmd and ldc, is there any support for SSE or other SIMD intrinsics? I realize that I could write some asm blocks, but that means each operation (vector add, sub, mul, dot product, etc.) would need to probably include a prelude and postlude with loads and stores. I worry that this will not get optimized away (unless I don't use 'naked'?). In the alternative, is it possible to support something along the lines of gcc's vector extensions: typedef int v4si __attribute__ ((vector_size (16))); typedef float v4sf __attribute__ ((vector_size (16))); where the compiler will automatically generate opAdd, etc. for those types? I'm not suggesting using gcc's syntax, of course, but you get the idea.. It would provide a very easy way for the compiler to prefer to keep 4-float vectors in SSE registers, pass them in registers where appropriate in function calls, nuke lots of loads and stores when inlining, etc. Having good, native SIMD support in D seems like a natural fit (heck, it's got complex numbers built-in). Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays have become value types. This is a big step: it means that (eg) float[4] can be returned from a function for the first time. On 32-bit, we're a bit limited in SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but this will mean that on 64 bit, we'll be able to define an ABI in which short static arrays are passed in SSE registers. Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.
Nov 06 2009
On Fri, Nov 6, 2009 at 11:29 AM, Don <nospam nospam.com> wrote:Hi Mike, Welcome to D! In the latest compiler release (ie, this morning!), fixed-length arrays h=avebecome value types. This is a big step: it means that (eg) float[4] can b=ereturned from a function for the first time. On 32-bit, we're a bit limit=edin SSE support (eg, since *no* 32-bit AMD processors have SSE2) -- but th=iswill mean that on 64 bit, we'll be able to define an ABI in which =A0shor=tstatic arrays are passed in SSE registers. Also, D has array operations. =A0If x, y, and z are int[4], then x[] =3D y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them ye=t(there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned wi=thSIMD in mind.But what about the question of direct support for SSE intrinsics? I don't see any in std.* but is there any reason not to, say, beef up std.intrinsics with such things? Is there any major hurdle to overcome? Seems like it would be useful to have. --bb
Nov 06 2009
Bill Baxter wrote:But what about the question of direct support for SSE intrinsics?The following are directly supported: a[] = b[] + c[] a[] = b[] - c[] a[] = b[] + value a[] += value a[] += b[] a[] = b[] - value a[] = value - b[] a[] -= value a[] -= b[] a[] = b[] * value a[] = b[] * c[] a[] *= value a[] *= b[] a[] = b[] / value a[] /= value a[] -= b[] * value CPU detection is done at runtime and picks which of none, mmx, sse, sse2 or amd3dnow instructions to use. Other operations are done using loop fusion.
Nov 06 2009
Don wrote:Also, D has array operations. If x, y, and z are int[4], then x[] = y[]*3 + z[]; corresponds directly to SIMD operations. DMD doesn't do much with them yet (there's been so many language design issues that optimisation hasn't received much attention), but the language has definitely been planned with SIMD in mind.Many of the array operations do use the CPU vector operations.
Nov 06 2009
Mike Farnsworth wrote: ...Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHave you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
Nov 08 2009
On Sun, 08 Nov 2009 17:47:31 -0500, Lutger <lutger.blijdestijn gmail.com> wrote:Mike Farnsworth wrote: ...SSE intrinsics allow you to specify the operation, but allow the compiler to do the register assignments, inlining, etc. D's inline asm requires the programmer to manage everything.Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHave you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
Nov 08 2009
On 11/08/2009 06:35 PM, Robert Jacques wrote:On Sun, 08 Nov 2009 17:47:31 -0500, Lutger <lutger.blijdestijn gmail.com> wrote:I finally went and did a little homework, so sorry for the long reply that follows. I have been experimenting with both the ldc.llvmasm.__asm() function, as well as getting D's asm {} to do what I want. So far, I have been able to get some SSE instructions in there, but I'm running into a few issues. For now, I'm only using ldc, but I'll try out dmd eventually as well. * Using "-release -O5 -enable-inlining" in ldc, I can't for the life of me get it to inline the functions with the SSE asm statements. * Overriding opAdd for a struct, I had a hard time getting it to not spit what appears to me to be a lot of extra loading / stack code. In order to even get it to do what I wanted, I wrote it like this: Vector opAdd(Vector v) { Vector result = void; float* c0 = &c[0]; float* vc0 = &v.c[0]; float* rc0 = &v.c[0]; asm { movaps XMM0,c0 ; movaps XMM1,vc0 ; addps XMM0,XMM1 ; movaps rc0,XMM0 ; } return result; } And that ended up with the address-of code and stack stuff that isn't optimal. * When I instead write a function like this: static vecAdd(ref Vector v1, ref Vector v2, ref Vector result) { asm { movaps XMM0,v1 ; movaps XMM1,v2 ; addps XMM0,XMM1 ; movaps rv,XMM0 ; } } where Vector is defined as: align(16) struct Vector { public: float[4] c; } (Note that 'result' is passed as 'ref' and not 'out'. With 'out', it inserted init code in there, probably because the compiler thought I hadn't actually touched the result, even though the assembly did its job. 'out' is a better contract description, so it'd be nice to know how to suppress that.) With this I get a fewer instructions in the function; but it still has an extraneous stack push/pop pair surrounding it, and it still won't inline for me where I call it. It's all of 8 instructions including the return, and any inlining scheme that thinks that merits a function call instead ought to be drug out and shot. =P * I used __asm(T)(char[], char[], T) from ldc as well, but either I suck at getting LLVM to recognize my constraints, or ldc doesn't support SSE constraints yet, but it just wouldn't take. I ended up going the D asm block route once I figured out how to give it addresses without taking the address of everything (using ref for struct arguments works great!). So, yeah, once I can figure out how to get any of the compilers to inline my asm-laced functions, and then figure out how to get an optimizer to eliminate all the (what should be) extraneous movaps instructions, then I'll be in good shape. Until then, I won't port my ray tracer over to D. But I will be happy to try to help out with patches/experiments until then to get to the goal of making D suitable for heavy SIMD calculations. I'm talking with the ldc guys about it, as LLVM should be able to make really good use of this stuff (especially intrinsics) once the frontend can hand it off suitably. I'm excited to work on a project like this, because if I get better at dealing with SIMD issues in the compiler I should be able to capitalize on it to make my math-heavy code even faster. Mmmm...speed... -MikeMike Farnsworth wrote: ...SSE intrinsics allow you to specify the operation, but allow the compiler to do the register assignments, inlining, etc. D's inline asm requires the programmer to manage everything.Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHave you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
Nov 08 2009
On Mon, 09 Nov 2009 01:53:11 -0500, Michael Farnsworth <mike.farnsworth gmail.com> wrote:On 11/08/2009 06:35 PM, Robert Jacques wrote:By design, D asm blocks are separated from the optimizer: no code motion, etc occurs. D2 just changed fixed sized arrays to value types, which provide most of the functionality of a small vector struct. However, actual SSE optimization of these types is probably going to wait until x64 support; since a bunch of 32-bit chips don't support them. P.S. For what it's worth, I do research which involves volumetric ray-tracing. I've always found memory to bottleneck computations. Also, why not look into CUDA/OpenCL/DirectCompute?On Sun, 08 Nov 2009 17:47:31 -0500, Lutger <lutger.blijdestijn gmail.com> wrote:I finally went and did a little homework, so sorry for the long reply that follows. I have been experimenting with both the ldc.llvmasm.__asm() function, as well as getting D's asm {} to do what I want. So far, I have been able to get some SSE instructions in there, but I'm running into a few issues. For now, I'm only using ldc, but I'll try out dmd eventually as well. * Using "-release -O5 -enable-inlining" in ldc, I can't for the life of me get it to inline the functions with the SSE asm statements. * Overriding opAdd for a struct, I had a hard time getting it to not spit what appears to me to be a lot of extra loading / stack code. In order to even get it to do what I wanted, I wrote it like this: Vector opAdd(Vector v) { Vector result = void; float* c0 = &c[0]; float* vc0 = &v.c[0]; float* rc0 = &v.c[0]; asm { movaps XMM0,c0 ; movaps XMM1,vc0 ; addps XMM0,XMM1 ; movaps rc0,XMM0 ; } return result; } And that ended up with the address-of code and stack stuff that isn't optimal. * When I instead write a function like this: static vecAdd(ref Vector v1, ref Vector v2, ref Vector result) { asm { movaps XMM0,v1 ; movaps XMM1,v2 ; addps XMM0,XMM1 ; movaps rv,XMM0 ; } } where Vector is defined as: align(16) struct Vector { public: float[4] c; } (Note that 'result' is passed as 'ref' and not 'out'. With 'out', it inserted init code in there, probably because the compiler thought I hadn't actually touched the result, even though the assembly did its job. 'out' is a better contract description, so it'd be nice to know how to suppress that.) With this I get a fewer instructions in the function; but it still has an extraneous stack push/pop pair surrounding it, and it still won't inline for me where I call it. It's all of 8 instructions including the return, and any inlining scheme that thinks that merits a function call instead ought to be drug out and shot. =P * I used __asm(T)(char[], char[], T) from ldc as well, but either I suck at getting LLVM to recognize my constraints, or ldc doesn't support SSE constraints yet, but it just wouldn't take. I ended up going the D asm block route once I figured out how to give it addresses without taking the address of everything (using ref for struct arguments works great!). So, yeah, once I can figure out how to get any of the compilers to inline my asm-laced functions, and then figure out how to get an optimizer to eliminate all the (what should be) extraneous movaps instructions, then I'll be in good shape. Until then, I won't port my ray tracer over to D. But I will be happy to try to help out with patches/experiments until then to get to the goal of making D suitable for heavy SIMD calculations. I'm talking with the ldc guys about it, as LLVM should be able to make really good use of this stuff (especially intrinsics) once the frontend can hand it off suitably. I'm excited to work on a project like this, because if I get better at dealing with SIMD issues in the compiler I should be able to capitalize on it to make my math-heavy code even faster. Mmmm...speed... -MikeMike Farnsworth wrote: ...SSE intrinsics allow you to specify the operation, but allow the compiler to do the register assignments, inlining, etc. D's inline asm requires the programmer to manage everything.Of course, there are some operations that the available SSE intrinsics cover that the compiler can't expose via the typical operators, so those still need to be supported somehow. Does anyone know if ldc or dmd has those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm intrinsics, but as far as hardware-specific codegen intrinsics I couldn't spot any. Thanks, Mike FarnsworthHave you seen this page? http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions This is similar to gcc's (gdc has it too) extended inline asm expressions. I'm not at all in the know about all this, but I think this will allow you to built something yourself that works well with the optimizations done by the compiler. If someone could clarify how these inline expressions work exactly, that would be great.
Nov 08 2009
On 11/08/2009 11:28 PM, Robert Jacques wrote:By design, D asm blocks are separated from the optimizer: no code motion, etc occurs. D2 just changed fixed sized arrays to value types, which provide most of the functionality of a small vector struct. However, actual SSE optimization of these types is probably going to wait until x64 support; since a bunch of 32-bit chips don't support them. P.S. For what it's worth, I do research which involves volumetric ray-tracing. I've always found memory to bottleneck computations. Also, why not look into CUDA/OpenCL/DirectCompute?Yeah, I've discovered that having either the constraints-based __asm() from ldc or actual intrinsics probably makes optimization opportunities more frequent. But, if it at least inlined the regular asm blocks for me I'd be most of the way there. The ldc guys tell me that they didn't include the llvm vector intrinsics already because they were going to need either a custom type in the frontend, or else the D2 fixed-size-arrays-as-value-types functionality. I might take a stab at some of that in ldc in the future to see if I can get it to work, but I'm not an expert in compilers by any stretch of the imagination. -Mike PS: As for trying CUDA/OpenCL/DirectCompute, I haven't gotten into it much for a few reasons: * The standards and APIs are still evolving * I refuse to pigeon-hole myself into windows (I'm typing this from a Fedora 11 box, and at work we're a linux shop doing movie VFX) * Larrabee (yes, yes, semi-vaporware until Intel gets their crap together) will allow something much closer to standard CPU code. I really think that's the direction the GPU makers are heading in general, so why hobble myself with cruddy GPU memory/threading models to code around right now? * GPUs keep changing, and every change brings with it subtle (and sometimes drastic) effects on your code's performance and results from card to card. It's a nightmare to maintain, and every project we've done trying to do production rendering stuff on GPU (even just relighting) has ended in tears and gnashing of teeth. Everyone just eventually throws up their hands and goes back to optimized CPU rendering in the VFX industry (Pixar, ILM, Tippett have all done that, just to name a few). Good, solid general purpose CPUs with caches, decently wide SIMD with scatter/gather, and plenty of hardware threads are the wave of the future. (Or was that the past? I can't remember.) GPUs are slowly converging back to that, except that currently they have a programmer-managed cache (texture mem), and they execute multiple threads concurrently over the same instructions in groups (warps, in CUDA-speak?). They'll eventually add the 'feature' of a more automatically-managed cache, and better memory throughput when allowing warps to be smaller and more flexible. And they'll look nearly identical to all the multi-core CPUs again when it happens.
Nov 09 2009
Michael Farnsworth wrote:The ldc guys tell me that they didn't include the llvm vector intrinsics already because they were going to need either a custom type in the frontend, or else the D2 fixed-size-arrays-as-value-types functionality. I might take a stab at some of that in ldc in the future to see if I can get it to work, but I'm not an expert in compilers by any stretch of the imagination.I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Nov 09 2009
Walter Bright Wrote:Michael Farnsworth wrote:Can you elaborate a bit on what you mean? If I understand what you're getting at, it's as simple as recognizing array-wise operations (the a[] = b[] * c expressions in D), and decomposing them into SIMD underneath where possible? It would also be cool if the compiler could catch cases where a struct was essentially a wrapper around one of those arrays, and similarly turn the ops into SIMD ops (so as to allow some operator overloads and extra method wrapping additional intrinsics, for example). There are a lot of cases to recognize, but the compiler could start with the simple ones and then go from there with no need to change the language or declare custom types (minus some alignment to help it along, perhaps). The nice thing about it is you automatically get a pretty big swath of auto-vectorization by the compiler in the most natural types and operations you'd expect it to show up. Of course, SOA-style SIMD takes more intervention by the programmer, but there is probably no easy way around that, since it's based on a data-layout technique. -MikeThe ldc guys tell me that they didn't include the llvm vector intrinsics already because they were going to need either a custom type in the frontend, or else the D2 fixed-size-arrays-as-value-types functionality. I might take a stab at some of that in ldc in the future to see if I can get it to work, but I'm not an expert in compilers by any stretch of the imagination.I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Nov 09 2009
Mike Farnsworth wrote:Walter Bright Wrote:Sure. Consider the code: for (int i = 0; i < 100; i++) array[i] = 0; It takes a fair amount of work for a compiler to deduce "aha!" this code is intended to clear the array! The compiler then replaces the loop with: memset(array, 0, 100 * sizeof(array[0])); In D, you can specify the array operation at a high level: array[0..100] = 0; In other words, a language is supposed to represent high level concepts and the compiler breaks it down into low level ones supported by the machine. With vector operations, etc., the language supports only the low level operations and the compiler must reconstruct the high level operations supported by the machine. This inversion of roles is bizarre.Michael Farnsworth wrote:Can you elaborate a bit on what you mean?The ldc guys tell me that they didn't include the llvm vector intrinsics already because they were going to need either a custom type in the frontend, or else the D2 fixed-size-arrays-as-value-types functionality. I might take a stab at some of that in ldc in the future to see if I can get it to work, but I'm not an expert in compilers by any stretch of the imagination.I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Nov 09 2009
On Mon, Nov 9, 2009 at 1:56 PM, Mike Farnsworth <mike.farnsworth gmail.com> wrote:Walter Bright Wrote:atMichael Farnsworth wrote:The ldc guys tell me that they didn't include the llvm vector intrinsics already because they were going to need either a custom type in the frontend, or else the D2 fixed-size-arrays-as-value-types functionality. =A0I might take a stab=getting at, it's as simple as recognizing array-wise operations (the a[] = =3D b[] * c expressions in D), and decomposing them into SIMD underneath wh= ere possible? =A0It would also be cool if the compiler could catch cases wh= ere a struct was essentially a wrapper around one of those arrays, and simi= larly turn the ops into SIMD ops (so as to allow some operator overloads an= d extra method wrapping additional intrinsics, for example).Can you elaborate a bit on what you mean? =A0If I understand what you're =some of that in ldc in the future to see if I can get it to work, but I'm not an expert in compilers by any stretch of the imagination.I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.There are a lot of cases to recognize, but the compiler could start with =the simple ones and then go from there with no need to change the language = or declare custom types (minus some alignment to help it along, perhaps). = =A0The nice thing about it is you automatically get a pretty big swath of a= uto-vectorization by the compiler in the most natural types and operations = you'd expect it to show up.Of course, SOA-style SIMD takes more intervention by the programmer, but =there is probably no easy way around that, since it's based on a data-layou= t technique. I think what he's saying is use array expressions like a[] =3D b[] + c[] and let the compiler take care of it, instead of trying to write SSE yourself. I haven't tried, but does this kind of thing turn into SSE and get inlined? struct Vec3 { float v[3]; void opAddAssign(ref Vec3 o) { this.v[] +=3D o.v[]; } } If so then that's very slick. Much nicer than having to delve into compiler intrinsics. But at least on DMD I know it won't actually inline because it doesn't inline functions with ref arguments. (http://d.puremagic.com/issues/show_bug.cgi?id=3D2008) --bb
Nov 09 2009
Bill Baxter wrote:On Mon, Nov 9, 2009 at 1:56 PM, Mike Farnsworth <mike.farnsworth gmail.com> wrote:The bad news: The DMD back-end is a state-of-the-art backend from the late 90's. Despite its age, its treatment of integer operations is, in general, still quite respectable. However, it _never_ generates SSE instructions. Ever. However, array operations _are_ detected, and they become to calls to library functions which use SSE if available. That's not bad for moderately large arrays -- 200 elements or so -- but of course it's completely non-optimal for short arrays. The good news: Now that static arrays are passed by value, introducing inline SSE support for short arrays suddenly makes a lot of sense -- there can be a big performance benefit for a small backend change; it could be done without introducing SSE anywhere else. Most importantly, it doesn't require any auto-vectorisation support.Walter Bright Wrote:I think what he's saying is use array expressions like a[] = b[] + c[] and let the compiler take care of it, instead of trying to write SSE yourself. I haven't tried, but does this kind of thing turn into SSE and get inlined? struct Vec3 { float v[3]; void opAddAssign(ref Vec3 o) { this.v[] += o.v[]; } } If so then that's very slick. Much nicer than having to delve into compiler intrinsics. But at least on DMD I know it won't actually inline because it doesn't inline functions with ref arguments. (http://d.puremagic.com/issues/show_bug.cgi?id=2008) --bbMichael Farnsworth wrote:Can you elaborate a bit on what you mean? If I understand what you're getting at, it's as simple as recognizing array-wise operations (the a[] = b[] * c expressions in D), and decomposing them into SIMD underneath where possible? It would also be cool if the compiler could catch cases where a struct was essentially a wrapper around one of those arrays, and similarly turn the ops into SIMD ops (so as to allow some operator overloads and extra method wrapping additional intrinsics, for example). There are a lot of cases to recognize, but the compiler could start with the simple ones and then go from there with no need to change the language or declare custom types (minus some alignment to help it along, perhaps). The nice thing about it is you automatically get a pretty big swath of auto-vectorization by the compiler in the most natural types and operations you'd expect it to show up. Of course, SOA-style SIMD takes more intervention by the programmer, but there is probably no easy way around that, since it's based on a data-layout technique.The ldc guys tell me that they didn't include the llvm vector intrinsics already because they were going to need either a custom type in the frontend, or else the D2 fixed-size-arrays-as-value-types functionality. I might take a stab at some of that in ldc in the future to see if I can get it to work, but I'm not an expert in compilers by any stretch of the imagination.I think there's a lot of potential in this. Most languages lack array operations, forcing the compiler into the bizarre task of trying to reconstruct high level operations from low level ones to then convert to array ops.
Nov 10 2009
Don wrote:The bad news: The DMD back-end is a state-of-the-art backend from the late 90's. Despite its age, its treatment of integer operations is, in general, still quite respectable.Modern compilers don't do much better. The point of diminishing returns was clearly reached.However, it _never_ generates SSE instructions. Ever. However, array operations _are_ detected, and they become to calls to library functions which use SSE if available. That's not bad for moderately large arrays -- 200 elements or so -- but of course it's completely non-optimal for short arrays. The good news: Now that static arrays are passed by value, introducing inline SSE support for short arrays suddenly makes a lot of sense -- there can be a big performance benefit for a small backend change; it could be done without introducing SSE anywhere else. Most importantly, it doesn't require any auto-vectorisation support.What the library functions also do is have a runtime switch based on the capabilities of the processor, switching to operations tailored to that processor. To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.
Nov 10 2009
Walter Bright wrote:Don wrote:Yup. The only integer operation modern compilers still don't do well is -- array operations!The bad news: The DMD back-end is a state-of-the-art backend from the late 90's. Despite its age, its treatment of integer operations is, in general, still quite respectable.Modern compilers don't do much better. The point of diminishing returns was clearly reached.I'd say it's not a problem to use MMX or even SSE1. It's really, really difficult to find a processor that doesn't support them. I've tried. I've really tried. I don't think many are still around: they all have motherboards which require really small hard disks that you can no longer buy. Certainly no-one is putting new software on them. Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at work, to support an ancient but expensive bit of lab equipment. Even it was a Pentium II. Getting the spare parts for it was a nightmare*; we had to ship them from 600km away. Hard disks just don't last that long. SSE2 is a different story, since AMD never made a 32 bit CPU with SSE2. *Actually it was more of a horror comedy. It was hard to take it seriously.However, it _never_ generates SSE instructions. Ever. However, array operations _are_ detected, and they become to calls to library functions which use SSE if available. That's not bad for moderately large arrays -- 200 elements or so -- but of course it's completely non-optimal for short arrays. The good news: Now that static arrays are passed by value, introducing inline SSE support for short arrays suddenly makes a lot of sense -- there can be a big performance benefit for a small backend change; it could be done without introducing SSE anywhere else. Most importantly, it doesn't require any auto-vectorisation support.What the library functions also do is have a runtime switch based on the capabilities of the processor, switching to operations tailored to that processor. To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.
Nov 10 2009
Don wrote:I'd say it's not a problem to use MMX or even SSE1. It's really, really difficult to find a processor that doesn't support them. I've tried. I've really tried. I don't think many are still around: they all have motherboards which require really small hard disks that you can no longer buy. Certainly no-one is putting new software on them. Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at work, to support an ancient but expensive bit of lab equipment. Even it was a Pentium II. Getting the spare parts for it was a nightmare*; we had to ship them from 600km away. Hard disks just don't last that long.I do have a working Pentium around here somewhere. I even have a 486, though I haven't turned the machine on in 15 years. I no longer have a 386 (gave it away). 10 years ago, I heard that the 386 was commonly used in embedded systems. I don't know what the base level x86 used today is.
Nov 10 2009
On Tue, Nov 10, 2009 at 12:06:08PM -0800, Walter Bright wrote:I do have a working Pentium around here somewhere.I actually still use my Pentium 1 computers. I have three of them, one works as a thin terminal to my newer computer, one is my secondary main computer (if/when my main computer decides to quit working and I'm waiting on replacement parts, I go back to the old box - it was my main from 1996 through to 2005! I also use it for the occasional multiplayer game), and the last one I still use to host some small, low traffic websites. I don't buy into the "omg must be bleeding edge or else" philosophy. I'll work these computers until their parts fail entirely! But, I'm surely in the minority. Heck, I still sometimes write 16 bit DOS code for those computers! If DMD starts outputting fancier code, that's awesome for the 99% of cases where it is fine, I'd just request a compiler switch in there to turn it back to old behaviour for the <1% of cases where we don't want it. -- Adam D. Ruppe http://arsdnet.net
Nov 10 2009
Adam D. Ruppe wrote:If DMD starts outputting fancier code, that's awesome for the 99% of cases where it is fine, I'd just request a compiler switch in there to turn it back to old behaviour for the <1% of cases where we don't want it.Interestingly, dmd does a very good job of Pentium instruction scheduling. I thought that was hopelessly obsolete, although it didn't actually hurt anything, so no worries. But it turns out that the Intel Atom benefits a lot from Pentium style scheduling, and no other compiler seems to support that anymore!
Nov 10 2009
Walter Bright wrote:Don wrote:Until recently my stepdad still had his 8086 setup to interface with an old- school velotype keyboard. I even built a nasty old 5 1/2 inch floppy drive into his shiny dualcore rig, which he used to transfer plain text files between the two machines. It worked fine.I'd say it's not a problem to use MMX or even SSE1. It's really, really difficult to find a processor that doesn't support them. I've tried. I've really tried. I don't think many are still around: they all have motherboards which require really small hard disks that you can no longer buy. Certainly no-one is putting new software on them. Earlier this year I had to install Windows3.1 (!!!) on an ancient PC at work, to support an ancient but expensive bit of lab equipment. Even it was a Pentium II. Getting the spare parts for it was a nightmare*; we had to ship them from 600km away. Hard disks just don't last that long.I do have a working Pentium around here somewhere. I even have a 486, though I haven't turned the machine on in 15 years. I no longer have a 386 (gave it away). 10 years ago, I heard that the 386 was commonly used in embedded systems. I don't know what the base level x86 used today is.
Nov 10 2009
Walter Bright:Modern compilers don't do much better. The point of diminishing returns was clearly reached.I routinely see D benchmarks that are 2+ times faster with LDC compared to DMD. Today CPUs don't get faster and faster as in the past so a 250% improvement coming just from the compiler is not something you want to ignore. And more optimizations for LLVM are planned (like auto-vectorization, better inlining of function pointers, better de-virtualization and inlining of virtual class methods, partial compilation, super compilation of tiny chunks of code, and quite more). Another thing to take in account is that today you usually don't want to Python, Fortress, etc. Such higher level languages offer challenges to the optimizators, that were not present in the past. For example most optimizations done by the Just-In-Time compiler for Lua were not needed by a C compiler. Today there are many people that want to program in Lua or Python or JavaScript instead of C, so they need a quite more refined optimizer and compiler, like the LuaJIT2 or Unladen Swallow or V8. You also have new smaller challenges created by multi-core CPUs and languages that are functional, immutable-based. That's why modern compilers are quickly improving today too, and today we need still such improvements. Bye, bearophile
Nov 10 2009
bearophile wrote:Walter Bright:Have to be careful about benchmarks without looking at why. A few months ago, a benchmark was posted here purportedly showing that dmd was awful at integer math. Turns out, the problem was entirely in the long divide function, not the code generator at all. I rewrote the long divide helper function, and problem solved.Modern compilers don't do much better. The point of diminishing returns was clearly reached.I routinely see D benchmarks that are 2+ times faster with LDC compared to DMD.
Nov 10 2009
Walter Bright Wrote:Don wrote:For my purposes, runtime detection is probably out the window, unless the tests for it can happen infrequently enough to reduce the overhead. There are too many SSE variations to switch on them all, and they incrementally provide better and better functionality that I could make use of. I'd rather compile different executables for different hardware and distribute them all (e.g. detect the SSE version at compile time). Really, high performance graphics is an exercise in getting tightly vectorized code to inline appropriately, eliminate as many loads and stores as possible, and then on top of that build algorithms that don't suck in runtime or memory/cache complexity. Often in computer graphics you end up distilling a huge amount of operations down to SIMD instructions that are very highly-threaded and have (hopefully) minimal I/O. If you introduce any extra overhead for getting to those SIMD instructions, you usually take a measurable throughput hit. I'd like to see D give me a much better mix of high throughput + high coding productivity. As it stands, I've got high throughput + medium coding productivity in C++, and I've started looking at some ldc code to lurch towards this goal, and if there is something I can look at in dmd2 itself to help out, I'd love to. Just point me where you think I ought to start. -MikeThe bad news: The DMD back-end is a state-of-the-art backend from the late 90's. Despite its age, its treatment of integer operations is, in general, still quite respectable.Modern compilers don't do much better. The point of diminishing returns was clearly reached.However, it _never_ generates SSE instructions. Ever. However, array operations _are_ detected, and they become to calls to library functions which use SSE if available. That's not bad for moderately large arrays -- 200 elements or so -- but of course it's completely non-optimal for short arrays. The good news: Now that static arrays are passed by value, introducing inline SSE support for short arrays suddenly makes a lot of sense -- there can be a big performance benefit for a small backend change; it could be done without introducing SSE anywhere else. Most importantly, it doesn't require any auto-vectorisation support.What the library functions also do is have a runtime switch based on the capabilities of the processor, switching to operations tailored to that processor. To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.
Nov 10 2009
Mike Farnsworth wrote:For my purposes, runtime detection is probably out the window, unless the tests for it can happen infrequently enough to reduce the overhead. There are too many SSE variations to switch on them all, and they incrementally provide better and better functionality that I could make use of. I'd rather compile different executables for different hardware and distribute them all (e.g. detect the SSE version at compile time). Really, high performance graphics is an exercise in getting tightly vectorized code to inline appropriately, eliminate as many loads and stores as possible, and then on top of that build algorithms that don't suck in runtime or memory/cache complexity.The way to do it is to not distribute multiple executables, but have the initialization code detect the chip. Then, you compile the same code for different instructions, and have a high level runtime switch between them. I used to do this for machines with and without x87 support.
Nov 10 2009
Walter Bright Wrote:Mike Farnsworth wrote:Was it actually rewriting the executable code to call the alternate functions (e.g. a exe load time decision, patch the code in memory, and then run)? I thought that sort of thing would run into all sorts of runtime linker issues (ro code pages in memory, shared libs that also need the rewriting, etc.), but then again, they do that with JIT compiling all the time. Does dmd already have some of this capability hanging around (but not used yet)? -MikeFor my purposes, runtime detection is probably out the window, unless the tests for it can happen infrequently enough to reduce the overhead. There are too many SSE variations to switch on them all, and they incrementally provide better and better functionality that I could make use of. I'd rather compile different executables for different hardware and distribute them all (e.g. detect the SSE version at compile time). Really, high performance graphics is an exercise in getting tightly vectorized code to inline appropriately, eliminate as many loads and stores as possible, and then on top of that build algorithms that don't suck in runtime or memory/cache complexity.The way to do it is to not distribute multiple executables, but have the initialization code detect the chip. Then, you compile the same code for different instructions, and have a high level runtime switch between them. I used to do this for machines with and without x87 support.
Nov 10 2009
Mike Farnsworth wrote:Was it actually rewriting the executable code to call the alternate functions (e.g. a exe load time decision, patch the code in memory, and then run)? I thought that sort of thing would run into all sorts of runtime linker issues (ro code pages in memory, shared libs that also need the rewriting, etc.), but then again, they do that with JIT compiling all the time.It's much simpler than that. Some C: ========================================= void foo_with_FPU(); void foo_without_FPU(); void (*foo)(); void main() { has_fp = doesCPUhaveFPU(); if (has_fp) foo = &foo_with_FPU(); else foo = &foo_without_FPU(); ... execute app ... (*foo)(); ... execute more app ... } ========================================= #if WITH_FPU #define FOO foo_with_FPU #else #define FOO foo_without_FPU #endif void FOO() { ... do some floating point calculations ... } ========================================== dmc -DWITH_FPU -c foo.c -f -ofoo_with_fpu.obj dmc -c foo.c -ofoo_without_fpu.obj dmc app.obj foo_with_fpu.obj foo_without_fpu.obj =========================================== Hope that makes it clearer. No runtime linking, no runtime compiling, no self-modifying code, etc. A better way to do it is to put your FP code behind a class interface, then have derived classes implement them, compiled with different instruction set options. At runtime, decide which derived class to use.
Nov 10 2009
Walter Bright wrote:... To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to have an -sse (and maybe -sse2, -sse3, and -no-sse) switch for the compiler to determine whether the compiler emits those instructions or not. I'm also wondering if a more ideal approach (and perhaps additional option to those above) would be to borrow the best of JIT compilation and emit multiple code paths. Maybe the program would have a bootstrap phase when starting up where it would call cpuid, find out what it has available, rewrite the main binary to use the optimal paths, then execute the main binary. That way feature detection doesn't happen while the program itself is running, and thus doesn't slow down the computations as they happen. Then passing -sse* would cause it to not emit the bootstrap, but instead just assume that the instructions will be available.
Nov 10 2009
Chad J Wrote:Walter Bright wrote:Incidentally, if you use LLVM to compile to their bitcode, you can at runtime do exactly this sort of thing based on the host hardware, selecting opt passes and having it run codegen based on your exact hardware. As long as using a given intrinsic falls through to the right glue code where it isn't supported, or else you let the compiler deduce where to use the fancier instructions (not as likely to happen), that works out nicely. -Mike... To generate the code directly, assuming the existence of SSE, is to mean the code will only run on modern chips. Whether or not this is a problem depends on your application.If MMX/SSE/SSE2 optimizations are low-lying fruit, I'd at least like to have an -sse (and maybe -sse2, -sse3, and -no-sse) switch for the compiler to determine whether the compiler emits those instructions or not. I'm also wondering if a more ideal approach (and perhaps additional option to those above) would be to borrow the best of JIT compilation and emit multiple code paths. Maybe the program would have a bootstrap phase when starting up where it would call cpuid, find out what it has available, rewrite the main binary to use the optimal paths, then execute the main binary. That way feature detection doesn't happen while the program itself is running, and thus doesn't slow down the computations as they happen. Then passing -sse* would cause it to not emit the bootstrap, but instead just assume that the instructions will be available.
Nov 10 2009