digitalmars.D - 64-bit and SSE
- dsimcha (7/7) Mar 02 2010 Given that Walter has indicated that 64-bit support is on the agenda for...
- retard (2/11) Mar 02 2010 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
- Nick Sabalausky (3/14) Mar 02 2010 Yes. The ones who enjoy arbitrarily shrinking their potential user base....
- retard (10/28) Mar 02 2010 Why not dynamic code path selection:
- dsimcha (11/39) Mar 02 2010 Two reasons: At the top end it's more trouble than it's worth unless th...
- retard (6/49) Mar 02 2010 At least popular video decoders like mplayer seem to have dynamic cpu
- retard (3/8) Mar 02 2010 Also the code example was badly crafted. It's not even my opinion that w...
- Nick Sabalausky (10/39) Mar 02 2010 I'd think that kind of branching could be done automatically by a reason...
- retard (5/40) Mar 02 2010 You can compile two versions of e.g. mplayer. The one with architecture
- Don (4/34) Mar 02 2010 The method needs to be fairly large for that to be beneficial. For
- Nick Sabalausky (75/109) Mar 02 2010 You can still just increase the grain-size as needed. For instance, take...
- Rainer Deyke (17/33) Mar 02 2010 Why not do it at the largest possible level of granularity?
- Robert Jacques (4/35) Mar 02 2010 That's great until you start linking, or worse, dynamically linking. The...
- Don (4/37) Mar 03 2010 I don't think that ever makes sense. I'd just compile multiple
- Don (13/20) Mar 02 2010 I think the way to do this will be, as a first step, to use SSE for
- bearophile (4/6) Mar 02 2010 Can you explain me this a bit better? Why can't D use SSE registers in 3...
- Don (2/10) Mar 02 2010 It can, but not in the ABI. So support in 64-bit can be better.
- dsimcha (5/7) Mar 02 2010 Is SSE(2) inherently faster then (at least in real-world implementations...
- bearophile (5/7) Mar 02 2010 sqrt for example is fast, and there are other high level instructions (f...
- Don (13/21) Mar 03 2010 No. (Except on Pentium 4, where SSE was basically the only part of the
- #ponce (6/10) Mar 04 2010 There is a couple of interesting scalar instructions in SSE
Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.
Mar 02 2010
Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
"retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:Yes. The ones who enjoy arbitrarily shrinking their potential user base.Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:"retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:Yes. The ones who enjoy arbitrarily shrinking their potential user base.Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
== Quote from retard (re tard.com.invalid)'s articleTue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow. At the bottom end, who the heck still uses machines that don't support SSE2? I agree with Nick to some degree that developers shouldn't assume their audiences have the latest and greatest, but SSE2 has been supported by AMD for about 7 years and Intel for about 9. I'm pretty sure that's at least a few standard deviations longer than the average lifetime of computer equipment. You have to draw the line somewhere or we'd all be tweaking our programs to fit in 640k of address space."retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:Yes. The ones who enjoy arbitrarily shrinking their potential user base.Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
Tue, 02 Mar 2010 19:41:17 +0000, dsimcha wrote:== Quote from retard (re tard.com.invalid)'s articleAt least popular video decoders like mplayer seem to have dynamic cpu detection. No idea, how it exactly affects the instructions used in the code. Well, there are all kinds of x86 clones out there. I don't remember what instructions they all support, but they have more variation than the pentium/core line of processors.Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow. At the bottom end, who the heck still uses machines that don't support SSE2? I agree with Nick to some degree that developers shouldn't assume their audiences have the latest and greatest, but SSE2 has been supported by AMD for about 7 years and Intel for about 9. I'm pretty sure that's at least a few standard deviations longer than the average lifetime of computer equipment. You have to draw the line somewhere or we'd all be tweaking our programs to fit in 640k of address space."retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:Yes. The ones who enjoy arbitrarily shrinking their potential user base.Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
Tue, 02 Mar 2010 19:46:20 +0000, retard wrote:At least popular video decoders like mplayer seem to have dynamic cpu detection. No idea, how it exactly affects the instructions used in the code. Well, there are all kinds of x86 clones out there. I don't remember what instructions they all support, but they have more variation than the pentium/core line of processors.Also the code example was badly crafted. It's not even my opinion that we should support all legacy pre-586 CPUs.
Mar 02 2010
"dsimcha" <dsimcha yahoo.com> wrote in message news:hmjpkt$1i20$1 digitalmars.com...== Quote from retard (re tard.com.invalid)'s articleI'd think that kind of branching could be done automatically by a reasonably intelligent optimizer. Or there's the possibility of compiling-upon-installation that could just detect the CPU being used (although that admittedly comes with a few difficulties and potential issues). I guess I was only assuming that retard was suggesting requiring > SSE2. I'm not sure if he really did mean it that way.Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow."retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?Yes. The ones who enjoy arbitrarily shrinking their potential user base.At the bottom end, who the heck still uses machines that don't support SSE2?My Linux box is an AMD without SSE2.You have to draw the line somewhere or we'd all be tweaking our programs to fit in 640k of address space.Certainly true.
Mar 02 2010
Tue, 02 Mar 2010 15:13:10 -0500, Nick Sabalausky wrote:"dsimcha" <dsimcha yahoo.com> wrote in message news:hmjpkt$1i20$1 digitalmars.com...You can compile two versions of e.g. mplayer. The one with architecture fixed on compile time and the one with dynamic cpu detection. The latter is rather useful for free linux live CDs when you really can't guess all the target machines beforehand.== Quote from retard (re tard.com.invalid)'s articleI'd think that kind of branching could be done automatically by a reasonably intelligent optimizer. Or there's the possibility of compiling-upon-installation that could just detect the CPU being used (although that admittedly comes with a few difficulties and potential issues). I guess I was only assuming that retard was suggesting requiring > SSE2. I'm not sure if he really did mean it that way.Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow."retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Mar 02 2010
retard wrote:Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE."retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:Yes. The ones who enjoy arbitrarily shrinking their potential user base.Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
"Don" <nospam nospam.com> wrote in message news:hmk01v$1u32$1 digitalmars.com...retard wrote:You can still just increase the grain-size as needed. For instance, take this example of code that is too fine-grained: ------------------------------------------- void fineGranedA(Param p) { if(supports_SSE4) // Use SSE4 else if(supports_SSE2) // Use SSE2 else // Use Default } void fineGranedB(Param p) { if(supports_SSE4) // Use SSE4 else if(supports_SSE2) // Use SSE2 else // Use Default } void foo() { foreach(thing; bunchOThings) { fineGranedA(thing); fineGranedB(thing); } } ------------------------------------------- That can be turned into this (and a smart optimizer could probably do it automatically, especially if it's the compiler that's internally generating 'fineGrainedA' and 'fineGrainedB' in the first place): ------------------------------------------- enum CPUVer { SSE4, SSE2, Default } void fineGranedA(CPUVer ver)(Param p) { static if(ver == CPUVer.SSE4) // Use SSE4 else static if(ver == CPUVer.SSE2) // Use SSE2 else // Use Default } void fineGranedB(CPUVer ver)(Param p) { static if(ver == CPUVer.SSE4) // Use SSE4 else static if(ver == CPUVer.SSE2) // Use SSE2 else // Use Default } void fooImpl(CPUVer ver)() { foreach(thing; bunchOThings) { fineGranedA!(ver)(thing); fineGranedB!(ver)(thing); } } void foo() { if(supports_SSE4) fooImpl!(CPUVer.SSE4)(); else if(supports_SSE2) fooImpl!(CPUVer.SSE2)(); else fooImpl!(CPUVer.Default)(); } ------------------------------------------- And if foo gets called a lot, like in some loop, you can just take things another level out.Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE."retard" <re tard.com.invalid> wrote in message news:hmjmjd$15uj$1 digitalmars.com...Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:Yes. The ones who enjoy arbitrarily shrinking their potential user base.Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode? I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
On 3/2/2010 14:28, Don wrote:retard wrote:Why not do it at the largest possible level of granularity? int main() { if (cpu_capabilities && SSE4_2) { return run_fast_main(); } else if (cpu_capabilities && SSE2) { return run_medium_fast_main(); } else { return run_slow_main(); } } The compiler should be able to do this automatically by compiling every single function in the program N times with N different code generation setting. Executable size will skyrocket, but it won't matter because executable size is rarely a significant concern. -- Rainer Deyke - rainerd eldwood.comWhy not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
Mar 02 2010
On Tue, 02 Mar 2010 23:01:01 -0500, Rainer Deyke <rainerd eldwood.com> wrote:On 3/2/2010 14:28, Don wrote:That's great until you start linking, or worse, dynamically linking. Then you run into some major problems.retard wrote:Why not do it at the largest possible level of granularity? int main() { if (cpu_capabilities && SSE4_2) { return run_fast_main(); } else if (cpu_capabilities && SSE2) { return run_medium_fast_main(); } else { return run_slow_main(); } } The compiler should be able to do this automatically by compiling every single function in the program N times with N different code generation setting. Executable size will skyrocket, but it won't matter because executable size is rarely a significant concern.Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
Mar 02 2010
Rainer Deyke wrote:On 3/2/2010 14:28, Don wrote:I don't think that ever makes sense. I'd just compile multiple executables with different settings, and select which one to use at install time.retard wrote:Why not do it at the largest possible level of granularity? int main() { if (cpu_capabilities && SSE4_2) { return run_fast_main(); } else if (cpu_capabilities && SSE2) { return run_medium_fast_main(); } else { return run_slow_main(); } } The compiler should be able to do this automatically by compiling every single function in the program N times with N different code generation setting. Executable size will skyrocket, but it won't matter because executable size is rarely a significant concern.Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
Mar 03 2010
dsimcha wrote:Given that Walter has indicated that 64-bit support is on the agenda for after D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2) support in DMD in the relatively near future? If so, will it be exposed as a compiler option even when compiling in 32-bit mode?I think the way to do this will be, as a first step, to use SSE for short vector operations. There's some really low-hanging fruit there. To get the full benefit from SSE we need to use SSE registers for parameter passing, but I think that'll only be possible with the 64 bit API.I've realized that this is kind of important for me since Intel deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school floating point code runs painfully slow compared to, say, an AMD chip that still has a decent x87.x87 is only slow on P4. But everything is slow on P4. AFAIK 80 bit loads and stores are the only things which are slower on Core2 and i7 than on Pentium 3 (4 vs 2 cycles). And they're actually faster than AMD. So I don't think this is such a big issue. In fact, some of the x87 transcendental operations are faster on Core2 than on any earlier processor. So they still have a decent x87 :-). Of course, in the occasions when SSE lets you do 4 operations at once, you get nearly a 4X speedup...
Mar 02 2010
Don:To get the full benefit from SSE we need to use SSE registers for parameter passing, but I think that'll only be possible with the 64 bit API.Can you explain me this a bit better? Why can't D use SSE registers in 32 bit code too? Bye, bearophile
Mar 02 2010
bearophile wrote:Don:It can, but not in the ABI. So support in 64-bit can be better.To get the full benefit from SSE we need to use SSE registers for parameter passing, but I think that'll only be possible with the 64 bit API.Can you explain me this a bit better? Why can't D use SSE registers in 32 bit code too? Bye, bearophile
Mar 02 2010
== Quote from Don (nospam nospam.com)'s articleOf course, in the occasions when SSE lets you do 4 operations at once, you get nearly a 4X speedup...Is SSE(2) inherently faster then (at least in real-world implementations) than x87, even when you don't vectorize? Would I be able to expect any speedup from going from x87 to SSE(2) for code that has a decent amount of implicit instruction level parallelism but wasn't explicitly vectorized either by me or the compiler?
Mar 02 2010
dsimcha:Is SSE(2) inherently faster then (at least in real-world implementations) than x87, even when you don't vectorize?<sqrt for example is fast, and there are other high level instructions (for video decoding cryptography, etc). But you have to think how much time has passed from the design of C language. CPUs when C was designed were profoundly different from the ones available now. If D will have some success, future CPUs will be surely different from the current ones. I think SSE registers will be kind of obsolete when AVX will be out about next year. Do you need to change the ABI of D3 again for AVX? Bye, bearophile
Mar 02 2010
dsimcha wrote:== Quote from Don (nospam nospam.com)'s articleNo. (Except on Pentium 4, where SSE was basically the only part of the CPU that wasn't crippled). Would I be able to expect any speedup fromOf course, in the occasions when SSE lets you do 4 operations at once, you get nearly a 4X speedup...Is SSE(2) inherently faster then (at least in real-world implementations) than x87, even when you don't vectorize?going from x87 to SSE(2) for code that has a decent amount of implicit instruction level parallelism but wasn't explicitly vectorized either by me or the compiler?I doubt it. The only time that you get an easy benefit is when you have a mix of serial and parallel calculations. float[4] x, y; float z = some_calculation; x[] += z*y[]; If you're using SSE for all your calculations, z will already be in an SSE register, so it makes setting up the parallel calculation a bit quicker. And the compiler might be better at scheduling SSE code, than x87. But that's not really a processor thing.
Mar 03 2010
Is SSE(2) inherently faster then (at least in real-world implementations) than x87, even when you don't vectorize? Would I be able to expect any speedup from going from x87 to SSE(2) for code that has a decent amount of implicit instruction level parallelism but wasn't explicitly vectorized either by me or the compiler?There is a couple of interesting scalar instructions in SSE - cvttss2si : floorf without modifying the rounding mode (SSE2) - 32-bit float square root and inverse square root - min, max SSE doesn't suffer from denormalization which can be very useful. I personnally don't mind if the compiler use them or not, provided one can use inline assembly :)
Mar 04 2010