digitalmars.D - How about implementing SPMD on SIMD for D?
- Random D user (72/72) Jul 06 2018 TL;DR
- solidstate1991 (7/7) Jul 06 2018 I think fixing the SIMD support would be also great.
- Guillaume Piolat (14/21) Jul 07 2018 Working on this, https://github.com/AuburnSounds/intel-intrinsics
- solidstate1991 (2/4) Jul 07 2018 Does D_SIMD work on LDC?
- Guillaume Piolat (3/9) Jul 07 2018 No, it has different capabilities. (However some of it is
- Ethan (3/7) Jul 07 2018 Word to the wise. Some platforms out there require you to
- Guillaume Piolat (3/5) Jul 08 2018 Can you elaborate? What do you mean by AVX encoding, you mean the
- Ethan (5/7) Jul 08 2018 VEX encoding. AVX intrinsics in the Intel API are just intrinsics
- Guillaume Piolat (10/20) Jul 07 2018 I think you are mistaken, D code is autovectorized often when
- Random D user (28/43) Jul 08 2018 That is good to know.
- Guillaume Piolat (4/8) Jul 09 2018 I agree. That's why it's useful to have a (stable) syntax for
TL;DR Would want to run your code 8x - 32x faster? SPMD (Single Program Multiple Data) on SIMD (Single Instruction Multiple Data) might be the answer you're looking for. It works by running multiple iterations/instances of your loop at once on SIMD and the compiler could do that automatically for you and your normal loop & array code. --- I'm a bit late to the party, but I recently was reading this ( http://pharr.org/matt/blog/2018/04/30/ispc-all.html ), a highly interesting blog post series about how one guy did what the Intel compiler team wouldn't or couldn't do. He wrote a C like language and compiler on top of LLVM which transforms normal scalar code into "parallel" SIMD code. That compiler is called the ISPC ( https://ispc.github.io/perf.html ). It basically works the similarly as GPU shaders, but the code runs on the CPU SIMD. You write your code for one thread/lane and the compiler then runs N instances of that code simultaneously in lockstep. For example, loop 8x (c.xyzw = a.xyzw + b.xyzw) would become 2x (x.cccc = x.aaaa + x.bbbb; y.cccc = y.aaaa + y.bbbb; z.cccc = z.aaaa + z.bbbb; w.cccc = w.aaaa + w.bbbb) (the notation here is a bit weird, but I was trying to keep it short). Branches are done using masking, so the code runs both sides of the branch, but masks away the wrong results. All of this is way better described in the paper they wrote about it ( http://pharr.org/matt/papers/ispc_inpar_2012.pdf ). I recommend reading it. I was also looking at some videos from Unity (game engine/framework) about their new "Performance by default" initiative. functions, slices and annotations (no classes). That reminded me of D :). One thing they touched was pointer aliasing and how slices and custom compiler tech (that knows about the other engine systems) allows them to avoid aliasing and produce more optimal code. However the interesting part was that the compiler does similar things as the ISPC when specific annotations are given by the programmer. Video about the tech/compiler is here ( https://www.youtube.com/watch?v=NF6kcNS6U80&feature=youtu.be?list=PLX2vGYjWbI0S8 jCJKYT-mIZf7YCuF-Ka ). It occurred to me that SPMD on SIMD would be really nice addition to D's arsenal. Especially, since D doesn't even attempt any auto-vectorization (poor results and difficult to implement) and manual loops are quite tedious to write (even std.simd failed to materialize), so SPMD would be nice alternative. D also has some existing vector syntax and specialization, so there's a precedent for vector programing. This could be considered as an extension to that. The SPMD should be easy to implement (I'm not a compiler expert) since it's only a code transformation and not an optimization. Finally, I don't think any serious systems/performance oriented language can ignore that kind of performance-increase figures for too long. I had something like this in mind: spmd //or simd // NOTE: just removing spmd would mean it's a normal loop, great for debugging foreach( int i; 0 .. 100 ) { c[i] = a[i] + b[i]; } or void doSum( float4[] a, float4[] b, float4[] c ) spmd //or simd { c = a + b; // NOTE: c[i] = a[i] + b[i], array index is implicit because of spmd, it's just some index of 0 .. a.length } What do you think?
Jul 06 2018
I think fixing the SIMD support would be also great. 1) Make it compatible with Intel intrinsics. 2) Update it to work with AVX-512. 3) Add some preliminary support for ARM Neon. 4) Optional: Make SIMD to compile for all 32 bit targets, so I don't have to write long-ass assembly code for stuff. Once I even broke LDC with my code (see CPUblit).
Jul 06 2018
On Saturday, 7 July 2018 at 02:00:14 UTC, solidstate1991 wrote:I think fixing the SIMD support would be also great. 1) Make it compatible with Intel intrinsics.Working on this, https://github.com/AuburnSounds/intel-intrinsics With LDC I'm getting the best performance out of vector intrinsics (wrapped or not with that library). But this is a layer that will change less that LDC intrinsics, and can .2) Update it to work with AVX-512.The nice thing about LLVM intrinsics is that _semantics is deccorelated from codegen_. You can generate AVX instructions (if you really think it helps you) while writing the more common SSE intrinsics.3) Add some preliminary support for ARM Neon.I think it will already work of you use LDC intrinsics (or intel-intrinsics) and build for ARM.4) Optional: Make SIMD to compile for all 32 bit targets, so I don't have to write long-ass assembly code for stuff. Once I even broke LDC with my code (see CPUblit).Oh yes, so much this. It would be very nice to have D_SIMD on DMD 32-bit.
Jul 07 2018
On Saturday, 7 July 2018 at 13:20:53 UTC, Guillaume Piolat wrote:Oh yes, so much this. It would be very nice to have D_SIMD on DMD 32-bit.Does D_SIMD work on LDC?
Jul 07 2018
On Saturday, 7 July 2018 at 15:34:40 UTC, solidstate1991 wrote:On Saturday, 7 July 2018 at 13:20:53 UTC, Guillaume Piolat wrote:No, it has different capabilities. (However some of it is similar).Oh yes, so much this. It would be very nice to have D_SIMD on DMD 32-bit.Does D_SIMD work on LDC?
Jul 07 2018
On Saturday, 7 July 2018 at 13:20:53 UTC, Guillaume Piolat wrote:The nice thing about LLVM intrinsics is that _semantics is deccorelated from codegen_. You can generate AVX instructions (if you really think it helps you) while writing the more common SSE intrinsics.Word to the wise. Some platforms out there require you to generate AVX encodings, not SSE. Nudge nudge.
Jul 07 2018
On Saturday, 7 July 2018 at 21:59:09 UTC, Ethan wrote:Word to the wise. Some platforms out there require you to generate AVX encodings, not SSE. Nudge nudge.Can you elaborate? What do you mean by AVX encoding, you mean the new VEX encoding or AVX intrinsics?
Jul 08 2018
On Sunday, 8 July 2018 at 11:47:20 UTC, Guillaume Piolat wrote:Can you elaborate? What do you mean by AVX encoding, you mean the new VEX encoding or AVX intrinsics?VEX encoding. AVX intrinsics in the Intel API are just intrinsics extensions like every SSE revision before it. It's purely up to the compiler how it transforms those intrinsics in to instructions.
Jul 08 2018
On Friday, 6 July 2018 at 23:08:27 UTC, Random D user wrote:[SPMD] works by running multiple iterations/instances of your loop at once on SIMD and the compiler could do that automatically for you and your normal loop & array code. It occurred to me that SPMD on SIMD would be really nice addition to D's arsenal. Especially, since D doesn't even attempt any auto-vectorization (poor results and difficult to implement) and manual loops are quite tedious to write (even std.simd failed to materialize), so SPMD would be nice alternative.I think you are mistaken, D code is autovectorized often when using LDC. Sometimes it's not and it's hard to know why. A pragma we could have is the one in the Intel C++ Compiler that says "hey this loop is safe to autovectorize".What do you think?I think that ispc is like OpenCL on the CPU, but can't work on the GPU, FPGA or other OpenCL implementation. OpenCL is so fast because caching is explicit (several levels of memory are exposed).
Jul 07 2018
On Saturday, 7 July 2018 at 13:26:10 UTC, Guillaume Piolat wrote:On Friday, 6 July 2018 at 23:08:27 UTC, Random D user wrote:That is good to know. I haven't looked that much into LDC (or clang). I mostly use dmd for fast edit-compile cycle. Although, plan is to use LDC for "release"/optimized build eventually. Anyway, I would just want to code some non-trivial loops in SIMD, but I wouldn't want to fiddle with intrinsics. Or write a higher level wrapper for them. In my experience, you can only get the real benefits out of SIMD if you carefully handcraft your hot loops to fully use it. Sprinkling some SIMD here and there with a SIMD vector type, doesn't really seem to yield big benefits.Especially, since D doesn't even attempt any auto-vectorization (poor results and difficult to implement) and manual loops are quite tedious to write (even std.simd failed to materialize), so SPMD would be nice alternative.I think you are mistaken, D code is autovectorized often when using LDC.Sometimes it's not and it's hard to know why.Exactly. In my experience compilers (msvc) often don't.A pragma we could have is the one in the Intel C++ Compiler that says "hey this loop is safe to autovectorize".Yeah, it should be similar. The point is not run it on GPU, you can do CUDA, OpenCL, compute shader etc. for that. CPU code is much easier to debug, and sometimes you're already doing things on the GPU, but your CPU side has more room for computation. And you don't have to copy your data between the GPU and CPU or deal with latency. Of course, OpenCL runs on CPU too, but I think there's quite a bit of code required to set it up and to use it. I guess my point was that I would like to do CPU SIMD code easily without intrinsics (or manually trying to trick the compiler to vectorize the code). SPMD stuff seems to solve these issues. It would also be a forward looking step for D. Ideally, just write your loop normally, debug it and add an annotation to get it to run fast on SIMD. Done.What do you think?I think that ispc is like OpenCL on the CPU, but can't work on the GPU, FPGA or other OpenCL implementation. OpenCL is so fast because caching is explicit (several levels of memory are exposed).
Jul 08 2018
On Sunday, 8 July 2018 at 19:07:57 UTC, Random D user wrote:In my experience, you can only get the real benefits out of SIMD if you carefully handcraft your hot loops to fully use it. Sprinkling some SIMD here and there with a SIMD vector type, doesn't really seem to yield big benefits.I agree. That's why it's useful to have a (stable) syntax for such instruction like PMADDWD that can't be described with just operator overloading.
Jul 09 2018