digitalmars.D.ldc - ldc nvvm GPU intrinsics good news
- Bruce Carneal (14/14) Mar 04 2021 After updating the first line to
- kinke (9/14) Mar 04 2021 Sorry, I assumed you had already seen it (plus
- Bruce Carneal (4/18) Mar 04 2021 So what is the right/clean way to get @compute access to these
- kinke (7/10) Mar 04 2021 AFAICT, the only challenge is to find a suited signature in D for
- Bruce Carneal (11/21) Mar 04 2021 To be clear, I'm not asking for how a new builtin might be added
- kinke (9/19) Mar 05 2021 The dcompute parts in the compiler could definitely be improved
- Bruce Carneal (20/29) Mar 05 2021 Okay. I'll do the gccbuiltins_nvvm.di copy/edit for use in
- kinke (7/11) Mar 05 2021 For __nvvm_vote_ballot (no idea what it does), this seems to
- kinke (5/6) Mar 05 2021 Small correction (IR generation was fine, compilation wasn't):
- Bruce Carneal (8/14) Mar 05 2021 Yep. The above works for me as well. cuda_620 (sm_62 IIUC) is
- Bruce Carneal (21/28) Mar 04 2021 Good to know. __nvvm_vote_ballot and __nvvm_vote_ballot_sync
- Johan Engelen (15/20) Apr 25 2021 Hi Bruce,
- Bruce Carneal (22/42) Apr 26 2021 Yes, I'll help when the current push is over here, but I think I
- Imperatorn (3/10) Apr 27 2021 Nice work, thanks for wanting to improve dcompute! I think D has
After updating the first line to ' compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics. I've only tried it out on __syncthreads and __nvvm_shfl_down_i32 but both "invocations" of those LDC_intrinsics resulted in the expected single ptx instruction in the dcompute .ptx output. There are over 600 pragma(LDC_intrinsic, "llvm.nvvm.xxxxxx") builtins in the gcc_builtins_nvvm.di file so, of course, I've not hand tested them all but it looks very promising. If you're working with dcompute on an OpenCL device, I'd love to hear if something similar works for your use cases of if you've found another way forward.
Mar 04 2021
On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:After updating the first line to ' compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics.Sorry, I assumed you had already seen it (plus ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC v1.22. These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord(). [1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp
Mar 04 2021
On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:So what is the right/clean way to get compute access to these builtins? Should practitioners do a manual copy edit or is there a better way?After updating the first line to ' compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics.Sorry, I assumed you had already seen it (plus ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC v1.22. These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord(). [1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp
Mar 04 2021
On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:So what is the right/clean way to get compute access to these builtins? Should practitioners do a manual copy edit or is there a better way?AFAICT, the only challenge is to find a suited signature in D for the LLVM intrinsic, incl. some unique name if there's no gcc builtin name - as a GitHub PR for that tool, so that all future .di files contain these extra intrinsics. To get there, I'd suggest adding some temp logging to inspect the records from the parsed .td file.
Mar 04 2021
On Friday, 5 March 2021 at 00:48:26 UTC, kinke wrote:On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:To be clear, I'm not asking for how a new builtin might be added cleanly, I'm asking if there is a better way to access all the builtins present in the current .di file. As noted above, I had to copy and edit the .di file to avoid this error from ldc: ...: Error: can only call functions from other compute modules in compute code Once I did that, everything I have tried from the edited .di file appears to work as intended. Amazing! I just thought that copying/editing could be avoided.So what is the right/clean way to get compute access to these builtins? Should practitioners do a manual copy edit or is there a better way?AFAICT, the only challenge is to find a suited signature in D for the LLVM intrinsic, incl. some unique name if there's no gcc builtin name - as a GitHub PR for that tool, so that all future .di files contain these extra intrinsics. To get there, I'd suggest adding some temp logging to inspect the records from the parsed .td file.
Mar 04 2021
On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:To be clear, I'm not asking for how a new builtin might be added cleanly, I'm asking if there is a better way to access all the builtins present in the current .di file. As noted above, I had to copy and edit the .di file to avoid this error from ldc: ...: Error: can only call functions from other compute modules in compute code Once I did that, everything I have tried from the edited .di file appears to work as intended. Amazing! I just thought that copying/editing could be avoided.The dcompute parts in the compiler could definitely be improved (but it's nowhere on my priorities list). What I tried to get at by linking my comment [1] is that you can generally target NVPTX/AMDGCN directly, without any compute stuff, and in that case importing and using intrinsics from that .di works without any issues. [1] https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886
Mar 05 2021
On Friday, 5 March 2021 at 11:28:31 UTC, kinke wrote:On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:[...]The dcompute parts in the compiler could definitely be improved (but it's nowhere on my priorities list). What I tried to get at by linking my comment [1] is that you can generally target NVPTX/AMDGCN directly, without any compute stuff, and in that case importing and using intrinsics from that .di works without any issues. [1] https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886Okay. I'll do the gccbuiltins_nvvm.di copy/edit for use in dcompute as part of my build script. A copy/edit at build time will give access to many of the nvvm intrinsics and Johan's irEx suggestion can be used for access to others. That leaves the intrinsics in the .td file that have llvm_i1_ty parameters and gcc builtin names (like vote/ballot). IIUC these would all be handled by the gen_gccbuiltins.cpp program if it knew how to deal with llvm_i1_ty parameters. Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms that I've tried for llvm_i1_ty gccbuiltin intrinsics have not worked, at least not for the vote/ballot test cases. I'll keep investigating. Clang and nvcc are both based on LLVM and deal with these somehow so I'll start looking there. I expect that you're very busy but if you already know or suspect what the llvm_i1_ty problem/fix is, a brief sketch here would be much appreciated. Regardless, thanks again for your efforts. A working LDC/dcompute is *much* nicer than C++/CUDA.
Mar 05 2021
On Friday, 5 March 2021 at 13:30:52 UTC, Bruce Carneal wrote:Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms that I've tried for llvm_i1_ty gccbuiltin intrinsics have not worked, at least not for the vote/ballot test cases. I'll keep investigating.For __nvvm_vote_ballot (no idea what it does), this seems to suffice (bool for i1): pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot") int __nvvm_vote_ballot(bool); void foo(bool a) { __nvvm_vote_ballot(a); } https://run.dlang.io/is/z0k6l9
Mar 05 2021
On Friday, 5 March 2021 at 16:34:07 UTC, kinke wrote:https://run.dlang.io/is/z0k6l9Small correction (IR generation was fine, compilation wasn't): that intrinsic requires a newer target shader model (-mcpu=sm_70) [and -betterC to avoid ModuleInfo etc.]. Then the asm can be generated and inspected too.
Mar 05 2021
On Friday, 5 March 2021 at 16:41:39 UTC, kinke wrote:On Friday, 5 March 2021 at 16:34:07 UTC, kinke wrote:Yep. The above works for me as well. cuda_620 (sm_62 IIUC) is the highest currently allowed via the --mdcompute-targets interface. I'll work on expanding that. I'm not sure if the "i1" type can just plug in to gen_gccbuiltins.cpp as previously discussed but I'll look at that as well. Thanks.https://run.dlang.io/is/z0k6l9Small correction (IR generation was fine, compilation wasn't): that intrinsic requires a newer target shader model (-mcpu=sm_70) [and -betterC to avoid ModuleInfo etc.]. Then the asm can be generated and inspected too.
Mar 05 2021
On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote: ... These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord(). [1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cppGood to know. __nvvm_vote_ballot and __nvvm_vote_ballot_sync both show up in the .td file with GCCBuiltin names but do not show up in the .di file. I've found a workaround for those but I'd like to understand what kept them from being included in the .di file. I'm guessing it is the dtype(...) call from within processRecord(...) that did not know what to do with an i1 type? If it's as simple as that I'd greatly appreciate your augmenting the if/else ladder in dtype() to include the branch for i1. The ballot/vote instruction may be better than my workaround and has other uses. I tried to work around the .di file issue by declaring my own pragma: pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot") int myBallot(bool); with no luck: LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.vote.ballot If you know how I can manually define the ballot pragmas correctly before any dtype(...) upgrade you might release, please let me know. Finally and again, many thanks for your efforts maintaining LDC.
Mar 04 2021
On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:After updating the first line to ' compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics.Hi Bruce, Why not submit a PR that modifies `gen_gccbuiltins.cpp` such that it adds the ` compute` attribute for the relevant intrinsics files? I think it's OK if `gen_gccbuiltins` contains some hacks like that . Please add a small compile test case, so we verify that it won't bitrot in the future. Wouldn't ` compute(CompileFor.deviceOnly)` make more sense, because the intrinsics will not be available on normal CPUs anyway? I hope all your work will land in either LDC or dcompute's repositories, such that others can easily benefit from it. cheers, Johan
Apr 25 2021
On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:Yes, I'll help when the current push is over here, but I think I dont understand enough quite yet. I'm still bumping in to limitations/awkwardness in dcompute that should admit simple solutions. At least it feels that way. One idea from my experience to date is that we can and probably should create a simpler (from a programmer perspective) and finer granularity way to handle multiple targets. Intrinsic selection is part of that as is library selection. Also on my mind is how we should handle deployment. For the ultimate in speed we can do AOT per-target specialized compiles and "fat" binaries but using SPIR-V + Vulkan compute could significantly improve penetration and reduce bloat. I read a relatively recent thread in an LLVM forum indicating that the Intel guys are pushing a "real" SPIR-V IR effort now so maybe we can help out there. Also, I dont know how MLIR should fit in to our plans. I'll be in touch when I get my head above water here. Thanks to you and the rest of the LDC crew for the help so far. Looking forward to advancing dlang on GPUs in the future. It really can be much much better than C++ in that arena. BruceAfter updating the first line to ' compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics.Hi Bruce, Why not submit a PR that modifies `gen_gccbuiltins.cpp` such that it adds the ` compute` attribute for the relevant intrinsics files? I think it's OK if `gen_gccbuiltins` contains some hacks like that . Please add a small compile test case, so we verify that it won't bitrot in the future. Wouldn't ` compute(CompileFor.deviceOnly)` make more sense, because the intrinsics will not be available on normal CPUs anyway? I hope all your work will land in either LDC or dcompute's repositories, such that others can easily benefit from it. cheers, Johan
Apr 26 2021
On Monday, 26 April 2021 at 13:20:11 UTC, Bruce Carneal wrote:On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:Nice work, thanks for wanting to improve dcompute! I think D has real potential there[...]Yes, I'll help when the current push is over here, but I think I dont understand enough quite yet. I'm still bumping in to limitations/awkwardness in dcompute that should admit simple solutions. At least it feels that way. [...]
Apr 27 2021