digitalmars.D.ldc - ldc/dcompute nvptx intrinsics
- Bruce Carneal (17/17) Feb 19 2021 LLVM appears to "know" about most if not all of the CUDA
- Johan Engelen (11/16) Feb 20 2021 Hi Bruce,
- Bruce Carneal (6/24) Feb 20 2021 This looks like a very good way to enable intrinsics. I'll give
- Bruce Carneal (8/26) Feb 20 2021 Here's what I get so far when testing the above out against
- Johan (52/81) Feb 23 2021 Hi Bruce,
- Bruce Carneal (16/40) Feb 23 2021 Success!
- Bruce Carneal (10/46) Mar 04 2021 Turns out that almost all intrinsics are available after a tiny
- Imperatorn (3/21) Feb 23 2021 Wat, didn't even know about that
LLVM appears to "know" about most if not all of the CUDA intrinsics as can be seen here: https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXIntrinsics.td I've successfully mimicked dcompute's pragma(LDC_intrinsic "llvm.nvvm...") idiom for some things but have not succeeded in implementing "ballot" or other important CUDA intrinsics that are commented out in dcompute.std.warp.d. Perhaps I'm just not uttering the correct incantations... I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve. Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally? I'm considering trying to merge a hand written PTX file in at build time and falling back to C++/CUDA if all else fails. Anything that will help me avoid a trip to the dark side would be much appreciated.
Feb 19 2021
On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve. Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Feb 20 2021
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:This looks like a very good way to enable intrinsics. I'll give it a try and report back. If it's clean enough I'll also submit a dcompute PR. Thanks Johan!I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve. Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Feb 20 2021
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:Here's what I get so far when testing the above out against llvm.nvvm.barrier0.popc (something that works as an LDC_intrinsic): "can only call functions from other compute modules in compute code" I'll keep digging.I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve. Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Feb 20 2021
On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal wrote:On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:Hi Bruce, I played around a bit and have a full working example for you: ``` compute(CompileFor.deviceOnly) module dcompute; import ldc.dcompute; // Copied from ldc.llvmasm, because indeed it needs to be defined in compute code. pragma(LDC_inline_ir) R __irEx(string prefix, string code, string suffix, R, P...)(P) trusted nothrow nogc; alias someFunction = __irEx!("declare i32 llvm.nvvm.barrier0.popc(i32)", "%i = call i32 llvm.nvvm.barrier0.popc(i32 %0)\nret i32 %i", "", int, int); void foo(GlobalPointer!float x_in) { auto i = someFunction(2); } ``` Compile with `bin/ldc2 -c -mdcompute-targets=cuda-350 -mdcompute-file-prefix=dcompute_kernel dcompute.d -O3` gives PTX output: ``` ❯ cat dcompute_kernel_cuda350_64.ptx // // Generated by LLVM NVPTX Back-End // .version 3.2 .target sm_35 .address_size 64 // .globl _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv .visible .func _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv( .param .b64 _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv_param_0 ) { .reg .b32 %r<3>; mov.u32 %r1, 2; { .reg .pred %p1; setp.ne.u32 %p1, %r1, 0; bar.red.popc.u32 %r2, 0, %p1; } ret; } ``` If this indeed will fit your usecase, you have a good argument for including `__irEx` into ldc.dcompute. Please file bugs/features on github! cheers, JohanOn Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:Here's what I get so far when testing the above out against llvm.nvvm.barrier0.popc (something that works as an LDC_intrinsic): "can only call functions from other compute modules in compute code" I'll keep digging.I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve. Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Feb 23 2021
On Tuesday, 23 February 2021 at 18:04:52 UTC, Johan wrote:On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal wrote:On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve. Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?Hi Bruce, I played around a bit and have a full working example for you: ``` compute(CompileFor.deviceOnly) module dcompute; import ldc.dcompute; [... working example ...] If this indeed will fit your usecase, you have a good argument for including `__irEx` into ldc.dcompute. Please file bugs/features on github! cheers, JohanSuccess! After verifying that your latest example worked I plugged in llvm.nvvm.clz.i from the .td file. That generated the hoped for single instruction function body: clz.b32 %r2, %r1 This clz intrinsic alone saves me a couple dozen instructions in a hot section of code where I "call" clz twice. I will expand the set of intrinsics enabled via the __irEx method over the next few days and then try to contact Nicholas W. and/or John C. via email or beerconf to get their take on the capabilities (they may suggest a more easily supported way to go about things, or have cuda naming suggestions, or want to rationalize these with OCL, or ...). Assuming that goes well I'll file with ldc and dcompute. Thank you Johan.
Feb 23 2021
On Tuesday, 23 February 2021 at 23:36:53 UTC, Bruce Carneal wrote:On Tuesday, 23 February 2021 at 18:04:52 UTC, Johan wrote:[...]On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal wrote:Turns out that almost all intrinsics are available after a tiny bit of .di file massaging (see the "ldc nvvm GPU intrinsics good news" thread). The irEx capability Johan pointed out seems to cover the rest. The irEx accessible set is defined at the top of the intrinsics .td file and includes the popc intrinsic, the clz initrinsic, and about a dozen others, most very useful. The .di file instrinsics number over 500, some of which appear useful.Hi Bruce, I played around a bit and have a full working example for you: ``` compute(CompileFor.deviceOnly) module dcompute; import ldc.dcompute; [... working example ...] If this indeed will fit your usecase, you have a good argument for including `__irEx` into ldc.dcompute. Please file bugs/features on github! cheers, JohanSuccess! After verifying that your latest example worked I plugged in llvm.nvvm.clz.i from the .td file. That generated the hoped for single instruction function body: clz.b32 %r2, %r1 This clz intrinsic alone saves me a couple dozen instructions in a hot section of code where I "call" clz twice. I will expand the set of intrinsics enabled via the __irEx method over the next few days and then try to contact Nicholas W. and/or John C. via email or beerconf to get their take on the capabilities (they may suggest a more easily supported way to go about things, or have cuda naming suggestions, or want to rationalize these with OCL, or ...). Assuming that goes well I'll file with ldc and dcompute. Thank you Johan.
Mar 04 2021
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:Wat, didn't even know about thatI'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve. Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?Hi Bruce, You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics. Very quickly, something like this: ``` alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*); ``` -Johan
Feb 23 2021