digitalmars.D - D and GPGPU
- Russel Winder via Digitalmars-d (32/32) Feb 18 2015 It strikes me that D really ought to be able to work with GPGPU =E2=80=9...
- John Colvin (4/38) Feb 18 2015 It would be great if LDC could do this using
- Laeeth Isharc (6/40) Feb 18 2015 I agree it would be very helpful.
- Laeeth Isharc (5/5) Feb 18 2015 One interesting C++ use in finance of CUDA. Joshi is porting
- ponce (18/63) Feb 18 2015 What is does is provide access to the most useful part of the
- Jacob Carlborg (13/17) Feb 20 2015 For OS X:
- ponce (22/43) Feb 18 2015 I'd like to about the kernel languages (having done both OpenCL
- luminousone (26/60) Feb 18 2015 https://github.com/HSAFoundation
- Paulo Pinto (7/72) Feb 18 2015 Java will support HSA as of Java 9 - 10, depending on project's
- francesco.cattoglio (4/7) Feb 20 2015 Unless I'm mistaken, it will more like the opposite: HSA will use
- luminousone (18/26) Feb 20 2015 HSAIL does not depend on opencl, and it supports more then
It strikes me that D really ought to be able to work with GPGPU =E2=80=93 i= s there already something and I just failed to notice. This is data parallelism but of a slightly different sort to that in std.parallelism. std.concurrent, std.parallelism, std.gpgpu ought to be harmonious though. The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out =E2=80=93 currently very slow but the next generation = of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate. Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but=E2=80=A6 So with D= can we write D kernels and have them compiled and loaded using a combination of CTFE, D =E2=86=92 C translation, C ompiler call, and other magic? Is this a GSoC 2015 type thing? (*) It will be interesting to see how NVIDIA responds to the tack Intel are taking on GPGPU and main memory access. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 18 2015
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder wrote:It strikes me that D really ought to be able to work with GPGPU – is there already something and I just failed to notice. This is data parallelism but of a slightly different sort to that in std.parallelism. std.concurrent, std.parallelism, std.gpgpu ought to be harmonious though. The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out – currently very slow but the next generation of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate. Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but… So with D can we write D kernels and have them compiled and loaded using a combination of CTFE, D → C translation, C ompiler call, and other magic? Is this a GSoC 2015 type thing? (*) It will be interesting to see how NVIDIA responds to the tack Intel are taking on GPGPU and main memory access.It would be great if LDC could do this using https://www.khronos.org/spir
Feb 18 2015
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder wrote:It strikes me that D really ought to be able to work with GPGPU – is there already something and I just failed to notice. This is data parallelism but of a slightly different sort to that in std.parallelism. std.concurrent, std.parallelism, std.gpgpu ought to be harmonious though. The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out – currently very slow but the next generation of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate. Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but… So with D can we write D kernels and have them compiled and loaded using a combination of CTFE, D → C translation, C ompiler call, and other magic? Is this a GSoC 2015 type thing? (*) It will be interesting to see how NVIDIA responds to the tack Intel are taking on GPGPU and main memory access.I agree it would be very helpful. I have this on my to look at list, and don't yet know exactly what it does and doesn't do: http://code.dlang.org/packages/derelict-cuda
Feb 18 2015
One interesting C++ use in finance of CUDA. Joshi is porting quantlib, or at least part of it, to a cuda environment. Some nice speed ups for Bermudan pricing. http://sourceforge.net/projects/kooderive/ http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1473563
Feb 18 2015
On Wednesday, 18 February 2015 at 16:03:20 UTC, Laeeth Isharc wrote:On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder wrote:What is does is provide access to the most useful part of the CUDA API which is two-headed: - the Driver API provides the most control over the GPU and I would recommend this one. If you are in CUDA you probably want top efficiency and control. - the Runtime API abstract over multi-GPU and is the basis for high-level libraries NVIDIA churns out in trendy domains. (request to Linux/Mac readers: still searching for the correct library names for linux :) ). When using DerelictCUDA, you still need nvcc to compile your .cu files and then load them. This is "less easy" than when using the NVIDIA SDK which will eventually allow to combine GPU and CPU code in the same source file. Apart from that, this is 2015 and I see little reasons to start new projects in CUDA with the advent of OpenCL 2.0 drivers.It strikes me that D really ought to be able to work with GPGPU – is there already something and I just failed to notice. This is data parallelism but of a slightly different sort to that in std.parallelism. std.concurrent, std.parallelism, std.gpgpu ought to be harmonious though. The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out – currently very slow but the next generation of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate. Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but… So with D can we write D kernels and have them compiled and loaded using a combination of CTFE, D → C translation, C ompiler call, and other magic? Is this a GSoC 2015 type thing? (*) It will be interesting to see how NVIDIA responds to the tack Intel are taking on GPGPU and main memory access.I agree it would be very helpful. I have this on my to look at list, and don't yet know exactly what it does and doesn't do: http://code.dlang.org/packages/derelict-cuda
Feb 18 2015
On 2015-02-18 18:56, ponce wrote:- the Runtime API abstract over multi-GPU and is the basis for high-level libraries NVIDIA churns out in trendy domains. (request to Linux/Mac readers: still searching for the correct library names for linux :) ).For OS X: CUDA Driver: This will install /Library/Frameworks/CUDA.framework and the UNIX-compatibility stub /usr/local/cuda/lib/libcuda.dylib that refers to it I would recommend the framework. Make sure the correct path is added, take a look at SDL for example [1]. You need something like "../Frameworks/CUDA.framework/CUDA" to make sure it's possible to bundle the cuda framework in an application bundle. [1] https://github.com/DerelictOrg/DerelictSDL2/blob/master/source/derelict/sdl2/sdl.d#L42 -- /Jacob Carlborg
Feb 20 2015
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder wrote:The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out – currently very slow but the next generation of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate. Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but… So with D can we write D kernels and have them compiled and loaded using a combination of CTFE, D → C translation, C ompiler call, and other magic?I'd like to about the kernel languages (having done both OpenCL and CUDA). A big speed-up factor is the multiple level of parallelism exposed in OpenCL C and CUDA C: - contect parallelism (eg. several GPU) - command parallelism (based on a future model) - block parallelism - warp/sub-block parallelism - in each sub-block, N threads (typically 32 or 64) All of that supported by appropriate barrier semantics. Typical C-like code only has threads as parallelism and a less complex cache. Also most algorithms don't translate all that well to SIMD threads working in lockstep. Example: instead of looping on that 2D image and perform an horizontal blur on 15 pixels, instead perform this operation on 32x16 blocks simultaneously, while caching stuff in block-local memory. It is much like an auto-vectorization problem and auto-vectorization is hard.
Feb 18 2015
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder wrote:It strikes me that D really ought to be able to work with GPGPU – is there already something and I just failed to notice. This is data parallelism but of a slightly different sort to that in std.parallelism. std.concurrent, std.parallelism, std.gpgpu ought to be harmonious though. The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out – currently very slow but the next generation of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate. Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but… So with D can we write D kernels and have them compiled and loaded using a combination of CTFE, D → C translation, C ompiler call, and other magic? Is this a GSoC 2015 type thing? (*) It will be interesting to see how NVIDIA responds to the tack Intel are taking on GPGPU and main memory access.https://github.com/HSAFoundation This is really the way to go, yea opencl and cuda exist, along with opengl/directx compute shaders, but pretty much every thing out their suffers from giant limitations. With HSA, HSAIL bytecode is embedded directly into the elf/exe file, HASIL bytecode can can fully support all the features of c++, virtual function lookups in code, access to the stack, cache coherent memory access, the same virtual memory view as the application it runs in, etc. HSA is implemented in the llvm backend compiler, and when it is used in a elf/exe file, their is a llvm based finalizer that generates gpu bytecode. More importantly, it should be very easy to implement in any llvm supported language once all of the patches are moved up stream to their respective libraries/toolsets. I believe that linux kernel 3.19 and above have the iommu 2.5 patches, and I think amd's radeon KFD driver made it into 3.20. HSA will also be supported by ARM. HSA is generic enough, that assuming Intel implements similar capabilities into their chips it otta be supportable their with or without intels direct blessing. HSA does work with discrete gpu's and not just the embedded stuff, And I believe that HSA can be used to accelerate OpenCL 2.0, via copyless cache coherent memory access.
Feb 18 2015
On Wednesday, 18 February 2015 at 18:14:19 UTC, luminousone wrote:On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder wrote:Java will support HSA as of Java 9 - 10, depending on project's progress. http://openjdk.java.net/projects/sumatra/ https://wiki.openjdk.java.net/display/Sumatra/Main -- PauloIt strikes me that D really ought to be able to work with GPGPU – is there already something and I just failed to notice. This is data parallelism but of a slightly different sort to that in std.parallelism. std.concurrent, std.parallelism, std.gpgpu ought to be harmonious though. The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out – currently very slow but the next generation of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate. Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but… So with D can we write D kernels and have them compiled and loaded using a combination of CTFE, D → C translation, C ompiler call, and other magic? Is this a GSoC 2015 type thing? (*) It will be interesting to see how NVIDIA responds to the tack Intel are taking on GPGPU and main memory access.https://github.com/HSAFoundation This is really the way to go, yea opencl and cuda exist, along with opengl/directx compute shaders, but pretty much every thing out their suffers from giant limitations. With HSA, HSAIL bytecode is embedded directly into the elf/exe file, HASIL bytecode can can fully support all the features of c++, virtual function lookups in code, access to the stack, cache coherent memory access, the same virtual memory view as the application it runs in, etc. HSA is implemented in the llvm backend compiler, and when it is used in a elf/exe file, their is a llvm based finalizer that generates gpu bytecode. More importantly, it should be very easy to implement in any llvm supported language once all of the patches are moved up stream to their respective libraries/toolsets. I believe that linux kernel 3.19 and above have the iommu 2.5 patches, and I think amd's radeon KFD driver made it into 3.20. HSA will also be supported by ARM. HSA is generic enough, that assuming Intel implements similar capabilities into their chips it otta be supportable their with or without intels direct blessing. HSA does work with discrete gpu's and not just the embedded stuff, And I believe that HSA can be used to accelerate OpenCL 2.0, via copyless cache coherent memory access.
Feb 18 2015
On Wednesday, 18 February 2015 at 18:14:19 UTC, luminousone wrote:HSA does work with discrete gpu's and not just the embedded stuff, And I believe that HSA can be used to accelerate OpenCL 2.0, via copyless cache coherent memory access.Unless I'm mistaken, it will more like the opposite: HSA will use OpenCL 2.0 as a backend to do that kind of "copyless" GPGPU acceleration.
Feb 20 2015
On Friday, 20 February 2015 at 10:05:34 UTC, francesco.cattoglio wrote:On Wednesday, 18 February 2015 at 18:14:19 UTC, luminousone wrote:HSAIL does not depend on opencl, and it supports more then copyless gpgpu acceleration, as it said, it has been access to virtual memory, including the program stack. HSA defines changes to the MMU, IOMMU, cpu cache coherency protocol, a new bytecode(HSAIL), a software stack built around llvm and its own backend in the gpu device driver. OpenCL 2.0, generally obtains its copyless accel from remapping the gpu memory into system memory, not from direct access to virtual memory, Intel supports a form of copyless accel via this remapping system. The major difference between the two systems, is that HSA can access any arbitrary location in memory, where as OpenCL must still rely on the pointers being mapped for use. HSA has for example have complete access to runtime type reflection, vtable pointers, you could have a linked list or a tree that is allocated arbitrarily in memory.HSA does work with discrete gpu's and not just the embedded stuff, And I believe that HSA can be used to accelerate OpenCL 2.0, via copyless cache coherent memory access.Unless I'm mistaken, it will more like the opposite: HSA will use OpenCL 2.0 as a backend to do that kind of "copyless" GPGPU acceleration.
Feb 20 2015