digitalmars.D - Any usable SIMD implementation?
- Martin Nowak (11/11) Mar 31 2016 I'm currently working on a templated arrayop implementation (using RPN
- ZombineDev (16/29) Mar 31 2016 I don't know how far has Ilya's work [1] advanced, but you may
- Martin Nowak (31/32) Apr 01 2016 Well apparently stores w/ dmd's weird core.simd interface don't work, or
- Iain Buclaw via Digitalmars-d (8/40) Apr 01 2016 I would just let the compiler optimize / vectorize the operation, but th...
- Martin Nowak (6/11) Apr 02 2016 It's intended to replace the array ops in druntime, relying on
- Iain Buclaw via Digitalmars-d (7/14) Apr 02 2016 then again that it is probably just me who thinks these things.
- Martin Nowak (6/9) Apr 02 2016 I'm already using vector types for most operations, so it's somewhat
- Johan Engelen (3/6) Apr 03 2016 Please submit a GH issue with LDC, thanks!
- Martin Nowak (2/5) Apr 03 2016 https://github.com/D-Programming-Language/dmd/pull/5625
- John Colvin (2/15) Mar 31 2016 Am I being stupid or is core.simd what you want?
- Johan Engelen (9/12) Mar 31 2016 I think you want to write your code using SIMD primitives.
- Iakh (7/20) Mar 31 2016 Unfortunately my one(https://github.com/Iakh/simd) is far from
- 9il (8/21) Apr 02 2016 Hello Martin,
- Iain Buclaw via Digitalmars-d (9/26) Apr 02 2016 https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/...
- 9il (7/22) Apr 04 2016 Target cpu configuration:
- Marco Leise (22/29) Apr 04 2016 - On amd64, whether floating-point math is handled by the FPU
- 9il (5/25) Apr 04 2016 @attribute("target", "+sse4")) would not work well for BLAS. BLAS
- Marco Leise (8/12) Apr 11 2016 It's just for the case where you want a generic executable
- Walter Bright (2/5) Apr 04 2016 http://dlang.org/phobos/core_cpuid.html
- Marco Leise (56/62) Apr 11 2016 That's what I implied in "what we have now":
- Walter Bright (17/78) Apr 11 2016 There's no reason core.cpuid, which has a platform-independent API, cann...
- Marco Leise (54/101) Apr 12 2016 LDC implements InlineAsm_X86_Any (DMD style asm), so
- Walter Bright (11/43) Apr 12 2016 Years? Anyone who needs core.cpuid could translate it to GDC's inline as...
- Marco Leise (94/122) Apr 12 2016 You mean it is ok, if I duplicated most of the asm in there
- Walter Bright (29/54) Apr 12 2016 It's Boost licensed, and Boost licensed code can be shipped with GPL'd c...
- Iain Buclaw via Digitalmars-d (14/27) Apr 13 2016 Infact the "correct" version of "mul eax" is.
- Marco Leise (64/87) Apr 16 2016 Tell me again, what's more elgant !
- Walter Bright (3/4) Apr 16 2016 If I wanted to write in assembler, I wouldn't write in a high level lang...
- Marco Leise (55/60) Apr 17 2016 I hate the many pitfalls of extended asm: Forget to mention a
- Iain Buclaw via Digitalmars-d (33/54) Apr 12 2016 asm { "mul eax"; } - That wasn't so difficult. :-)
- Walter Bright (26/62) Apr 12 2016 My understanding is that is not sufficient if you want gcc to track regi...
- Iain Buclaw via Digitalmars-d (9/44) Apr 13 2016 My only point was that in GDC, the translation of opcodes to machine
- Marco Leise (11/17) Apr 13 2016 Am Wed, 13 Apr 2016 09:51:25 +0200
- Iain Buclaw via Digitalmars-d (7/24) Apr 13 2016 Yes, cpu_supports is a good way to do it as we only need to invoke
- Marco Leise (23/26) Apr 13 2016 Am Wed, 13 Apr 2016 11:21:35 +0200
- Walter Bright (3/12) Apr 13 2016 Please do not invent an alternative interface, use the one in core.cpuid...
- Marco Leise (9/26) Apr 13 2016 Yes, they are all @property and a substitution with direct
- Walter Bright (4/9) Apr 13 2016 It doesn't need to be efficient, because such checks should be done at a...
- Iain Buclaw via Digitalmars-d (3/19) Apr 14 2016 An alternative interface needs to be invented anyway for other CPUs.
- Walter Bright (2/3) Apr 14 2016 That would be fine. But there is no reason to redo core.cpuid for x86 ma...
- Walter Bright (6/12) Apr 04 2016 ??
- 9il (17/30) Apr 04 2016 How many general purpose registers, SIMD Floating Point
- jmh530 (3/5) Apr 04 2016 Are you familiar with this project at all?
- 9il (7/13) Apr 04 2016 Thank for the link. BLIS has the same issue like OpenBLAS - a
- Walter Bright (17/27) Apr 04 2016 Since the compiler never generates AVX or AVX2 instructions, there is no...
- 9il (24/60) Apr 04 2016 It is impossible to deduct from that combination that Xeon Phi
- Walter Bright (9/22) Apr 05 2016 Since dmd doesn't generate specific code for a Xeon Phi, having a compil...
- John Colvin (12/45) Apr 05 2016 The particular design and limitations of the dmd backend
- Walter Bright (15/21) Apr 05 2016 There's a line between trying to standardize everything and letting add-...
- 9il (5/22) Apr 05 2016 Yes, but this is bad idea to have a set of versions for Phobos,
- Walter Bright (4/5) Apr 05 2016 Because it would affect all the code in the module and every template it...
- 9il (6/12) Apr 05 2016 99.99% of them do not need to compile code with different
- Joe Duarte (38/44) Apr 17 2016 There are many organizations in the world that are building
- Temtaime (9/20) Apr 17 2016 In addition it's COMPILER work, not programmer!
- Johan Engelen (9/21) Apr 23 2016 Thanks, I've seen similar comments in LLVM code.
- Marco Leise (7/14) Apr 23 2016 Please do test it. Activating sse3 and disabling sse2
- Joe Duarte (12/34) May 02 2016 If you specify SSE3, you should definitely get SSE2 and plain old
- 9il (9/13) Apr 05 2016 I can do it, however I would like to get this information from
- Walter Bright (2/10) Apr 05 2016 Where does the compiler get the information that it should compile for, ...
- 9il (7/16) Apr 05 2016 No idea about AFX. Do you choose AFX to disallow me to find an
- Walter Bright (4/14) Apr 05 2016 I want to make it clear that dmd does not generate AFX specific code, ha...
- Johan Engelen (6/10) Apr 05 2016 How about adding a "__target(...)" compile-time function, that
- 9il (9/20) Apr 05 2016 Yes, something like that is what I am looking for.
- Manu via Digitalmars-d (7/18) Apr 06 2016 With respect to SIMD, knowing a processor model like 'broadwell' is
- 9il (3/12) Apr 06 2016 Yes, however this can be implemented in a spcial Phobos module.
- Johan Engelen (10/23) Apr 06 2016 After browsing through some LLVM code, I think is actually very
- 9il (2/8) Apr 06 2016 Ahaha)) --Ilya
- Manu via Digitalmars-d (14/29) Apr 06 2016 Sure, but it's an ongoing maintenance task, constantly requiring
- Walter Bright (4/16) Apr 06 2016 You're not making a good case for a standard language defined set of def...
- Marco Leise (40/47) Apr 11 2016 We can either define the language in terms of CPU models or
- Johannes Pfau (35/68) Apr 07 2016 GCC already keeps a cpu <=> feature mapping (after all it needs to know
- 9il (9/29) Apr 05 2016 Please think that D has other compilers, not only DMD. We need a
- jmh530 (3/6) Apr 06 2016 Especially since everyone says to use LDC for the fastest code
- Manu via Digitalmars-d (15/33) Apr 06 2016 I would add that GDC and LDC have such compiler flags and it's
- Walter Bright (19/21) Apr 06 2016 It's a reasonable suggestion; some points:
- Manu via Digitalmars-d (43/68) Apr 06 2016 It's sufficiently blocking that I have not felt like working any
- Walter Bright (36/77) Apr 06 2016 I can understand that it might be demotivating for you, but that is not ...
- 9il (16/39) Apr 07 2016 ldc -mcpu=native
- Walter Bright (8/15) Apr 07 2016 Yes, and nobody cares. With virtual memory and demand loading, unexecute...
- 9il (7/29) Apr 07 2016 This is not true for BLAS based on D. You don't want to see the
- jmh530 (3/4) Apr 07 2016 Perhaps if you provide him a simplified example he might see what
- 9il (4/9) Apr 07 2016 He know what I am talking about. This is about
- Johannes Pfau (11/21) Apr 07 2016 Actually for GDC/GCC you can't even write functions using certain SIMD
- Johannes Pfau (15/36) Apr 07 2016 The problem is that march=x can set more than one
- Walter Bright (4/17) Apr 07 2016 Having a veritable blizzard of these predefined versions, that constantl...
- Kai Nacke (8/18) Apr 07 2016 glibc has a special mechanism for resolving the called function
- Johannes Pfau (7/31) Apr 07 2016 Available in GCC as the 'ifunc' attribute:
- Johan Engelen (13/29) Apr 07 2016 I thought that the ifunc mechanism means an indirect call (i.e. a
- Johannes Pfau (25/61) Apr 07 2016 The simple variant I've posted needs an additional branch on every
- Johan Engelen (10/55) Apr 07 2016 Yep exactly.
- Walter Bright (5/23) Apr 07 2016 We already have core.cupid, which covers most of what that article talks...
- Manu via Digitalmars-d (38/122) Apr 07 2016 Sure. I've done this in my own tests. I just never published that
- Walter Bright (9/19) Apr 07 2016 We recognize C++ interoperability to be a key feature of D. I hope you l...
- Johannes Pfau (9/29) Apr 07 2016 That's my #1 argument why '-version' is dangerous and 'static if' is
- xenon325 (25/42) Apr 12 2016 Have you seen how GCC's function multiversioning [1] ?
- Marco Leise (19/43) Apr 12 2016 Awesome! I just tried it and it ties runtime and compile-time
- Marco Leise (9/9) Apr 12 2016 The system seems to call CPUID at startup and for every
- jmh530 (14/15) Apr 15 2016 I've been thinking about the gcc multiversioning since you
- Marco Leise (18/33) Apr 16 2016 GCC only has one architecture as a target at a time. As long
- Johan Engelen (8/10) Apr 05 2016 Last time I looked into this (related to implementing @target,
- jmh530 (9/11) Apr 04 2016 I'm not a SIMD expert, I've only played around with SIMD a
- Walter Bright (3/9) Apr 04 2016 Use a runtime switch (see core.cpuid).
- Marco Leise (12/27) Apr 11 2016 I wonder if answers like this are meant to be filled into a
- Manu via Digitalmars-d (5/28) Apr 03 2016 My SIMD implementation has been blocked on that for years too.
- Walter Bright (7/10) Apr 03 2016 Here is a list of all the open Bugzilla issues tagged with the keyword S...
- Jack Stouffer (3/13) Apr 03 2016 He's talked about it on his github PR:
- Walter Bright (4/10) Apr 03 2016 Yes, but I never noticed that until you posted a link. The place to file...
- Jack Stouffer (3/7) Apr 04 2016 I made a bug to track this problem:
- jmh530 (3/5) Apr 04 2016 You might add link to this thread and github where he made the
- Walter Bright (2/7) Apr 04 2016 http://www.digitalmars.com/d/archives/digitalmars/D/Any_usable_SIMD_impl...
- Walter Bright (2/7) Apr 04 2016 I believe the issue is fixed (for DMD) with a documentation improvement.
- ZombineDev (5/16) Apr 04 2016 I believe the problem is that you can't rely on D_SIMD that SSE4,
- Walter Bright (7/11) Apr 04 2016 Right, you can't. But the issue here is having the compiler give a prede...
- Johan Engelen (2/15) Apr 15 2016 https://github.com/ldc-developers/ldc/pull/1434
- Marco Leise (7/16) Apr 04 2016 +1000!
- Etienne (4/17) Apr 12 2016 Not sure if it's been mentioned, but I've made a best effort to
- Ilya Yaroshenko (10/23) Aug 22 2016 ndslice.algorithm [1], [2] compiled with recent LDC beta will do
I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -Martin
Mar 31 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor /internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -MartinI don't know how far has Ilya's work [1] advanced, but you may want to join efforts with him. There are also two std.simd packages [2] [3]. BTW, I looked at your code a couple of days ago and I thought that it is a really interesting approach to encode operations like that. I'm just wondering if pursuing this approach is a good idea in the long run, i.e. is it expressible enough to cover the use cases of HPC which would also need something similar, but for custom linear algebra types. Here's an interesting video about approaches to solving this problem in C++: https://www.youtube.com/watch?v=hfn0BVOegac [1]: http://forum.dlang.org/post/nilhvnqbsgqhxdshpqfl forum.dlang.org [2]: https://github.com/D-Programming-Language/phobos/pull/2862 [3]: https://github.com/Iakh/simd
Mar 31 2016
On 03/31/2016 10:55 AM, ZombineDev wrote:[2]: https://github.com/D-Programming-Language/phobos/pull/2862Well apparently stores w/ dmd's weird core.simd interface don't work, or I can't figure out (from the non-existent documentation) how to use it. --- import core.simd; void test(float4* ptr, float4 val) { __simd_sto(XMM.STOUPS, *ptr, val); __simd(XMM.STOUPS, *ptr, val); auto val1 = __simd_sto(XMM.STOUPS, *ptr, val); auto val2 = __simd(XMM.STOUPS, *ptr, val); } --- LDC at least has some intrinsics once you find ldc.gccbuiltins_x86, but for some reason comes with it's own broken ldc.simd.loadUnaligned instead of providing intrinsics. --- import core.simd, ldc.simd; float4 test(float* ptr) { return loadUnaligned!float4(ptr); } --- /home/dawg/dlang/ldc-0.17.1/bin/../import/ldc/simd.di(212): Error: can't parse inline LLVM IR: %r = load <4 x float>* %p, align 1 ^ expected comma after load's type So are 3 different untested and unused APIs really the current state of SIMD? -Martin
Apr 01 2016
On 2 Apr 2016 12:40 am, "Martin Nowak via Digitalmars-d" < digitalmars-d puremagic.com> wrote:On 03/31/2016 10:55 AM, ZombineDev wrote:I would just let the compiler optimize / vectorize the operation, but then again that it is probably just me who thinks these things. http://goo.gl/XdiKZX I'm not aware of any intrinsic to load unaligned data. Only to assume alignment. Iain.[2]: https://github.com/D-Programming-Language/phobos/pull/2862Well apparently stores w/ dmd's weird core.simd interface don't work, or I can't figure out (from the non-existent documentation) how to use it. --- import core.simd; void test(float4* ptr, float4 val) { __simd_sto(XMM.STOUPS, *ptr, val); __simd(XMM.STOUPS, *ptr, val); auto val1 = __simd_sto(XMM.STOUPS, *ptr, val); auto val2 = __simd(XMM.STOUPS, *ptr, val); } --- LDC at least has some intrinsics once you find ldc.gccbuiltins_x86, but for some reason comes with it's own broken ldc.simd.loadUnaligned instead of providing intrinsics. --- import core.simd, ldc.simd; float4 test(float* ptr) { return loadUnaligned!float4(ptr); } --- /home/dawg/dlang/ldc-0.17.1/bin/../import/ldc/simd.di(212): Error: can't parse inline LLVM IR: %r = load <4 x float>* %p, align 1 ^ expected comma after load's type So are 3 different untested and unused APIs really the current state of SIMD? -Martin
Apr 01 2016
On Saturday, 2 April 2016 at 06:13:24 UTC, Iain Buclaw wrote:I would just let the compiler optimize / vectorize the operation, but then again that it is probably just me who thinks these things.It's intended to replace the array ops in druntime, relying on vecorizers won't suffice, e.g. your example already stops working when I pass dynamic instead of static arrays.I'm not aware of any intrinsic to load unaligned data. Only to assume alignment.__builtin_ia32_loadups __builtin_ia32_storeups
Apr 02 2016
On 2 Apr 2016 9:45 am, "Martin Nowak via Digitalmars-d" < digitalmars-d puremagic.com> wrote:On Saturday, 2 April 2016 at 06:13:24 UTC, Iain Buclaw wrote:then again that it is probably just me who thinks these things.I would just let the compiler optimize / vectorize the operation, butIt's intended to replace the array ops in druntime, relying on vecorizerswon't suffice, e.g. your example already stops working when I pass dynamic instead of static arrays.alignment.I'm not aware of any intrinsic to load unaligned data. Only to assume__builtin_ia32_loadups __builtin_ia32_storeupsAny agnostic way to... :-)
Apr 02 2016
On 04/02/2016 10:19 AM, Iain Buclaw via Digitalmars-d wrote:I'm already using vector types for most operations, so it's somewhat portable. But for whatever reason D doesn't allow multiplication/division w/ integral vectors (departing from GCC/clang) and I can't perform unaligned loads, so I have to resort to intrinsics for that.Any agnostic way to... :-)__builtin_ia32_loadups __builtin_ia32_storeups
Apr 02 2016
On Friday, 1 April 2016 at 22:31:00 UTC, Martin Nowak wrote:LDC at least has some intrinsics once you find ldc.gccbuiltins_x86, but for some reason comes with it's own broken ldc.simd.loadUnalignedPlease submit a GH issue with LDC, thanks! -Johan
Apr 03 2016
On Friday, 1 April 2016 at 22:31:00 UTC, Martin Nowak wrote:Well apparently stores w/ dmd's weird core.simd interface don't work, or I can't figure out (from the non-existent documentation) how to use it.https://github.com/D-Programming-Language/dmd/pull/5625
Apr 03 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor /internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -MartinAm I being stupid or is core.simd what you want?
Mar 31 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.?I think you want to write your code using SIMD primitives. But in case you want the compiler to generate SIMD instructions, perhaps ldc.attributes.target may help you: I have not checked what LDC does with SIMD with default commandline parameters. Cheers, Johan
Mar 31 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor /internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -MartinUnfortunately my one(https://github.com/Iakh/simd) is far from production code. For now I'm trying to figure out interface common to all archs/compilers. And its more about SIMD comparison operations. You could do loads, stores and mul with default D SIMD support but not int div
Mar 31 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor /internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -MartinHello Martin, Is it possible to introduce compile time information about target platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. Best regards, Ilya
Apr 02 2016
On 3 Apr 2016 8:15 am, "9il via Digitalmars-d" <digitalmars-d puremagic.com> wrote:On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.dI'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend).integral mul/div. Is this really the current state of SIMD or am I missing sth.?I don't want to do anything fancy, just unaligned loads, stores, andplatform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target.-MartinHello Martin, Is it possible to introduce compile time information about targetBest regards, IlyaWhat kind of information?
Apr 02 2016
On Sunday, 3 April 2016 at 06:33:13 UTC, Iain Buclaw wrote:On 3 Apr 2016 8:15 am, "9il via Digitalmars-d" <digitalmars-d puremagic.com> wrote:Target cpu configuration: - CPU architecture (done) - Count of FP/Integer registers - Allowed sets of instructions: for example, AVX2, FMA4 - Compiler optimization options (for math) IlyaHello Martin, Is it possible to introduce compile time information about targetplatform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target.Best regards, IlyaWhat kind of information?
Apr 04 2016
Am Mon, 04 Apr 2016 14:02:03 +0000 schrieb 9il <ilyayaroshenko gmail.com>:Target cpu configuration: - CPU architecture (done) - Count of FP/Integer registers - Allowed sets of instructions: for example, AVX2, FMA4 - Compiler optimization options (for math) Ilya- On amd64, whether floating-point math is handled by the FPU or SSE. When emulating floating-point, e.g. for float-to-string and string-to-float code, it is useful to know where to get the active rounding mode from, since they may differ and at least GCC has a switch to choose between both. - For compile time enabling of SSE4 code, a version define is sufficient. Sometimes we want to select a code path at runtime. For this to work, GDC and LDC use a conservative feature set at compile time (e.g. amd64 with SSE2) and tag each SSE4 function with an attribute to temporarily elevate the instruction set. (e.g. attribute("target", "+sse4")) If you didn't tag the function like that the compiler would error out, because the SSE4 instructions are not supported by a minimal amd64 CPU. To put this to good use, we need a reliable way - basically a global variable - to check for SSE4 (or POPCNT, etc.). What we have now does not work across all compilers. -- Marco
Apr 04 2016
On Monday, 4 April 2016 at 16:21:15 UTC, Marco Leise wrote:Am Mon, 04 Apr 2016 14:02:03 +0000 schrieb 9il <ilyayaroshenko gmail.com>: - On amd64, whether floating-point math is handled by the FPU or SSE. When emulating floating-point, e.g. for float-to-string and string-to-float code, it is useful to know where to get the active rounding mode from, since they may differ and at least GCC has a switch to choose between both. - For compile time enabling of SSE4 code, a version define is sufficient. Sometimes we want to select a code path at runtime. For this to work, GDC and LDC use a conservative feature set at compile time (e.g. amd64 with SSE2) and tag each SSE4 function with an attribute to temporarily elevate the instruction set. (e.g. attribute("target", "+sse4")) If you didn't tag the function like that the compiler would error out, because the SSE4 instructions are not supported by a minimal amd64 CPU. To put this to good use, we need a reliable way - basically a global variable - to check for SSE4 (or POPCNT, etc.). What we have now does not work across all compilers.attribute("target", "+sse4")) would not work well for BLAS. BLAS needs compile time constants. This is very important because BLAS can be 95% portable, so I just need to write a code that would be optimized very well by compiler. --Ilya
Apr 04 2016
Am Mon, 04 Apr 2016 18:35:26 +0000 schrieb 9il <ilyayaroshenko gmail.com>:attribute("target", "+sse4")) would not work well for BLAS. BLAS needs compile time constants. This is very important because BLAS can be 95% portable, so I just need to write a code that would be optimized very well by compiler. --IlyaIt's just for the case where you want a generic executable with a generic and a specialized code path. I didn't mean this to be exclusively used without compile-time information about target features. -- Marco
Apr 11 2016
On 4/4/2016 9:21 AM, Marco Leise wrote:To put this to good use, we need a reliable way - basically a global variable - to check for SSE4 (or POPCNT, etc.). What we have now does not work across all compilers.http://dlang.org/phobos/core_cpuid.html
Apr 04 2016
Am Mon, 4 Apr 2016 11:43:58 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/4/2016 9:21 AM, Marco Leise wrote:That's what I implied in "what we have now": import core.cpuid; writeln( mmx ); // prints 'false' with GDC version(InlineAsm_X86_Any) writeln("DMD and LDC support the Dlang inline assembler"); else writeln("GDC has the GCC extended inline assembler"); Both LLVM and GCC have moved to "extended inline assemblers" that require you to provide information about input, output and scratch registers as well as memory locations, so the compiler can see through the asm-block for register allocation and inlining purposes. It's more difficult to get right, but also more rewarding, as it enables you to write no-overhead "one-liners" and "intrinsics" while having calling conventions still handled by the compiler. An example for GDC: struct DblWord { ulong lo, hi; } /// Multiplies two machine words and returns a double /// machine word. DblWord bigMul(ulong x, ulong y) { DblWord tmp = void; // '=a' and '=d' are outputs to RAX and RDX // respectively that are bound to the two // fields of 'tmp'. // '"a" x' means that we want 'x' as input in // RAX and '"rm" y' places 'y' wherever it // suits the compiler (any general purpose // register or memory location). // 'mulq %3' multiplies with the ulong // represented by the argument at index 3 (y). asm { "mulq %3" : "=a" tmp.lo, "=d" tmp.hi : "a" x, "rm" y; } return tmp; } In the above example the compiler has enough information to inline the function or directly return the result in RAX:RDX without writing to memory first. The same thing in DMD would likely have turned out slower than emulating this using several uint->ulong multiplies. Although less powerful, the LDC team implemented Dlang inline assembly according to the specs and so core.cpuid works for them. GDC on the other hand is out of the picture until either 1) GDC adds Dlang inline assembly 2) core.cpuid duplicates most of its assembly code to support the GCC extended inline assembler I would prefer a common extended inline assembler though, because when you use it for performance reasons you typically cannot go with non-inlinable Dlang asm, so you end up with pure D for DMD, GCC asm for GDC and LDC asm - three code paths. -- MarcoTo put this to good use, we need a reliable way - basically a global variable - to check for SSE4 (or POPCNT, etc.). What we have now does not work across all compilers.http://dlang.org/phobos/core_cpuid.html
Apr 11 2016
On 4/11/2016 7:24 AM, Marco Leise wrote:Am Mon, 4 Apr 2016 11:43:58 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:There's no reason core.cpuid, which has a platform-independent API, cannot be made to work with GDC and LDC. Adding more global variables to do the same thing would add no value and would not be easier to implement.On 4/4/2016 9:21 AM, Marco Leise wrote:That's what I implied in "what we have now": import core.cpuid; writeln( mmx ); // prints 'false' with GDC version(InlineAsm_X86_Any) writeln("DMD and LDC support the Dlang inline assembler"); else writeln("GDC has the GCC extended inline assembler");To put this to good use, we need a reliable way - basically a global variable - to check for SSE4 (or POPCNT, etc.). What we have now does not work across all compilers.http://dlang.org/phobos/core_cpuid.htmlBoth LLVM and GCC have moved to "extended inline assemblers" that require you to provide information about input, output and scratch registers as well as memory locations, so the compiler can see through the asm-block for register allocation and inlining purposes. It's more difficult to get right, but also more rewarding, as it enables you to write no-overhead "one-liners" and "intrinsics" while having calling conventions still handled by the compiler.I know, but "more difficult" is a bit of an understatement. For example, core.cpuid has not been implemented using those assemblers. BTW, dmd's inline assembler does know about which instructions read/write which registers, and makes use of that when inserting the code so it will work with the rest of the code generator's register usage tracking. I find needing to tell gcc which registers are read/written by a particular instruction to be a step BACKWARDS in usability. This is what computers are supposed to be good for :-)An example for GDC: struct DblWord { ulong lo, hi; } /// Multiplies two machine words and returns a double /// machine word. DblWord bigMul(ulong x, ulong y) { DblWord tmp = void; // '=a' and '=d' are outputs to RAX and RDX // respectively that are bound to the two // fields of 'tmp'. // '"a" x' means that we want 'x' as input in // RAX and '"rm" y' places 'y' wherever it // suits the compiler (any general purpose // register or memory location). // 'mulq %3' multiplies with the ulong // represented by the argument at index 3 (y). asm { "mulq %3" : "=a" tmp.lo, "=d" tmp.hi : "a" x, "rm" y; } return tmp; } In the above example the compiler has enough information to inline the function or directly return the result in RAX:RDX without writing to memory first. The same thing in DMD would likely have turned out slower than emulating this using several uint->ulong multiplies.DMD doesn't inline functions with asm in them, but that is not the fault of the inline assembler. The only real weakness in the DMD inline assembler is it doesn't support "let the compiler select the register". DMD's strong support for compiler builtins, however, mitigate this to an acceptable level.Although less powerful, the LDC team implemented Dlang inline assembly according to the specs and so core.cpuid works for them. GDC on the other hand is out of the picture until either 1) GDC adds Dlang inline assembly 2) core.cpuid duplicates most of its assembly code to support the GCC extended inline assembler I would prefer a common extended inline assembler though, because when you use it for performance reasons you typically cannot go with non-inlinable Dlang asm, so you end up with pure D for DMD, GCC asm for GDC and LDC asm - three code paths.
Apr 11 2016
Am Mon, 11 Apr 2016 14:29:11 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/11/2016 7:24 AM, Marco Leise wrote:LDC implements InlineAsm_X86_Any (DMD style asm), so core.cpuid works. GDC is the only compiler that does not implement it. We agree that core.cpuid should provide this information, but what we have now - core.cpuid in a mix with GDC's lack of DMD style asm - does not work in practice for the years to come.Am Mon, 4 Apr 2016 11:43:58 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:There's no reason core.cpuid, which has a platform-independent API, cannot be made to work with GDC and LDC. Adding more global variables to do the same thing would add no value and would not be easier to implement.On 4/4/2016 9:21 AM, Marco Leise wrote:That's what I implied in "what we have now": import core.cpuid; writeln( mmx ); // prints 'false' with GDC version(InlineAsm_X86_Any) writeln("DMD and LDC support the Dlang inline assembler"); else writeln("GDC has the GCC extended inline assembler");To put this to good use, we need a reliable way - basically a global variable - to check for SSE4 (or POPCNT, etc.). What we have now does not work across all compilers.http://dlang.org/phobos/core_cpuid.htmlYep, and that makes it unavailable in GDC. All feature tests return false, even MMX or SSE2 on amd64.Both LLVM and GCC have moved to "extended inline assemblers" that require you to provide information about input, output and scratch registers as well as memory locations, so the compiler can see through the asm-block for register allocation and inlining purposes. It's more difficult to get right, but also more rewarding, as it enables you to write no-overhead "one-liners" and "intrinsics" while having calling conventions still handled by the compiler.I know, but "more difficult" is a bit of an understatement. For example, core.cpuid has not been implemented using those assemblers.BTW, dmd's inline assembler does know about which instructions read/write which registers, and makes use of that when inserting the code so it will work with the rest of the code generator's register usage tracking.That is a pleasant surprise. :)I find needing to tell gcc which registers are read/written by a particular instruction to be a step BACKWARDS in usability. This is what computers are supposed to be good for :-)Still, DMD does not inline asm and always adds a function prolog and epilog around asm blocks in an otherwise empty function (correct me if I'm wrong). "naked" means you have to duplicate code for the different calling conventions, in particular Win32. Your look on GCC (and LLVM) may be a bit biased. First of all you don't need to tell it exactly which registers to use. A rough classification is enough and gives the compiler a good idea of where calculations should be stored upon arrival at the asm statement. You can be specific down to the register name or let the backend chose freely with "rm" (= any register or memory). An example: We have a variable x that is computed inside a function followed by an asm block that multiplies it with something else. Typically you would "MOV EAX, [x]" to load x into the register that the MUL instruction expects. With extended assemblers you can be declarative about that and just state that x is needed in EAX as an input. You drop the MOV from the asm block and let the compiler figure out in its codegen, how x will end up in EAX. That's a step FORWARD in usability.DMD doesn't inline functions with asm in them, but that is not the fault of the inline assembler. The only real weakness in the DMD inline assembler is it doesn't support "let the compiler select the register". DMD's strong support for compiler builtins, however, mitigate this to an acceptable level.Yes, I've witnessed that in multiply with overflow check. DMD generates very efficient code for 'mulu'. It's just that the compiler cannot have builtins for everything. (I personally was looking for 64-bit multiply with 128-bit result and SSE4 string scanning.) The extended assemblers in GCC and LLVM allow me to write intrinsics, often as a single(!) instruction, that seamlessly inlines into the surrounding code, just as DMD's builtins would do. And it seems to me we could have less backend complexity if we were able to implement intrinsics as library code with the same efficiency. ;) But most of the time when I want to access a specialized CPU instruction for speed with asm in DMD, the generic pure D code is faster. I would advise to only use it if the concept is not expressible in pure D at the moment. You might add that we shouldn't write asm in the first place, because compilers have become smart enough, but it's not like I was writing large chunks of asm. I use it to write "compiler builtins" in D source code. -- Marco
Apr 12 2016
On 4/12/2016 9:53 AM, Marco Leise wrote:LDC implements InlineAsm_X86_Any (DMD style asm), so core.cpuid works. GDC is the only compiler that does not implement it. We agree that core.cpuid should provide this information, but what we have now - core.cpuid in a mix with GDC's lack of DMD style asm - does not work in practice for the years to come.Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style in an hour or so. It could even be simply written separately in GAS and linked in. Since this has not been done, I can only conclude that core.cpuid has not been an actual blocker.https://github.com/D-Programming-Language/dmd/blob/master/src/iasm.c#L1255BTW, dmd's inline assembler does know about which instructions read/write which registers, and makes use of that when inserting the code so it will work with the rest of the code generator's register usage tracking.That is a pleasant surprise. :)Still, DMD does not inline asm and always adds a function prolog and epilog around asm blocks in an otherwise empty function (correct me if I'm wrong).Not if you use "naked"."naked" means you have to duplicate code for the different calling conventions, in particular Win32.Why complain about it adding a prolog/epilog, and complain about it not adding it?Your look on GCC (and LLVM) may be a bit biased. First of all you don't need to tell it exactly which registers to use. A rough classification is enough and gives the compiler a good idea of where calculations should be stored upon arrival at the asm statement. You can be specific down to the register name or let the backend chose freely with "rm" (= any register or memory). An example: We have a variable x that is computed inside a function followed by an asm block that multiplies it with something else. Typically you would "MOV EAX, [x]" to load x into the register that the MUL instruction expects. With extended assemblers you can be declarative about that and just state that x is needed in EAX as an input. You drop the MOV from the asm block and let the compiler figure out in its codegen, how x will end up in EAX. That's a step FORWARD in usability.It's a step backwards because I can't just say "MUL EAX". I have to tell GCC what register the result gets put in. This is, to my mind, ridiculous. GCC's inline assembler apparently has no knowledge of what the opcodes actually do.
Apr 12 2016
Am Tue, 12 Apr 2016 13:22:12 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/12/2016 9:53 AM, Marco Leise wrote:You mean it is ok, if I duplicated most of the asm in there and created a pull request ?LDC implements InlineAsm_X86_Any (DMD style asm), so core.cpuid works. GDC is the only compiler that does not implement it. We agree that core.cpuid should provide this information, but what we have now - core.cpuid in a mix with GDC's lack of DMD style asm - does not work in practice for the years to come.Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style in an hour or so. It could even be simply written separately in GAS and linked in. Since this has not been done, I can only conclude that core.cpuid has not been an actual blocker.Yeah, I didn't make this clear. To reduce code repetition I'd like to avoid "naked" and have the compiler handle the calling conventions. Let's compare the earlier example in both GDC and DMD in a coding style that is agnostic wrt. the calling convention. First GDC: struct DblWord { ulong lo, hi; } DblWord bigMul(ulong x, ulong y) { DblWord tmp; asm { "mulq %[y]" : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y; } return tmp; } This is turned into the following instruction sequence (AT&T): mov %rdi,%rax mul %rsi retq Note how elegantly GCC handles the calling convention for us. The prolog reduces to moving 'x' from RDI to RAX where I asked it to place it for the MUL to use as the implicit operand. After multiplying it by the explicit operand in RSI, the resulting two machine words would be in RAX:RDX as we know. I created a data structure to return those two and told GCC to tie tmp.lo to RAX and tmp.hi to RDX. Since the calling convention happens to return structs of 2 machine words in RAX:RDX, the whole assignment to 'tmp' and the return become no-ops. With inlining enabled only the 'mul' would remain. This is the ideal outcome. Now let's look at the DMD implementation - again letting the compiler figure out the calling convention: DblWord bigMul(ulong x, ulong y) { DblWord tmp; asm { mov RAX, x; mul y; mov tmp+DblWord.lo.offsetof, RAX; mov tmp+DblWord.hi.offsetof, RDX; } return tmp; } This generates the following: push %rbp mov %rsp,%rbp sub $0x20,%rsp mov %rdi,-0x10(%rbp) mov %rsi,-0x8(%rbp) lea -0x20(%rbp),%rax xor %ecx,%ecx mov %rcx,(%rax) mov %rcx,0x8(%rax) mov -0x8(%rbp),%rax mulq -0x10(%rbp) mov %rax,-0x20(%rbp) mov %rdx,-0x18(%rbp) mov -0x18(%rbp),%rdx mov -0x20(%rbp),%rax mov %rbp,%rsp pop %rbp retq In practice GDC will just replace the invokation with a single 'mul' instruction while DMD will emit a call to this 18 instructions long function. Now you keep telling me extended assembly is a step backwards. :)Still, DMD does not inline asm and always adds a function prolog and epilog around asm blocks in an otherwise empty function (correct me if I'm wrong).Not if you use "naked"."naked" means you have to duplicate code for the different calling conventions, in particular Win32.Why complain about it adding a prolog/epilog, and complain about it not adding it?It's a step backwards because I can't just say "MUL EAX".You could write this, you'd only have to tell the assembler that EAX and EDX will be overwritten, something that DMD already knows.I have to tell GCC what register the result gets put in.And by doing this you allow it to figure out the shortest way to return the result in compliance with the calling convention.This is, to my mind, ridiculous.I too find it annoying that I have to inform it about the scratch registers used in the asm, but the rest seems legit to me. At some point you will have to connect variables in the host language with registers in assembly. Doing this in a declarative manner instead of explicit assembly code, allows the backend to find the optimal code (literally) as demonstated above.GCC's inline assembler apparently has no knowledge of what the opcodes actually do.Agreed. It seems to treat the assembly text merely as a text template. It is the same with LLVM's extended assembler which borrows heavily from GCC's. This is probably due to the fact that the assembler is historically a standalone executable and as such the authority for interpreting the asm code is outside of the scope of the host language compiler. Under these circumstances we might have gone for the same implementation. -- Marco
Apr 12 2016
On 4/12/2016 4:29 PM, Marco Leise wrote:Am Tue, 12 Apr 2016 13:22:12 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:It's Boost licensed, and Boost licensed code can be shipped with GPL'd code as far as I know.On 4/12/2016 9:53 AM, Marco Leise wrote:You mean it is ok, if I duplicated most of the asm in there and created a pull request ?LDC implements InlineAsm_X86_Any (DMD style asm), so core.cpuid works. GDC is the only compiler that does not implement it. We agree that core.cpuid should provide this information, but what we have now - core.cpuid in a mix with GDC's lack of DMD style asm - does not work in practice for the years to come.Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style in an hour or so. It could even be simply written separately in GAS and linked in. Since this has not been done, I can only conclude that core.cpuid has not been an actual blocker."mulq %[y]" : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;I don't see anything elegant about those lines, starting with "mulq" is not in any of the AMD or Intel CPU manuals. The assembler should notice that 'y' is a ulong and select the 64 bit version of the MUL opcode automatically. I can see nothing to recommend the: "=a" tmp.lo syntax. How about something comprehensible like "tmp.lo = EAX"? I bet people could even figure that out without consulting stackoverflow! :-) I have no idea what: "a" x and: [y] "rm" y mean, nor why the ":" appears sometimes and the "," other times. It does look like it was designed by the same guy who invented TECO macros: https://www.reddit.com/r/programming/comments/4e07lo/last_night_in_a_fit_of_boredom_far_away_from_my/d1xlbh7 but that's not much of a compliment.In practice GDC will just replace the invokation with a single 'mul' instruction while DMD will emit a call to this 18 instructions long function. Now you keep telling me extended assembly is a step backwards. :)DMD version: DblWord bigMul(ulong x, ulong y) { naked asm { mov RAX,RDI; mul RSI; ret; } }This is the basis of my assertion it is a step backwards. Granted, it has some nice capability as you've demonstrated. But it sure makes you suffer to get it.GCC's inline assembler apparently has no knowledge of what the opcodes actually do.Agreed.
Apr 12 2016
On 13 April 2016 at 08:22, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 4/12/2016 4:29 PM, Marco Leise wrote:Infact the "correct" version of "mul eax" is. asm { "mul{l} {%%}eax" : "=a" var : "a" var; } - Works with both dialects (Intel and ATT) - Compiler knows the first register ("a") is read and written to, so doesn't keep temporaries stored there. - Compiler loads the variable "var" into EAX before the statement is executed. - Compiler knows that the value of "var" in EAX after the statement is finished. http://goo.gl/64SSD5 Just toggle on/off intel syntax to see the difference. :-) I can agree that the way that instruction (or insn) templates look are pretty ugly. But IMO, for the most part on x86 their ugliness is attributed to having to support two types of assembler syntax at once.In practice GDC will just replace the invokation with a single 'mul' instruction while DMD will emit a call to this 18 instructions long function. Now you keep telling me extended assembly is a step backwards. :)DMD version: DblWord bigMul(ulong x, ulong y) { naked asm { mov RAX,RDI; mul RSI; ret; } }
Apr 13 2016
Am Tue, 12 Apr 2016 23:22:37 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:Tell me again, what's more elgant ! uint* pnb = cast(uint*)cf.processorNameBuffer.ptr; version(GNU) { asm { "cpuid" : "=a" pnb[0], "=b" pnb[1], "=c" pnb[ 2], "=d" pnb[ 3] : "a" 0x8000_0002; } asm { "cpuid" : "=a" pnb[4], "=b" pnb[5], "=c" pnb[ 6], "=d" pnb[ 7] : "a" 0x8000_0003; } asm { "cpuid" : "=a" pnb[8], "=b" pnb[9], "=c" pnb[10], "=d" pnb[11] : "a" 0x8000_0004; } } else version(D_InlineAsm_X86) { asm pure nothrow nogc { push ESI; mov ESI, pnb; mov EAX, 0x8000_0002; cpuid; mov [ESI], EAX; mov [ESI+4], EBX; mov [ESI+8], ECX; mov [ESI+12], EDX; mov EAX, 0x8000_0003; cpuid; mov [ESI+16], EAX; mov [ESI+20], EBX; mov [ESI+24], ECX; mov [ESI+28], EDX; mov EAX, 0x8000_0004; cpuid; mov [ESI+32], EAX; mov [ESI+36], EBX; mov [ESI+40], ECX; mov [ESI+44], EDX; pop ESI; } } else version(D_InlineAsm_X86_64) { asm pure nothrow nogc { push RSI; mov RSI, pnb; mov EAX, 0x8000_0002; cpuid; mov [RSI], EAX; mov [RSI+4], EBX; mov [RSI+8], ECX; mov [RSI+12], EDX; mov EAX, 0x8000_0003; cpuid; mov [RSI+16], EAX; mov [RSI+20], EBX; mov [RSI+24], ECX; mov [RSI+28], EDX; mov EAX, 0x8000_0004; cpuid; mov [RSI+32], EAX; mov [RSI+36], EBX; mov [RSI+40], ECX; mov [RSI+44], EDX; pop RSI; } } -- Marco"mulq %[y]" : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;I don't see anything elegant about those lines, starting with "mulq" is not in any of the AMD or Intel CPU manuals. The assembler should notice that 'y' is a ulong and select the 64 bit version of the MUL opcode automatically. I can see nothing to recommend the: "=a" tmp.lo syntax. How about something comprehensible like "tmp.lo = EAX"? I bet people could even figure that out without consulting stackoverflow! :-) I have no idea what: "a" x and: [y] "rm" y mean, nor why the ":" appears sometimes and the "," other times.
Apr 16 2016
On 4/16/2016 2:40 PM, Marco Leise wrote:Tell me again, what's more elgant !If I wanted to write in assembler, I wouldn't write in a high level language, especially a weird one like GNU version.
Apr 16 2016
Am Sat, 16 Apr 2016 21:46:08 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/16/2016 2:40 PM, Marco Leise wrote:I hate the many pitfalls of extended asm: Forget to mention a side effect in the "clobbers" list and the compiler assumes that register or memory location still holds the value from before the asm. Have an _input_ reg clobbered? Must NOT name it in the clobber list but use it as a dummy output with a dummy variable assignment. The learning curve is steep and as you said, usually unintelligible without prior knowledge. But what I really miss from the last generation of inline assemblers are these points: 1. In most cases you can make the asm transparent to the optimizer leading to: 1.a Inlining of asm 1.b Dead-code removal of asm blocks 2. Asm Template arguments (e.g. input variables) are bound via constraints: 2.a Can use output constraint `"=a" var` to mean an of "AL", "AX", "EAX" or "RAX" depending on size of 'var' 2.b `"r" ptr` can bind 32-bit and 64-bit pointers often eliminating the need for duplicate asm blocks that only differ in one mention of e.g. RSI vs. ESI. 2.c Compiler seamlessly integrates host code variables with asm with host code. No need to manually pick tmp registers to move parameters and output. `"r" myUint` is all it takes for 'myUint' to end up in any of EAX, EDX, ... (whatever the register allocator deems efficient at that point) 3.d As a net result, asm templates often reduce to a single mnemonic and work with X86, X32 and AMD64. 3. In DMD I often see "naked" used to get rid of function prolog and epilog in an attempt to get an intrinsic-like, fast function. This requires extra care to get the calling convention right and may require more code duplication for e.g. Win32. Asm templates in GCC and LLVM benefit from this speedup automatically, because the backend will remove unneeded prolog/epilog code and even inline small functions. GCC's historically grown template syntax based on multiple _external_ assembler backends ain't that great and it is a PITA that it cannot understand the mnemonics and figure out side effects itself like DMD. But I hope I could highlight a few points where classic assemblers as found in Delphi or DMD fall behind in modern convenience and native efficiency. When C was invented it matched the CPUs quite well, but today we have dozens of instructions that C and D syntax has no expression for. All modern compilers spend considerable amount of backend code to the task of pattern matching code constructs like a layman's POPCNT and replace them with optimal CPU instructions. More and more we turn to browsing the list of readily available compiler built-ins first and the next step is to acknowledge the need and make inline assemblers powerful enough for programmers to efficiently implement non-existing intrinsics in library code. -- MarcoTell me again, what's more elgant !If I wanted to write in assembler, I wouldn't write in a high level language, especially a weird one like GNU version.
Apr 17 2016
On 12 April 2016 at 22:22, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 4/12/2016 9:53 AM, Marco Leise wrote:asm { "mul eax"; } - That wasn't so difficult. :-) I don't know if D data and calling functions from DMD-IASM is safe (it is in GDC extended IASM). But I have always chosen the path that requires the least amount of maintenance burden/overhead. And I'm sorry to say that supporting GCC-style extended assembler both comes for free (handling is managed by the middle-end), and requires no platform-specific support on the language implementation side. However, I have always considered comparing the two a bit like apples and oranges. DMD compiles to object code, so it makes sense to me that you have an entire assembler embedded in. However GDC compiles to assembly, and I expect that GNU As will know a lot more about what opcodes actually do on, say a Motorola 68k, than the poor mans parser I would be able to write. There were a lot of challenges supporting DMD-style IASM, all non-existent in DMD. Drawing a list off the top of my head - I'll let you decide whether IASM is pro or con in this area, but again bear in mind that DMD doesn't have to deal with calling an external assembler. - What dialect am I writing in? (Do I emit mul or mull? eax or %eax?) - Some opcodes in IASM have a different name in the assembler (Emitted fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall std.math didn't work without the translation). - Some opcodes are actually directives in disguise (db, ds, dw, ...) - Frame-relative addressing/displacement of a symbol before the backend has decided where incoming parameters will land is a good way to get hit by a truck. - GCC backend doesn't support naked functions on x86. - Or even in the sense that DMD supports naked functions where there is support (only plain text assembler allowed) - Want to support ARM? MIPS? PPC? At the time when GDC supported DMD-style IASM for x86, the implementation was over 3000 LOC, adding platform support just looked like an unmanageable nightmare.Your look on GCC (and LLVM) may be a bit biased. First of all you don't need to tell it exactly which registers to use. A rough classification is enough and gives the compiler a good idea of where calculations should be stored upon arrival at the asm statement. You can be specific down to the register name or let the backend chose freely with "rm" (= any register or memory). An example: We have a variable x that is computed inside a function followed by an asm block that multiplies it with something else. Typically you would "MOV EAX, [x]" to load x into the register that the MUL instruction expects. With extended assemblers you can be declarative about that and just state that x is needed in EAX as an input. You drop the MOV from the asm block and let the compiler figure out in its codegen, how x will end up in EAX. That's a step FORWARD in usability.It's a step backwards because I can't just say "MUL EAX". I have to tell GCC what register the result gets put in. This is, to my mind, ridiculous. GCC's inline assembler apparently has no knowledge of what the opcodes actually do.
Apr 12 2016
On 4/12/2016 4:35 PM, Iain Buclaw via Digitalmars-d wrote:My understanding is that is not sufficient if you want gcc to track register usage, etc. I could be wrong, I found the documentation on how the gcc inline assembler works to be impossible to figure out what was required and what wasn't. I'd just look at existing examples and modify to suit :-(It's a step backwards because I can't just say "MUL EAX". I have to tell GCC what register the result gets put in. This is, to my mind, ridiculous. GCC's inline assembler apparently has no knowledge of what the opcodes actually do.asm { "mul eax"; } - That wasn't so difficult. :-)I don't know if D data and calling functions from DMD-IASM is safeI don't know what you mean by 'safe' in this context. If you follow the ABI it should work.(it is in GDC extended IASM). But I have always chosen the path that requires the least amount of maintenance burden/overhead. And I'm sorry to say that supporting GCC-style extended assembler both comes for free (handling is managed by the middle-end), and requires no platform-specific support on the language implementation side.Your decision makes sense.However, I have always considered comparing the two a bit like apples and oranges. DMD compiles to object code, so it makes sense to me that you have an entire assembler embedded in. However GDC compiles to assembly, and I expect that GNU As will know a lot more about what opcodes actually do on, say a Motorola 68k, than the poor mans parser I would be able to write. There were a lot of challenges supporting DMD-style IASM, all non-existent in DMD. Drawing a list off the top of my head - I'll let you decide whether IASM is pro or con in this area, but again bear in mind that DMD doesn't have to deal with calling an external assembler. - What dialect am I writing in? (Do I emit mul or mull? eax or %eax?) - Some opcodes in IASM have a different name in the assembler (Emitted fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall std.math didn't work without the translation).DMD's iasm uses the opcodes as written in the Intel CPU manuals. There is no MULL opcode in the manual, so no MULL in DMD's iasm. It figures out which opcode by looking at the operands, using the Intel CPU manual as a guide. It's a bit of a pain as there are a lot of special cases, but the end result is pretty straightforward if you're using the Intel CPU manual as a reference guide.- Some opcodes are actually directives in disguise (db, ds, dw, ...) - Frame-relative addressing/displacement of a symbol before the backend has decided where incoming parameters will land is a good way to get hit by a truck. - GCC backend doesn't support naked functions on x86. - Or even in the sense that DMD supports naked functions where there is support (only plain text assembler allowed) - Want to support ARM? MIPS? PPC? At the time when GDC supported DMD-style IASM for x86, the implementation was over 3000 LOC, adding platform support just looked like an unmanageable nightmare.I understand that GDC has special challenges because it writes to an assembler rather than direct to object code. I understand it is not easy to replicate DMD's iasm functionality. Which is why I haven't given you a hard time about it :-) and it is not terribly important. But core.cpuid needs to be made to work in GDC, whatever it takes to do so. ---- Personally, I strongly dislike the fact that the GAS syntax is the reverse of Intel's. It isn't just GAS, it's GDB and everything else. It just sux. It makes my eyeballs hurt looking at it. It's like giving me a car with the brake and gas pedals reversed. Nothing but accidents result :-) And I don't like that they use different opcodes than the Intel manuals. That just sux, too. But I know that the GNU world is stuck with that, and GDC should behave like the rest of GCC.
Apr 12 2016
On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 4/12/2016 4:35 PM, Iain Buclaw via Digitalmars-d wrote:My only point was that in GDC, the translation of opcodes to machine code is done in two steps by two separate processes, rather than one. DMD is proof that the benefit of having unified syntax is a big win.- What dialect am I writing in? (Do I emit mul or mull? eax or %eax?) - Some opcodes in IASM have a different name in the assembler (Emitted fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall std.math didn't work without the translation).DMD's iasm uses the opcodes as written in the Intel CPU manuals. There is no MULL opcode in the manual, so no MULL in DMD's iasm. It figures out which opcode by looking at the operands, using the Intel CPU manual as a guide. It's a bit of a pain as there are a lot of special cases, but the end result is pretty straightforward if you're using the Intel CPU manual as a reference guide.Indeed, it's been on my TODO list for a long time, among many other things. :-)- Some opcodes are actually directives in disguise (db, ds, dw, ...) - Frame-relative addressing/displacement of a symbol before the backend has decided where incoming parameters will land is a good way to get hit by a truck. - GCC backend doesn't support naked functions on x86. - Or even in the sense that DMD supports naked functions where there is support (only plain text assembler allowed) - Want to support ARM? MIPS? PPC? At the time when GDC supported DMD-style IASM for x86, the implementation was over 3000 LOC, adding platform support just looked like an unmanageable nightmare.I understand that GDC has special challenges because it writes to an assembler rather than direct to object code. I understand it is not easy to replicate DMD's iasm functionality. Which is why I haven't given you a hard time about it :-) and it is not terribly important. But core.cpuid needs to be made to work in GDC, whatever it takes to do so.---- Personally, I strongly dislike the fact that the GAS syntax is the reverse of Intel's. It isn't just GAS, it's GDB and everything else. It just sux. It makes my eyeballs hurt looking at it. It's like giving me a car with the brake and gas pedals reversed. Nothing but accidents result :-) And I don't like that they use different opcodes than the Intel manuals. That just sux, too.Like riding a backwards bicycle. :-) https://www.youtube.com/watch?v=MFzDaBzBlL0But I know that the GNU world is stuck with that, and GDC should behave like the rest of GCC.Yeah, and I'm glad that you do.
Apr 13 2016
Am Wed, 13 Apr 2016 09:51:25 +0200 schrieb Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com>:On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:Would you want to implement this in the compiler like the checkedint functions? I guess that's the only way to guarantee cross-module inlining with GDC. Otherwise I would use __builtin_cpu_supports (const char *feature). (GCC practically has its own internal core.cpuid implementation made of intrinsics.) -- MarcoBut core.cpuid needs to be made to work in GDC, whatever it takes to do so.Indeed, it's been on my TODO list for a long time, among many other things. :-)
Apr 13 2016
On 13 April 2016 at 11:13, Marco Leise via Digitalmars-d <digitalmars-d puremagic.com> wrote:Am Wed, 13 Apr 2016 09:51:25 +0200 schrieb Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com>:Yes, cpu_supports is a good way to do it as we only need to invoke __builtin_cpu_init once and cache all values when running 'shared static this()'. I would also like to be able to support other processes too. ARM is a high priority one which should follow suit.On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:Would you want to implement this in the compiler like the checkedint functions? I guess that's the only way to guarantee cross-module inlining with GDC. Otherwise I would use __builtin_cpu_supports (const char *feature). (GCC practically has its own internal core.cpuid implementation made of intrinsics.) -- MarcoBut core.cpuid needs to be made to work in GDC, whatever it takes to do so.Indeed, it's been on my TODO list for a long time, among many other things. :-)
Apr 13 2016
Am Wed, 13 Apr 2016 11:21:35 +0200 schrieb Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com>:Yes, cpu_supports is a good way to do it as we only need to invoke __builtin_cpu_init once and cache all values when running 'shared static this()'.I was under the assumption that GCC already emits an 'early' static ctor with a call to __builtin_cpu_init(). It is also likely that we don't need extra code to copy GCC's cache to core.cpuids cache (unless the cached data is publicly exposed somehow). What is your stance on the cross module inlining issue? Stuff like hasPopcnt etc. wont be inlined unless you turn them into compiler recognised builtins, right? It's not a blocker, but something to keep in mind when not accessing global variables directly. How about this style as an alternative?: immutable bool mmx; immutable bool hasPopcnt; shared static this() { import gcc.builtins; mmx = __builtin_cpu_supports("mmx" ) > 0; hasPopcnt = __builtin_cpu_supports("popcnt") > 0; } -- Marco
Apr 13 2016
On 4/13/2016 3:58 AM, Marco Leise wrote:How about this style as an alternative?: immutable bool mmx; immutable bool hasPopcnt; shared static this() { import gcc.builtins; mmx = __builtin_cpu_supports("mmx" ) > 0; hasPopcnt = __builtin_cpu_supports("popcnt") > 0; }Please do not invent an alternative interface, use the one in core.cpuid:
Apr 13 2016
Am Wed, 13 Apr 2016 04:14:48 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/13/2016 3:58 AM, Marco Leise wrote:Yes, they are all property and a substitution with direct access to the globals will work around GDC's lack of cross-module inlining. Otherwise these feature checks which might be used in hot code, are more costly than they should be. I hate when things get in the way of efficiency. :) -- MarcoHow about this style as an alternative?: immutable bool mmx; immutable bool hasPopcnt; shared static this() { import gcc.builtins; mmx = __builtin_cpu_supports("mmx" ) > 0; hasPopcnt = __builtin_cpu_supports("popcnt") > 0; }Please do not invent an alternative interface, use the one in core.cpuid:
Apr 13 2016
On 4/13/2016 5:47 AM, Marco Leise wrote:Yes, they are all property and a substitution with direct access to the globals will work around GDC's lack of cross-module inlining. Otherwise these feature checks which might be used in hot code, are more costly than they should be. I hate when things get in the way of efficiency. :)It doesn't need to be efficient, because such checks should be done at a higher level in the program's logic, not on low level code. Even so, the program could cache the result of the call.
Apr 13 2016
On 13 April 2016 at 13:14, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 4/13/2016 3:58 AM, Marco Leise wrote:An alternative interface needs to be invented anyway for other CPUs.How about this style as an alternative?: immutable bool mmx; immutable bool hasPopcnt; shared static this() { import gcc.builtins; mmx = __builtin_cpu_supports("mmx" ) > 0; hasPopcnt = __builtin_cpu_supports("popcnt") > 0; }Please do not invent an alternative interface, use the one in core.cpuid:
Apr 14 2016
On 4/14/2016 1:21 AM, Iain Buclaw via Digitalmars-d wrote:An alternative interface needs to be invented anyway for other CPUs.That would be fine. But there is no reason to redo core.cpuid for x86 machines.
Apr 14 2016
On 4/4/2016 7:02 AM, 9il wrote:Done.What kind of information?Target cpu configuration: - CPU architecture (done)- Count of FP/Integer registers??- Allowed sets of instructions: for example, AVX2, FMA4Done. D_SIMD- Compiler optimization options (for math)Moot. DMD does not have compiler switches to set FP code generation. (This is deliberate.)
Apr 04 2016
On Monday, 4 April 2016 at 20:29:11 UTC, Walter Bright wrote:On 4/4/2016 7:02 AM, 9il wrote:How many general purpose registers, SIMD Floating Point registers, SIMD Integer registers have a CPU?Done.What kind of information?Target cpu configuration: - CPU architecture (done)- Count of FP/Integer registers??This is not enough. Needs to know is it AVX or AVX2 in compile time (this may be completely different source code for this cases).- Allowed sets of instructions: for example, AVX2, FMA4Done. D_SIMDWe have LDC and GDC. And looks like a little bit standardization based on DMD would be good, even if this would be useless for DMD. With compile time information about CPU it is possible to always have fast generic BLAS for any target as soon as LLVM is released for this target. D+LLVM = fast generic BLAS. For DMD and GDC would be target specified BLAS optimizations. OpenBLAS kernels is 30 MB of assembler code! So we would be able to replace it once and for a very long time with Phobos. Best regards, Ilya- Compiler optimization options (for math)Moot. DMD does not have compiler switches to set FP code generation. (This is deliberate.)
Apr 04 2016
On Monday, 4 April 2016 at 21:05:44 UTC, 9il wrote:OpenBLAS kernels is 30 MB of assembler code! So we would be able to replace it once and for a very long time with Phobos.Are you familiar with this project at all? https://github.com/flame/blis
Apr 04 2016
On Monday, 4 April 2016 at 21:13:30 UTC, jmh530 wrote:On Monday, 4 April 2016 at 21:05:44 UTC, 9il wrote:Thank for the link. BLIS has the same issue like OpenBLAS - a collection of kernels for each target. I want to write internal kernel compiler (like CT regex) that will build kernels based in CT information about the target. Best regards, IlyaOpenBLAS kernels is 30 MB of assembler code! So we would be able to replace it once and for a very long time with Phobos.Are you familiar with this project at all? https://github.com/flame/blis
Apr 04 2016
On 4/4/2016 2:05 PM, 9il wrote:These are deducible from X86, X86_64, and SIMD version identifiers.How many general purpose registers, SIMD Floating Point registers, SIMD Integer registers have a CPU?- Count of FP/Integer registers??Needs to know is it AVX or AVX2 in compile timeSince the compiler never generates AVX or AVX2 instructions, there is no purpose to setting such as a predefined version identifier. You might as well use a: -version=AVX switch. Note that it is a very bad idea for a compiler to detect the CPU it is running on and default generate code specific to that CPU.(this may be completely different source code for this cases).It's entirely practical to compile code with different source code, link them *both* into the executable, and switch between them based on runtime detection of the CPU.We have LDC and GDC. And looks like a little bit standardization based on DMD would be good, even if this would be useless for DMD.There is no such thing as a standard compiler floating point switch, and I'm doubtful defining one would be practical or make much of any sense.With compile time information about CPU it is possible to always have fast generic BLAS for any target as soon as LLVM is released for this target.The SIMD instruction set is highly resistant to transforming generic code into optimal vector instructions. Yes, I know about auto-vectorization, and in general it is a doomed and unworkable technology. http://www.amazon.com/dp/0974364924 It's gotta be done by hand to get it to fly.
Apr 04 2016
On Monday, 4 April 2016 at 22:34:06 UTC, Walter Bright wrote:On 4/4/2016 2:05 PM, 9il wrote:It is impossible to deduct from that combination that Xeon Phi has 32 FP registers.These are deducible from X86, X86_64, and SIMD version identifiers.How many general purpose registers, SIMD Floating Point registers, SIMD Integer registers have a CPU?- Count of FP/Integer registers??"Since the compiler never generates AVX or AVX2" - this is definitely nor true, see, for example, LLVM vectorization and SLP vectorization. This is normal situation for scientific software, supercomputers software, hight performance server applications.Needs to know is it AVX or AVX2 in compile timeSince the compiler never generates AVX or AVX2 instructions, there is no purpose to setting such as a predefined version identifier. You might as well use a: -version=AVX switch. Note that it is a very bad idea for a compiler to detect the CPU it is running on and default generate code specific to that CPU.This approach is complex, and normal for desktop applications. If you have a big cluster of similar computers or you have a supercomputer cluster, only the thing you want to do is `-mcpu=native`/ `-march=native`. And this single compiler flag should be enough to build hight performance linear algebra application.(this may be completely different source code for this cases).It's entirely practical to compile code with different source code, link them *both* into the executable, and switch between them based on runtime detection of the CPU.I just want an unified instrument to receive CT information about target and optimization switches. It is OK if this information would have different switches on different compilers.We have LDC and GDC. And looks like a little bit standardization based on DMD would be good, even if this would be useless for DMD.There is no such thing as a standard compiler floating point switch, and I'm doubtful defining one would be practical or make much of any sense.Auto vectorization is only example (maybe bad). I would use SIMD vectors, but I need CT information about target CPU, because it is impossible to build optimal BLAS kernels without it! My idea is internal kernel compiler :-) Something similar to compile time regex, but more complex. Best regards, IlyaWith compile time information about CPU it is possible to always have fast generic BLAS for any target as soon as LLVM is released for this target.The SIMD instruction set is highly resistant to transforming generic code into optimal vector instructions. Yes, I know about auto-vectorization, and in general it is a doomed and unworkable technology. http://www.amazon.com/dp/0974364924 It's gotta be done by hand to get it to fly.
Apr 04 2016
On 4/4/2016 11:10 PM, 9il wrote:It is impossible to deduct from that combination that Xeon Phi has 32 FP registers.Since dmd doesn't generate specific code for a Xeon Phi, having a compile time switch for it is meaningless."Since the compiler never generates AVX or AVX2" - this is definitely nor true, see, for example, LLVM vectorization and SLP vectorization.dmd is not LLVM.Not at all. Used to do it all the time in the DOS world (FPU vs emulation).It's entirely practical to compile code with different source code, link them *both* into the executable, and switch between them based on runtime detection of the CPU.This approach is complex,I just want an unified instrument to receive CT information about target and optimization switches. It is OK if this information would have different switches on different compilers.Optimizations simply do not transfer from one compiler to another, whether the switch is the same or not. They are highly implementation dependent.Auto vectorization is only example (maybe bad). I would use SIMD vectors, but I need CT information about target CPU, because it is impossible to build optimal BLAS kernels without it!I still don't understand why you cannot just set '-version=xxx' on the command line and then switch off that version in your custom code.
Apr 05 2016
On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:On 4/4/2016 11:10 PM, 9il wrote:The particular design and limitations of the dmd backend shouldn't be used to define D. In the extreme, your argument would imply that there's no point having version(ARM) built in to the language, because dmd doesn't support it.It is impossible to deduct from that combination that Xeon Phi has 32 FP registers.Since dmd doesn't generate specific code for a Xeon Phi, having a compile time switch for it is meaningless."Since the compiler never generates AVX or AVX2" - this is definitely nor true, see, for example, LLVM vectorization and SLP vectorization.dmd is not LLVM.So you're suggesting that libraries invent their own list of versions for specific architectures / CPU features, which the user then has to specify somehow on the command line? I want to be able to write code that uses standardised versions that work across various D compilers, with the user only needing to type e.g. -march=native on GDC and get the fastest possible code.Not at all. Used to do it all the time in the DOS world (FPU vs emulation).It's entirely practical to compile code with different source code, link them *both* into the executable, and switch between them based on runtime detection of the CPU.This approach is complex,I just want an unified instrument to receive CT information about target and optimization switches. It is OK if this information would have different switches on different compilers.Optimizations simply do not transfer from one compiler to another, whether the switch is the same or not. They are highly implementation dependent.Auto vectorization is only example (maybe bad). I would use SIMD vectors, but I need CT information about target CPU, because it is impossible to build optimal BLAS kernels without it!I still don't understand why you cannot just set '-version=xxx' on the command line and then switch off that version in your custom code.
Apr 05 2016
On 4/5/2016 2:03 AM, John Colvin wrote:So you're suggesting that libraries invent their own list of versions for specific architectures / CPU features, which the user then has to specify somehow on the command line? I want to be able to write code that uses standardised versions that work across various D compilers, with the user only needing to type e.g. -march=native on GDC and get the fastest possible code.There's a line between trying to standardize everything and letting add-on libraries be free to innovate. Besides, I think it's a poor design to customize the app for only one SIMD type. A better idea (I've repeated this ad nauseum over the years) is to have n modules, one for each supported SIMD type. Compile and link all of them in, then detect the SIMD type at runtime and call the corresponding module. (This is how the D array ops are currently implemented.) My experience with command line FPU switches is few users understand what they do and even fewer use them correctly. In fact, I suspect that having a command line FPU switch is too global a hammer. A pragma set in just the functions that need it might be much better. ------- In any case, this is not a blocker for getting the library designed, built and debugged.
Apr 05 2016
On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:On 4/5/2016 2:03 AM, John Colvin wrote: There's a line between trying to standardize everything and letting add-on libraries be free to innovate. Besides, I think it's a poor design to customize the app for only one SIMD type. A better idea (I've repeated this ad nauseum over the years) is to have n modules, one for each supported SIMD type. Compile and link all of them in, then detect the SIMD type at runtime and call the corresponding module. (This is how the D array ops are currently implemented.) My experience with command line FPU switches is few users understand what they do and even fewer use them correctly. In fact, I suspect that having a command line FPU switch is too global a hammer. A pragma set in just the functions that need it might be much better.What wrong for scientist to write `-mcpu=native`?------- In any case, this is not a blocker for getting the library designed, built and debugged.Yes, but this is bad idea to have a set of versions for Phobos, is not it? Ilya
Apr 05 2016
On 4/5/2016 4:17 AM, 9il wrote:What wrong for scientist to write `-mcpu=native`?Because it would affect all the code in the module and every template it imports, which is a problem if you are using 'static if' and want to compile different pieces with different settings.
Apr 05 2016
On Wednesday, 6 April 2016 at 00:45:54 UTC, Walter Bright wrote:On 4/5/2016 4:17 AM, 9il wrote:99.99% of them do not need to compile code with different settings. Furthermore 90% of them don't know what CPU their supercomputer has. They just want to have as fast code as possible without googling what CPU instructions are available for the CPU.What wrong for scientist to write `-mcpu=native`?Because it would affect all the code in the module and every template it imports, which is a problem if you are using 'static if' and want to compile different pieces with different settings.
Apr 05 2016
On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:Besides, I think it's a poor design to customize the app for only one SIMD type. A better idea (I've repeated this ad nauseum over the years) is to have n modules, one for each supported SIMD type. Compile and link all of them in, then detect the SIMD type at runtime and call the corresponding module. (This is how the D array ops are currently implemented.)There are many organizations in the world that are building software in-house, where such software is targeted to modern CPU SIMD types, most typically AVX/AVX2 and crypto instructions. In these settings -- many of them scientific compute or big data center operators -- they know what servers they have, what CPU platforms they have. They don't care about portability to the past, older computers and so forth. A runtime check would make no sense for them, not for their baseline, and it would probably be a waste of time for them to design code to run on pre-AVX silicon. (AVX is not new anymore -- it's been around for a few years.) Good examples can be found on Cloudflare's blog, especially Vlad Krasnov's posts. Here's one where he accelerates Golang's crypto libraries: https://blog.cloudflare.com/go-crypto-bridging-the-performance-gap/ Companies like CF probably spend millions of dollars on electricity, and there are some workloads where AVX-optimized code can yield tangible monetary savings. Someone else said talked about marking "Broadwell" and other generation names. As others have said, it's better to specify features. I wanted to chime in with a couple of additional examples. Intel's transactional memory accelerating instructions (TSX) are only available on some Broadwell parts because there was a bug in the original implementation (Haswell and early Broadwell) and it's disabled on most. But the new Broadwell server chips have it, and it's a big deal for some DB workloads. Similarly, only some Skylake chips have the Secure Guard instructions (SGX), which are very powerful for creating secure enclaves on an untrusted host. On the broader SIMD-as-first-class-citizen issue, I think it would be worth thinking about how to bake SIMD into the language instead of bolting it on. If I were designing a new language in 2016, I would take a fresh look at how SIMD could be baked into a language's core constructs. I'd think about new loop abstractions that could make SIMD easier to exploit, and how to nudge programmers away from serial monotonic mindsets and into more of a SIMD/FMA way of reasoning.
Apr 17 2016
On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:In addition it's COMPILER work, not programmer! Compiler SHOULD be able to vectorize the code using SSE/AVX depending on command line switch. Why i should write all these merde ? Let compiler do its work. Also compiler CAN generate multiple versions of one function using different SIMD instructions : Intel C++ Compiler works this way : it generates a few versions of a function and checks at run-time CPU capabilities and executes the fastest one.Besides, I think it's a poor design to customize the app for only one SIMD type. A better idea (I've repeated this ad nauseum over the years) is to have n modules, one for each supported SIMD type. Compile and link all of them in, then detect the SIMD type at runtime and call the corresponding module. (This is how the D array ops are currently implemented.)There are many organizations in the world that are building software in-house, where such software is targeted to modern CPU SIMD types, most typically AVX/AVX2 and crypto instructions.
Apr 17 2016
On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:Someone else said talked about marking "Broadwell" and other generation names. As others have said, it's better to specify features. I wanted to chime in with a couple of additional examples. Intel's transactional memory accelerating instructions (TSX) are only available on some Broadwell parts because there was a bug in the original implementation (Haswell and early Broadwell) and it's disabled on most. But the new Broadwell server chips have it, and it's a big deal for some DB workloads. Similarly, only some Skylake chips have the Secure Guard instructions (SGX), which are very powerful for creating secure enclaves on an untrusted host.Thanks, I've seen similar comments in LLVM code. I have a question perhaps you can comment on? With LLVM, it is possible to specify something like "+sse3,-sse2" (I did not test whether this actually results in SSE3 instructions being used, but no SSE2 instructions). What should be returned when querying whether "sse3" feature is enabled? Should __traits(targetHasFeature, "sse3") == true mean that implied features (such as sse and sse2) are also available?
Apr 23 2016
Am Sat, 23 Apr 2016 10:40:12 +0000 schrieb Johan Engelen <j j.nl>:I have a question perhaps you can comment on? With LLVM, it is possible to specify something like "+sse3,-sse2" (I did not test whether this actually results in SSE3 instructions being used, but no SSE2 instructions). What should be returned when querying whether "sse3" feature is enabled? Should __traits(targetHasFeature, "sse3") == true mean that implied features (such as sse and sse2) are also available?Please do test it. Activating sse3 and disabling sse2 likely causes the compiler to silently re-enable sse2 as a dependency or error out. -- Marco
Apr 23 2016
On Saturday, 23 April 2016 at 10:40:12 UTC, Johan Engelen wrote:On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:If you specify SSE3, you should definitely get SSE2 and plain old SSE with it. SSE3 is a superset of SSE2 and includes all the SSE2 instructions (more than 100 I think.) I'm not sure about your syntax – I thought the hyphen meant to include the option, not remove it, and I haven't seen the addition sign used for those settings. But I haven't done much with those optimization flags. You wouldn't want to exclude SSE2 support because it's becoming the bare minimum baseline for modern systems, the de facto FP unit. Windows 10 requires a CPU with SSE2, as do more and more applications on the archaic Unix-like platforms.Someone else said talked about marking "Broadwell" and other generation names. As others have said, it's better to specify features. I wanted to chime in with a couple of additional examples. Intel's transactional memory accelerating instructions (TSX) are only available on some Broadwell parts because there was a bug in the original implementation (Haswell and early Broadwell) and it's disabled on most. But the new Broadwell server chips have it, and it's a big deal for some DB workloads. Similarly, only some Skylake chips have the Secure Guard instructions (SGX), which are very powerful for creating secure enclaves on an untrusted host.Thanks, I've seen similar comments in LLVM code. I have a question perhaps you can comment on? With LLVM, it is possible to specify something like "+sse3,-sse2" (I did not test whether this actually results in SSE3 instructions being used, but no SSE2 instructions). What should be returned when querying whether "sse3" feature is enabled? Should __traits(targetHasFeature, "sse3") == true mean that implied features (such as sse and sse2) are also available?
May 02 2016
On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:On 4/4/2016 11:10 PM, 9il wrote: I still don't understand why you cannot just set '-version=xxx' on the command line and then switch off that version in your custom code.I can do it, however I would like to get this information from compiler. Why? 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong. Ilya
Apr 05 2016
On 4/5/2016 2:39 AM, 9il wrote:On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:Where does the compiler get the information that it should compile for, say, AFX?On 4/4/2016 11:10 PM, 9il wrote: I still don't understand why you cannot just set '-version=xxx' on the command line and then switch off that version in your custom code.I can do it, however I would like to get this information from compiler. Why? 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong.
Apr 05 2016
On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:On 4/5/2016 2:39 AM, 9il wrote:No idea about AFX. Do you choose AFX to disallow me to find an example? You know better than me, that GCC and LLVM based compilers have options like march, mcpu, mtarget, mtune and others. And things like `-mcpu=native` or `-march=native` are allowed. IlyaOn Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote: 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong.Where does the compiler get the information that it should compile for, say, AFX?
Apr 05 2016
On 4/5/2016 4:07 AM, 9il wrote:On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:I want to make it clear that dmd does not generate AFX specific code, has no switch to enable AFX code generation and has no basis for setting predefined version identifiers for it.On 4/5/2016 2:39 AM, 9il wrote:No idea about AFX. Do you choose AFX to disallow me to find an example?On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote: 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong.Where does the compiler get the information that it should compile for, say, AFX?
Apr 05 2016
On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:I want to make it clear that dmd does not generate AFX specific code, has no switch to enable AFX code generation and has no basis for setting predefined version identifiers for it.How about adding a "__target(...)" compile-time function, that would return false if the compiler doesn't know? __target("broadwell") --> true means: target cpu is broadwell, false means compiler doesn't know or target cpu is not broadwell. Would that work for all?
Apr 05 2016
On Tuesday, 5 April 2016 at 21:41:46 UTC, Johan Engelen wrote:On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:Yes, something like that is what I am looking for. Two nitpicks: 1. __target("broadwell") is not well API. Something like that would be more efficient: enum target = __target(); // .. use target 2. Is it possible to reflect additional settings about instruction set? Maybe "broadwell,-avx"?I want to make it clear that dmd does not generate AFX specific code, has no switch to enable AFX code generation and has no basis for setting predefined version identifiers for it.How about adding a "__target(...)" compile-time function, that would return false if the compiler doesn't know? __target("broadwell") --> true means: target cpu is broadwell, false means compiler doesn't know or target cpu is not broadwell. Would that work for all?
Apr 05 2016
On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.I want to make it clear that dmd does not generate AFX specific code, has no switch to enable AFX code generation and has no basis for setting predefined version identifiers for it.How about adding a "__target(...)" compile-time function, that would return false if the compiler doesn't know? __target("broadwell") --> true means: target cpu is broadwell, false means compiler doesn't know or target cpu is not broadwell. Would that work for all?
Apr 06 2016
On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d <digitalmars-d puremagic.com> wrote:Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya[...]With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Apr 06 2016
On Wednesday, 6 April 2016 at 13:26:51 UTC, 9il wrote:On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:After browsing through some LLVM code, I think is actually very easy for LDC to also tell you about which features (sse2, avx, etc.) a target supports. Probably the most difficult part is defining an API. Ilya made a start here: http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org (but he doesn't like his earlier API "bool a = __target("broadwell")" any more ;-P , I also think enum cpu = __target(); would be nicer)On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d <digitalmars-d puremagic.com> wrote:Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya[...]With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Apr 06 2016
On Wednesday, 6 April 2016 at 14:31:58 UTC, Johan Engelen wrote:Probably the most difficult part is defining an API. Ilya made a start here: http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org (but he doesn't like his earlier API "bool a = __target("broadwell")" any more ;-P , I also think enum cpu = __target(); would be nicer)Ahaha)) --Ilya
Apr 06 2016
On 6 April 2016 at 23:26, 9il via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:Sure, but it's an ongoing maintenance task, constantly requiring population with metadata for new processors that become available. Remember, most processors are arm processors, and there are like 20 manufacturers of arm chips, and many of those come in a series of minor variations with/without sub-features present, and in a lot of cases, each permutation of features attached to random manufacturers arm chip 'X' doesn't actually have a name to describe it. It's also completely impractical to declare a particular arm chip by name when compiling for arm. It's a sloppy relationship comparing intel and AMD let alone the myriad of arm chips available. TL;DR, defining architectures with an intel-centric naming convention is a very bad idea.On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d <digitalmars-d puremagic.com> wrote:Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya[...]With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Apr 06 2016
On 4/6/2016 7:25 PM, Manu via Digitalmars-d wrote:Sure, but it's an ongoing maintenance task, constantly requiring population with metadata for new processors that become available. Remember, most processors are arm processors, and there are like 20 manufacturers of arm chips, and many of those come in a series of minor variations with/without sub-features present, and in a lot of cases, each permutation of features attached to random manufacturers arm chip 'X' doesn't actually have a name to describe it. It's also completely impractical to declare a particular arm chip by name when compiling for arm. It's a sloppy relationship comparing intel and AMD let alone the myriad of arm chips available. TL;DR, defining architectures with an intel-centric naming convention is a very bad idea.You're not making a good case for a standard language defined set of definitions for all these (they'll always be obsolete, inadequate and probably wrong, as you point out).
Apr 06 2016
Am Wed, 6 Apr 2016 20:29:21 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/6/2016 7:25 PM, Manu via Digitalmars-d wrote:We can either define the language in terms of CPU models or features and Manu gave two good reasons to go with features: 1) Typically we end up with "version(SSE4)" and similar in our code, not "version(Haswell)". 2) On ARM chips it turns out difficult to translate models to features to begin with. It wasn't a good or bad case for the feature in general. That said, in the long run Dlang should grow said language. Aside from scientific servers there are also a few Linux distributions that compile and install most packages from sources and telling the compile to target the host CPU comes naturally there. In practice there is likely some config file that sets an environment variable like CFLAGS to "-march=native" on such systems. I understand that DMD doesn't concern itself with all that, but the D language itself of which DMD is one implementation should not artificially be limited compared to popular C/C++ compilers. I died a bit on the inside when I saw Phobos add both popcnt and _popcnt of which the latter is the version that uses the POPCNT instruction found in newer x86 CPUs. In GCC or LLVM when we use such an intrinic, the compiler will take a look at the compilation target and pick the optimal code at compile-time. In one micro-benchmark [1], POPCNT was roughly 50 times faster than bit-twiddling. If I wanted an SSE4 version in otherwise generic amd64 code, I would add attribute("target", "+sse4") before the function using popcnt. So in my eyes a system like GCC offers, where you can specify target features on the command line and also override them for specific functions is a viable solution that simplifies user code (just picking the popcnt, clz, bsr, ... intrinsic will always be optimal) and Phobos code by making _popcnt et.al. superfluous. In addition, the compiler could later error out on mnemonics in our inline assembly that don't exist on the target. This avoids unexpected "Illegal Instruction" crashes. [1] http://kent-vandervelden.blogspot.de/2009/10/counting-bits-population-count-and.html -- MarcoTL;DR, defining architectures with an intel-centric naming convention is a very bad idea.You're not making a good case for a standard language defined set of definitions for all these (they'll always be obsolete, inadequate and probably wrong, as you point out).
Apr 11 2016
Am Thu, 7 Apr 2016 12:25:03 +1000 schrieb Manu via Digitalmars-d <digitalmars-d puremagic.com>:On 6 April 2016 at 23:26, 9il via Digitalmars-d <digitalmars-d puremagic.com> wrote:GCC already keeps a cpu <=> feature mapping (after all it needs to know what features it can use when you specify -mcpu) so for GDC exposing available features isn't more difficult than exposing the CPU type. I'm not sure if you can actually enable/disable CPU features manually without -mcpu? However, available features and even the type used to describe the CPU are completely architecture specific in GCC. This means for GDC we have to write custom code for every supported architecture. (We already have to do this for version(Architecture) though). FYI this is handled in the gcc/config subsystem: https://github.com/gcc-mirror/gcc/tree/master/gcc/config #defines for C/ARM: arm_cpu_builtins in https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-c.c (__ARM_NEON__ etc) As you can see the only common requirement for backend architectures is to call def_or_undef_macro. This means we have to modify the gcc/config files and write replacements for arm_cpu_builtins and similar functions. Known ARM cores and feature sets: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-cores.def I guess every backend-architecture has to provide cpu names for -mcpu so that's probably the one thing we could expose to D for all architectures. (Names are of course GCC specific, but I guess LLVM should use compatible names). This is less work to implement in GDC but you'd have to duplicate the GCC feature table in phobos. OTOH standardizing the names and available feature flag means somebody with knowledge in that area has to write down a spec. TLDR: If required we can always expose compiler specific versions (GNU_NEON/LDC_NEON) even without DMD approval/integration. This should be coordinated with LDC though. Somebody has to make a list of needed identifiers, preferably mentioning the matching C macros. Things get much more complicated if you need feature flags not currently used by / present in GCC.On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:Sure, but it's an ongoing maintenance task, constantly requiring population with metadata for new processors that become available. Remember, most processors are arm processors, and there are like 20 manufacturers of arm chips, and many of those come in a series of minor variations with/without sub-features present, and in a lot of cases, each permutation of features attached to random manufacturers arm chip 'X' doesn't actually have a name to describe it. It's also completely impractical to declare a particular arm chip by name when compiling for arm. It's a sloppy relationship comparing intel and AMD let alone the myriad of arm chips available. TL;DR, defining architectures with an intel-centric naming convention is a very bad idea.On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d <digitalmars-d puremagic.com> wrote:Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya[...]With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Apr 07 2016
On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:On 4/5/2016 4:07 AM, 9il wrote:Please think that D has other compilers, not only DMD. We need a language feature, and I am ok that this feature would be useless for DMD. But the fact that DMD can not optimize code for, say, AVX, AVX2, AVX-512, FMA4, ..., is not good reason to reject small language changes that would be very helpful for D for community. Yes, only few of us would use this feature directly, however, many of us would use this under-the-hood in BLAS/SIMD oriented part of Phobos.On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:I want to make it clear that dmd does not generate AFX specific code, has no switch to enable AFX code generation and has no basis for setting predefined version identifiers for it.On 4/5/2016 2:39 AM, 9il wrote:No idea about AFX. Do you choose AFX to disallow me to find an example?On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote: 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong.Where does the compiler get the information that it should compile for, say, AFX?
Apr 05 2016
On Wednesday, 6 April 2016 at 06:11:15 UTC, 9il wrote:Yes, only few of us would use this feature directly, however, many of us would use this under-the-hood in BLAS/SIMD oriented part of Phobos.Especially since everyone says to use LDC for the fastest code anyway...
Apr 06 2016
On 5 April 2016 at 20:30, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 4/5/2016 2:39 AM, 9il wrote:I would add that GDC and LDC have such compiler flags and it's possible that they could pass the state of those flags through as versions, but all compilers need to agree on the set of versions that will be defined for this purpose. If DMD users express them as -version=[STANDARD_VERSION_NAME], that's fine, I guess, but a proper flag would help avoid the situation where people get the version names wrong, and it feels a little bit more deliberate. Setting a version this way might lead them to presume that it's just an arbitrary setting by the author of the build script, and not actually an agreed standard name that GDC and LDC also produce from their compiler flags. But at very least, the important detail is that the version ID's are standardised and shared among all compilers.On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:Where does the compiler get the information that it should compile for, say, AFX?On 4/4/2016 11:10 PM, 9il wrote: I still don't understand why you cannot just set '-version=xxx' on the command line and then switch off that version in your custom code.I can do it, however I would like to get this information from compiler. Why? 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong.
Apr 06 2016
On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:But at very least, the important detail is that the version ID's are standardised and shared among all compilers.It's a reasonable suggestion; some points: 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time. 2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on. The main trouble comes about when different modules are compiled with different settings. What happens with template code generation, when the templates are pulled from different modules? What happens when COMDAT functions are generated? (The linker picks one arbitrarily and discards the others.) Which settings wind up in the executable will be not easily predictable. I suspect that using a pragma would be a much better approach: pragma(SIMD, AFX) { ... code ... } Doing it on the command line is certainly the traditional way, but it strikes me as being bug-prone and as unhygienic and obsolete as the C preprocessor is (for similar reasons).
Apr 06 2016
On 7 April 2016 at 10:42, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.But at very least, the important detail is that the version ID's are standardised and shared among all compilers.It's a reasonable suggestion; some points: 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time.2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on.They are necessary to provide a baseline. It is typical when building code that you specify a min-spec. This is what's used by default throughout the application. Runtime selection is not practical in a broad sense. Emitting small fragments of SIMD here and there will probably take a loss if they are all surrounded by a runtime selector. SIMD is all about pipelining, and runtime branches on SIMD version are antithesis to good SIMD usage; they can't be applied for small-scale deployment. In my experience, runtime selection is desirable for large scale instantiations at an outer level of the work loop. I've tried to design this intent in my library, by making each simd API capable of receiving SIMD version information via template arg, and within the library, the version is always passed through to dependent calls. The Idea is, if you follow this pattern; propagating a SIMD version template arg through to your outer function, then you can instantiate your higher-level work function for any number of SIMD feature combinations you feel is appropriate. Naturally, this process requires a default, otherwise this usage baggage will cloud the API everywhere (rather than in the few cases where a developer specifically wants to make use of it), and many developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1 in my applications, xbox developers would choose AVX1, it's very application/target-audience specific, but SSE2 is the only reasonable selection if we are not to accept a hint from the command line.The main trouble comes about when different modules are compiled with different settings. What happens with template code generation, when the templates are pulled from different modules? What happens when COMDAT functions are generated? (The linker picks one arbitrarily and discards the others.) Which settings wind up in the executable will be not easily predictable.In my library design, the baseline simd version (expected from the compiler) is mangled into the symbols, just as in the case a user overrides it when instantiating a code path that may be selected on runtime branch. I had imagined this would solve such link related symbol selection problems. Can you think of cases where this is insufficient?I suspect that using a pragma would be a much better approach: pragma(SIMD, AFX) { ... code ... } Doing it on the command line is certainly the traditional way, but it strikes me as being bug-prone and as unhygienic and obsolete as the C preprocessor is (for similar reasons).I've done it with a template arg because it can be manually propagated, and users can extrapolate the pattern into their outer work functions, which can then easily have multiple versions instantiated for runtime selection. I think it's also important to mangle it into the symbol name for the reasons I mention above.
Apr 06 2016
On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround: gdc -simd=AFX foo.d becomes: gdc -simd=AFX -version=AFX foo.d It's even simpler if you use a makefile variable: FPU=AFX gdc -simd=$(FPU) -version=$(FPU) You also mentioned being blocked (i.e. demotivated) for *years* by this, and I assume that may be because we don't care about SIMD support. That would be wrong, as I care a lot about it. But I had no idea you were having a problem with this, as you did not file any bug reports. Suffering in silence is never going to work :-)1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time.It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.It is not necessary to do it that way. Call std.cpuid to determine what is available at runtime, and issue an error message if not. There is no runtime cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to seg fault trying to execute instructions that do not exist.2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on.They are necessary to provide a baseline. It is typical when building code that you specify a min-spec. This is what's used by default throughout the application.Runtime selection is not practical in a broad sense. Emitting small fragments of SIMD here and there will probably take a loss if they are all surrounded by a runtime selector. SIMD is all about pipelining, and runtime branches on SIMD version are antithesis to good SIMD usage; they can't be applied for small-scale deployment. In my experience, runtime selection is desirable for large scale instantiations at an outer level of the work loop. I've tried to design this intent in my library, by making each simd API capable of receiving SIMD version information via template arg, and within the library, the version is always passed through to dependent calls. The Idea is, if you follow this pattern; propagating a SIMD version template arg through to your outer function, then you can instantiate your higher-level work function for any number of SIMD feature combinations you feel is appropriate.Doing it at a high level is what I meant, not for each SIMD code fragment.Naturally, this process requires a default, otherwise this usage baggage will cloud the API everywhere (rather than in the few cases where a developer specifically wants to make use of it), and many developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1 in my applications, xbox developers would choose AVX1, it's very application/target-audience specific, but SSE2 is the only reasonable selection if we are not to accept a hint from the command line.I still don't see how it is a problem to do the switch at a high level. Heck, you could put the ENTIRE ENGINE inside a template, have a template parameter be the instruction set, and instantiate the template for each supported instruction set. Then, void app(int simd)() { ... my fabulous app ... } int main() { auto fpu = core.cpuid.getfpu(); switch (fpu) { case SIMD: app!(SIMD)(); break; case SIMD4: app!(SIMD4)(); break; default: error("unsupported FPU"); exit(1); } }I've done it with a template arg because it can be manually propagated, and users can extrapolate the pattern into their outer work functions, which can then easily have multiple versions instantiated for runtime selection. I think it's also important to mangle it into the symbol name for the reasons I mention above.Note that version identifiers are not usable directly as template parameters. You'd have to set up a mapping. And yes, if mangled in as part of the symbol, the linker won't pick the wrong one.
Apr 06 2016
On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround: gdc -simd=AFX foo.d becomes: gdc -simd=AFX -version=AFX foo.d It's even simpler if you use a makefile variable: FPU=AFX gdc -simd=$(FPU) -version=$(FPU)ldc -mcpu=native becomes: ????I still don't see how it is a problem to do the switch at a high level. Heck, you could put the ENTIRE ENGINE inside a template, have a template parameter be the instruction set, and instantiate the template for each supported instruction set. Then, void app(int simd)() { ... my fabulous app ... } int main() { auto fpu = core.cpuid.getfpu(); switch (fpu) { case SIMD: app!(SIMD)(); break; case SIMD4: app!(SIMD4)(); break; default: error("unsupported FPU"); exit(1); } }1. Executable size will grow with every instruction set release 2. BLAS already has big executable size And main: 3. This would not solve the problem for generic BLAS implementation for Phobos at all! How you would force compiler to USE and NOT USE specific vector permutations for example in the same object file? Yes, I know, DMD has not permutations. No, I don't want to write permutation for each architecture. Why? I can write simple D code that generates single LLVM IR code which would work for ALL targets! Best regards, Ilya
Apr 07 2016
On 4/7/2016 12:59 AM, 9il wrote:1. Executable size will grow with every instruction set releaseYes, and nobody cares. With virtual memory and demand loading, unexecuted code will never be loaded off of disk and will never consume memory space. And with a 64 bit address space, there will never be a shortage of virtual address space. It will consume space on your 1 terabyte drive. Meh. I have several of those drives, and what consumes space is video, not code binaries :-)3. This would not solve the problem for generic BLAS implementation for Phobos at all! How you would force compiler to USE and NOT USE specific vector permutations for example in the same object file? Yes, I know, DMD has not permutations. No, I don't want to write permutation for each architecture. Why? I can write simple D code that generates single LLVM IR code which would work for ALL targets!There's no reason for the compiler to make target CPU information available when writing generic code.
Apr 07 2016
On Thursday, 7 April 2016 at 09:41:06 UTC, Walter Bright wrote:On 4/7/2016 12:59 AM, 9il wrote:what about 1GB game 2D for a Phone, or maybe a clock?1. Executable size will grow with every instruction set releaseYes, and nobody cares. With virtual memory and demand loading, unexecuted code will never be loaded off of disk and will never consume memory space. And with a 64 bit address space, there will never be a shortage of virtual address space. It will consume space on your 1 terabyte drive. Meh. I have several of those drives, and what consumes space is video, not code binaries :-)This is not true for BLAS based on D. You don't want to see the opportunities. The final result of your dogmatic decision would make code slower for DMD, but LDC and GDC would implement required simple features. I just wanted to write fast code for DMD too.3. This would not solve the problem for generic BLAS implementation for Phobos at all! How you would force compiler to USE and NOT USE specific vector permutations for example in the same object file? Yes, I know, DMD has not permutations. No, I don't want to write permutation for each architecture. Why? I can write simple D code that generates single LLVM IR code which would work for ALL targets!There's no reason for the compiler to make target CPU information available when writing generic code.
Apr 07 2016
On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:This is not true for BLAS based on D.Perhaps if you provide him a simplified example he might see what you're talking about?
Apr 07 2016
On Thursday, 7 April 2016 at 12:35:51 UTC, jmh530 wrote:On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:He know what I am talking about. This is about architecture/style/concepts. If Walter disagree with this then nobody can change it.This is not true for BLAS based on D.Perhaps if you provide him a simplified example he might see what you're talking about?
Apr 07 2016
Am Thu, 7 Apr 2016 02:41:06 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:Actually for GDC/GCC you can't even write functions using certain SIMD stuff as 'generic' code. Unless you use -mavx or -march the builtins are not exposed to user code. IIRC the compiler even complains about inline ASM if you use unsupported instructions. You also can't always compile with the 'biggest' feature set, as GCC might use these features in codegen. TLDR; For GCC/GDC you will have to use target flags / attribute(target) to mix feature sets.3. This would not solve the problem for generic BLAS implementation for Phobos at all! How you would force compiler to USE and NOT USE specific vector permutations for example in the same object file? Yes, I know, DMD has not permutations. No, I don't want to write permutation for each architecture. Why? I can write simple D code that generates single LLVM IR code which would work for ALL targets!There's no reason for the compiler to make target CPU information available when writing generic code.
Apr 07 2016
Am Wed, 6 Apr 2016 20:27:31 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:The problem is that march=x can set more than one feature flag. So instead of gdc -march=armv7-a you have to do gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32 -fversion=ARM_FEATURE_UNALIGNED ... Sou have to know exactly which features are supported for a CPU. Essentially you have to duplicate the CPU<=>feature database already present in GCC (and likely LLVM too) in your Makefile. And you'll need -march=armv7-a anyway to make sure the GCC codegen can use these features as well. So this issue is not a blocker, but what you propose is a workaround at best, not a solution.I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround: gdc -simd=AFX foo.d becomes: gdc -simd=AFX -version=AFX foo.d1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time.It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.
Apr 07 2016
On 4/7/2016 3:15 AM, Johannes Pfau wrote:The problem is that march=x can set more than one feature flag. So instead of gdc -march=armv7-a you have to do gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32 -fversion=ARM_FEATURE_UNALIGNED ... Sou have to know exactly which features are supported for a CPU. Essentially you have to duplicate the CPU<=>feature database already present in GCC (and likely LLVM too) in your Makefile. And you'll need -march=armv7-a anyway to make sure the GCC codegen can use these features as well. So this issue is not a blocker, but what you propose is a workaround at best, not a solution.Having a veritable blizzard of these predefined versions, that constantly are obsoleted and new ones appearing, seems like a serious problem when trying to standardize the language.
Apr 07 2016
On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:Then, void app(int simd)() { ... my fabulous app ... } int main() { auto fpu = core.cpuid.getfpu(); switch (fpu) { case SIMD: app!(SIMD)(); break; case SIMD4: app!(SIMD4)(); break; default: error("unsupported FPU"); exit(1); } }glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos. Regards, Kai
Apr 07 2016
Am Thu, 07 Apr 2016 10:52:42 +0000 schrieb Kai Nacke <kai redstar.de>:On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26aThen, void app(int simd)() { ... my fabulous app ... } int main() { auto fpu = core.cpuid.getfpu(); switch (fpu) { case SIMD: app!(SIMD)(); break; case SIMD4: app!(SIMD4)(); break; default: error("unsupported FPU"); exit(1); } }glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos. Regards, Kai
Apr 07 2016
On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:Am Thu, 07 Apr 2016 10:52:42 +0000 schrieb Kai Nacke <kai redstar.de>:I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference. https://gcc.gnu.org/wiki/FunctionMultiVersioning "To keep the cost of dispatching low, the IFUNC mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump ** indirect ** instruction." (emphasis mine) I looked into this some time ago and did not see a reason to use the ifunc mechanism (which would not be available on Windows). I thought it should be implementable in a library, exactly as you did in your dpaste! :-) (does `&foo` return `impl`?)glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos. Regards, KaiAvailable in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26a
Apr 07 2016
Am Thu, 07 Apr 2016 13:27:05 +0000 schrieb Johan Engelen <j j.nl>:On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:The simple variant I've posted needs an additional branch on every function call. If we instead initialize the function pointer in a shared static ctor there's indeed no performance difference. The main problem here is because of cyclic constructor detection it will be more difficult to implement a generic template solution. http://www.airs.com/blog/archives/403 "An alternative to all this linker stuff would be a variable holding a function pointer. The function could then be written in assembler to do the indirect jump. The variable would be initialized at program startup time. The efficiency would be the same. The address of the function would be the address of the indirect jump, so function pointers would compare consistently."Am Thu, 07 Apr 2016 10:52:42 +0000 schrieb Kai Nacke <kai redstar.de>:I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference. https://gcc.gnu.org/wiki/FunctionMultiVersioning "To keep the cost of dispatching low, the IFUNC mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump ** indirect ** instruction." (emphasis mine)glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos. Regards, KaiAvailable in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26aI looked into this some time ago and did not see a reason to use the ifunc mechanism (which would not be available on Windows). I thought it should be implementable in a library, exactly as you did in your dpaste! :-)(does `&foo` return `impl`?)No, &foo will return the address of the wrapper function. I'm not sure if we can solve this. IIRC we can't overload &. Here's the alternative using a constructor which makes the address accessible. The syntax will still be different though: __gshared void function() foo; shared static this() { foo = &foo1; } auto addr = &foo; // address of the variable addr = cast(void*)foo; // the function address
Apr 07 2016
On Thursday, 7 April 2016 at 14:46:06 UTC, Johannes Pfau wrote:Am Thu, 07 Apr 2016 13:27:05 +0000 schrieb Johan Engelen <j j.nl>:Yep exactly. For target multiversioned functions, I thought one would want to create one static ctor that calls cpuid once and sets all function ptrs of that module.On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:The simple variant I've posted needs an additional branch on every function call. If we instead initialize the function pointer in a shared static ctor there's indeed no performance difference.Am Thu, 07 Apr 2016 10:52:42 +0000 schrieb Kai Nacke <kai redstar.de>:I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference. https://gcc.gnu.org/wiki/FunctionMultiVersioning "To keep the cost of dispatching low, the IFUNC mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump ** indirect ** instruction." (emphasis mine)glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos. Regards, KaiAvailable in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26aOK. Well, the target multifunctioning would need compiler support anyway and it is easy to do something slightly different for `&foo` when foo is a multiversioned function. This should be fairly easy to implement in LDC, with some smarts needed in ordering and selecting the best function version.(does `&foo` return `impl`?)No, &foo will return the address of the wrapper function. I'm not sure if we can solve this. IIRC we can't overload &.
Apr 07 2016
On 4/7/2016 3:52 AM, Kai Nacke wrote:On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:We already have core.cupid, which covers most of what that article talks about. The indirect function thing appears to be a way to selectively load from various dlls. But that can be done anyway with core.cpuid and dynamic dll loading, so I'm not sure what advantage it brings.Then, void app(int simd)() { ... my fabulous app ... } int main() { auto fpu = core.cpuid.getfpu(); switch (fpu) { case SIMD: app!(SIMD)(); break; case SIMD4: app!(SIMD4)(); break; default: error("unsupported FPU"); exit(1); } }glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos.
Apr 07 2016
On 7 April 2016 at 13:27, Walter Bright via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:Sure. I've done this in my own tests. I just never published that anyone else should do it.I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround: gdc -simd=AFX foo.d becomes: gdc -simd=AFX -version=AFX foo.d It's even simpler if you use a makefile variable: FPU=AFX gdc -simd=$(FPU) -version=$(FPU)1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time.It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.You also mentioned being blocked (i.e. demotivated) for *years* by this, and I assume that may be because we don't care about SIMD support. That would be wrong, as I care a lot about it. But I had no idea you were having a problem with this, as you did not file any bug reports. Suffering in silence is never going to work :-)There's been threads, but sure, I could have done more to push it along. Motivation is a complex and not particularly logical emotion, there's a lot of factors feeding into it. Not least of which, is that I haven't been working in games for a while, which means I haven't depended on it for my work. Don't take that to read I have lost interest in the support, just that the pressure is reduced. You'll have noticed that C++ interaction is my recent focus, since that's directly related to my current day-job, and the path that I need to solve now to get D into my work. That's consuming almost 100% of my D-time-allocation... if I could ever manage to just kick that goal, it might free me up >_< .. I keep on trying.The author still needs to be able to control at compile-time what min-spec shall be supported. I agree the check is valuable, but I think it's an unrelated detail.It is not necessary to do it that way. Call std.cpuid to determine what is available at runtime, and issue an error message if not. There is no runtime cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to seg fault trying to execute instructions that do not exist.2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on.They are necessary to provide a baseline. It is typical when building code that you specify a min-spec. This is what's used by default throughout the application.Sure, so you agree we need a mechanism for the author to tune the default selection then? Or are you suggesting SSE2 is 'fine' as a default? (ie, that is what is implied by D_SIMD)Runtime selection is not practical in a broad sense. Emitting small fragments of SIMD here and there will probably take a loss if they are all surrounded by a runtime selector. SIMD is all about pipelining, and runtime branches on SIMD version are antithesis to good SIMD usage; they can't be applied for small-scale deployment. In my experience, runtime selection is desirable for large scale instantiations at an outer level of the work loop. I've tried to design this intent in my library, by making each simd API capable of receiving SIMD version information via template arg, and within the library, the version is always passed through to dependent calls. The Idea is, if you follow this pattern; propagating a SIMD version template arg through to your outer function, then you can instantiate your higher-level work function for any number of SIMD feature combinations you feel is appropriate.Doing it at a high level is what I meant, not for each SIMD code fragment.It's not a problem, that's exactly my design, but it's not a universal solution.Naturally, this process requires a default, otherwise this usage baggage will cloud the API everywhere (rather than in the few cases where a developer specifically wants to make use of it), and many developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1 in my applications, xbox developers would choose AVX1, it's very application/target-audience specific, but SSE2 is the only reasonable selection if we are not to accept a hint from the command line.I still don't see how it is a problem to do the switch at a high level.Heck, you could put the ENTIRE ENGINE inside a template, have a template parameter be the instruction set, and instantiate the template for each supported instruction set. Then, void app(int simd)() { ... my fabulous app ... } int main() { auto fpu = core.cpuid.getfpu(); switch (fpu) { case SIMD: app!(SIMD)(); break; case SIMD4: app!(SIMD4)(); break; default: error("unsupported FPU"); exit(1); } }Sure, I've designed for this specifically, but it's not practical to wind this all the way to the top of the stack. Some hot code will make make use of this pattern, but small fragments that appear throughout the code don't want to have this baggage applied. They should just work with the developer's deliberately selected default. It's not worth runtime selection on small deployments. You will likely end up with numerous helper functions, which when involved in the runtime-selected loops, would have different versions generated appropriately, but when these helper functions appear on their own, they would want to use a sensible default.I guess you haven't looked at my code, but yes, it's all mapped to enums used by the templates. The versions would assign a constant used as the template's default arg.I've done it with a template arg because it can be manually propagated, and users can extrapolate the pattern into their outer work functions, which can then easily have multiple versions instantiated for runtime selection. I think it's also important to mangle it into the symbol name for the reasons I mention above.Note that version identifiers are not usable directly as template parameters. You'd have to set up a mapping.
Apr 07 2016
On 4/7/2016 5:27 PM, Manu via Digitalmars-d wrote:You'll have noticed that C++ interaction is my recent focus, since that's directly related to my current day-job, and the path that I need to solve now to get D into my work.We recognize C++ interoperability to be a key feature of D. I hope you like the support you got with the C++ virtual functions! I got bogged down recently with getting the C++ exception handling support working better, hopefully we've turned the corner on that one. I'd hoped to be further along at the moment with C++ interoperability (but it's always going to be a work in progress).That's consuming almost 100% of my D-time-allocation... if I could ever manage to just kick that goal, it might free me up >_< .. I keep on trying.I do appreciate your efforts in this direction.From the command line, probably not. I like the pragma thing better.Doing it at a high level is what I meant, not for each SIMD code fragment.Sure, so you agree we need a mechanism for the author to tune the default selection then?Or are you suggesting SSE2 is 'fine' as a default? (ie, that is what is implied by D_SIMD)It is fine as a default, as it is the baseline minimum machine D is expecting.
Apr 07 2016
Am Wed, 6 Apr 2016 17:42:30 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:better ;-) If you've got a version() block in a template and compile two modules using the same template with different -version flags you'll have exactly that problem. Have an enum myFlag = x; in a config module + static if => problem solved. The problem isn't having global settings, the problem is having to manually specify the same global setting for every source file.But at very least, the important detail is that the version ID's are standardised and shared among all compilers.It's a reasonable suggestion; some points: 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time. 2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on. The main trouble comes about when different modules are compiled with different settings. What happens with template code generation, when the templates are pulled from different modules? What happens when COMDAT functions are generated? (The linker picks one arbitrarily and discards the others.) Which settings wind up in the executable will be not easily predictable.
Apr 07 2016
On Thursday, 7 April 2016 at 00:42:30 UTC, Walter Bright wrote:[...] especially if one is writing applications that dynamically adjusts based on the CPU the user is running on. The main trouble comes about when different modules are compiled with different settings. What happens with template code generation, when the templates are pulled from different modules? What happens when COMDAT functions are generated? (The linker picks one arbitrarily and discards the others.) Which settings wind up in the executable will be not easily predictable. I suspect that using a pragma would be a much better approach: pragma(SIMD, AFX) { ... code ... } Doing it on the command line is certainly the traditional way, but it strikes me as being bug-prone and as unhygienic and obsolete as the C preprocessor is (for similar reasons).Have you seen how GCC's function multiversioning [1] ? This whole thread is far too low-level for me and I'm not sure if GCC's dispatcher overhead is OK, but the syntax looks really nice and it seems to address all of your concerns. __attribute__ ((target ("default"))) int foo () { // The default version of foo. return 0; } __attribute__ ((target ("sse4.2"))) int foo () { // foo version for SSE4.2 return 1; } __attribute__ ((target ("arch=atom"))) int foo () { // foo version for the Intel ATOM processor return 2; } [1] https://gcc.gnu.org/wiki/FunctionMultiVersioning -Alexander
Apr 12 2016
Am Tue, 12 Apr 2016 10:55:18 +0000 schrieb xenon325 <anm programmer.net>:Have you seen how GCC's function multiversioning [1] ? =20 This whole thread is far too low-level for me and I'm not sure if=20 GCC's dispatcher overhead is OK, but the syntax looks really nice=20 and it seems to address all of your concerns. =20 __attribute__ ((target ("default"))) int foo () { // The default version of foo. return 0; } =20 __attribute__ ((target ("sse4.2"))) int foo () { // foo version for SSE4.2 return 1; } =20 =20 [1] https://gcc.gnu.org/wiki/FunctionMultiVersioning =20 -AlexanderAwesome! I just tried it and it ties runtime and compile-time selection of code paths together in an unprecedented way! As you said, there is the runtime dispatcher overhead if you just compile normally. But if you specifically compile with "gcc -msse4.2 <=E2=80=A6>", GCC calls the correct function directly: 0000000000400512 <main>: 400512: e8 f5 ff ff ff callq 40050c <_Z3foov.sse4.2> 400517: f3 c3 repz retq=20 400519: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) For demonstration purposes I disabled the inliner here. The best thing about it is that for users of libraries employing this technique, it happens behind the scenes and user code stays clean of instrumentation. No ugly versioning and hand written switch-case blocks! (It currently only works with C++ on x86, but I like the general direction.) --=20 Marco
Apr 12 2016
The system seems to call CPUID at startup and for every multiversioned function, patch an offset in its dispatcher function. The dispatcher function is then nothing more than a jump realtive to RIP, e.g.: jmp QWORD PTR [rip+0x200bf2] This is as efficient as it gets short of using whole-program optimization. -- Marco
Apr 12 2016
On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote:Have you seen how GCC's function multiversioning [1] ?I've been thinking about the gcc multiversioning since you mentioned it previously. I keep thinking about how the optimal algorithm for something like matrix multiplication depends on the size of the matrices. For instance, you might do something for very small matrices that just relies on one processor, then you add in SIMD as the size grows, then you add in multiple CPUs, then you add in the GPU (or maybe you add before CPUs), then you add in multiple computers. I don't know how some of those choices would get made at compile time for dynamic arrays. Would need some kind of run-time approach. At least for static arrays, you could do multiple versions of the function and then use template constraints to call whichever function. Some tuning would be necessary.
Apr 15 2016
Am Fri, 15 Apr 2016 18:54:12 +0000 schrieb jmh530 <john.michael.hall gmail.com>:On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote:GCC only has one architecture as a target at a time. As long as this is so, there is little point in contemplating how it handles multiple architectures and network traffic. :) CPUs run the bulk of code, from booting over kernel and drivers to applications and there will always be something that can be optimized if it is statically known that a certain instruction set is supported. To pick up your matrices example, imagine OpenGL code that has some 4x4 matrices that are in no direct relation to each other. The GPU is only good at bulk processing, and it doesn't apply here. So you need the general purpose processor and benefit from the knowledge that some SSE level is supported. In general, when you have to make many quick decisions on small amounts of data the GPU or networking are out of question. -- MarcoHave you seen how GCC's function multiversioning [1] ?I've been thinking about the gcc multiversioning since you mentioned it previously. I keep thinking about how the optimal algorithm for something like matrix multiplication depends on the size of the matrices. For instance, you might do something for very small matrices that just relies on one processor, then you add in SIMD as the size grows, then you add in multiple CPUs, then you add in the GPU (or maybe you add before CPUs), then you add in multiple computers.
Apr 16 2016
On Tuesday, 5 April 2016 at 09:39:21 UTC, 9il wrote:3. This is possible and not very hard to implement if I am not wrong.Last time I looked into this (related to implementing target, see [1]), I only found some Clang code dealing with this, but now I found LLVM functions about architectures, cpus, features, etc. So indeed I also think it will be relatively easy indeed to implement at least rudimentary support for what you'd want. [1] http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org
Apr 05 2016
On Monday, 4 April 2016 at 20:29:11 UTC, Walter Bright wrote:I'm not a SIMD expert, I've only played around with SIMD a little, but this confuses me. version(D_SIMD) will tell you when SIMD is implemented, but not what type of SIMD. For instance, if I am on a machine that can use AVX2 instructions, then code in a version(D_SIMD) block will execute, but it should also execute if the processor only supports SSE4. What if the writer of an SIMD library wants to have code execute differently if SSE4 is detected instead of AVX2?- Allowed sets of instructions: for example, AVX2, FMA4Done. D_SIMD
Apr 04 2016
On 4/4/2016 2:11 PM, jmh530 wrote:version(D_SIMD) will tell you when SIMD is implemented, but not what type of SIMD.The first SIMD level.For instance, if I am on a machine that can use AVX2 instructions, then code in a version(D_SIMD) block will execute, but it should also execute if the processor only supports SSE4. What if the writer of an SIMD library wants to have code execute differently if SSE4 is detected instead of AVX2?Use a runtime switch (see core.cpuid).
Apr 04 2016
Am Mon, 4 Apr 2016 13:29:11 -0700 schrieb Walter Bright <newshound2 digitalmars.com>:On 4/4/2016 7:02 AM, 9il wrote:I wonder if answers like this are meant to be filled into a template like this: "We have [$2] in place for that. If that doesn't get the job $1, please report whatever is missing to bugzilla. Thanks!" Since otherwise it should be clear that the distinction between AVX2 and FMA4 asks for something more specialized than D_SIMD, which is basically the same as checking the front-end __VERSION__. -- MarcoDone.What kind of information?Target cpu configuration: - CPU architecture (done)- Count of FP/Integer registers??- Allowed sets of instructions: for example, AVX2, FMA4Done. D_SIMD
Apr 11 2016
On 3 April 2016 at 16:14, 9il via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:My SIMD implementation has been blocked on that for years too. I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -MartinHello Martin, Is it possible to introduce compile time information about target platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. Best regards, Ilya
Apr 03 2016
On 4/3/2016 12:39 AM, Manu via Digitalmars-d wrote:My SIMD implementation has been blocked on that for years too.First I've heard of that.I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.Here is a list of all the open Bugzilla issues tagged with the keyword SIMD: https://issues.dlang.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=SIMD%2C%20&keywords_type=allwords&list_id=207488&query_format=advanced There is no issue I can find about being blocked for years on SIMD flags. I guarantee you that if you never report the problems you're having, you will suffer in silence and they will not get fixed :-)
Apr 03 2016
On Sunday, 3 April 2016 at 22:00:51 UTC, Walter Bright wrote:He's talked about it on his github PR: https://github.com/D-Programming-Language/phobos/pull/2862I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.Here is a list of all the open Bugzilla issues tagged with the keyword SIMD: https://issues.dlang.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=SIMD%2C%20&keywords_type=allwords&list_id=207488&query_format=advanced There is no issue I can find about being blocked for years on SIMD flags. I guarantee you that if you never report the problems you're having, you will suffer in silence and they will not get fixed :-)
Apr 03 2016
On 4/3/2016 7:12 PM, Jack Stouffer wrote:On Sunday, 3 April 2016 at 22:00:51 UTC, Walter Bright wrote:Yes, but I never noticed that until you posted a link. The place to file bug reports and enhancement requests is on Bugzilla. Otherwise nobody will see them. It's why we have Bugzilla.There is no issue I can find about being blocked for years on SIMD flags. I guarantee you that if you never report the problems you're having, you will suffer in silence and they will not get fixed :-)He's talked about it on his github PR: https://github.com/D-Programming-Language/phobos/pull/2862
Apr 03 2016
On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:My SIMD implementation has been blocked on that for years too. I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.I made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873
Apr 04 2016
On Monday, 4 April 2016 at 17:23:49 UTC, Jack Stouffer wrote:I made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873You might add link to this thread and github where he made the original comment..
Apr 04 2016
On 4/4/2016 10:27 AM, jmh530 wrote:On Monday, 4 April 2016 at 17:23:49 UTC, Jack Stouffer wrote:http://www.digitalmars.com/d/archives/digitalmars/D/Any_usable_SIMD_implementation_282806.htmlI made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873You might add link to this thread and github where he made the original comment..
Apr 04 2016
On 4/4/2016 10:23 AM, Jack Stouffer wrote:On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:I believe the issue is fixed (for DMD) with a documentation improvement.My SIMD implementation has been blocked on that for years too. I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.I made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873
Apr 04 2016
On Monday, 4 April 2016 at 19:43:43 UTC, Walter Bright wrote:On 4/4/2016 10:23 AM, Jack Stouffer wrote:I believe the problem is that you can't rely on D_SIMD that SSE4, FMA, AVX2, AVX-512, etc. are available on the target platform. See also http://forum.dlang.org/post/fnrmgfvqmykttsuuxxib forum.dlang.org.On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:I believe the issue is fixed (for DMD) with a documentation improvement.My SIMD implementation has been blocked on that for years too. I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.I made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873
Apr 04 2016
On 4/4/2016 12:55 PM, ZombineDev wrote:Right, you can't. But the issue here is having the compiler give a predefined version for what is the MINIMUM that the target machine supports. And the D_SIMD does that. There is no purpose to the compiler predefining a version for an instruction set it does not generate code for. You can also do a runtime test with http://dlang.org/phobos/core_cpuid.htmlI believe the issue is fixed (for DMD) with a documentation improvement.I believe the problem is that you can't rely on D_SIMD that SSE4, FMA, AVX2, AVX-512, etc. are available on the target platform. See also http://forum.dlang.org/post/fnrmgfvqmykttsuuxxib forum.dlang.org.
Apr 04 2016
On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:On 3 April 2016 at 16:14, 9il via Digitalmars-d <digitalmars-d puremagic.com> wrote:https://github.com/ldc-developers/ldc/pull/1434Is it possible to introduce compile time information about target platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. Best regards, IlyaMy SIMD implementation has been blocked on that for years too. I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.
Apr 15 2016
Am Sun, 03 Apr 2016 06:14:23 +0000 schrieb 9il <ilyayaroshenko gmail.com>:Hello Martin, Is it possible to introduce compile time information about target platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. Best regards, Ilya+1000! I've hardcoded SSE4 in fast.json, but would much prefer to type version(sse4) and have it compile on older CPUs as well. -- Marco
Apr 04 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor /internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -MartinNot sure if it's been mentioned, but I've made a best effort to implement GCC's in here: https://github.com/etcimon/botan/tree/master/source/botan/utils/simd
Apr 12 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:I'm currently working on a templated arrayop implementation (using RPN to encode ASTs). So far things worked out great, but now I got stuck b/c apparently none of the D compilers has a working SIMD implementation (maybe GDC has but it's very difficult to work w/ the 2.066 frontend). https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor /internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.? -Martinndslice.algorithm [1], [2] compiled with recent LDC beta will do all work for you. Vectorized flag should be turned on and the last (row) dimension should have stride==1. Generic matrix-matrix multiplication [3] is available in Mir version 0.16.0-beta2 It should be compiled with recent LDC beta, and -mcpu=native flag. [1] http://docs.mir.dlang.io/latest/mir_ndslice_algorithm.html [2] https://github.com/dlang/phobos/pull/4652 [3] http://docs.mir.dlang.io/latest/mir_glas_gemm.html
Aug 22 2016