digitalmars.D.ldc - Disappointing math performance compared to GDC
- Gabor Mezo (18/18) Oct 08 2014 Hello,
- Trass3r (16/16) Oct 08 2014 Try with '-O3 -release -vectorize-slp-aggressive -g
- Gabor Mezo (5/22) Oct 08 2014 I get:
- Trass3r (2/5) Oct 08 2014 They were added to llvm in April/May.
- Gabor Mezo (4/9) Oct 08 2014 I can confirm that there are no pass-remarks options found. I'm
- Russel Winder via digitalmars-d-ldc (20/31) Oct 08 2014 -----BEGIN PGP SIGNED MESSAGE-----
- Trass3r (11/17) Oct 08 2014 Note that this is less of an issue for x86 code using x87 by
- Gabor Mezo (5/23) Oct 08 2014 Just for the record my benchmark code doesn't use math libraries,
- Trass3r (2/2) Oct 08 2014 Just check it with '-output-ll' or '-output-s
- Gabor Mezo (96/98) Oct 08 2014 I'm not an ASM expert, but as far as I can see it indeed use some
- Gabor Mezo (17/17) Oct 08 2014 On a second thought I can see that the main problem is my
- Gabor Mezo (14/14) Oct 08 2014 Here are the abs/min/max functions:
- David Nadlinger (19/20) Oct 08 2014 They are likely in a different module than the code using them,
- Gabor Mezo (5/5) Oct 08 2014 Hi David,
- Trass3r (4/10) Oct 08 2014 If you see a 'ps' suffix (packed single-precision) it's SIMD ;)
- David Nadlinger (11/13) Oct 08 2014 On x86_64, scalar single and double precision math uses the SSE
- David Nadlinger (13/20) Oct 08 2014 Would it be possible to publish the relevant parts of the code,
- Gabor Mezo (2/12) Oct 08 2014 Of course. The code will be accessible on github on this week.
- Gabor Mezo (30/30) Oct 09 2014 Let me introduce my project for you guys.
- Gabor Mezo (8/8) Oct 09 2014 Hey,
- John Colvin (3/11) Oct 10 2014 The -singleobj flag may give you that same performance boost
- Gabor Mezo (2/15) Oct 10 2014 How do you do this by using dub?
- Gabor Mezo (2/18) Oct 10 2014 Ok, thanks, I've already figured it out.
- Gabor Mezo (7/7) Oct 13 2014 I just wanted to inform you guys, that I optimized my code to
- Fool (9/12) Oct 11 2014 I recently posted a test case [1] that originated from the
Hello, I have a machine learning library and I'm porting it from C++ to D right now. There is a number crunching benchmark in it that doing a simple gradient descent learning on a small multilayer perceptron neural network. The core of the benchmark is about some loops doing basic computations on numbers in float[] arrays (add, mul, exp, abs). The reference is the C++ version compiled with Clang: 0.044 secs. D results: DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs LDC2 0.14 -O3 -release : 0.051 secs GDC 4.9 -O3 -release : 0.031 secs I think my benchmark code would hugely benefit from auto vectorization, so that might be the cause of the above results. I've found some vectorization compiler options for ldc2 but they seems have no effect on performance whatsoever. Any suggestions?
Oct 08 2014
Try with '-O3 -release -vectorize-slp-aggressive -g -pass-remarks-analysis="loop-vectorize|loop-unroll" -pass-remarks=loop-unroll' Note that the D situation is a mess in general (correct me if I'm wrong): * Never ever use std.math as you will get the insane 80-bit functions. * core.math has some hacks to use llvm builtins but also mostly using type real. * core.stdc.math supports all types but uses suffixes and maps to C functions. * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. * you can also use ldc.intrinsics to kill portability. Hello C++. And there's no fast-math yet: https://github.com/ldc-developers/ldc/issues/722
Oct 08 2014
On Wednesday, 8 October 2014 at 11:29:30 UTC, Trass3r wrote:Try with '-O3 -release -vectorize-slp-aggressive -g -pass-remarks-analysis="loop-vectorize|loop-unroll" -pass-remarks=loop-unroll' Note that the D situation is a mess in general (correct me if I'm wrong): * Never ever use std.math as you will get the insane 80-bit functions. * core.math has some hacks to use llvm builtins but also mostly using type real. * core.stdc.math supports all types but uses suffixes and maps to C functions. * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. * you can also use ldc.intrinsics to kill portability. Hello C++. And there's no fast-math yet: https://github.com/ldc-developers/ldc/issues/722I get: Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' Unknown command line argument '-pass-remarks=loop-unroll'
Oct 08 2014
Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' Unknown command line argument '-pass-remarks=loop-unroll'They were added to llvm in April/May. -help-hidden lists all available options.
Oct 08 2014
On Wednesday, 8 October 2014 at 15:04:10 UTC, Trass3r wrote:I can confirm that there are no pass-remarks options found. I'm using 0.14.0 from here: https://github.com/ldc-developers/ldc/releases/tag/v0.14.0Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' Unknown command line argument '-pass-remarks=loop-unroll'They were added to llvm in April/May. -help-hidden lists all available options.
Oct 08 2014
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/10/14 12:29, Trass3r via digitalmars-d-ldc wrote: […]* Never ever use std.math as you will get the insane 80-bit functions.What can one use to avoid this, and use the 64-bit numbers.* core.math has some hacks to use llvm builtins but also mostly using type real. * core.stdc.math supports all types but uses suffixes and maps to C functions. * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. * you can also use ldc.intrinsics to kill portability. Hello C++. And there's no fast-math yet: https://github.com/ldc-developers/ldc/issues/722Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues? - -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlQ1TlEACgkQ+ooS3F10Be/1BwCeNxSxy86BThgVuhYOawFNHi3A uboAn0lautCqxvrOHp+mIYlVz7qw2jpV =7Wy1 -----END PGP SIGNATURE-----
Oct 08 2014
Note that this is less of an issue for x86 code using x87 by default. On x64 though this results in really bad code switching between SSE and x87 registers. But vectorization is usually killed in any case. I personally use core.stdc.tgmath atm.* Never ever use std.math as you will get the insane 80-bit functions.What can one use to avoid this, and use the 64-bit numbers.I think there have been threads debating the unreasonable 'real by default' attitude. No clue if there's any result. And core.math is a big question mark to me. I don't know about gdc. Its runtime doesn't look much different.* [...]Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues?
Oct 08 2014
On Wednesday, 8 October 2014 at 15:32:46 UTC, Trass3r wrote:Just for the record my benchmark code doesn't use math libraries, I'm using logistics function approximations. That's why I thought that the cause of my results hafta be the lack of the auto vectorization.Note that this is less of an issue for x86 code using x87 by default. On x64 though this results in really bad code switching between SSE and x87 registers. But vectorization is usually killed in any case. I personally use core.stdc.tgmath atm.* Never ever use std.math as you will get the insane 80-bit functions.What can one use to avoid this, and use the 64-bit numbers.I think there have been threads debating the unreasonable 'real by default' attitude. No clue if there's any result. And core.math is a big question mark to me. I don't know about gdc. Its runtime doesn't look much different.* [...]Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues?
Oct 08 2014
Just check it with '-output-ll' or '-output-s -x86-asm-syntax=intel' ;)
Oct 08 2014
On Wednesday, 8 October 2014 at 16:02:17 UTC, Trass3r wrote:Just check it with '-output-ll' or '-output-s -x86-asm-syntax=intel' ;)I'm not an ASM expert, but as far as I can see it indeed use some SIMD registers and instructions. For examlple: .LBB0_16: mov rcx, qword ptr [rax] mov rdi, rax call qword ptr [rcx + 56] test rax, rax jne .LBB0_18 movss xmm1, dword ptr [rsp + 116] jmp .LBB0_20 .align 16, 0x90 .LBB0_18: mov rcx, rbx imul rcx, rax add r12, rcx movss xmm1, dword ptr [rsp + 116] .align 16, 0x90 .LBB0_19: movss xmm0, dword ptr [rdx] mulss xmm0, dword ptr [r12] addss xmm1, xmm0 add rdx, 4 add r12, 4 dec rax jne .LBB0_19 .LBB0_20: movss dword ptr [rsp + 116], xmm1 inc r14 cmp r14, r15 jne .LBB0_12 .LBB0_21: mov rax, qword ptr [rsp + 80] mov rdi, qword ptr [rax] mov rax, qword ptr [rdi] call qword ptr [rax + 40] test eax, eax mov rbp, qword ptr [rsp + 104] jne .LBB0_24 movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers7sigmoidFNbffZf mov rax, qword ptr [rsp + 64] movss dword ptr [rax + 4*rbp], xmm0 xor edx, edx xor ecx, ecx mov r8d, _D11TypeInfo_Af6__initZ mov rdi, qword ptr [rsp + 48] mov rsi, qword ptr [rsp + 96] call _adEq2 test eax, eax jne .LBB0_27 movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers12sigmoidDerivFNbffZf mov rax, qword ptr [rsp + 96] jmp .LBB0_26 .align 16, 0x90 .LBB0_24: movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers6linearFNbffZf mov rax, qword ptr [rsp + 64] movss dword ptr [rax + 4*rbp], xmm0 xor edx, edx xor ecx, ecx mov r8d, _D11TypeInfo_Af6__initZ mov rdi, qword ptr [rsp + 48] mov rsi, qword ptr [rsp + 96] call _adEq2 test eax, eax jne .LBB0_27 mov rax, qword ptr [rsp + 96] movss xmm0, dword ptr [rsp + 92] .LBB0_26: movss dword ptr [rax + 4*rbp], xmm0 .LBB0_27: inc rbp add rbx, 4 cmp rbp, qword ptr [rsp + 72] jne .LBB0_9 .LBB0_28: mov rax, qword ptr [rsp + 24] inc rax cmp rax, qword ptr [rsp + 8] mov rbp, qword ptr [rsp + 16] jne .LBB0_1 .LBB0_29: add rsp, 120 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret
Oct 08 2014
On a second thought I can see that the main problem is my computation functions are not inlined. They are likes of this: float sigmoid(float value, float alpha) nothrow { return (value * alpha) / (1.0f + nfAbs(value * alpha)); // Elliot } float sigmoidDeriv(float value, float alpha) nothrow { return alpha * 1.0f / ((1.0f + nfAbs(value * alpha)) * (1.0f + nfAbs(value * alpha))); // Elliot } float linear(float value, float alpha) nothrow { return nfMin(nfMax(value * alpha, -alpha), alpha); } Why those calls are not inlined?? Or vectorized?
Oct 08 2014
Here are the abs/min/max functions: float nfAbs(float num) nothrow { return num < 0.0f ? -num : num; } float nfMax(float num1, float num2) nothrow { return num1 < num2 ? num2 : num1; } float nfMin(float num1, float num2) nothrow { return num2 < num1 ? num2 : num1; } Didn't inlined too. Why?
Oct 08 2014
On Wednesday, 8 October 2014 at 16:26:02 UTC, Gabor Mezo wrote:Why those calls are not inlined?They are likely in a different module than the code using them, right? Modules in D are supposed to be their own, separate compilation unit, just like .cpp in C++. Thus by default no inlining across module boundaries will take place, unless you use something like link-time optimization. Now of course this is rather undesirable and a big problem for trivial helper functions. If you just compile a single executable, you can pass -singleobj to LDC to instruct it to generate only one object file, so that the optimization boundaries disappear (arguably, this should be the default). Furthermore, both DMD and LDC actually attempt to work around this by also analyzing imported modules so that functions in them can be inlined. Unfortunately, the LDC implementation of this is defunct as of a couple of DMD frontend merges ago. Thus, not even simple cases as in your example are not covered. I'm working on a reimplementation right now, hopefully to appear in master soon. Cheers, David
Oct 08 2014
Hi David, Thanks for trying to help me out. Indeed, helper functions reside in separate modules. They are system functions. I try to convert my helper function system to mixins then.
Oct 08 2014
I'm not an ASM expert'-output-ll' gives you llvm IR, a bit higher level.but as far as I can see it indeed use some SIMD registers and instructions. For examlple: movss xmm0, dword ptr [rdx] mulss xmm0, dword ptr [r12] addss xmm1, xmm0If you see a 'ps' suffix (packed single-precision) it's SIMD ;) Your helper functions are probably in a different module. Cross-module inlining is problematic currently.
Oct 08 2014
On Wednesday, 8 October 2014 at 16:23:19 UTC, Gabor Mezo wrote:I'm not an ASM expert, but as far as I can see it indeed use some SIMD registers and instructions.On x86_64, scalar single and double precision math uses the SSE registers and instructions by default too. The relevant mnemonics (mostly) end with "ss", which stands for "scalar single". On the other hand, vectorize code would use e.g. the instructions ending in "ps", for "packed single" (multiple values in one SSE register). Your snippet has not actually been vectorized. Assuming that the code you posted was from a hot loop, a much bigger problem are the many function calls, though. David
Oct 08 2014
Hi, On Wednesday, 8 October 2014 at 07:37:15 UTC, Gabor Mezo wrote:There is a number crunching benchmark in it that doing a simple gradient descent learning on a small multilayer perceptron neural network. The core of the benchmark is about some loops doing basic computations on numbers in float[] arrays (add, mul, exp, abs).Would it be possible to publish the relevant parts of the code, i.e. what is needed to reproduce the performance problem? I'm currently working on a D compiler performance tracking project, so real-world test-cases where one compiler does much better than another are interesting to me. If the code is proprietary, would it be possible for or me another compiler dev to have a look at the code, so we can determine the issues more quickly?DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs LDC2 0.14 -O3 -release : 0.051 secsNote that array bounds checks are still enabled for LDC here if your code was safe. David
Oct 08 2014
Would it be possible to publish the relevant parts of the code, i.e. what is needed to reproduce the performance problem? I'm currently working on a D compiler performance tracking project, so real-world test-cases where one compiler does much better than another are interesting to me. If the code is proprietary, would it be possible for or me another compiler dev to have a look at the code, so we can determine the issues more quickly?Of course. The code will be accessible on github on this week. This is an LGPL licensed hobbyist project, not confidential. ;)DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs LDC2 0.14 -O3 -release : 0.051 secs
Oct 08 2014
Let me introduce my project for you guys. There is the blog: http://neuroflowblog.wordpress.com/ and the productivity of the language allowed me to implement advanced machine learning algorithms like Realtime Recurrent Learning and Scaled Conjugate Gradient. Sadly the performance was not that good, so I learned OpenCL. I implemented a provider model in my framework, so I became able to use managed and OpenCL implementations in the same system. Because my experimental code experiments went really slow. Then I decided to move my experimental layer to C++11, and my framework became pure native. Sadly productivity of C++ is poor experiments became slower than was by using the managed version. The I decided to learn D, and the result is on the Github (a DUB project): https://github.com/unbornchikken/neuroflow-D This is a console application, the mentioned benchmark will start. Please note, this is my first time D code. There are constructs those seems lead to nowhere, but they will gain purpose when I port all of the planned functionality. Because there are a provider model to have OpenCL and D (and whatever) based implementations in parallel, I wasn't able to avoid downcasting in my design. Because downcasting can hugely affect performance I implemented some ugly but performant void * magic. Sorry for that. :) Conversion of the OpenCL implementation to D is still TODO. Recurrent learning implementations are not implemented right now.
Oct 09 2014
Hey, We have made progress. I've merged my computation code into a single module, and now the LDC build is as perfomant as the Clang one! The benchmark took around 0.044 secs. It's slower that the GDC version but it is amazing that D language can be as performant as C++ by using the same compiler backend, so no magic allowed. Results pushed in.
Oct 09 2014
On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:Hey, We have made progress. I've merged my computation code into a single module, and now the LDC build is as perfomant as the Clang one! The benchmark took around 0.044 secs. It's slower that the GDC version but it is amazing that D language can be as performant as C++ by using the same compiler backend, so no magic allowed. Results pushed in.The -singleobj flag may give you that same performance boost without having to refactor the code.
Oct 10 2014
On Friday, 10 October 2014 at 12:13:49 UTC, John Colvin wrote:On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:How do you do this by using dub?Hey, We have made progress. I've merged my computation code into a single module, and now the LDC build is as perfomant as the Clang one! The benchmark took around 0.044 secs. It's slower that the GDC version but it is amazing that D language can be as performant as C++ by using the same compiler backend, so no magic allowed. Results pushed in.The -singleobj flag may give you that same performance boost without having to refactor the code.
Oct 10 2014
On Friday, 10 October 2014 at 15:08:21 UTC, Gabor Mezo wrote:On Friday, 10 October 2014 at 12:13:49 UTC, John Colvin wrote:Ok, thanks, I've already figured it out.On Thursday, 9 October 2014 at 08:13:21 UTC, Gabor Mezo wrote:How do you do this by using dub?Hey, We have made progress. I've merged my computation code into a single module, and now the LDC build is as perfomant as the Clang one! The benchmark took around 0.044 secs. It's slower that the GDC version but it is amazing that D language can be as performant as C++ by using the same compiler backend, so no magic allowed. Results pushed in.The -singleobj flag may give you that same performance boost without having to refactor the code.
Oct 10 2014
I just wanted to inform you guys, that I optimized my code to avoid casting entirely in hot paths. To be fair I did backport my refinements to C++ version. Now all builds runs with roughly the same performance (Clang, LDC, GDC). All changes are pushed to the mentioned repo. Thanks for your help, I'm satisfied. (And eagerly waiting for 2.066 compatible GDC and LDC releases. :))
Oct 13 2014
On Wednesday, 8 October 2014 at 17:31:21 UTC, David Nadlinger wrote:[...] so real-world test-cases where one compiler does much better than another are interesting to me.I recently posted a test case [1] that originated from the discussion [2]. [1] http://forum.dlang.org/thread/fowvgokbjuxplvcskswg forum.dlang.org [2] http://forum.dlang.org/thread/ls9dbk$jkq$1 digitalmars.com Kind regards, Fool
Oct 11 2014