digitalmars.D.ldc - Performance problem in reverse algorithm
- Fool (103/103) Aug 24 2014 Below there is a snippet of D code followed by a close C++
- Kai Nacke (16/120) Aug 25 2014 The speed of binaries produced by ldc is roughly the same as of
- Fool (15/29) Aug 25 2014 Yes, that's what I thought. Probably, there is something strange
- Daniel N (7/14) Aug 30 2014 I tried changing 'immutable' to 'enum' and adding 'pure nothrow'
- Daniel N (3/10) Aug 30 2014 Hmm, sorry the execution time varies too wildly on my system,
- David Nadlinger (17/21) Oct 11 2014 I just tested this, and on the machine I ran this on the code LDC
- David Nadlinger (17/21) Oct 11 2014 I just tested this, and on the machine I ran this on the code LDC
- Fool (13/25) Oct 11 2014 Hi David,
Below there is a snippet of D code followed by a close C++ translation. clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here. The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is? clang++ / inline: 5.7s ldc / inline: 5.7s clang++ / noinline: 1.4s ldc / noinline: 5.8s g++: 1.2s gdc: 1.2s Test environment: Arch Linux x86_64 clang version 3.4.2 LDC - the LLVM D compiler (0.14.0) g++ (GCC) 4.9.1 gdc (GCC) 4.9.1 clang++ -std=c++11 -march=native -O3 -DNOINLINE ldc2 -O3 -mcpu=native -release -disable-boundscheck g++ -std=c++11 -march=native -O3 gdc -O3 -march=native -frelease -fno-bounds-check Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message: /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv': (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info' /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv': (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup' /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ': (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init' collect2: error: ld returned 1 exit status Error: /usr/bin/gcc failed with status: 1 //// // reverse.d import std.algorithm, std.range; import ldc.attribute; attribute("noinline") void my_reverse(int* b, int* e) { auto steps = (e - b) / 2; if (steps) { auto l = b; auto r = e - 1; do { swap(*l, *r); ++l; --r; } while (--steps); } } void main(string[] args) { immutable N = 2000; immutable K = 10000; auto a = iota(N).array; for (auto n = 0; n <= N; ++n) { for (auto k = 0; k <= K; ++k) { my_reverse(&a[0], &a[0] + n); } } } //// // reverse.cpp #include <numeric> #include <vector> #ifdef NOINLINE __inline__ __attribute__((noinline)) #endif void my_reverse(int* b, int* e) { auto steps = (e - b) / 2; if (steps) { auto l = b; auto r = e - 1; do { std::swap(*l, *r); ++l; --r; } while (--steps); } } int main() { const auto N = 2000; const auto K = 10000; std::vector<int> a(N); auto b = std::begin(a); auto e = std::end(a); std::iota(b, e, 0); for (auto n = 0; n <= N; ++n) { for (auto k = 0; k <= K; ++k) { my_reverse(&a[0], &a[0] + n); } } }
Aug 24 2014
Hi Fool! On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote:Below there is a snippet of D code followed by a close C++ translation. clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here.The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery. You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes). Unfortunately, general attributes are not yet implemented in LDC https://github.com/ldc-developers/ldc/issues/561).The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is? clang++ / inline: 5.7s ldc / inline: 5.7s clang++ / noinline: 1.4s ldc / noinline: 5.8s g++: 1.2s gdc: 1.2s Test environment: Arch Linux x86_64 clang version 3.4.2 LDC - the LLVM D compiler (0.14.0) g++ (GCC) 4.9.1 gdc (GCC) 4.9.1 clang++ -std=c++11 -march=native -O3 -DNOINLINE ldc2 -O3 -mcpu=native -release -disable-boundscheck g++ -std=c++11 -march=native -O3 gdc -O3 -march=native -frelease -fno-bounds-check Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message:https://github.com/ldc-developers/ldc/issues/683. Sorry for that./usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv': (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info' /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv': (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup' /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ': (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init' collect2: error: ld returned 1 exit status Error: /usr/bin/gcc failed with status: 1 //// // reverse.d import std.algorithm, std.range; import ldc.attribute; attribute("noinline") void my_reverse(int* b, int* e) { auto steps = (e - b) / 2; if (steps) { auto l = b; auto r = e - 1; do { swap(*l, *r); ++l; --r; } while (--steps); } } void main(string[] args) { immutable N = 2000; immutable K = 10000; auto a = iota(N).array; for (auto n = 0; n <= N; ++n) { for (auto k = 0; k <= K; ++k) { my_reverse(&a[0], &a[0] + n); } } } //// // reverse.cpp #include <numeric> #include <vector> #ifdef NOINLINE __inline__ __attribute__((noinline)) #endif void my_reverse(int* b, int* e) { auto steps = (e - b) / 2; if (steps) { auto l = b; auto r = e - 1; do { std::swap(*l, *r); ++l; --r; } while (--steps); } } int main() { const auto N = 2000; const auto K = 10000; std::vector<int> a(N); auto b = std::begin(a); auto e = std::end(a); std::iota(b, e, 0); for (auto n = 0; n <= N; ++n) { for (auto k = 0; k <= K; ++k) { my_reverse(&a[0], &a[0] + n); } } }I will try to check this but it will take some time as I am busy with some other D related tasks. Regards, Kai
Aug 25 2014
Hi Kai! On Monday, 25 August 2014 at 16:49:34 UTC, Kai Nacke wrote:On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote: The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery.Yes, that's what I thought. Probably, there is something strange going on in the depths of LLVM.You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes). Unfortunately, general attributes are not yet implemented in https://github.com/ldc-developers/ldc/issues/561).Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.https://github.com/ldc-developers/ldc/issues/683. Sorry for that.No problem, thanks for the info!I will try to check this but it will take some time as I am busy with some other D related tasks.Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations. Kind regards, Fool
Aug 25 2014
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations. Kind regards, FoolI tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference. Regards, Daniel N
Aug 30 2014
On Sunday, 31 August 2014 at 03:47:04 UTC, Daniel N wrote:I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference. Regards, Daniel NHmm, sorry the execution time varies too wildly on my system, measuring error.
Aug 30 2014
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang. --- $ ldc2 -version LDC - the LLVM D compiler (c8b9f6): based on DMD v2.066 and LLVM 3.5.0 Default target: x86_64-unknown-linux-gnu Host CPU: corei7-avx […] $ clang++ --version clang version 3.5.0 (tags/RELEASE_350/final) Target: x86_64-unknown-linux-gnu Thread model: posix --- David
Oct 11 2014
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang. --- $ ldc2 -version LDC - the LLVM D compiler (c8b9f6): based on DMD v2.066 and LLVM 3.5.0 Default target: x86_64-unknown-linux-gnu Host CPU: corei7-avx […] $ clang++ --version clang version 3.5.0 (tags/RELEASE_350/final) Target: x86_64-unknown-linux-gnu Thread model: posix --- David
Oct 11 2014
Hi David, thank you for checking this case. Currently, I'm using LDC - the LLVM D compiler (0.14.0): based on DMD v2.065 and LLVM 3.4.2 Default target: x86_64-unknown-linux-gnu Host CPU: core-avx2 If waiting for the next version of ldc is all I have to do that's good news. :-) Thank you again, kind greetings, Fool On Saturday, 11 October 2014 at 20:04:59 UTC, David Nadlinger wrote:--- $ ldc2 -version LDC - the LLVM D compiler (c8b9f6): based on DMD v2.066 and LLVM 3.5.0 Default target: x86_64-unknown-linux-gnu Host CPU: corei7-avx […] $ clang++ --version clang version 3.5.0 (tags/RELEASE_350/final) Target: x86_64-unknown-linux-gnu Thread model: posix ---
Oct 11 2014