www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Performance problem in reverse algorithm

reply "Fool" <fool dlang.org> writes:
Below there is a snippet of D code followed by a close C++ 
translation.

clang++ seems to have some problems optimizing the C++ version if 
my_reverse is not attributed "noinline". Unfortunately, I was not 
able to convince ldc to produce a fast executable. g++ and gdc 
both produce good results; annotation is not required here.

The same effect is visible in the D version if one replaces 
my_reverse with reverse from std.algorithm. Can someone figure 
out what the problem is?

clang++ /   inline: 5.7s
     ldc /   inline: 5.7s
clang++ / noinline: 1.4s
     ldc / noinline: 5.8s
                g++: 1.2s
                gdc: 1.2s

Test environment: Arch Linux x86_64

clang version 3.4.2
LDC - the LLVM D compiler (0.14.0)
g++ (GCC) 4.9.1
gdc (GCC) 4.9.1

clang++ -std=c++11 -march=native -O3 -DNOINLINE
ldc2 -O3 -mcpu=native -release -disable-boundscheck
g++ -std=c++11 -march=native -O3
gdc -O3 -march=native -frelease -fno-bounds-check

Moreover, if I do not specify '-disable-boundscheck' ldc2 
produces the following error message:

/usr/lib/liblphobos2.a(curl.o): In function 
`_D3std3net4curl4HTTP18_sharedStaticCtor1FZv':
(.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): 
undefined reference to `curl_version_info'
/usr/lib/liblphobos2.a(curl.o): In function 
`_D3std3net4curl4Curl18_sharedStaticDtor3FZv':
(.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): 
undefined reference to `curl_global_cleanup'
/usr/lib/liblphobos2.a(curl.o): In function 
`_D3std3net4curl13__shared_ctorZ':
(.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference 
to `curl_global_init'
collect2: error: ld returned 1 exit status
Error: /usr/bin/gcc failed with status: 1


////
// reverse.d

import std.algorithm, std.range;
import ldc.attribute;

 attribute("noinline")
void my_reverse(int* b, int* e)
{
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
}

void main(string[] args)
{
    immutable N = 2000;
    immutable K = 10000;
    auto a = iota(N).array;
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
}


////
// reverse.cpp

#include <numeric>
#include <vector>

#ifdef NOINLINE
__inline__ __attribute__((noinline))
#endif
void my_reverse(int* b, int* e)
{
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          std::swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
}

int main()
{
    const auto N = 2000;
    const auto K = 10000;
    std::vector<int> a(N);
    auto b = std::begin(a);
    auto e = std::end(a);
    std::iota(b, e, 0);
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
}
Aug 24 2014
parent reply "Kai Nacke" <kai redstar.de> writes:
Hi Fool!

On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote:
 Below there is a snippet of D code followed by a close C++ 
 translation.

 clang++ seems to have some problems optimizing the C++ version 
 if my_reverse is not attributed "noinline". Unfortunately, I 
 was not able to convince ldc to produce a fast executable. g++ 
 and gdc both produce good results; annotation is not required 
 here.
The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery. You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes). Unfortunately, general attributes are not yet implemented in LDC https://github.com/ldc-developers/ldc/issues/561).
 The same effect is visible in the D version if one replaces 
 my_reverse with reverse from std.algorithm. Can someone figure 
 out what the problem is?

 clang++ /   inline: 5.7s
     ldc /   inline: 5.7s
 clang++ / noinline: 1.4s
     ldc / noinline: 5.8s
                g++: 1.2s
                gdc: 1.2s

 Test environment: Arch Linux x86_64

 clang version 3.4.2
 LDC - the LLVM D compiler (0.14.0)
 g++ (GCC) 4.9.1
 gdc (GCC) 4.9.1

 clang++ -std=c++11 -march=native -O3 -DNOINLINE
 ldc2 -O3 -mcpu=native -release -disable-boundscheck
 g++ -std=c++11 -march=native -O3
 gdc -O3 -march=native -frelease -fno-bounds-check

 Moreover, if I do not specify '-disable-boundscheck' ldc2 
 produces the following error message:
https://github.com/ldc-developers/ldc/issues/683. Sorry for that.
 /usr/lib/liblphobos2.a(curl.o): In function 
 `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv':
 (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): 
 undefined reference to `curl_version_info'
 /usr/lib/liblphobos2.a(curl.o): In function 
 `_D3std3net4curl4Curl18_sharedStaticDtor3FZv':
 (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): 
 undefined reference to `curl_global_cleanup'
 /usr/lib/liblphobos2.a(curl.o): In function 
 `_D3std3net4curl13__shared_ctorZ':
 (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined 
 reference to `curl_global_init'
 collect2: error: ld returned 1 exit status
 Error: /usr/bin/gcc failed with status: 1


 ////
 // reverse.d

 import std.algorithm, std.range;
 import ldc.attribute;

  attribute("noinline")
 void my_reverse(int* b, int* e)
 {
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
 }

 void main(string[] args)
 {
    immutable N = 2000;
    immutable K = 10000;
    auto a = iota(N).array;
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
 }


 ////
 // reverse.cpp

 #include <numeric>
 #include <vector>

 #ifdef NOINLINE
 __inline__ __attribute__((noinline))
 #endif
 void my_reverse(int* b, int* e)
 {
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          std::swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
 }

 int main()
 {
    const auto N = 2000;
    const auto K = 10000;
    std::vector<int> a(N);
    auto b = std::begin(a);
    auto e = std::end(a);
    std::iota(b, e, 0);
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
 }
I will try to check this but it will take some time as I am busy with some other D related tasks. Regards, Kai
Aug 25 2014
parent reply "Fool" <fool dlang.org> writes:
Hi Kai!

On Monday, 25 August 2014 at 16:49:34 UTC, Kai Nacke wrote:
 On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote:
 The speed of binaries produced by ldc is roughly the same as of 
 binaries from clang++. That is no wonder as both use the same 
 LLVM machinery.
Yes, that's what I thought. Probably, there is something strange going on in the depths of LLVM.
 You could try the LDC_never_inline pragma 
 (http://wiki.dlang.org/LDC-specific_language_changes).
 Unfortunately, general attributes are not yet implemented in 

 https://github.com/ldc-developers/ldc/issues/561).
Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.

 https://github.com/ldc-developers/ldc/issues/683. Sorry for 
 that.
No problem, thanks for the info!
 I will try to check this but it will take some time as I am 
 busy with some other D related tasks.
Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations. Kind regards, Fool
Aug 25 2014
next sibling parent reply "Daniel N" <ufo orbiting.us> writes:
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
 Thanks, there is no time pressure. I compared ASM and LLVM-IR 
 of the fast and slow versions using clang++. Unfortunately, my 
 foo in this area is non-existent. The only thing I noticed is 
 that the fast version has significantly longer ASM and LLVM-I 
 representations.

 Kind regards,
 Fool
I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference. Regards, Daniel N
Aug 30 2014
parent "Daniel N" <ufo orbiting.us> writes:
On Sunday, 31 August 2014 at 03:47:04 UTC, Daniel N wrote:
 I tried changing 'immutable' to 'enum' and adding 'pure 
 nothrow' annotations on the functions, got a decent speedup for 
 ldc, I currently don't have gdc installed, so I can't check if 
 these changes make gdc even faster or if it's part of the 
 difference.

 Regards,
 Daniel N
Hmm, sorry the execution time varies too wildly on my system, measuring error.
Aug 30 2014
prev sibling next sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
 Introducing pragma(LDC_never_inline) slightly improves 
 execution time of the ldc result to 5.4s. Still, this is more 
 than three times slower than the fast version produced by 
 clang++.
I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang. --- $ ldc2 -version LDC - the LLVM D compiler (c8b9f6): based on DMD v2.066 and LLVM 3.5.0 Default target: x86_64-unknown-linux-gnu Host CPU: corei7-avx […] $ clang++ --version clang version 3.5.0 (tags/RELEASE_350/final) Target: x86_64-unknown-linux-gnu Thread model: posix --- David
Oct 11 2014
prev sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
 Introducing pragma(LDC_never_inline) slightly improves 
 execution time of the ldc result to 5.4s. Still, this is more 
 than three times slower than the fast version produced by 
 clang++.
I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang. --- $ ldc2 -version LDC - the LLVM D compiler (c8b9f6): based on DMD v2.066 and LLVM 3.5.0 Default target: x86_64-unknown-linux-gnu Host CPU: corei7-avx […] $ clang++ --version clang version 3.5.0 (tags/RELEASE_350/final) Target: x86_64-unknown-linux-gnu Thread model: posix --- David
Oct 11 2014
parent "Fool" <fool dlang.org> writes:
Hi David,

thank you for checking this case.

Currently, I'm using

LDC - the LLVM D compiler (0.14.0):
   based on DMD v2.065 and LLVM 3.4.2
   Default target: x86_64-unknown-linux-gnu
   Host CPU: core-avx2

If waiting for the next version of ldc is all I have to do that's 
good news. :-)

Thank you again, kind greetings,
Fool

On Saturday, 11 October 2014 at 20:04:59 UTC, David Nadlinger 
wrote:
 ---
 $ ldc2 -version
 LDC - the LLVM D compiler (c8b9f6):
   based on DMD v2.066 and LLVM 3.5.0
   Default target: x86_64-unknown-linux-gnu
   Host CPU: corei7-avx
 […]

 $ clang++ --version
 clang version 3.5.0 (tags/RELEASE_350/final)
 Target: x86_64-unknown-linux-gnu
 Thread model: posix
 ---
Oct 11 2014