digitalmars.D.ldc - Performance problem in reverse algorithm

Fool (103/103) Aug 24 2014 Below there is a snippet of D code followed by a close C++

Kai Nacke (16/120) Aug 25 2014 The speed of binaries produced by ldc is roughly the same as of

Fool (15/29) Aug 25 2014 Yes, that's what I thought. Probably, there is something strange

Daniel N (7/14) Aug 30 2014 I tried changing 'immutable' to 'enum' and adding 'pure nothrow'

Daniel N (3/10) Aug 30 2014 Hmm, sorry the execution time varies too wildly on my system,

David Nadlinger (17/21) Oct 11 2014 I just tested this, and on the machine I ran this on the code LDC
David Nadlinger (17/21) Oct 11 2014 I just tested this, and on the machine I ran this on the code LDC

Fool (13/25) Oct 11 2014 Hi David,

"Fool" <fool dlang.org> writes:

Below there is a snippet of D code followed by a close C++ 
translation.

clang++ seems to have some problems optimizing the C++ version if 
my_reverse is not attributed "noinline". Unfortunately, I was not 
able to convince ldc to produce a fast executable. g++ and gdc 
both produce good results; annotation is not required here.

The same effect is visible in the D version if one replaces 
my_reverse with reverse from std.algorithm. Can someone figure 
out what the problem is?

clang++ /   inline: 5.7s
     ldc /   inline: 5.7s
clang++ / noinline: 1.4s
     ldc / noinline: 5.8s
                g++: 1.2s
                gdc: 1.2s

Test environment: Arch Linux x86_64

clang version 3.4.2
LDC - the LLVM D compiler (0.14.0)
g++ (GCC) 4.9.1
gdc (GCC) 4.9.1

clang++ -std=c++11 -march=native -O3 -DNOINLINE
ldc2 -O3 -mcpu=native -release -disable-boundscheck
g++ -std=c++11 -march=native -O3
gdc -O3 -march=native -frelease -fno-bounds-check

Moreover, if I do not specify '-disable-boundscheck' ldc2 
produces the following error message:

/usr/lib/liblphobos2.a(curl.o): In function 
`_D3std3net4curl4HTTP18_sharedStaticCtor1FZv':
(.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): 
undefined reference to `curl_version_info'
/usr/lib/liblphobos2.a(curl.o): In function 
`_D3std3net4curl4Curl18_sharedStaticDtor3FZv':
(.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): 
undefined reference to `curl_global_cleanup'
/usr/lib/liblphobos2.a(curl.o): In function 
`_D3std3net4curl13__shared_ctorZ':
(.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference 
to `curl_global_init'
collect2: error: ld returned 1 exit status
Error: /usr/bin/gcc failed with status: 1


////
// reverse.d

import std.algorithm, std.range;
import ldc.attribute;

 attribute("noinline")
void my_reverse(int* b, int* e)
{
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
}

void main(string[] args)
{
    immutable N = 2000;
    immutable K = 10000;
    auto a = iota(N).array;
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
}


////
// reverse.cpp

#include <numeric>
#include <vector>

#ifdef NOINLINE
__inline__ __attribute__((noinline))
#endif
void my_reverse(int* b, int* e)
{
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          std::swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
}

int main()
{
    const auto N = 2000;
    const auto K = 10000;
    std::vector<int> a(N);
    auto b = std::begin(a);
    auto e = std::end(a);
    std::iota(b, e, 0);
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
}

Aug 24 2014

"Kai Nacke" <kai redstar.de> writes:

Hi Fool!

On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote:
 Below there is a snippet of D code followed by a close C++ 
 translation.

 clang++ seems to have some problems optimizing the C++ version 
 if my_reverse is not attributed "noinline". Unfortunately, I 
 was not able to convince ldc to produce a fast executable. g++ 
 and gdc both produce good results; annotation is not required 
 here.

The speed of binaries produced by ldc is roughly the same as of 
binaries from clang++. That is no wonder as both use the same 
LLVM machinery.
You could try the LDC_never_inline pragma 
(http://wiki.dlang.org/LDC-specific_language_changes).
Unfortunately, general attributes are not yet implemented in LDC 

https://github.com/ldc-developers/ldc/issues/561).

 The same effect is visible in the D version if one replaces 
 my_reverse with reverse from std.algorithm. Can someone figure 
 out what the problem is?

 clang++ /   inline: 5.7s
     ldc /   inline: 5.7s
 clang++ / noinline: 1.4s
     ldc / noinline: 5.8s
                g++: 1.2s
                gdc: 1.2s

 Test environment: Arch Linux x86_64

 clang version 3.4.2
 LDC - the LLVM D compiler (0.14.0)
 g++ (GCC) 4.9.1
 gdc (GCC) 4.9.1

 clang++ -std=c++11 -march=native -O3 -DNOINLINE
 ldc2 -O3 -mcpu=native -release -disable-boundscheck
 g++ -std=c++11 -march=native -O3
 gdc -O3 -march=native -frelease -fno-bounds-check

 Moreover, if I do not specify '-disable-boundscheck' ldc2 
 produces the following error message:


https://github.com/ldc-developers/ldc/issues/683. Sorry for that.

 /usr/lib/liblphobos2.a(curl.o): In function 
 `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv':
 (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): 
 undefined reference to `curl_version_info'
 /usr/lib/liblphobos2.a(curl.o): In function 
 `_D3std3net4curl4Curl18_sharedStaticDtor3FZv':
 (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): 
 undefined reference to `curl_global_cleanup'
 /usr/lib/liblphobos2.a(curl.o): In function 
 `_D3std3net4curl13__shared_ctorZ':
 (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined 
 reference to `curl_global_init'
 collect2: error: ld returned 1 exit status
 Error: /usr/bin/gcc failed with status: 1


 ////
 // reverse.d

 import std.algorithm, std.range;
 import ldc.attribute;

  attribute("noinline")
 void my_reverse(int* b, int* e)
 {
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
 }

 void main(string[] args)
 {
    immutable N = 2000;
    immutable K = 10000;
    auto a = iota(N).array;
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
 }


 ////
 // reverse.cpp

 #include <numeric>
 #include <vector>

 #ifdef NOINLINE
 __inline__ __attribute__((noinline))
 #endif
 void my_reverse(int* b, int* e)
 {
    auto steps = (e - b) / 2;
    if (steps) {
       auto l = b;
       auto r = e - 1;
       do {
          std::swap(*l, *r);
          ++l;
          --r;
       } while (--steps);
    }
 }

 int main()
 {
    const auto N = 2000;
    const auto K = 10000;
    std::vector<int> a(N);
    auto b = std::begin(a);
    auto e = std::end(a);
    std::iota(b, e, 0);
    for (auto n = 0; n <= N; ++n) {
       for (auto k = 0; k <= K; ++k) {
          my_reverse(&a[0], &a[0] + n);
       }
    }
 }

I will try to check this but it will take some time as I am busy 
with some other D related tasks.

Regards,
Kai

Aug 25 2014

"Fool" <fool dlang.org> writes:

Hi Kai!

On Monday, 25 August 2014 at 16:49:34 UTC, Kai Nacke wrote:
 On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote:
 The speed of binaries produced by ldc is roughly the same as of 
 binaries from clang++. That is no wonder as both use the same 
 LLVM machinery.

Yes, that's what I thought. Probably, there is something strange 
going on in the depths of LLVM.


 You could try the LDC_never_inline pragma 
 (http://wiki.dlang.org/LDC-specific_language_changes).
 Unfortunately, general attributes are not yet implemented in 

 https://github.com/ldc-developers/ldc/issues/561).

Introducing pragma(LDC_never_inline) slightly improves execution 
time of the ldc result to 5.4s. Still, this is more than three 
times slower than the fast version produced by clang++.



 https://github.com/ldc-developers/ldc/issues/683. Sorry for 
 that.

No problem, thanks for the info!


 I will try to check this but it will take some time as I am 
 busy with some other D related tasks.

Thanks, there is no time pressure. I compared ASM and LLVM-IR of 
the fast and slow versions using clang++. Unfortunately, my foo 
in this area is non-existent. The only thing I noticed is that 
the fast version has significantly longer ASM and LLVM-I 
representations.

Kind regards,
Fool

Aug 25 2014

"Daniel N" <ufo orbiting.us> writes:

On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
 Thanks, there is no time pressure. I compared ASM and LLVM-IR 
 of the fast and slow versions using clang++. Unfortunately, my 
 foo in this area is non-existent. The only thing I noticed is 
 that the fast version has significantly longer ASM and LLVM-I 
 representations.

 Kind regards,
 Fool

I tried changing 'immutable' to 'enum' and adding 'pure nothrow' 
annotations on the functions, got a decent speedup for ldc, I 
currently don't have gdc installed, so I can't check if these 
changes make gdc even faster or if it's part of the difference.

Regards,
Daniel N

Aug 30 2014

"Daniel N" <ufo orbiting.us> writes:

On Sunday, 31 August 2014 at 03:47:04 UTC, Daniel N wrote:
 I tried changing 'immutable' to 'enum' and adding 'pure 
 nothrow' annotations on the functions, got a decent speedup for 
 ldc, I currently don't have gdc installed, so I can't check if 
 these changes make gdc even faster or if it's part of the 
 difference.

 Regards,
 Daniel N

Hmm, sorry the execution time varies too wildly on my system, 
measuring error.

Aug 30 2014

"David Nadlinger" <code klickverbot.at> writes:

On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
 Introducing pragma(LDC_never_inline) slightly improves 
 execution time of the ldc result to 5.4s. Still, this is more 
 than three times slower than the fast version produced by 
 clang++.

I just tested this, and on the machine I ran this on the code LDC 
produces with inlining disabled is just as fast as the one 
produced by Clang.

---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
   based on DMD v2.066 and LLVM 3.5.0
   Default target: x86_64-unknown-linux-gnu
   Host CPU: corei7-avx
[…]

$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---

David

Oct 11 2014

"David Nadlinger" <code klickverbot.at> writes:

On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
 Introducing pragma(LDC_never_inline) slightly improves 
 execution time of the ldc result to 5.4s. Still, this is more 
 than three times slower than the fast version produced by 
 clang++.

I just tested this, and on the machine I ran this on the code LDC 
produces with inlining disabled is just as fast as the one 
produced by Clang.

---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
   based on DMD v2.066 and LLVM 3.5.0
   Default target: x86_64-unknown-linux-gnu
   Host CPU: corei7-avx
[…]

$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---

David

Oct 11 2014

"Fool" <fool dlang.org> writes:

Hi David,

thank you for checking this case.

Currently, I'm using

LDC - the LLVM D compiler (0.14.0):
   based on DMD v2.065 and LLVM 3.4.2
   Default target: x86_64-unknown-linux-gnu
   Host CPU: core-avx2

If waiting for the next version of ldc is all I have to do that's 
good news. :-)

Thank you again, kind greetings,
Fool

On Saturday, 11 October 2014 at 20:04:59 UTC, David Nadlinger 
wrote:
 ---
 $ ldc2 -version
 LDC - the LLVM D compiler (c8b9f6):
   based on DMD v2.066 and LLVM 3.5.0
   Default target: x86_64-unknown-linux-gnu
   Host CPU: corei7-avx
 […]

 $ clang++ --version
 clang version 3.5.0 (tags/RELEASE_350/final)
 Target: x86_64-unknown-linux-gnu
 Thread model: posix
 ---

Oct 11 2014

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Performance problem in reverse algorithm