digitalmars.D.ldc - LDC 1.5-1.6 huge degradation of optimization
- Igor Shirkalin (80/80) Nov 27 2017 Hello!
- kinke (12/17) Nov 27 2017 LDC 1.5 and 1.6 both come with LLVM 5.0.0, so it looks as if this
- Igor Shirkalin (4/23) Nov 30 2017 It's now obvious the reason of this regression is LLVM 5.0.0
- kinke (7/10) Nov 30 2017 5.0.0 *is* the latest released version. 5.0.1 is about to be
- Joakim (3/13) Nov 30 2017 I too read it the way you did, that llvm needs to be updated
- Igor Shirkalin (2/17) Nov 30 2017 Right. Exactly what I meant. Excuse for some irrational English.
- kinke (5/23) Dec 01 2017 No worries. One performance regression is by no means enough to
- Johan Engelen (16/21) Nov 30 2017 This will need a lot more investigation to figure out what is
- Igor Shirkalin (7/23) Nov 30 2017 I'm almost sure that the problem is in new LLVM.
Hello!
I have found that LDC1.5-1.6 generate unoptimized code in
contrast to LDC1.3-1.4 in some cases. I tried to extract the
example and make it as short as possible. The goal is to get the
compiled code with avx(2) instructions.
Here is the source of tst.d with comments to demonstrate the
problem.
// tst.d
import ldc.attributes;
// ldc1.3-1.4 generate higly optimized code with avx2 instructions
// ldc1.5-1.6 generate the code without any vector instructions
// the command line: ldc2 tst.d -m32 -O3 -release -output-s
alias Arr = ubyte[16][20]; // 20 of 16-ubyte vectors
import ldc.attributes;
target("avx2") nogc pure
auto distance(ref const Arr t1, ref const Arr t2)
{
int[20] res = void;
int sum;
foreach(t, ref r; res) {
int sv=0;
foreach(i; 0 .. 16) // the main cycle to be optimized with
avx2 instructions
sv += (t1[t][i]-t2[t][i])^^2;
r = sv;
// by uncommenting the following assignmet the avx2
optimization is turned on in ldc 1.6
// sum += sv;
}
return sum + res[10]; // returm some dummy sum
}
/* ldc1.3 (avx2 instructions are used)
LBB0_1:
vpmovzxbd -8(%ecx), %ymm0
vpmovzxbd -8(%eax), %ymm1
vpmovzxbd (%eax), %ymm2
addl $16, %eax
vpsubd %ymm1, %ymm0, %ymm0
vpmovzxbd (%ecx), %ymm1
addl $16, %ecx
vpmulld %ymm0, %ymm0, %ymm0
vpsubd %ymm2, %ymm1, %ymm1
vpmulld %ymm1, %ymm1, %ymm1
vpaddd %ymm0, %ymm1, %ymm0
vextracti128 $1, %ymm0, %xmm1
vpaddd %ymm1, %ymm0, %ymm0
vpshufd $78, %xmm0, %xmm1
vpaddd %ymm1, %ymm0, %ymm0
vphaddd %ymm0, %ymm0, %ymm0
vmovd %xmm0, (%esp,%edx,4)
incl %edx
cmpl $20, %edx
jb LBB0_1
*/
/* ldc1.6 (avx2 instructions aren't used)
LBB0_1:
movl %edx, (%esp)
movzbl -15(%ecx), %esi
movzbl -15(%eax), %edx
movzbl -14(%ecx), %edi
subl %edx, %esi
movzbl -14(%eax), %edx
imull %esi, %esi
... ; skipped
imull %ebp, %ebp
addl %ebp, %esi
movl 36(%esp), %ebp
imull %ebp, %ebp
addl %ebp, %esi
movl 32(%esp), %ebp
... ; skipped
imull %edx, %edx
addl %esi, %edx
movl (%esp), %esi
movl %edx, 48(%esp,%esi,4)
movl (%esp), %edx
incl %edx
cmpl $20, %edx
jb LBB0_1
*/
Nov 27 2017
On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin wrote:Hello! I have found that LDC1.5-1.6 generate unoptimized code in contrast to LDC1.3-1.4 in some cases. I tried to extract the example and make it as short as possible. The goal is to get the compiled code with avx(2) instructions.LDC 1.5 and 1.6 both come with LLVM 5.0.0, so it looks as if this is an LLVM regression. This can be shown by compiling to unoptimized textual LLVM IR (that's the LLVM IR LDC generates, before LLVM optimizations) and comparing it across LDC versions. I did that for LDC 1.4 and 1.6, and the relevant IR is identical: ldc2-1.4.0-win64-msvc\bin\ldc2 -release -output-ll perf.d -of=perf_1.4.ll ldc2-1.6.0-win64-msvc\bin\ldc2 -release -output-ll perf.d -of=perf_1.6.ll <compare files perf_1.4.ll and perf_1.6.ll>
Nov 27 2017
On Monday, 27 November 2017 at 13:21:04 UTC, kinke wrote:On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin wrote:It's now obvious the reason of this regression is LLVM 5.0.0 Does it mean it's not time to move to latest LLVM for the latest LDC?Hello! I have found that LDC1.5-1.6 generate unoptimized code in contrast to LDC1.3-1.4 in some cases. I tried to extract the example and make it as short as possible. The goal is to get the compiled code with avx(2) instructions.LDC 1.5 and 1.6 both come with LLVM 5.0.0, so it looks as if this is an LLVM regression. This can be shown by compiling to unoptimized textual LLVM IR (that's the LLVM IR LDC generates, before LLVM optimizations) and comparing it across LDC versions. I did that for LDC 1.4 and 1.6, and the relevant IR is identical: ldc2-1.4.0-win64-msvc\bin\ldc2 -release -output-ll perf.d -of=perf_1.4.ll ldc2-1.6.0-win64-msvc\bin\ldc2 -release -output-ll perf.d -of=perf_1.6.ll <compare files perf_1.4.ll and perf_1.6.ll>
Nov 30 2017
On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin wrote:It's now obvious the reason of this regression is LLVM 5.0.0 Does it mean it's not time to move to latest LLVM for the latest LDC?5.0.0 *is* the latest released version. 5.0.1 is about to be released these days, but whether it'll fix this issue is uncertain. As is whether it's fixed in current LLVM master (6.0.0). LLVM is a huge piece of software, bugs and regressions are to be expected.
Nov 30 2017
On Thursday, 30 November 2017 at 16:01:19 UTC, kinke wrote:On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin wrote:I too read it the way you did, that llvm needs to be updated forward, but I think he meant ldc should stick with 4.0.1 for now.It's now obvious the reason of this regression is LLVM 5.0.0 Does it mean it's not time to move to latest LLVM for the latest LDC?5.0.0 *is* the latest released version. 5.0.1 is about to be released these days, but whether it'll fix this issue is uncertain. As is whether it's fixed in current LLVM master (6.0.0). LLVM is a huge piece of software, bugs and regressions are to be expected.
Nov 30 2017
On Thursday, 30 November 2017 at 22:58:10 UTC, Joakim wrote:On Thursday, 30 November 2017 at 16:01:19 UTC, kinke wrote:Right. Exactly what I meant. Excuse for some irrational English.On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin wrote:I too read it the way you did, that llvm needs to be updated forward, but I think he meant ldc should stick with 4.0.1 for now.It's now obvious the reason of this regression is LLVM 5.0.0 Does it mean it's not time to move to latest LLVM for the latest LDC?5.0.0 *is* the latest released version. 5.0.1 is about to be released these days, but whether it'll fix this issue is uncertain. As is whether it's fixed in current LLVM master (6.0.0). LLVM is a huge piece of software, bugs and regressions are to be expected.
Nov 30 2017
On Friday, 1 December 2017 at 04:30:52 UTC, Igor Shirkalin wrote:On Thursday, 30 November 2017 at 22:58:10 UTC, Joakim wrote:No worries. One performance regression is by no means enough to convince me to step back, especially since anyone is free to compile LDC himself and use LLVM versions as old as 3.7 if they like.On Thursday, 30 November 2017 at 16:01:19 UTC, kinke wrote:Right. Exactly what I meant. Excuse for some irrational English.On Thursday, 30 November 2017 at 15:25:31 UTC, Igor Shirkalin wrote:I too read it the way you did, that llvm needs to be updated forward, but I think he meant ldc should stick with 4.0.1 for now.It's now obvious the reason of this regression is LLVM 5.0.0 Does it mean it's not time to move to latest LLVM for the latest LDC?5.0.0 *is* the latest released version. 5.0.1 is about to be released these days, but whether it'll fix this issue is uncertain. As is whether it's fixed in current LLVM master (6.0.0). LLVM is a huge piece of software, bugs and regressions are to be expected.
Dec 01 2017
On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin wrote:Hello! I have found that LDC1.5-1.6 generate unoptimized code in contrast to LDC1.3-1.4 in some cases. I tried to extract the example and make it as short as possible. The goal is to get the compiled code with avx(2) instructions.This will need a lot more investigation to figure out what is going wrong. It could be that the optimization pipeline set up by LDC needs to be adjusted for newer LLVM versions, or that extra annotations are needed. Some notes: - It's strange that adding the calculation of `sum` leads to an overall more optimized output with AVX2 instructions (good that you found out about that!). - It would help if you find a C/C++ equivalent to show the problem to LLVM devs (gcc.godbolt.org has all relevant LLVM/Clang versions) - The optimization is fragile also in LDC 1.4: manual unrolling of the inner loop somehow removes the AVX2 optimizations. https://godbolt.org/g/NF3eHf Did I make a mistake? -Johan
Nov 30 2017
On Thursday, 30 November 2017 at 08:56:31 UTC, Johan Engelen wrote:On Monday, 27 November 2017 at 10:41:44 UTC, Igor Shirkalin This will need a lot more investigation to figure out what is going wrong. It could be that the optimization pipeline set up by LDC needs to be adjusted for newer LLVM versions, or that extra annotations are needed.I'm almost sure that the problem is in new LLVM.Some notes:- It's strange that adding the calculation of `sum` leads to an overall more optimized output with AVX2 instructions (good that you found out about that!).- It would help if you find a C/C++ equivalent to show the problem to LLVM devs (gcc.godbolt.org has all relevant LLVM/Clang versions)Yes, I have found it for clang: https://bugs.llvm.org/show_bug.cgi?id=35448- The optimization is fragile also in LDC 1.4: manual unrolling of the inner loop somehow removes the AVX2 optimizations. https://godbolt.org/g/NF3eHf Did I make a mistake?I've noticed that manual unrolling usually doesnt't help to vectorize the code.-Johan
Nov 30 2017









kinke <kinke libero.it> 