digitalmars.D.ldc - A small comparison
- bearophile (122/122) Nov 20 2013 I've found a post about a functional library for LuaJIT code:
- jerro (12/29) Nov 20 2013 If I use core.stdc.math.sin and compile the code with:
- bearophile (6/28) Nov 21 2013 I guess the back-end is very similar, and it's able to optimize
- Kai Nacke (13/17) Nov 22 2013 The difference between std.math.sin and core.stdc.math.sin is
- Kai Nacke (6/8) Dec 13 2013 The answer is simple: it is yet not implemented. Just look at the
I've found a post about a functional library for LuaJIT code: http://rtsisyk.github.io/luafun/intro.html This dynamically typed Lua code: require "fun" () n = 100 x = sum(map(function(x) return x^2 end, take(n, tabulate(math.sin)))) -- calculate sum(sin(x)^2 for x in 0..n-1) print(x) 50.011981355266 Gets compiled to: ->LOOP: 0bcaffd0 movsd [rsp+0x8], xmm7 0bcaffd6 addsd xmm4, xmm5 0bcaffda ucomisd xmm6, xmm1 0bcaffde jnb 0x0bca0028 ->6 0bcaffe4 addsd xmm6, xmm0 0bcaffe8 addsd xmm7, xmm0 0bcaffec fld qword [rsp+0x8] 0bcafff0 fsin 0bcafff2 fstp qword [rsp] 0bcafff5 movsd xmm5, [rsp] 0bcafffa mulsd xmm5, xmm5 0bcafffe jmp 0x0bcaffd0 ->LOOP So I've written a similar D version using Phobos: void main() { import std.stdio, std.algorithm, std.range; enum n = 100; immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a + b}; r.writeln; } LDC2 compiles it to (32 bit): LBB0_1: fxch fstpt 44(%esp) fld %st(0) fstpt (%esp) fld1 faddp %st(1) fstpt 32(%esp) calll __D3std4math3sinFNaNbNfeZe subl $12, %esp fmul %st(0), %st(0) fldt 44(%esp) faddp %st(1) fstpt 44(%esp) fldt 32(%esp) fldt 44(%esp) decl %esi fxch jne LBB0_1 As you see the D version doesn't use the fsin instruction as LuaJIT, it calls a library function. If I import the sin from core.stdc.math ldc2 produces (notice the usage of xmm registers): LBB0_1: movsd %xmm1, 32(%esp) movsd %xmm0, 40(%esp) movsd %xmm1, (%esp) calll _sin fstpl 48(%esp) movsd 48(%esp), %xmm0 mulsd %xmm0, %xmm0 movsd 40(%esp), %xmm1 addsd %xmm0, %xmm1 movsd %xmm1, 40(%esp) movsd 32(%esp), %xmm1 movsd 40(%esp), %xmm0 addsd LCPI0_0, %xmm1 decl %esi jne LBB0_1 I have also written a normal for loop in both C and D. This is C code: #include "stdio.h" #include "math.h" int main() { double r = 0.0; int i; for (i = 0; i < 100; i++) { const double aux = sin(i); r += aux * aux; } printf("%f\n", r); return 0; } GCC compiles it to: .L3: cvtsi2sd %ebx, %xmm0 call sin .L2: mulsd %xmm0, %xmm0 addl $1, %ebx cmpl $100, %ebx addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) jne .L3 The Intel compiler compiles it to this (this is even vectorized, sin2 and mulpd work on two doubles at a time): cvtdq2pd %xmm8, %xmm0 call __svml_sin2 mulpd %xmm0, %xmm0 addb $2, %r12b paddd %xmm9, %xmm8 addpd %xmm0, %xmm10 cmpb $100, %r12b In scientific code it's common to have small kernels that take most of the run-time of a program, so it's important to shave every instructions from such loops. Do you think Phobos/LDC2 are doing well enough here? Bye, bearophile
Nov 20 2013
void main() { import std.stdio, std.algorithm, std.range; enum n = 100; immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a + b}; r.writeln; }If I use core.stdc.math.sin and compile the code with: gdc c.d -o c -O3 -fno-bounds-check -frelease I get: 404e80: cvtsi2sd %ebx,%xmm0 404e84: callq 403070 <sin plt> 404e89: mulsd %xmm0,%xmm0 404e8d: add $0x1,%ebx 404e90: cmp $0x64,%ebx 404e93: addsd 0x28(%rsp),%xmm0 404e99: movsd %xmm0,0x28(%rsp) 404e9f: jne 404e80 <_Dmain+0x60> This is almost exactly the same code as.L3: cvtsi2sd %ebx, %xmm0 call sin .L2: mulsd %xmm0, %xmm0 addl $1, %ebx cmpl $100, %ebx addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) jne .L3
Nov 20 2013
jerro:If I use core.stdc.math.sin and compile the code with: gdc c.d -o c -O3 -fno-bounds-check -frelease I get: 404e80: cvtsi2sd %ebx,%xmm0 404e84: callq 403070 <sin plt> 404e89: mulsd %xmm0,%xmm0 404e8d: add $0x1,%ebx 404e90: cmp $0x64,%ebx 404e93: addsd 0x28(%rsp),%xmm0 404e99: movsd %xmm0,0x28(%rsp) 404e9f: jne 404e80 <_Dmain+0x60> This is almost exactly the same code asI guess the back-end is very similar, and it's able to optimize the code well :-) What's the difference between std.math.sin and core.stdc.math.sin? Bye, bearophile.L3: cvtsi2sd %ebx, %xmm0 call sin .L2: mulsd %xmm0, %xmm0 addl $1, %ebx cmpl $100, %ebx addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) jne .L3
Nov 21 2013
Hi bearophile! On Wednesday, 20 November 2013 at 22:12:29 UTC, bearophile wrote:In scientific code it's common to have small kernels that take most of the run-time of a program, so it's important to shave every instructions from such loops. Do you think Phobos/LDC2 are doing well enough here?The difference between std.math.sin and core.stdc.math.sin is that std.math.sin is mapped to the LLVM intrinsic llvm.sin while core.stdc.math.sin is the sine function from the C library. I would expect that the LLVM intrinsic is replaced with fsin (at least with -m32). I am unsure why this does not happen. My understanding is that the auto-vectorizer should kick-in here (see http://llvm.org/docs/Vectorizers.html). I really need to go deeper here to understand what's happening in LLVM. For sure, Phobos/LDC2 should do better here!!! Regards, Kai
Nov 22 2013
On Friday, 22 November 2013 at 21:44:28 UTC, Kai Nacke wrote:I would expect that the LLVM intrinsic is replaced with fsin (at least with -m32). I am unsure why this does not happen.The answer is simple: it is yet not implemented. Just look at the slides from last LLVM developer meeting: http://www.llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdf Regards, Kai
Dec 13 2013