digitalmars.D - From a C++/JS benchmark
- bearophile (9/9) Aug 03 2011 The benchmark info:
- Denis Shelomovskij (24/33) Aug 03 2011 Compilers:
- Ziad Hatahet (6/44) Aug 03 2011 I believe that "long" in this case is 32 bits in C++, and 64-bits in the
- Denis Shelomovskij (7/51) Aug 03 2011 Good! This is my first blunder (it's so easy to complitely forget
- Adam D. Ruppe (2/3) Aug 03 2011 Is this Windows XP 32 bit or 64 bit? That will probably make
- David Nadlinger (3/6) Aug 03 2011 It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64).
- Marco Leise (4/11) Aug 03 2011 I thought he was referring to the processor being able to handle 64-bit ...
- Adam Ruppe (5/8) Aug 04 2011 I was thinking a little of both but this is the main thing. My
- Denis Shelomovskij (3/6) Aug 03 2011 I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according
- bearophile (13/17) Aug 03 2011 Languages aren't slow or fast, their implementations produce assembly th...
- Trass3r (1/4) Aug 03 2011 I'm afraid not. dmd's backend isn't good at floating point calculations.
- bearophile (86/87) Aug 03 2011 Studying a bit the asm it's not hard to find the cause, because this ben...
- Trass3r (17/17) Aug 03 2011 C++:
- Trass3r (3/11) Aug 03 2011 D ldc:
- Trass3r (3/18) Aug 05 2011 D gdc with added -frelease -fno-bounds-check:
- bearophile (5/10) Aug 05 2011 I'd like to know why the GCC back-end is able to produce a more efficien...
- Trass3r (1/4) Aug 05 2011 I attached both asm versions ;)
- bearophile (5/10) Aug 03 2011 Are you able and willing to show me the asm produced by gdc? There's a p...
- Trass3r (1/2) Aug 04 2011
- bearophile (513/517) Aug 04 2011 In the bla.rar attach there's the unstripped Linux binary, so to read th...
- Adam Ruppe (8/10) Aug 04 2011 I find AT&T syntax to be almost impossible to read, but it looks
- Don (8/20) Aug 05 2011 They do that to implement Position Independent Code: you need to know
- Trass3r (0/4) Aug 05 2011
- bearophile (21/25) Aug 05 2011 You are a person of few words :-) Thank you for the asm.
- Trass3r (2/4) Aug 05 2011 Ok, now I get up to 32930000 skinned vertices per second.
- Iain Buclaw (8/18) Aug 06 2011 Notes from me:
- bearophile (4/5) Aug 06 2011 The remaining thing to look at is just the small performance difference ...
- Iain Buclaw (14/19) Aug 06 2011 Three things that helped improve performance in a minor way for me:
- bearophile (9/22) Aug 06 2011 Really? I don't remember discussions about it. What is its cause?
- Iain Buclaw (9/21) Aug 06 2011 I can't remember the exact discussion, but it was something about a benc...
- bearophile (25/28) Aug 06 2011 With DMD I have seen 180k -> 190k vertices/sec replacing this:
- Walter Bright (2/5) Aug 06 2011 A dynamic array is two values being passed, a pointer is one.
- bearophile (32/33) Aug 06 2011 I know, but I think there are many optimization opportunities. An exampl...
- Iain Buclaw (16/41) Aug 06 2011 normally, and then perform the call like this (with DMD it gives the sam...
- bearophile (7/12) Aug 06 2011 In newer versions of GCC -Ofast means -ffast-math too.
- Walter Bright (3/4) Aug 06 2011 No, I am not. Few understand the subtleties of IEEE arithmetic, and brea...
- bearophile (5/10) Aug 06 2011 I have read several papers about FP arithmetic, but I am not an expert y...
- Eric Poggel (JoeCoder) (8/18) Aug 07 2011 Floating point determinism can be very important when it comes to
- bearophile (7/11) Aug 08 2011 It seems a hard thing to obtain, but I agree that it gets useful.
- Eric Poggel (JoeCoder) (9/20) Aug 08 2011 You'd be surprised how much I lurk here. I agree there are some
- Trass3r (23/34) Aug 07 2011 64Bit:
The benchmark info: http://chadaustin.me/2011/01/digging-into-javascript-performance/ https://github.com/chadaustin/Web-Benchmarks/ The C++/JS/Java code runs on a single core. inheritance!): http://ideone.com/kf1tz Bye, bearophile
Aug 03 2011
03.08.2011 18:20, bearophile:The benchmark info: http://chadaustin.me/2011/01/digging-into-javascript-performance/ https://github.com/chadaustin/Web-Benchmarks/ The C++/JS/Java code runs on a single core. inheritance!): http://ideone.com/kf1tz Bye, bearophileCompilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000 Java: Oracle Java 1.6 with hm... Oracle default settings D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp"). System: Windows XP, Core 2 Duo E6850 ----------------------------------------------------------- ----------------------------------------------------------- float | 31_400_000 | 17_000_000 | 14_700_000 | 168_000 double | 32_300_000 | 16_000_000 | 14_100_000 | 166_000 real | 32_300_000 | no real | no real | 203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long | 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 ----------------------------------------------------------- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.
Aug 03 2011
I believe that "long" in this case is 32 bits in C++, and 64-bits in the remaining languages, hence the same result for int and long in C++. Try with "long long" maybe? :) -- Ziad 2011/8/3 Denis Shelomovskij <verylonglogin.reg gmail.com>03.08.2011 18:20, bearophile: The benchmark info:http://chadaustin.me/2011/01/**digging-into-javascript-**performance/<http://chadaustin.me/2011/01/digging-into-javascript-performance/> https://github.com/chadaustin/**Web-Benchmarks/<https://github.com/chadaustin/Web-Benchmarks/> The C++/JS/Java code runs on a single core. inheritance!): http://ideone.com/kf1tz Bye, bearophileCompilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000 Java: Oracle Java 1.6 with hm... Oracle default settings D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp"). System: Windows XP, Core 2 Duo E6850 ------------------------------**----------------------------- ------------------------------**----------------------------- float | 31_400_000 | 17_000_000 | 14_700_000 | 168_000 double | 32_300_000 | 16_000_000 | 14_100_000 | 166_000 real | 32_300_000 | no real | no real | 203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long | 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 ------------------------------**----------------------------- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.
Aug 03 2011
03.08.2011 22:15, Ziad Hatahet:I believe that "long" in this case is 32 bits in C++, and 64-bits in the remaining languages, hence the same result for int and long in C++. Try with "long long" maybe? :) -- Ziad 2011/8/3 Denis Shelomovskij <verylonglogin.reg gmail.com <mailto:verylonglogin.reg gmail.com>> 03.08.2011 18:20, bearophile: The benchmark info: http://chadaustin.me/2011/01/__digging-into-javascript-__performance/ <http://chadaustin.me/2011/01/digging-into-javascript-performance/> https://github.com/chadaustin/__Web-Benchmarks/ <https://github.com/chadaustin/Web-Benchmarks/> The C++/JS/Java code runs on a single core. struct inheritance!): http://ideone.com/kf1tz Bye, bearophile Compilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000 Java: Oracle Java 1.6 with hm... Oracle default settings D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp"). System: Windows XP, Core 2 Duo E6850 ------------------------------__----------------------------- ------------------------------__----------------------------- float | 31_400_000 | 17_000_000 | 14_700_000 | 168_000 double | 32_300_000 | 16_000_000 | 14_100_000 | 166_000 real | 32_300_000 | no real | no real | 203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long | 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 ------------------------------__----------------------------- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.Good! This is my first blunder (it's so easy to complitely forget illogical (for me) language design). So, corrected last row: ------------------------------------------------------------- long | 5_500_000 | 6_600_000 | 4_400_000 | 5_800_000 Java is the fastest "long" language :)
Aug 03 2011
System: Windows XP, Core 2 Duo E6850Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
Aug 03 2011
On 8/3/11 9:48 PM, Adam D. Ruppe wrote:It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64). DavidSystem: Windows XP, Core 2 Duo E6850Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
Aug 03 2011
Am 03.08.2011, 21:52 Uhr, schrieb David Nadlinger <see klickverbot.at>:On 8/3/11 9:48 PM, Adam D. Ruppe wrote:I thought he was referring to the processor being able to handle 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS with 64-bit executables.It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64). DavidSystem: Windows XP, Core 2 Duo E6850Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
Aug 03 2011
Marco Leise wrote:I thought he was referring to the processor being able to handle 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS with 64-bit executables.I was thinking a little of both but this is the main thing. My suspicion was that Java might have been using a 64 bit JVM and everything else was compiled in 32 bit, causing it to win in that place. But with a 32 bit OS that means 32 bit programs all around.
Aug 04 2011
03.08.2011 22:48, Adam D. Ruppe пишет:I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according to what is "Windows XP" in wikipedia)System: Windows XP, Core 2 Duo E6850Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
Aug 03 2011
Denis Shelomovskij:(tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").For a more realistic test I suggest you to time the C++ version that uses the intrinsics too (only for float).Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.Languages aren't slow or fast, their implementations produce assembly that's more or less efficient. A D1 version fit for LDC V1 with Tango: http://codepad.org/ewDy31UH Vertices (millions), Linux 32 bit: C++ no simd: 29.5 D: 27.6 LDC based on DMD v1.057 and llvm 2.6, ldc -O3 -release -inline G++ V4.3.3, -s -O3 -mfpmath=sse -ffast-math -msse3 It's a bit slower than the C++ version, but for most people that's an one and using a more modern LLVM you reduce that loss a bit). Bye, bearophile
Aug 03 2011
Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.I'm afraid not. dmd's backend isn't good at floating point calculations.
Aug 03 2011
Trass3r:I'm afraid not. dmd's backend isn't good at floating point calculations.Studying a bit the asm it's not hard to find the cause, because this benchmark is quite pure (synthetic, despite I think it comes from real-world code). This is what G++ generates from the C++ code without intrinsics (the version that uses SIMD intrinsics has a similar look but it's shorter): movl (%eax), %edx movss 4(%eax), %xmm0 movl 8(%eax), %ecx leal (%edx,%edx,2), %edx sall $4, %edx addl %ebx, %edx testl %ecx, %ecx movss 12(%edx), %xmm1 movss 20(%edx), %xmm7 movss (%edx), %xmm5 mulss %xmm0, %xmm1 mulss %xmm0, %xmm7 movss 4(%edx), %xmm6 movss 8(%edx), %xmm4 movss %xmm1, (%esp) mulss %xmm0, %xmm5 movss 28(%edx), %xmm1 movss %xmm7, 4(%esp) mulss %xmm0, %xmm6 movss 32(%edx), %xmm7 mulss %xmm0, %xmm1 movss 16(%edx), %xmm3 mulss %xmm0, %xmm7 movss 24(%edx), %xmm2 movss %xmm1, 16(%esp) mulss %xmm0, %xmm4 movss 36(%edx), %xmm1 movss %xmm7, 8(%esp) mulss %xmm0, %xmm3 movss 40(%edx), %xmm7 mulss %xmm0, %xmm2 mulss %xmm0, %xmm1 mulss %xmm0, %xmm7 mulss 44(%edx), %xmm0 leal 12(%eax), %edx movss %xmm7, 12(%esp) movss %xmm0, 20(%esp) This is what DMD generates for the same (or quite similar) piece of code: movsd mov EAX,068h[ESP] imul EDX,EAX,030h add EDX,018h[ESP] fld float ptr [EDX] fmul float ptr 06Ch[ESP] fstp float ptr 038h[ESP] fld float ptr 4[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 03Ch[ESP] fld float ptr 8[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 040h[ESP] fld float ptr 0Ch[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 044h[ESP] fld float ptr 010h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 048h[ESP] fld float ptr 014h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 04Ch[ESP] fld float ptr 018h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 050h[ESP] fld float ptr 01Ch[EDX] mov CL,070h[ESP] xor CL,1 fmul float ptr 06Ch[ESP] fstp float ptr 054h[ESP] fld float ptr 020h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 058h[ESP] fld float ptr 024h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 05Ch[ESP] fld float ptr 028h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 060h[ESP] fld float ptr 02Ch[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 064h[ESP] I think DMD back-end already contains logic to use xmm registers as true registers (not as a floating point stack or temporary holes where to push and pull FP values), so I suspect it doesn't take too much work to modify it to emit FP asm with a single optimization: just keep the values inside registers. In my uninformed opinion all other FP optimizations are almost insignificant compared to this one :-) Bye, bearophile
Aug 03 2011
C++: Skinned vertices per second: 48660000 C++ no SIMD: Skinned vertices per second: 42420000 D dmd: Skinned vertices per second: 159046 D gdc: Skinned vertices per second: 23450000 Compilers: gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) g++ -s -O3 -mfpmath=sse -ffast-math -march=native DMD64 D Compiler v2.054 dmd -O -noboundscheck -inline -release dver.d gcc version 4.6.1 20110627 (gdc 0.30, using dmd 2.054) (GCC) gdc -s -O3 -mfpmath=sse -ffast-math -march=native dver.d Ubuntu 11.04 x64 Core2 Duo E6300
Aug 03 2011
C++: Skinned vertices per second: 48660000 C++ no SIMD: Skinned vertices per second: 42420000 D dmd: Skinned vertices per second: 159046 D gdc: Skinned vertices per second: 23450000D ldc: Skinned vertices per second: 37910000 ldc2 -O3 -release -enable-inlining dver.d
Aug 03 2011
Am 04.08.2011, 04:07 Uhr, schrieb Trass3r <un known.com>:D gdc with added -frelease -fno-bounds-check: Skinned vertices per second: 37710000C++: Skinned vertices per second: 48660000 C++ no SIMD: Skinned vertices per second: 42420000 D dmd: Skinned vertices per second: 159046 D gdc: Skinned vertices per second: 23450000D ldc: Skinned vertices per second: 37910000 ldc2 -O3 -release -enable-inlining dver.d
Aug 05 2011
Trass3r:I'd like to know why the GCC back-end is able to produce a more efficient binary from the C++ code (compared to the D code), but now the problem is not large, as before. It seems I've found a benchmark coming from real-world code that's a worst case for DMD (GDC here produces code about 237 times faster than DMD). Bye, bearophile... D gdc with added -frelease -fno-bounds-check: Skinned vertices per second: 37710000C++ no SIMD: Skinned vertices per second: 42420000
Aug 05 2011
I'd like to know why the GCC back-end is able to produce a more efficient binary from the C++ code (compared to the D code), but now the problem is not large, as before.I attached both asm versions ;)
Aug 05 2011
Trass3r:C++ no SIMD: Skinned vertices per second: 42420000...D gdc: Skinned vertices per second: 23450000Are you able and willing to show me the asm produced by gdc? There's a problem there. Bye, bearophile
Aug 03 2011
e you able and willing to show me the asm produced by gdc? There's aproblem there.
Aug 04 2011
Trass3r:In the bla.rar attach there's the unstripped Linux binary, so to read the asm I have used the objdump disassembler. But are you willing and able to show me the asm before it gets assembled? (with gcc you do it with the -S switch). (I also suggest to use only the C standard library, with time() and printf() to produce a smaller asm output: http://codepad.org/12EUo16J ). Using objdump I see it uses 16 xmm registers, this is the main routine. But what's the purpose of those callq? They seem to call the successive asm instruction. The x86 asm of this routine contains jumps only and no "call". The asm of this routine is also very long, I don't know why yet. I see too many instructions like "movss 0x80(%rsp), %xmm7" this looks like a problem. _calculateVerticesAndNormals: push %r15 push %r14 push %r13 push %r12 push %rbp push %rbx sub $0x268, %rsp mov 0x2a0(%rsp), %rax mov %rdi, 0xe8(%rsp) mov %rsi, 0xe0(%rsp) mov %rcx, 0x128(%rsp) mov %r8, 0x138(%rsp) mov %rax, 0xf0(%rsp) mov 0x2a8(%rsp), %rax mov %rdi, 0x180(%rsp) mov %rsi, 0x188(%rsp) mov %rcx, 0x170(%rsp) mov %rax, 0xf8(%rsp) mov 0x2b0(%rsp), %rax mov %r8, 0x178(%rsp) mov %rax, 0x130(%rsp) mov 0x2b8(%rsp), %rax mov %rax, 0x140(%rsp) mov %rcx, %rax add %rax, %rax cmp 0x130(%rsp), %rax je 74d <_calculateVerticesAndNormals+0xcd> mov $0x57, %edx mov $0x6, %edi mov $0x0, %esi movq $0x6, 0x190(%rsp) movq $0x0, 0x198(%rsp) callq 74d <_calculateVerticesAndNormals+0xcd> cmpq $0x0, 0x128(%rsp) je 1317 <_calculateVerticesAndNormals+0xc97> movq $0x1, 0x120(%rsp) xor %r15d, %r15d movq $0x0, 0x100(%rsp) movslq %r15d, %r12 cmp %r12, 0xf0(%rsp) movq $0x0, 0x108(%rsp) jbe f1d <_calculateVerticesAndNormals+0x89d> nopl 0x0(%rax) lea (%r12, %r12, 2), %rax shl $0x2, %rax mov %rax, 0x148(%rsp) mov 0xf8(%rsp), %rax add 0x148(%rsp), %rax movss 0x4(%rax), %xmm9 movzbl 0x8(%rax), %r13d movslq (%rax), %rax cmp 0xe8(%rsp), %rax jae f50 <_calculateVerticesAndNormals+0x8d0> lea (%rax, %rax, 2), %rax shl $0x4, %rax mov %rax, 0x110(%rsp) mov 0xe0(%rsp), %rbx add 0x110(%rsp), %rbx je 12af <_calculateVerticesAndNormals+0xc2f> movss (%rbx), %xmm7 test %r13b, %r13b movss 0x4(%rbx), %xmm8 movss 0x8(%rbx), %xmm6 mulss %xmm9, %xmm7 movss 0xc(%rbx), %xmm11 mulss %xmm9, %xmm8 movss 0x10(%rbx), %xmm4 mulss %xmm9, %xmm6 movss 0x14(%rbx), %xmm5 mulss %xmm9, %xmm11 movss 0x18(%rbx), %xmm3 mulss %xmm9, %xmm4 movss 0x1c(%rbx), %xmm10 mulss %xmm9, %xmm5 movss 0x20(%rbx), %xmm1 mulss %xmm9, %xmm3 movss 0x24(%rbx), %xmm2 mulss %xmm9, %xmm10 movss 0x28(%rbx), %xmm0 mulss %xmm9, %xmm1 mulss %xmm9, %xmm2 mulss %xmm9, %xmm0 mulss 0x2c(%rbx), %xmm9 jne cdb <_calculateVerticesAndNormals+0x65b> add $0x1, %r12 mov %r14, %rax lea (%r12, %r12, 2), %r13 shl $0x2, %r13 jmpq 99e <_calculateVerticesAndNormals+0x31e> nopl (%rax) mov %r13, %rax mov 0xf8(%rsp), %rdx add %rax, %rdx movss 0x4(%rdx), %xmm12 movzbl 0x8(%rdx), %r14d movslq (%rdx), %rdx cmp %rdx, 0xe8(%rsp) jbe aa0 <_calculateVerticesAndNormals+0x420> mov 0xe0(%rsp), %rbx lea (%rdx, %rdx, 2), %rbp shl $0x4, %rbp add %rbp, %rbx je baf <_calculateVerticesAndNormals+0x52f> movss (%rbx), %xmm13 add $0x1, %r12 add $0xc, %r13 test %r14b, %r14b mulss %xmm12, %xmm13 addss %xmm13, %xmm7 movss 0x4(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm8 movss 0x8(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm6 movss 0xc(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm11 movss 0x10(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm4 movss 0x14(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm5 movss 0x18(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm3 movss 0x1c(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm10 movss 0x20(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm1 movss 0x24(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm2 movss 0x28(%rbx), %xmm13 mulss %xmm12, %xmm13 mulss 0x2c(%rbx), %xmm12 addss %xmm13, %xmm0 addss %xmm12, %xmm9 jne cd8 <_calculateVerticesAndNormals+0x658> add $0x1, %r15d cmp %r12, 0xf0(%rsp) ja 890 <_calculateVerticesAndNormals+0x210> mov $0x63, %edx mov $0x6, %edi mov $0x0, %esi mov %rax, 0xc8(%rsp) movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movq $0x6, 0x1c0(%rsp) movq $0x0, 0x1c8(%rsp) callq a3b <_calculateVerticesAndNormals+0x3bb> mov 0xc8(%rsp), %rax movss (%rsp), %xmm0 movss 0x20(%rsp), %xmm1 movss 0x10(%rsp), %xmm2 movss 0x30(%rsp), %xmm3 movss 0x50(%rsp), %xmm4 movss 0x40(%rsp), %xmm5 movss 0x60(%rsp), %xmm6 movss 0x80(%rsp), %xmm7 movss 0x70(%rsp), %xmm8 movss 0x90(%rsp), %xmm9 movss 0xa0(%rsp), %xmm10 movss 0xb0(%rsp), %xmm11 jmpq 893 <_calculateVerticesAndNormals+0x213> nop mov $0x65, %edx mov $0x6, %edi mov $0x0, %esi mov %rax, 0xc8(%rsp) movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movss %xmm12, 0xd0(%rsp) movq $0x6, 0x1d0(%rsp) movq $0x0, 0x1d8(%rsp) callq b35 <_calculateVerticesAndNormals+0x4b5> mov 0xe0(%rsp), %rbx movss 0xd0(%rsp), %xmm12 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 add %rbp, %rbx movss 0x70(%rsp), %xmm8 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 mov 0xc8(%rsp), %rax jne 8d3 <_calculateVerticesAndNormals+0x253> mov $0x23, %r8d mov $0x6, %edx mov $0x0, %ecx mov $0x9, %edi mov $0x0, %esi movss %xmm0, (%rsp) mov %rax, 0xc8(%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movss %xmm12, 0xd0(%rsp) movq $0x6, 0x240(%rsp) movq $0x0, 0x248(%rsp) movq $0x9, 0x250(%rsp) movq $0x0, 0x258(%rsp) callq c67 <_calculateVerticesAndNormals+0x5e7> movss 0x70(%rsp), %xmm8 movss 0xd0(%rsp), %xmm12 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 mov 0xc8(%rsp), %rax jmpq 8d3 <_calculateVerticesAndNormals+0x253> nopl (%rax) mov %rax, %r14 mov 0x108(%rsp), %rax cmp %rax, 0x128(%rsp) jbe 11d0 <_calculateVerticesAndNormals+0xb50> shl $0x5, %rax mov %rax, 0x150(%rsp) mov 0x100(%rsp), %rax mov 0x138(%rsp), %rbx add %rax, %rax add 0x150(%rsp), %rbx cmp %rax, 0x130(%rsp) jbe 10e8 <_calculateVerticesAndNormals+0xa68> mov 0x100(%rsp), %rax shl $0x5, %rax mov %rax, 0x158(%rsp) movss 0x8(%rbx), %xmm12 movaps %xmm8, %xmm15 movss (%rbx), %xmm14 movss %xmm12, 0x11c(%rsp) movss 0x4(%rbx), %xmm13 movaps %xmm7, %xmm12 mulss %xmm14, %xmm12 mov 0x140(%rsp), %rax mulss %xmm13, %xmm15 add 0x158(%rsp), %rax addss %xmm15, %xmm12 addss %xmm11, %xmm12 movl $0x0, 0xc(%rax) movss 0x11c(%rsp), %xmm11 mulss %xmm6, %xmm11 addss %xmm11, %xmm12 movaps %xmm4, %xmm11 mulss %xmm14, %xmm11 mulss %xmm1, %xmm14 movss %xmm12, (%rax) movaps %xmm5, %xmm12 mulss %xmm13, %xmm12 mulss %xmm2, %xmm13 addss %xmm12, %xmm11 addss %xmm13, %xmm14 addss %xmm10, %xmm11 movss 0x11c(%rsp), %xmm10 addss %xmm9, %xmm14 movss 0x11c(%rsp), %xmm9 mulss %xmm3, %xmm10 mulss %xmm0, %xmm9 addss %xmm10, %xmm11 addss %xmm9, %xmm14 movss %xmm11, 0x4(%rax) movss %xmm14, 0x8(%rax) mov 0x108(%rsp), %rax cmp %rax, 0x128(%rsp) jbe 1040 <_calculateVerticesAndNormals+0x9c0> shl $0x5, %rax mov %rax, 0x160(%rsp) mov 0x138(%rsp), %rbx mov 0x120(%rsp), %rax add 0x160(%rsp), %rbx cmp %rax, 0x130(%rsp) jbe f98 <_calculateVerticesAndNormals+0x918> shl $0x4, %rax mov %rax, 0x168(%rsp) movss 0x10(%rbx), %xmm10 add $0x1, %r15d movss 0x14(%rbx), %xmm11 mulss %xmm10, %xmm7 mov 0x140(%rsp), %rax mulss %xmm11, %xmm8 movss 0x18(%rbx), %xmm9 mulss %xmm11, %xmm5 mulss %xmm10, %xmm4 mulss %xmm11, %xmm2 add 0x168(%rsp), %rax addq $0x1, 0x100(%rsp) addss %xmm7, %xmm8 addq $0x2, 0x120(%rsp) addss %xmm4, %xmm5 mulss %xmm10, %xmm1 mulss %xmm9, %xmm6 movl $0x0, 0xc(%rax) mulss %xmm9, %xmm3 mulss %xmm9, %xmm0 addss %xmm1, %xmm2 addss %xmm6, %xmm8 addss %xmm3, %xmm5 addss %xmm0, %xmm2 movss %xmm8, (%rax) movss %xmm5, 0x4(%rax) movss %xmm2, 0x8(%rax) mov 0x100(%rsp), %rax cmp %rax, 0x128(%rsp) je 1317 <_calculateVerticesAndNormals+0xc97> movslq %r15d, %r12 mov %rax, 0x108(%rsp) cmp %r12, 0xf0(%rsp) ja 798 <_calculateVerticesAndNormals+0x118> mov $0x5d, %edx mov $0x6, %edi mov $0x0, %esi movq $0x6, 0x1a0(%rsp) movq $0x0, 0x1a8(%rsp) callq f49 <_calculateVerticesAndNormals+0x8c9> jmpq 7a8 <_calculateVerticesAndNormals+0x128> xchg %ax, %ax mov $0x5f, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm9, 0x90(%rsp) movq $0x6, 0x1b0(%rsp) movq $0x0, 0x1b8(%rsp) callq f86 <_calculateVerticesAndNormals+0x906> movss 0x90(%rsp), %xmm9 jmpq 7e4 <_calculateVerticesAndNormals+0x164> nopl (%rax) mov $0x69, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movq $0x6, 0x210(%rsp) movq $0x0, 0x218(%rsp) callq ffd <_calculateVerticesAndNormals+0x97d> movss 0x70(%rsp), %xmm8 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq e59 <_calculateVerticesAndNormals+0x7d9> nopl 0x0(%rax, %rax, 1) mov $0x69, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movq $0x6, 0x200(%rsp) movq $0x0, 0x208(%rsp) callq 10a5 <_calculateVerticesAndNormals+0xa25> movss 0x70(%rsp), %xmm8 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq e27 <_calculateVerticesAndNormals+0x7a7> nopl 0x0(%rax, %rax, 1) mov $0x68, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movq $0x6, 0x1f0(%rsp) movq $0x0, 0x1f8(%rsp) callq 116b <_calculateVerticesAndNormals+0xaeb> movss 0x70(%rsp), %xmm8 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq d3a <_calculateVerticesAndNormals+0x6ba> nopw 0x0(%rax, %rax, 1) mov $0x68, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movq $0x6, 0x1e0(%rsp) movq $0x0, 0x1e8(%rsp) callq 1253 <_calculateVerticesAndNormals+0xbd3> movss 0x70(%rsp), %xmm8 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq cfd <_calculateVerticesAndNormals+0x67d> mov $0x12, %r8d mov $0x6, %edx mov $0x0, %ecx mov $0x9, %edi mov $0x0, %esi movss %xmm9, 0x90(%rsp) movq $0x6, 0x220(%rsp) movq $0x0, 0x228(%rsp) movq $0x9, 0x230(%rsp) movq $0x0, 0x238(%rsp) callq 1308 <_calculateVerticesAndNormals+0xc88> movss 0x90(%rsp), %xmm9 jmpq 7fa <_calculateVerticesAndNormals+0x17a> add $0x268, %rsp pop %rbx pop %rbp pop %r12 pop %r13 pop %r14 pop %r15 retq nopl 0x0(%rax) Bye, bearophileare you able and willing to show me the asm produced by gdc? There's a problem there.[attach bla.rar]
Aug 04 2011
But what's the purpose of those callq? They seem to call the successive asm instructI find AT&T syntax to be almost impossible to read, but it looks like they are comparing the instruction pointer for some reason. call works by pushing the instruction pointer on the stack, then jumping to the new address. By calling the next thing, you can then pop the instruction pointer off the stack and continue on where you left off. I don't know why they want this though. That AT&T syntax really messes with my brain...
Aug 04 2011
Adam Ruppe wrote:They do that to implement Position Independent Code: you need to know the instruction pointer, to be able to access your data. Actually it has a terrible effect on performance, because it destroys the processor's return prediction mechanism (it guarantees multiple mispredictions). But it seems to be unavoidable -- I don't think it's possible to generate decent code for PIC on x86-32. But there should never be more than one call in a function.But what's the purpose of those callq? They seem to call the successive asm instructI find AT&T syntax to be almost impossible to read, but it looks like they are comparing the instruction pointer for some reason. call works by pushing the instruction pointer on the stack, then jumping to the new address. By calling the next thing, you can then pop the instruction pointer off the stack and continue on where you left off.I don't know why they want this though. That AT&T syntax really messes with my brain...
Aug 05 2011
are you willing and able to show me the asm before it gets assembled? (with gcc you do it with the -S switch). (I also suggest to use only the C standard library, with time() and printf() to produce a smaller asm output: http://codepad.org/12EUo16J ).
Aug 05 2011
Trass3r:You are a person of few words :-) Thank you for the asm. Apparently the program was not compiled in release mode (or with nobounds. With DMD it's the same thing, maybe with gdc it's not the same thing). It contains the calls, but they aren't to the next line, they were for the array bounds: call _d_assert call _d_array_bounds call _d_array_bounds call _d_assert_msg call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_assert_msg But I think this doesn't fully explain the low performance, I have seen too many instructions like: movss DWORD PTR [rsp+32], xmm1 movss DWORD PTR [rsp+16], xmm2 movss DWORD PTR [rsp+48], xmm3 If you want to go on with this exploration, then I suggest you to find a way to disable bound tests. Bye, bearophileare you willing and able to show me the asm before it gets assembled? (with gcc you do it with the -S switch). (I also suggest to use only the C standard library, with time() and printf() to produce a smaller asm output: http://codepad.org/12EUo16J ).
Aug 05 2011
If you want to go on with this exploration, then I suggest you to find a way to disable bound tests.Ok, now I get up to 32930000 skinned vertices per second. Still a bit worse than LDC.
Aug 05 2011
== Quote from bearophile (bearophileHUGS lycos.com)'s articleTrass3r:there.C++ no SIMD: Skinned vertices per second: 42420000...D gdc: Skinned vertices per second: 23450000Are you able and willing to show me the asm produced by gdc? There's a problemBye, bearophileNotes from me: - Options -fno-bounds-check and -frelease can be just as important in GDC as they are in DMD under certain instances. - You can output asm in intel dialect using -masm=intel if at&t is that difficult for you to read. 8-) I will look into this later from my workstation.
Aug 06 2011
Iain Buclaw:I will look into this later from my workstation.The remaining thing to look at is just the small performance difference between the D-GDC version and the C++-G++ version. Bye, bearophile
Aug 06 2011
== Quote from bearophile (bearophileHUGS lycos.com)'s articleIain Buclaw:the D-GDC version and the C++-G++ version.I will look into this later from my workstation.The remaining thing to look at is just the small performance difference betweenBye, bearophileThree things that helped improve performance in a minor way for me: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) 3) using core.stdc.time over std.datetime. (1.6% speedup) Point one is pretty well known issue in D as far as I'm aware. Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted. Point three is interesting, it seems that "sw.peek().msecs" slows down the number of iterations in the while loop. With those changes, D implementation is still 21% slower than C++ implementation without SIMD. http://ideone.com/4PP2D
Aug 06 2011
Iain Buclaw: Are you using GDC2-64 bit on Linux?Three things that helped improve performance in a minor way for me: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) 3) using core.stdc.time over std.datetime. (1.6% speedup) Point one is pretty well known issue in D as far as I'm aware.Really? I don't remember discussions about it. What is its cause?Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted.This too is something worth fixing. Is this issue in Bugzilla already?Point three is interesting, it seems that "sw.peek().msecs" slows down the number of iterations in the while loop.This needs to be fixed.With those changes, D implementation is still 21% slower than C++ implementation without SIMD. http://ideone.com/4PP2DThis is a lot still. Thank you for your work. I think all three issues are worth fixing, eventually. Bye, bearophile
Aug 06 2011
== Quote from bearophile (bearophileHUGS lycos.com)'s articleIain Buclaw: Are you using GDC2-64 bit on Linux?GDC2-32 bit on Linux.I can't remember the exact discussion, but it was something about a benchmark of passing by value vs passing by ref vs passing by pointer.Three things that helped improve performance in a minor way for me: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) 3) using core.stdc.time over std.datetime. (1.6% speedup) Point one is pretty well known issue in D as far as I'm aware.Really? I don't remember discussions about it. What is its cause?I don't think its an issue really. But of course, there is a difference between what you say and what you mean with regards to the code here (that being, with the first version, lots of temp vars get created and moved around the place). Regards IainPoint two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted.This too is something worth fixing. Is this issue in Bugzilla already?
Aug 06 2011
Iain Buclaw:1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup)With DMD I have seen 180k -> 190k vertices/sec replacing this: struct CalVector4 { float X, Y, Z, W; this(float x, float y, float z, float w = 0.0f) { X = x; Y = y; Z = z; W = w; } } With: struct CalVector4 { float X, Y, Z, W=0.0f; } I'd like the D compiler to optimize better there.http://ideone.com/4PP2DThis line of code is not good: auto vertices = cast(Vertex *) new Vertex[N]; This is much better, it's less bug-prone, simpler and shorter: auto vertices = (new Vertex[N]).ptr; But in practice in this program it is enough to allocate dynamic arrays normally, and then perform the call like this (with DMD it gives the same performance): calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr, influences.ptr, output.ptr); I don't know why passing pointers gives some more performance here, compared to passing dynamic arrays (but I have seen the same behaviour in other D programs of mine). Bye, bearophile
Aug 06 2011
On 8/6/2011 3:19 PM, bearophile wrote:I don't know why passing pointers gives some more performance here, compared to passing dynamic arrays (but I have seen the same behaviour in other D programs of mine).A dynamic array is two values being passed, a pointer is one.
Aug 06 2011
Walter:A dynamic array is two values being passed, a pointer is one.I know, but I think there are many optimization opportunities. An example: private void foo(int[] a2) {} void main() { int[100] a1; foo(a1); } In code like that I think a D compiler is free to compile like this, because foo is private, so it's free to perform optimizations based on just the code inside the module: private void foo(ref int[100] a2) {} void main() { int[100] a1; foo(a1); } I think there are several cases where a D compiler is free to replace the two values with just a pointer. Another example, to optimize code like this: private void foo(int[] a1, int[] a2) {} void main() { int n = 100; // run-time value auto a3 = new int[n]; auto a4 = new int[n]; foo(a3, a4); } Into something like this: private void foo(int* a1, int* a2, size_t a1a2len) {} void main() { int n = 100; auto a3 = new int[n]; auto a4 = new int[n]; foo(a3.ptr, a4.ptr, n); } Bye, bearophile
Aug 06 2011
== Quote from bearophile (bearophileHUGS lycos.com)'s articleIain Buclaw:normally, and then perform the call like this (with DMD it gives the same performance):1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup)With DMD I have seen 180k -> 190k vertices/sec replacing this: struct CalVector4 { float X, Y, Z, W; this(float x, float y, float z, float w = 0.0f) { X = x; Y = y; Z = z; W = w; } } With: struct CalVector4 { float X, Y, Z, W=0.0f; } I'd like the D compiler to optimize better there.http://ideone.com/4PP2DThis line of code is not good: auto vertices = cast(Vertex *) new Vertex[N]; This is much better, it's less bug-prone, simpler and shorter: auto vertices = (new Vertex[N]).ptr; But in practice in this program it is enough to allocate dynamic arrayscalculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr, influences.ptr,output.ptr); I was playing about with heap vs stack. Must've forgot to remove that, sorry. :) Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now (on my system). Implementation: http://ideone.com/0j0L1 Command-line: gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native Best times: G++-32bit: 11400000 vps GDC-32bit: 11350000 vps Regards Iain
Aug 06 2011
Iain Buclaw:Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now (on my system).Are you willing to explain your changes (and maybe give a link to the changes)? Maybe Walter is interested for DMD too.Command-line: gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=nativeIn newer versions of GCC -Ofast means -ffast-math too. Walter is not a lover of that -ffast-math switch. But I now think that the combination of D strongly pure functions with unsafe FP optimizations offers optimization opportunities that maybe not even GCC is able to use now when it compiles C/C++ code (do you see why?). Not using this opportunity is a waste, in my opinion. Bye, bearophile
Aug 06 2011
On 8/6/2011 4:46 PM, bearophile wrote:Walter is not a lover of that -ffast-math switch.No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider.
Aug 06 2011
Walter:On 8/6/2011 4:46 PM, bearophile wrote:I have read several papers about FP arithmetic, but I am not an expert yet on them. Both GDC and LDC have compilation switches to perform those unsafe FP optimizations, so even if you don't like them, most D compilers today have them optional, and I don't think those switches will be removed. If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) on the screen using D, and you use floating point values to represent their speed vector, introducing unsafe FP optimizations will not harm so much. Video games are a significant purpose for D language, and in them FP errors are often benign (maybe some parts of the game are able to tolerate them and some other part of the game needs to be compiled with strict FP semantics). Bye, bearophileWalter is not a lover of that -ffast-math switch.No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider.
Aug 06 2011
On 8/6/2011 8:34 PM, bearophile wrote:Walter:Floating point determinism can be very important when it comes to reducing network traffic. If you can achieve it, then you can make sure all players have the same game state and then only send user input commands over the network. Glenn Fiedler has an interesting writeup on it, but I haven't had a chance to read all of it yet: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/On 8/6/2011 4:46 PM, bearophile wrote:I have read several papers about FP arithmetic, but I am not an expert yet on them. Both GDC and LDC have compilation switches to perform those unsafe FP optimizations, so even if you don't like them, most D compilers today have them optional, and I don't think those switches will be removed. If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) on the screen using D, and you use floating point values to represent their speed vector, introducing unsafe FP optimizations will not harm so much. Video games are a significant purpose for D language, and in them FP errors are often benign (maybe some parts of the game are able to tolerate them and some other part of the game needs to be compiled with strict FP semantics). Bye, bearophileWalter is not a lover of that -ffast-math switch.No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider.
Aug 07 2011
Eric Poggel (JoeCoder):determinism can be very important when it comes to reducing network traffic. If you can achieve it, then you can make sure all players have the same game state and then only send user input commands over the network.It seems a hard thing to obtain, but I agree that it gets useful. For me having some FP determinism is useful for debugging: to avoid results from changing randomly if I perform a tiny change in the source code that triggers a change in what optimizations the compiler does. But there are several situations (if I am writing a ray tracer?) where FP determinism is not required in my release build. I was not arguing about removing FP rules from the D compiler, just that there are situations where relaxing those FP rules, on request, doesn't seem to harm. I am not expert about the risks Walter was talking about, so maybe I'm just walking on thin ice (but no one will get hurt if my little raytrcer produces some errors in its images). You don't come often in this newsgroup, thank you for the link :-) Bye, bearophile
Aug 08 2011
On 8/8/2011 3:02 PM, bearophile wrote:Eric Poggel (JoeCoder):You'd be surprised how much I lurk here. I agree there are some interesting areas where fast floating point may indeed be worth it, but I also don't know enough. I've also wondered about creating a Fixed!(long, 8) struct that would let me work with longs and 8 bits of precision after the decimal point as a way of having equal precision anywhere in a large universe and achieving determinism at the same time. But I don't know how performance would compare vs floats or doubles.determinism can be very important when it comes to reducing network traffic. If you can achieve it, then you can make sure all players have the same game state and then only send user input commands over the network.It seems a hard thing to obtain, but I agree that it gets useful. For me having some FP determinism is useful for debugging: to avoid results from changing randomly if I perform a tiny change in the source code that triggers a change in what optimizations the compiler does. But there are several situations (if I am writing a ray tracer?) where FP determinism is not required in my release build. I was not arguing about removing FP rules from the D compiler, just that there are situations where relaxing those FP rules, on request, doesn't seem to harm. I am not expert about the risks Walter was talking about, so maybe I'm just walking on thin ice (but no one will get hurt if my little raytrcer produces some errors in its images). You don't come often in this newsgroup, thank you for the link :-) Bye, bearophile
Aug 08 2011
Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now (on my system). Implementation: http://ideone.com/0j0L1 Command-line: gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native Best times: G++-32bit: 11400000 vps GDC-32bit: 11350000 vps Regards Iain64Bit: C++: 45010000 44270000 42740000 43900000 44680000 43490000 42390000 GDC: 42900000 44010000 44000000 44010000 44010000 44000000 GDC with -fno-bounds-check: 43280000 44440000 44420000 44340000 44440000 44450000
Aug 07 2011