digitalmars.D - From a C++/JS benchmark

bearophile (9/9) Aug 03 2011 The benchmark info:

Denis Shelomovskij (24/33) Aug 03 2011 Compilers:

Ziad Hatahet (6/44) Aug 03 2011 I believe that "long" in this case is 32 bits in C++, and 64-bits in the

Denis Shelomovskij (7/51) Aug 03 2011 Good! This is my first blunder (it's so easy to complitely forget

Adam D. Ruppe (2/3) Aug 03 2011 Is this Windows XP 32 bit or 64 bit? That will probably make

David Nadlinger (3/6) Aug 03 2011 It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64).

Marco Leise (4/11) Aug 03 2011 I thought he was referring to the processor being able to handle 64-bit ...

Adam Ruppe (5/8) Aug 04 2011 I was thinking a little of both but this is the main thing. My

Denis Shelomovskij (3/6) Aug 03 2011 I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according

bearophile (13/17) Aug 03 2011 Languages aren't slow or fast, their implementations produce assembly th...
Trass3r (1/4) Aug 03 2011 I'm afraid not. dmd's backend isn't good at floating point calculations.

bearophile (86/87) Aug 03 2011 Studying a bit the asm it's not hard to find the cause, because this ben...

Trass3r (17/17) Aug 03 2011 C++:

Trass3r (3/11) Aug 03 2011 D ldc:

Trass3r (3/18) Aug 05 2011 D gdc with added -frelease -fno-bounds-check:

bearophile (5/10) Aug 05 2011 I'd like to know why the GCC back-end is able to produce a more efficien...

Trass3r (1/4) Aug 05 2011 I attached both asm versions ;)

bearophile (5/10) Aug 03 2011 Are you able and willing to show me the asm produced by gdc? There's a p...

Trass3r (1/2) Aug 04 2011

bearophile (513/517) Aug 04 2011 In the bla.rar attach there's the unstripped Linux binary, so to read th...

Adam Ruppe (8/10) Aug 04 2011 I find AT&T syntax to be almost impossible to read, but it looks

Don (8/20) Aug 05 2011 They do that to implement Position Independent Code: you need to know

Trass3r (0/4) Aug 05 2011

bearophile (21/25) Aug 05 2011 You are a person of few words :-) Thank you for the asm.

Trass3r (2/4) Aug 05 2011 Ok, now I get up to 32930000 skinned vertices per second.

Iain Buclaw (8/18) Aug 06 2011 Notes from me:

bearophile (4/5) Aug 06 2011 The remaining thing to look at is just the small performance difference ...

Iain Buclaw (14/19) Aug 06 2011 Three things that helped improve performance in a minor way for me:

bearophile (9/22) Aug 06 2011 Really? I don't remember discussions about it. What is its cause?

Iain Buclaw (9/21) Aug 06 2011 I can't remember the exact discussion, but it was something about a benc...

bearophile (25/28) Aug 06 2011 With DMD I have seen 180k -> 190k vertices/sec replacing this:

Walter Bright (2/5) Aug 06 2011 A dynamic array is two values being passed, a pointer is one.

bearophile (32/33) Aug 06 2011 I know, but I think there are many optimization opportunities. An exampl...

Iain Buclaw (16/41) Aug 06 2011 normally, and then perform the call like this (with DMD it gives the sam...

bearophile (7/12) Aug 06 2011 In newer versions of GCC -Ofast means -ffast-math too.

Walter Bright (3/4) Aug 06 2011 No, I am not. Few understand the subtleties of IEEE arithmetic, and brea...

bearophile (5/10) Aug 06 2011 I have read several papers about FP arithmetic, but I am not an expert y...

Eric Poggel (JoeCoder) (8/18) Aug 07 2011 Floating point determinism can be very important when it comes to

bearophile (7/11) Aug 08 2011 It seems a hard thing to obtain, but I agree that it gets useful.

Eric Poggel (JoeCoder) (9/20) Aug 08 2011 You'd be surprised how much I lurk here. I agree there are some

Trass3r (23/34) Aug 07 2011 64Bit:

bearophile <bearophileHUGS lycos.com> writes:

The benchmark info:
http://chadaustin.me/2011/01/digging-into-javascript-performance/


https://github.com/chadaustin/Web-Benchmarks/
The C++/JS/Java code runs on a single core.


inheritance!):
http://ideone.com/kf1tz

Bye,
bearophile

Aug 03 2011

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

03.08.2011 18:20, bearophile:
 The benchmark info:
 http://chadaustin.me/2011/01/digging-into-javascript-performance/


 https://github.com/chadaustin/Web-Benchmarks/
 The C++/JS/Java code runs on a single core.


inheritance!):
 http://ideone.com/kf1tz

 Bye,
 bearophile

Compilers:
C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000
Java: Oracle Java 1.6 with hm... Oracle default settings

D2:   dmd -O -noboundscheck -inline -release

Type column: working scalar type
Other columns: vertices per second (inaccuracy is about 1%) by language 
(tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").

System: Windows XP, Core 2 Duo E6850

-----------------------------------------------------------

-----------------------------------------------------------
float   | 31_400_000 | 17_000_000 | 14_700_000 |    168_000
double  | 32_300_000 | 16_000_000 | 14_100_000 |    166_000
real    | 32_300_000 |   no real  |   no real  |    203_000
int     | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
long    | 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
-----------------------------------------------------------

JavaScript vs C++ speed is at the first link of original bearophile's 
post and JS is about 10-20 temes slower than C++.
Looks like a spiteful joke... In other words: WTF?! JavaScript is about 
10 times faster than D in floating point calculations!? Please, tell me 
that I'm mistaken.

Aug 03 2011

Ziad Hatahet <hatahet gmail.com> writes:

I believe that "long" in this case is 32 bits in C++, and 64-bits in the
remaining languages, hence the same result for int and long in C++. Try with
"long long" maybe? :)


--
Ziad


2011/8/3 Denis Shelomovskij <verylonglogin.reg gmail.com>

 03.08.2011 18:20, bearophile:

  The benchmark info:
 http://chadaustin.me/2011/01/**digging-into-javascript-**performance/<http://chadaustin.me/2011/01/digging-into-javascript-performance/>


 https://github.com/chadaustin/**Web-Benchmarks/<https://github.com/chadaustin/Web-Benchmarks/>
 The C++/JS/Java code runs on a single core.


 inheritance!):
 http://ideone.com/kf1tz

 Bye,
 bearophile

 Compilers:
 C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000
 Java: Oracle Java 1.6 with hm... Oracle default settings

 D2:   dmd -O -noboundscheck -inline -release

 Type column: working scalar type
 Other columns: vertices per second (inaccuracy is about 1%) by language
 (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").

 System: Windows XP, Core 2 Duo E6850

 ------------------------------**-----------------------------

 ------------------------------**-----------------------------
 float   | 31_400_000 | 17_000_000 | 14_700_000 |    168_000
 double  | 32_300_000 | 16_000_000 | 14_100_000 |    166_000
 real    | 32_300_000 |   no real  |   no real  |    203_000
 int     | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
 long    | 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
 ------------------------------**-----------------------------

 JavaScript vs C++ speed is at the first link of original bearophile's post
 and JS is about 10-20 temes slower than C++.
 Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10
 times faster than D in floating point calculations!? Please, tell me that
 I'm mistaken.

Aug 03 2011

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

03.08.2011 22:15, Ziad Hatahet:
 I believe that "long" in this case is 32 bits in C++, and 64-bits in the
 remaining languages, hence the same result for int and long in C++. Try
 with "long long" maybe? :)


 --
 Ziad


 2011/8/3 Denis Shelomovskij <verylonglogin.reg gmail.com
 <mailto:verylonglogin.reg gmail.com>>

     03.08.2011 18:20, bearophile:

         The benchmark info:
         http://chadaustin.me/2011/01/__digging-into-javascript-__performance/
         <http://chadaustin.me/2011/01/digging-into-javascript-performance/>


         https://github.com/chadaustin/__Web-Benchmarks/
         <https://github.com/chadaustin/Web-Benchmarks/>
         The C++/JS/Java code runs on a single core.


         struct inheritance!):
         http://ideone.com/kf1tz

         Bye,
         bearophile


     Compilers:
     C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000
     Java: Oracle Java 1.6 with hm... Oracle default settings

     D2:   dmd -O -noboundscheck -inline -release

     Type column: working scalar type
     Other columns: vertices per second (inaccuracy is about 1%) by
     language (tests from bearophile's message, C++ test is
     "skinning_test_no_simd.cpp").

     System: Windows XP, Core 2 Duo E6850

     ------------------------------__-----------------------------

     ------------------------------__-----------------------------
     float   | 31_400_000 | 17_000_000 | 14_700_000 |    168_000
     double  | 32_300_000 | 16_000_000 | 14_100_000 |    166_000
     real    | 32_300_000 |   no real  |   no real  |    203_000
     int     | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
     long    | 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
     ------------------------------__-----------------------------

     JavaScript vs C++ speed is at the first link of original
     bearophile's post and JS is about 10-20 temes slower than C++.
     Looks like a spiteful joke... In other words: WTF?! JavaScript is
     about 10 times faster than D in floating point calculations!?
     Please, tell me that I'm mistaken.

Good! This is my first blunder (it's so easy to complitely forget 
illogical (for me) language design). So, corrected last row:


-------------------------------------------------------------
long    | 5_500_000 |  6_600_000 |  4_400_000 |  5_800_000


Java is the fastest "long" language :)

Aug 03 2011

Adam D. Ruppe <destructionator gmail.com> writes:

 System: Windows XP, Core 2 Duo E6850

Is this Windows XP 32 bit or 64 bit? That will probably make
a difference on the longs I'd expect.

Aug 03 2011

David Nadlinger <see klickverbot.at> writes:

On 8/3/11 9:48 PM, Adam D. Ruppe wrote:
 System: Windows XP, Core 2 Duo E6850

 Is this Windows XP 32 bit or 64 bit? That will probably make
 a difference on the longs I'd expect.

It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64).

David

Aug 03 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 03.08.2011, 21:52 Uhr, schrieb David Nadlinger <see klickverbot.at>:

 On 8/3/11 9:48 PM, Adam D. Ruppe wrote:
 System: Windows XP, Core 2 Duo E6850

 Is this Windows XP 32 bit or 64 bit? That will probably make
 a difference on the longs I'd expect.

 It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64).

 David

I thought he was referring to the processor being able to handle 64-bit  
ints more efficiently in 64-bit operation mode on a 64-bit OS with 64-bit  
executables.

Aug 03 2011

Adam Ruppe <destructionator gmail.com> writes:

Marco Leise wrote:
 I thought he was referring to the processor being able to handle
 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS
 with 64-bit executables.

I was thinking a little of both but this is the main thing. My
suspicion was that Java might have been using a 64 bit JVM and
everything else was compiled in 32 bit, causing it to win in that place.

But with a 32 bit OS that means 32 bit programs all around.

Aug 04 2011

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

03.08.2011 22:48, Adam D. Ruppe пишет:
 System: Windows XP, Core 2 Duo E6850

 Is this Windows XP 32 bit or 64 bit? That will probably make
 a difference on the longs I'd expect.

I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according 
to what is "Windows XP" in wikipedia)

Aug 03 2011

bearophile <bearophileHUGS lycos.com> writes:

Denis Shelomovskij:

 (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").

For a more realistic test I suggest you to time the C++ version that uses the
intrinsics too (only for float).


 Looks like a spiteful joke... In other words: WTF?! JavaScript is about 
 10 times faster than D in floating point calculations!? Please, tell me 
 that I'm mistaken.

Languages aren't slow or fast, their implementations produce assembly that's
more or less efficient.

A D1 version fit for LDC V1 with Tango:
http://codepad.org/ewDy31UH

Vertices (millions), Linux 32 bit:
  C++ no simd:  29.5
  D:            27.6

LDC based on DMD v1.057 and llvm 2.6, ldc -O3 -release -inline

G++ V4.3.3, -s -O3 -mfpmath=sse -ffast-math -msse3

It's a bit slower than the C++ version, but for most people that's an

one and using a more modern LLVM you reduce that loss a bit).

Bye,
bearophile

Aug 03 2011

Trass3r <un known.com> writes:

 Looks like a spiteful joke... In other words: WTF?! JavaScript is about  
 10 times faster than D in floating point calculations!? Please, tell me  
 that I'm mistaken.

I'm afraid not. dmd's backend isn't good at floating point calculations.

Aug 03 2011

bearophile <bearophileHUGS lycos.com> writes:

Trass3r:

 I'm afraid not. dmd's backend isn't good at floating point calculations.

Studying a bit the asm it's not hard to find the cause, because this benchmark
is quite pure (synthetic, despite I think it comes from real-world code).

This is what G++ generates from the C++ code without intrinsics (the version
that uses SIMD intrinsics has a similar look but it's shorter):

movl  (%eax), %edx
movss  4(%eax), %xmm0
movl  8(%eax), %ecx
leal  (%edx,%edx,2), %edx
sall  $4, %edx
addl  %ebx, %edx
testl  %ecx, %ecx
movss  12(%edx), %xmm1
movss  20(%edx), %xmm7
movss  (%edx), %xmm5
mulss  %xmm0, %xmm1
mulss  %xmm0, %xmm7
movss  4(%edx), %xmm6
movss  8(%edx), %xmm4
movss  %xmm1, (%esp)
mulss  %xmm0, %xmm5
movss  28(%edx), %xmm1
movss  %xmm7, 4(%esp)
mulss  %xmm0, %xmm6
movss  32(%edx), %xmm7
mulss  %xmm0, %xmm1
movss  16(%edx), %xmm3
mulss  %xmm0, %xmm7
movss  24(%edx), %xmm2
movss  %xmm1, 16(%esp)
mulss  %xmm0, %xmm4
movss  36(%edx), %xmm1
movss  %xmm7, 8(%esp)
mulss  %xmm0, %xmm3
movss  40(%edx), %xmm7
mulss  %xmm0, %xmm2
mulss  %xmm0, %xmm1
mulss  %xmm0, %xmm7
mulss  44(%edx), %xmm0
leal  12(%eax), %edx
movss  %xmm7, 12(%esp)
movss  %xmm0, 20(%esp)


This is what DMD generates for the same (or quite similar) piece of code:

movsd
mov  EAX,068h[ESP]
imul  EDX,EAX,030h
add  EDX,018h[ESP]
fld  float ptr [EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 038h[ESP]
fld  float ptr 4[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 03Ch[ESP]
fld  float ptr 8[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 040h[ESP]
fld  float ptr 0Ch[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 044h[ESP]
fld  float ptr 010h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 048h[ESP]
fld  float ptr 014h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 04Ch[ESP]
fld  float ptr 018h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 050h[ESP]
fld  float ptr 01Ch[EDX]
mov  CL,070h[ESP]
xor  CL,1
fmul  float ptr 06Ch[ESP]
fstp  float ptr 054h[ESP]
fld  float ptr 020h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 058h[ESP]
fld  float ptr 024h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 05Ch[ESP]
fld  float ptr 028h[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 060h[ESP]
fld  float ptr 02Ch[EDX]
fmul  float ptr 06Ch[ESP]
fstp  float ptr 064h[ESP]

I think DMD back-end already contains logic to use xmm registers as true
registers (not as a floating point stack or temporary holes where to push and
pull FP values), so I suspect it doesn't take too much work to modify it to
emit FP asm with a single optimization: just keep the values inside registers.
In my uninformed opinion all other FP optimizations are almost insignificant
compared to this one :-)

Bye,
bearophile

Aug 03 2011

Trass3r <un known.com> writes:

C++:
Skinned vertices per second: 48660000

C++ no SIMD:
Skinned vertices per second: 42420000


D dmd:
Skinned vertices per second: 159046

D gdc:
Skinned vertices per second: 23450000



Compilers:

gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)
g++ -s -O3 -mfpmath=sse -ffast-math -march=native

DMD64 D Compiler v2.054
dmd -O -noboundscheck -inline -release dver.d

gcc version 4.6.1 20110627 (gdc 0.30, using dmd 2.054) (GCC)
gdc -s -O3 -mfpmath=sse -ffast-math -march=native dver.d


Ubuntu 11.04 x64
Core2 Duo E6300

Aug 03 2011

Trass3r <un known.com> writes:

 C++:
 Skinned vertices per second: 48660000

 C++ no SIMD:
 Skinned vertices per second: 42420000


 D dmd:
 Skinned vertices per second: 159046

 D gdc:
 Skinned vertices per second: 23450000


D ldc:
Skinned vertices per second: 37910000

ldc2 -O3 -release -enable-inlining dver.d

Aug 03 2011

Trass3r <un known.com> writes:

Am 04.08.2011, 04:07 Uhr, schrieb Trass3r <un known.com>:

 C++:
 Skinned vertices per second: 48660000

 C++ no SIMD:
 Skinned vertices per second: 42420000


 D dmd:
 Skinned vertices per second: 159046

 D gdc:
 Skinned vertices per second: 23450000


 D ldc:
 Skinned vertices per second: 37910000

 ldc2 -O3 -release -enable-inlining dver.d


D gdc with added -frelease -fno-bounds-check:
Skinned vertices per second: 37710000

Aug 05 2011

bearophile <bearophileHUGS lycos.com> writes:

Trass3r:

 C++ no SIMD:
 Skinned vertices per second: 42420000


...
 D gdc with added -frelease -fno-bounds-check:
 Skinned vertices per second: 37710000

I'd like to know why the GCC back-end is able to produce a more efficient
binary from the C++ code (compared to the D code), but now the problem is not
large, as before.

It seems I've found a benchmark coming from real-world code that's a worst case
for DMD (GDC here produces code about 237 times faster than DMD).

Bye,
bearophile

Aug 05 2011

Trass3r <un known.com> writes:

 I'd like to know why the GCC back-end is able to produce a more  
 efficient binary from the C++ code (compared to the D code), but now the  
 problem is not large, as before.

I attached both asm versions ;)

Aug 05 2011

bearophile <bearophileHUGS lycos.com> writes:

Trass3r:

 C++ no SIMD:
 Skinned vertices per second: 42420000
 

...
 D gdc:
 Skinned vertices per second: 23450000

Are you able and willing to show me the asm produced by gdc? There's a problem
there.

Bye,
bearophile

Aug 03 2011

Trass3r <un known.com> writes:

e you able and willing to show me the asm produced by gdc? There's a
 problem there.

Aug 04 2011

bearophile <bearophileHUGS lycos.com> writes:

 Trass3r:
 are you able and willing to show me the asm produced by gdc? There's a
 problem there.

 [attach bla.rar]

In the bla.rar attach there's the unstripped Linux binary, so to read the asm I
have used the objdump disassembler. But are you willing and able to show me the
asm before it gets assembled? (with gcc you do it with the -S switch). (I also
suggest to use only the C standard library, with time() and printf() to produce
a smaller asm output: http://codepad.org/12EUo16J ).

Using objdump I see it uses 16 xmm registers, this is the main routine. But
what's the purpose of those callq? They seem to call the successive asm
instruction. The x86 asm of this routine contains jumps only and no "call".
The asm of this routine is also very long, I don't know why yet. I see too many
instructions like "movss  0x80(%rsp), %xmm7" this looks like a problem.


_calculateVerticesAndNormals:
push   %r15
push   %r14
push   %r13
push   %r12
push   %rbp
push   %rbx
sub    $0x268, %rsp
mov    0x2a0(%rsp), %rax
mov    %rdi, 0xe8(%rsp)
mov    %rsi, 0xe0(%rsp)
mov    %rcx, 0x128(%rsp)
mov    %r8, 0x138(%rsp)
mov    %rax, 0xf0(%rsp)
mov    0x2a8(%rsp), %rax
mov    %rdi, 0x180(%rsp)
mov    %rsi, 0x188(%rsp)
mov    %rcx, 0x170(%rsp)
mov    %rax, 0xf8(%rsp)
mov    0x2b0(%rsp), %rax
mov    %r8, 0x178(%rsp)
mov    %rax, 0x130(%rsp)
mov    0x2b8(%rsp), %rax
mov    %rax, 0x140(%rsp)
mov    %rcx, %rax
add    %rax, %rax
cmp    0x130(%rsp), %rax
je     74d <_calculateVerticesAndNormals+0xcd>
mov    $0x57, %edx
mov    $0x6, %edi
mov    $0x0, %esi
movq   $0x6, 0x190(%rsp)
movq   $0x0, 0x198(%rsp)
callq  74d <_calculateVerticesAndNormals+0xcd>
cmpq   $0x0, 0x128(%rsp)
je     1317 <_calculateVerticesAndNormals+0xc97>
movq   $0x1, 0x120(%rsp)
xor    %r15d, %r15d
movq   $0x0, 0x100(%rsp)
movslq %r15d, %r12
cmp    %r12, 0xf0(%rsp)
movq   $0x0, 0x108(%rsp)
jbe    f1d <_calculateVerticesAndNormals+0x89d>
nopl   0x0(%rax)
lea    (%r12, %r12, 2), %rax
shl    $0x2, %rax
mov    %rax, 0x148(%rsp)
mov    0xf8(%rsp), %rax
add    0x148(%rsp), %rax
movss  0x4(%rax), %xmm9
movzbl 0x8(%rax), %r13d
movslq (%rax), %rax
cmp    0xe8(%rsp), %rax
jae    f50 <_calculateVerticesAndNormals+0x8d0>
lea    (%rax, %rax, 2), %rax
shl    $0x4, %rax
mov    %rax, 0x110(%rsp)
mov    0xe0(%rsp), %rbx
add    0x110(%rsp), %rbx
je     12af <_calculateVerticesAndNormals+0xc2f>
movss  (%rbx), %xmm7
test   %r13b, %r13b
movss  0x4(%rbx), %xmm8
movss  0x8(%rbx), %xmm6
mulss  %xmm9, %xmm7
movss  0xc(%rbx), %xmm11
mulss  %xmm9, %xmm8
movss  0x10(%rbx), %xmm4
mulss  %xmm9, %xmm6
movss  0x14(%rbx), %xmm5
mulss  %xmm9, %xmm11
movss  0x18(%rbx), %xmm3
mulss  %xmm9, %xmm4
movss  0x1c(%rbx), %xmm10
mulss  %xmm9, %xmm5
movss  0x20(%rbx), %xmm1
mulss  %xmm9, %xmm3
movss  0x24(%rbx), %xmm2
mulss  %xmm9, %xmm10
movss  0x28(%rbx), %xmm0
mulss  %xmm9, %xmm1
mulss  %xmm9, %xmm2
mulss  %xmm9, %xmm0
mulss  0x2c(%rbx), %xmm9
jne    cdb <_calculateVerticesAndNormals+0x65b>
add    $0x1, %r12
mov    %r14, %rax
lea    (%r12, %r12, 2), %r13
shl    $0x2, %r13
jmpq   99e <_calculateVerticesAndNormals+0x31e>
nopl   (%rax)
mov    %r13, %rax
mov    0xf8(%rsp), %rdx
add    %rax, %rdx
movss  0x4(%rdx), %xmm12
movzbl 0x8(%rdx), %r14d
movslq (%rdx), %rdx
cmp    %rdx, 0xe8(%rsp)
jbe    aa0 <_calculateVerticesAndNormals+0x420>
mov    0xe0(%rsp), %rbx
lea    (%rdx, %rdx, 2), %rbp
shl    $0x4, %rbp
add    %rbp, %rbx
je     baf <_calculateVerticesAndNormals+0x52f>
movss  (%rbx), %xmm13
add    $0x1, %r12
add    $0xc, %r13
test   %r14b, %r14b
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm7
movss  0x4(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm8
movss  0x8(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm6
movss  0xc(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm11
movss  0x10(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm4
movss  0x14(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm5
movss  0x18(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm3
movss  0x1c(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm10
movss  0x20(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm1
movss  0x24(%rbx), %xmm13
mulss  %xmm12, %xmm13
addss  %xmm13, %xmm2
movss  0x28(%rbx), %xmm13
mulss  %xmm12, %xmm13
mulss  0x2c(%rbx), %xmm12
addss  %xmm13, %xmm0
addss  %xmm12, %xmm9
jne    cd8 <_calculateVerticesAndNormals+0x658>
add    $0x1, %r15d
cmp    %r12, 0xf0(%rsp)
ja     890 <_calculateVerticesAndNormals+0x210>
mov    $0x63, %edx
mov    $0x6, %edi
mov    $0x0, %esi
mov    %rax, 0xc8(%rsp)
movss  %xmm0, (%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movss  %xmm9, 0x90(%rsp)
movss  %xmm10, 0xa0(%rsp)
movss  %xmm11, 0xb0(%rsp)
movq   $0x6, 0x1c0(%rsp)
movq   $0x0, 0x1c8(%rsp)
callq  a3b <_calculateVerticesAndNormals+0x3bb>
mov    0xc8(%rsp), %rax
movss  (%rsp), %xmm0
movss  0x20(%rsp), %xmm1
movss  0x10(%rsp), %xmm2
movss  0x30(%rsp), %xmm3
movss  0x50(%rsp), %xmm4
movss  0x40(%rsp), %xmm5
movss  0x60(%rsp), %xmm6
movss  0x80(%rsp), %xmm7
movss  0x70(%rsp), %xmm8
movss  0x90(%rsp), %xmm9
movss  0xa0(%rsp), %xmm10
movss  0xb0(%rsp), %xmm11
jmpq   893 <_calculateVerticesAndNormals+0x213>
nop    
mov    $0x65, %edx
mov    $0x6, %edi
mov    $0x0, %esi
mov    %rax, 0xc8(%rsp)
movss  %xmm0, (%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movss  %xmm9, 0x90(%rsp)
movss  %xmm10, 0xa0(%rsp)
movss  %xmm11, 0xb0(%rsp)
movss  %xmm12, 0xd0(%rsp)
movq   $0x6, 0x1d0(%rsp)
movq   $0x0, 0x1d8(%rsp)
callq  b35 <_calculateVerticesAndNormals+0x4b5>
mov    0xe0(%rsp), %rbx
movss  0xd0(%rsp), %xmm12
movss  0xb0(%rsp), %xmm11
movss  0xa0(%rsp), %xmm10
add    %rbp, %rbx
movss  0x70(%rsp), %xmm8
movss  0x90(%rsp), %xmm9
movss  0x80(%rsp), %xmm7
movss  0x60(%rsp), %xmm6
movss  0x40(%rsp), %xmm5
movss  0x50(%rsp), %xmm4
movss  0x30(%rsp), %xmm3
movss  0x10(%rsp), %xmm2
movss  0x20(%rsp), %xmm1
movss  (%rsp), %xmm0
mov    0xc8(%rsp), %rax
jne    8d3 <_calculateVerticesAndNormals+0x253>
mov    $0x23, %r8d
mov    $0x6, %edx
mov    $0x0, %ecx
mov    $0x9, %edi
mov    $0x0, %esi
movss  %xmm0, (%rsp)
mov    %rax, 0xc8(%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movss  %xmm9, 0x90(%rsp)
movss  %xmm10, 0xa0(%rsp)
movss  %xmm11, 0xb0(%rsp)
movss  %xmm12, 0xd0(%rsp)
movq   $0x6, 0x240(%rsp)
movq   $0x0, 0x248(%rsp)
movq   $0x9, 0x250(%rsp)
movq   $0x0, 0x258(%rsp)
callq  c67 <_calculateVerticesAndNormals+0x5e7>
movss  0x70(%rsp), %xmm8
movss  0xd0(%rsp), %xmm12
movss  0xb0(%rsp), %xmm11
movss  0xa0(%rsp), %xmm10
movss  0x90(%rsp), %xmm9
movss  0x80(%rsp), %xmm7
movss  0x60(%rsp), %xmm6
movss  0x40(%rsp), %xmm5
movss  0x50(%rsp), %xmm4
movss  0x30(%rsp), %xmm3
movss  0x10(%rsp), %xmm2
movss  0x20(%rsp), %xmm1
movss  (%rsp), %xmm0
mov    0xc8(%rsp), %rax
jmpq   8d3 <_calculateVerticesAndNormals+0x253>
nopl   (%rax)
mov    %rax, %r14
mov    0x108(%rsp), %rax
cmp    %rax, 0x128(%rsp)
jbe    11d0 <_calculateVerticesAndNormals+0xb50>
shl    $0x5, %rax
mov    %rax, 0x150(%rsp)
mov    0x100(%rsp), %rax
mov    0x138(%rsp), %rbx
add    %rax, %rax
add    0x150(%rsp), %rbx
cmp    %rax, 0x130(%rsp)
jbe    10e8 <_calculateVerticesAndNormals+0xa68>
mov    0x100(%rsp), %rax
shl    $0x5, %rax
mov    %rax, 0x158(%rsp)
movss  0x8(%rbx), %xmm12
movaps %xmm8, %xmm15
movss  (%rbx), %xmm14
movss  %xmm12, 0x11c(%rsp)
movss  0x4(%rbx), %xmm13
movaps %xmm7, %xmm12
mulss  %xmm14, %xmm12
mov    0x140(%rsp), %rax
mulss  %xmm13, %xmm15
add    0x158(%rsp), %rax
addss  %xmm15, %xmm12
addss  %xmm11, %xmm12
movl   $0x0, 0xc(%rax)
movss  0x11c(%rsp), %xmm11
mulss  %xmm6, %xmm11
addss  %xmm11, %xmm12
movaps %xmm4, %xmm11
mulss  %xmm14, %xmm11
mulss  %xmm1, %xmm14
movss  %xmm12, (%rax)
movaps %xmm5, %xmm12
mulss  %xmm13, %xmm12
mulss  %xmm2, %xmm13
addss  %xmm12, %xmm11
addss  %xmm13, %xmm14
addss  %xmm10, %xmm11
movss  0x11c(%rsp), %xmm10
addss  %xmm9, %xmm14
movss  0x11c(%rsp), %xmm9
mulss  %xmm3, %xmm10
mulss  %xmm0, %xmm9
addss  %xmm10, %xmm11
addss  %xmm9, %xmm14
movss  %xmm11, 0x4(%rax)
movss  %xmm14, 0x8(%rax)
mov    0x108(%rsp), %rax
cmp    %rax, 0x128(%rsp)
jbe    1040 <_calculateVerticesAndNormals+0x9c0>
shl    $0x5, %rax
mov    %rax, 0x160(%rsp)
mov    0x138(%rsp), %rbx
mov    0x120(%rsp), %rax
add    0x160(%rsp), %rbx
cmp    %rax, 0x130(%rsp)
jbe    f98 <_calculateVerticesAndNormals+0x918>
shl    $0x4, %rax
mov    %rax, 0x168(%rsp)
movss  0x10(%rbx), %xmm10
add    $0x1, %r15d
movss  0x14(%rbx), %xmm11
mulss  %xmm10, %xmm7
mov    0x140(%rsp), %rax
mulss  %xmm11, %xmm8
movss  0x18(%rbx), %xmm9
mulss  %xmm11, %xmm5
mulss  %xmm10, %xmm4
mulss  %xmm11, %xmm2
add    0x168(%rsp), %rax
addq   $0x1, 0x100(%rsp)
addss  %xmm7, %xmm8
addq   $0x2, 0x120(%rsp)
addss  %xmm4, %xmm5
mulss  %xmm10, %xmm1
mulss  %xmm9, %xmm6
movl   $0x0, 0xc(%rax)
mulss  %xmm9, %xmm3
mulss  %xmm9, %xmm0
addss  %xmm1, %xmm2
addss  %xmm6, %xmm8
addss  %xmm3, %xmm5
addss  %xmm0, %xmm2
movss  %xmm8, (%rax)
movss  %xmm5, 0x4(%rax)
movss  %xmm2, 0x8(%rax)
mov    0x100(%rsp), %rax
cmp    %rax, 0x128(%rsp)
je     1317 <_calculateVerticesAndNormals+0xc97>
movslq %r15d, %r12
mov    %rax, 0x108(%rsp)
cmp    %r12, 0xf0(%rsp)
ja     798 <_calculateVerticesAndNormals+0x118>
mov    $0x5d, %edx
mov    $0x6, %edi
mov    $0x0, %esi
movq   $0x6, 0x1a0(%rsp)
movq   $0x0, 0x1a8(%rsp)
callq  f49 <_calculateVerticesAndNormals+0x8c9>
jmpq   7a8 <_calculateVerticesAndNormals+0x128>
xchg   %ax, %ax
mov    $0x5f, %edx
mov    $0x6, %edi
mov    $0x0, %esi
movss  %xmm9, 0x90(%rsp)
movq   $0x6, 0x1b0(%rsp)
movq   $0x0, 0x1b8(%rsp)
callq  f86 <_calculateVerticesAndNormals+0x906>
movss  0x90(%rsp), %xmm9
jmpq   7e4 <_calculateVerticesAndNormals+0x164>
nopl   (%rax)
mov    $0x69, %edx
mov    $0x6, %edi
mov    $0x0, %esi
movss  %xmm0, (%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movq   $0x6, 0x210(%rsp)
movq   $0x0, 0x218(%rsp)
callq  ffd <_calculateVerticesAndNormals+0x97d>
movss  0x70(%rsp), %xmm8
movss  0x80(%rsp), %xmm7
movss  0x60(%rsp), %xmm6
movss  0x40(%rsp), %xmm5
movss  0x50(%rsp), %xmm4
movss  0x30(%rsp), %xmm3
movss  0x10(%rsp), %xmm2
movss  0x20(%rsp), %xmm1
movss  (%rsp), %xmm0
jmpq   e59 <_calculateVerticesAndNormals+0x7d9>
nopl   0x0(%rax, %rax, 1)
mov    $0x69, %edx
mov    $0x6, %edi
mov    $0x0, %esi
movss  %xmm0, (%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movq   $0x6, 0x200(%rsp)
movq   $0x0, 0x208(%rsp)
callq  10a5 <_calculateVerticesAndNormals+0xa25>
movss  0x70(%rsp), %xmm8
movss  0x80(%rsp), %xmm7
movss  0x60(%rsp), %xmm6
movss  0x40(%rsp), %xmm5
movss  0x50(%rsp), %xmm4
movss  0x30(%rsp), %xmm3
movss  0x10(%rsp), %xmm2
movss  0x20(%rsp), %xmm1
movss  (%rsp), %xmm0
jmpq   e27 <_calculateVerticesAndNormals+0x7a7>
nopl   0x0(%rax, %rax, 1)
mov    $0x68, %edx
mov    $0x6, %edi
mov    $0x0, %esi
movss  %xmm0, (%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movss  %xmm9, 0x90(%rsp)
movss  %xmm10, 0xa0(%rsp)
movss  %xmm11, 0xb0(%rsp)
movq   $0x6, 0x1f0(%rsp)
movq   $0x0, 0x1f8(%rsp)
callq  116b <_calculateVerticesAndNormals+0xaeb>
movss  0x70(%rsp), %xmm8
movss  0xb0(%rsp), %xmm11
movss  0xa0(%rsp), %xmm10
movss  0x90(%rsp), %xmm9
movss  0x80(%rsp), %xmm7
movss  0x60(%rsp), %xmm6
movss  0x40(%rsp), %xmm5
movss  0x50(%rsp), %xmm4
movss  0x30(%rsp), %xmm3
movss  0x10(%rsp), %xmm2
movss  0x20(%rsp), %xmm1
movss  (%rsp), %xmm0
jmpq   d3a <_calculateVerticesAndNormals+0x6ba>
nopw   0x0(%rax, %rax, 1)
mov    $0x68, %edx
mov    $0x6, %edi
mov    $0x0, %esi
movss  %xmm0, (%rsp)
movss  %xmm1, 0x20(%rsp)
movss  %xmm2, 0x10(%rsp)
movss  %xmm3, 0x30(%rsp)
movss  %xmm4, 0x50(%rsp)
movss  %xmm5, 0x40(%rsp)
movss  %xmm6, 0x60(%rsp)
movss  %xmm7, 0x80(%rsp)
movss  %xmm8, 0x70(%rsp)
movss  %xmm9, 0x90(%rsp)
movss  %xmm10, 0xa0(%rsp)
movss  %xmm11, 0xb0(%rsp)
movq   $0x6, 0x1e0(%rsp)
movq   $0x0, 0x1e8(%rsp)
callq  1253 <_calculateVerticesAndNormals+0xbd3>
movss  0x70(%rsp), %xmm8
movss  0xb0(%rsp), %xmm11
movss  0xa0(%rsp), %xmm10
movss  0x90(%rsp), %xmm9
movss  0x80(%rsp), %xmm7
movss  0x60(%rsp), %xmm6
movss  0x40(%rsp), %xmm5
movss  0x50(%rsp), %xmm4
movss  0x30(%rsp), %xmm3
movss  0x10(%rsp), %xmm2
movss  0x20(%rsp), %xmm1
movss  (%rsp), %xmm0
jmpq   cfd <_calculateVerticesAndNormals+0x67d>
mov    $0x12, %r8d
mov    $0x6, %edx
mov    $0x0, %ecx
mov    $0x9, %edi
mov    $0x0, %esi
movss  %xmm9, 0x90(%rsp)
movq   $0x6, 0x220(%rsp)
movq   $0x0, 0x228(%rsp)
movq   $0x9, 0x230(%rsp)
movq   $0x0, 0x238(%rsp)
callq  1308 <_calculateVerticesAndNormals+0xc88>
movss  0x90(%rsp), %xmm9
jmpq   7fa <_calculateVerticesAndNormals+0x17a>
add    $0x268, %rsp
pop    %rbx
pop    %rbp
pop    %r12
pop    %r13
pop    %r14
pop    %r15
retq   
nopl   0x0(%rax)


Bye,
bearophile

Aug 04 2011

Adam Ruppe <destructionator gmail.com> writes:

 But what's the purpose of those callq? They seem to call the
 successive asm instruct

I find AT&T syntax to be almost impossible to read, but it looks
like they are comparing the instruction pointer for some reason.

call works by pushing the instruction pointer on the stack, then
jumping to the new address. By calling the next thing, you can
then pop the instruction pointer off the stack and continue on where
you left off.

I don't know why they want this though. That AT&T syntax really
messes with my brain...

Aug 04 2011

Don <nospam nospam.com> writes:

Adam Ruppe wrote:
 But what's the purpose of those callq? They seem to call the
 successive asm instruct

 
 I find AT&T syntax to be almost impossible to read, but it looks
 like they are comparing the instruction pointer for some reason.
 
 call works by pushing the instruction pointer on the stack, then
 jumping to the new address. By calling the next thing, you can
 then pop the instruction pointer off the stack and continue on where
 you left off.

They do that to implement Position Independent Code: you need to know 
the instruction pointer, to be able to access your data. Actually it has 
a terrible effect on performance, because it destroys the processor's 
return prediction mechanism (it guarantees multiple mispredictions). But 
it seems to be unavoidable -- I don't think it's possible to generate 
decent code for PIC on x86-32. But there should never be more than one 
call in a function.

 I don't know why they want this though. That AT&T syntax really
 messes with my brain...

Aug 05 2011

Trass3r <un known.com> writes:

 are you willing and able to show me the asm before it gets assembled?  
 (with gcc you do it with the -S switch). (I also suggest to use only the  
 C standard library, with time() and printf() to produce a smaller asm  
 output: http://codepad.org/12EUo16J ).

Aug 05 2011

bearophile <bearophileHUGS lycos.com> writes:

Trass3r:

 are you willing and able to show me the asm before it gets assembled?  
 (with gcc you do it with the -S switch). (I also suggest to use only the  
 C standard library, with time() and printf() to produce a smaller asm  
 output: http://codepad.org/12EUo16J ).


You are a person of few words :-) Thank you for the asm.

Apparently the program was not compiled in release mode (or with nobounds. With
DMD it's the same thing, maybe with gdc it's not the same thing). It contains
the calls, but they aren't to the next line, they were for the array bounds:

    call    _d_assert
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_assert_msg
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
	call	_d_assert_msg

But I think this doesn't fully explain the low performance, I have seen too
many instructions like:

	movss	DWORD PTR [rsp+32], xmm1
	movss	DWORD PTR [rsp+16], xmm2
	movss	DWORD PTR [rsp+48], xmm3

If you want to go on with this exploration, then I suggest you to find a way to
disable bound tests.

Bye,
bearophile

Aug 05 2011

Trass3r <un known.com> writes:

 If you want to go on with this exploration, then I suggest you to find a  
 way to disable bound tests.

Ok, now I get up to 32930000 skinned vertices per second.
Still a bit worse than LDC.

Aug 05 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Trass3r:
 C++ no SIMD:
 Skinned vertices per second: 42420000

 ...
 D gdc:
 Skinned vertices per second: 23450000

 Are you able and willing to show me the asm produced by gdc? There's a problem

there.
 Bye,
 bearophile


Notes from me:

- Options -fno-bounds-check and -frelease can be just as important in GDC as
they
are in DMD under certain instances.
- You can output asm in intel dialect using -masm=intel if at&t is that
difficult
for you to read. 8-)

I will look into this later from my workstation.

Aug 06 2011

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 I will look into this later from my workstation.

The remaining thing to look at is just the small performance difference between
the D-GDC version and the C++-G++ version.

Bye,
bearophile

Aug 06 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 I will look into this later from my workstation.

 The remaining thing to look at is just the small performance difference between

the D-GDC version and the C++-G++ version.
 Bye,
 bearophile

Three things that helped improve performance in a minor way for me:
1) using pointers over dynamic arrays. (5% speedup)
2) removing the calls to CalVector4's constructor (5.7% speedup)
3) using core.stdc.time over std.datetime. (1.6% speedup)

Point one is pretty well known issue in D as far as I'm aware.
Point two is not an issue with inlining (all methods are marked 'inline'), but
it
did help remove quite a few movss instructions being emitted.
Point three is interesting, it seems that "sw.peek().msecs" slows down the
number
of iterations in the while loop.


With those changes, D implementation is still 21% slower than C++ implementation
without SIMD.

http://ideone.com/4PP2D

Aug 06 2011

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

Are you using GDC2-64 bit on Linux?

 Three things that helped improve performance in a minor way for me:
 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)
 3) using core.stdc.time over std.datetime. (1.6% speedup)
 
 Point one is pretty well known issue in D as far as I'm aware.

Really? I don't remember discussions about it. What is its cause?


 Point two is not an issue with inlining (all methods are marked 'inline'), but
it
 did help remove quite a few movss instructions being emitted.

This too is something worth fixing. Is this issue in Bugzilla already?


 Point three is interesting, it seems that "sw.peek().msecs" slows down the
number
 of iterations in the while loop.

This needs to be fixed.


 With those changes, D implementation is still 21% slower than C++
implementation
 without SIMD.
 http://ideone.com/4PP2D

This is a lot still.

Thank you for your work. I think all three issues are worth fixing, eventually.

Bye,
bearophile

Aug 06 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 Are you using GDC2-64 bit on Linux?

GDC2-32 bit on Linux.


 Three things that helped improve performance in a minor way for me:
 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)
 3) using core.stdc.time over std.datetime. (1.6% speedup)

 Point one is pretty well known issue in D as far as I'm aware.

 Really? I don't remember discussions about it. What is its cause?

I can't remember the exact discussion, but it was something about a benchmark of
passing by value vs passing by ref vs passing by pointer.

 Point two is not an issue with inlining (all methods are marked 'inline'), but
it
 did help remove quite a few movss instructions being emitted.

 This too is something worth fixing. Is this issue in Bugzilla already?

I don't think its an issue really. But of course, there is a difference between
what you say and what you mean with regards to the code here (that being, with
the
first version, lots of temp vars get created and moved around the place).


Regards
Iain

Aug 06 2011

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)

With DMD I have seen 180k -> 190k vertices/sec replacing this:

struct CalVector4 {
    float X, Y, Z, W;

    this(float x, float y, float z, float w = 0.0f) {
        X = x;
        Y = y;
        Z = z;
        W = w;
    }
}

With:

struct CalVector4 {
    float X, Y, Z, W=0.0f;
}

I'd like the D compiler to optimize better there.



 http://ideone.com/4PP2D

This line of code is not good:
auto vertices = cast(Vertex *) new Vertex[N];

This is much better, it's less bug-prone, simpler and shorter:
auto vertices = (new Vertex[N]).ptr;

But in practice in this program it is enough to allocate dynamic arrays
normally, and then perform the call like this (with DMD it gives the same
performance):
calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr,
influences.ptr, output.ptr);

I don't know why passing pointers gives some more performance here, compared to
passing dynamic arrays (but I have seen the same behaviour in other D programs
of mine).

Bye,
bearophile

Aug 06 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2011 3:19 PM, bearophile wrote:
 I don't know why passing pointers gives some more performance here, compared
 to passing dynamic arrays (but I have seen the same behaviour in other D
 programs of mine).

A dynamic array is two values being passed, a pointer is one.

Aug 06 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 A dynamic array is two values being passed, a pointer is one.

I know, but I think there are many optimization opportunities. An example:


private void foo(int[] a2) {}
void main() {
    int[100] a1;
    foo(a1);
}


In code like that I think a D compiler is free to compile like this, because
foo is private, so it's free to perform optimizations based on just the code
inside the module:

private void foo(ref int[100] a2) {}
void main() {
    int[100] a1;
    foo(a1);
}


I think there are several cases where a D compiler is free to replace the two
values with just a pointer.


Another example, to optimize code like this:

private void foo(int[] a1, int[] a2) {}
void main() {
    int n = 100; // run-time value
    auto a3 = new int[n];
    auto a4 = new int[n];
    foo(a3, a4);
}


Into something like this:

private void foo(int* a1, int* a2, size_t a1a2len) {}
void main() {
    int n = 100;
    auto a3 = new int[n];
    auto a4 = new int[n];
    foo(a3.ptr, a4.ptr, n);
}

Bye,
bearophile

Aug 06 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)

 With DMD I have seen 180k -> 190k vertices/sec replacing this:
 struct CalVector4 {
     float X, Y, Z, W;
     this(float x, float y, float z, float w = 0.0f) {
         X = x;
         Y = y;
         Z = z;
         W = w;
     }
 }
 With:
 struct CalVector4 {
     float X, Y, Z, W=0.0f;
 }
 I'd like the D compiler to optimize better there.
 http://ideone.com/4PP2D

 This line of code is not good:
 auto vertices = cast(Vertex *) new Vertex[N];
 This is much better, it's less bug-prone, simpler and shorter:
 auto vertices = (new Vertex[N]).ptr;
 But in practice in this program it is enough to allocate dynamic arrays

normally, and then perform the call like this (with DMD it gives the same
performance):
 calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr,
influences.ptr,

output.ptr);

I was playing about with heap vs stack. Must've forgot to remove that, sorry. :)

Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now
(on
my system).

Implementation: http://ideone.com/0j0L1

Command-line:
gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native

Best times:
G++-32bit:  11400000 vps
GDC-32bit:  11350000 vps


Regards
Iain

Aug 06 2011

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now
(on
 my system).

Are you willing to explain your changes (and maybe give a link to the changes)?
Maybe Walter is interested for DMD too.


 Command-line:
 gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
 g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native

In newer versions of GCC -Ofast means -ffast-math too.

Walter is not a lover of that -ffast-math switch.
But I now think that the combination of D strongly pure functions with unsafe
FP optimizations offers optimization opportunities that maybe not even GCC is
able to use now when it compiles C/C++ code (do you see why?). Not using this
opportunity is a waste, in my opinion.

Bye,
bearophile

Aug 06 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2011 4:46 PM, bearophile wrote:
 Walter is not a lover of that -ffast-math switch.

No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking 
IEEE conformance is something very, very few should even consider.

Aug 06 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 On 8/6/2011 4:46 PM, bearophile wrote:
 Walter is not a lover of that -ffast-math switch.

 
 No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking 
 IEEE conformance is something very, very few should even consider.

I have read several papers about FP arithmetic, but I am not an expert yet on
them. Both GDC and LDC have compilation switches to perform those unsafe FP
optimizations, so even if you don't like them, most D compilers today have them
optional, and I don't think those switches will be removed.

If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids )
on the screen using D, and you use floating point values to represent their
speed vector, introducing unsafe FP optimizations will not harm so much. Video
games are a significant purpose for D language, and in them FP errors are often
benign (maybe some parts of the game are able to tolerate them and some other
part of the game needs to be compiled with strict FP semantics).

Bye,
bearophile

Aug 06 2011

"Eric Poggel (JoeCoder)" <dnewsgroup2 yage3d.net> writes:

On 8/6/2011 8:34 PM, bearophile wrote:
 Walter:

 On 8/6/2011 4:46 PM, bearophile wrote:
 Walter is not a lover of that -ffast-math switch.

 No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking
 IEEE conformance is something very, very few should even consider.

 I have read several papers about FP arithmetic, but I am not an expert yet on
them. Both GDC and LDC have compilation switches to perform those unsafe FP
optimizations, so even if you don't like them, most D compilers today have them
optional, and I don't think those switches will be removed.

 If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids )
on the screen using D, and you use floating point values to represent their
speed vector, introducing unsafe FP optimizations will not harm so much. Video
games are a significant purpose for D language, and in them FP errors are often
benign (maybe some parts of the game are able to tolerate them and some other
part of the game needs to be compiled with strict FP semantics).

 Bye,
 bearophile

Floating point determinism can be very important when it comes to 
reducing network traffic.  If you can achieve it, then you can make sure 
all players have the same game state and then only send user input 
commands over the network.

Glenn Fiedler has an interesting writeup on it, but I haven't had a 
chance to read all of it yet:

http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/

Aug 07 2011

bearophile <bearophileHUGS lycos.com> writes:

Eric Poggel (JoeCoder):

 determinism can be very important when it comes to 
 reducing network traffic.  If you can achieve it, then you can make sure 
 all players have the same game state and then only send user input 
 commands over the network.

It seems a hard thing to obtain, but I agree that it gets useful.

For me having some FP determinism is useful for debugging: to avoid results
from changing randomly if I perform a tiny change in the source code that
triggers a change in what optimizations the compiler does.

But there are several situations (if I am writing a ray tracer?) where FP
determinism is not required in my release build. I was not arguing about
removing FP rules from the D compiler, just that there are situations where
relaxing those FP rules, on request, doesn't seem to harm. I am not expert
about the risks Walter was talking about, so maybe I'm just walking on thin ice
(but no one will get hurt if my little raytrcer produces some errors in its
images).

You don't come often in this newsgroup, thank you for the link :-)

Bye,
bearophile

Aug 08 2011

"Eric Poggel (JoeCoder)" <dnewsgroup2 yage3d.net> writes:

On 8/8/2011 3:02 PM, bearophile wrote:
 Eric Poggel (JoeCoder):

 determinism can be very important when it comes to
 reducing network traffic.  If you can achieve it, then you can make sure
 all players have the same game state and then only send user input
 commands over the network.

 It seems a hard thing to obtain, but I agree that it gets useful.

 For me having some FP determinism is useful for debugging: to avoid results
from changing randomly if I perform a tiny change in the source code that
triggers a change in what optimizations the compiler does.

 But there are several situations (if I am writing a ray tracer?) where FP
determinism is not required in my release build. I was not arguing about
removing FP rules from the D compiler, just that there are situations where
relaxing those FP rules, on request, doesn't seem to harm. I am not expert
about the risks Walter was talking about, so maybe I'm just walking on thin ice
(but no one will get hurt if my little raytrcer produces some errors in its
images).

 You don't come often in this newsgroup, thank you for the link :-)

 Bye,
 bearophile

You'd be surprised how much I lurk here.  I agree there are some 
interesting areas where fast floating point may indeed be worth it, but 
I also don't know enough.

I've also wondered about creating a Fixed!(long, 8) struct that would 
let me work with longs and 8 bits of precision after the decimal point 
as a way of having equal precision anywhere in a large universe and 
achieving determinism at the same time.  But I don't know how 
performance would compare vs floats or doubles.

Aug 08 2011

Trass3r <un known.com> writes:

 Anyways, I've tweaked the GDC codegen, and program speed meets that of  
 C++ now (on my system).

 Implementation: http://ideone.com/0j0L1

 Command-line:
 gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
 g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native

 Best times:
 G++-32bit:  11400000 vps
 GDC-32bit:  11350000 vps


 Regards
 Iain

64Bit:

C++:
45010000
44270000
42740000
43900000
44680000
43490000
42390000

GDC:
42900000
44010000
44000000
44010000
44010000
44000000

GDC with -fno-bounds-check:
43280000
44440000
44420000
44340000
44440000
44450000

Aug 07 2011

D Programming

C/C++ Programming

Other

digitalmars.D - From a C++/JS benchmark