digitalmars.D - SIMD on Windows

Jonathan Dunlap (4/4) Jun 21 2013 In D 2.063.2 on Windows 7:

Jonathan Dunlap (3/3) Jun 21 2013 Btw, is it possible to check for SIMD support as a compilation
Walter Bright (4/8) Jun 21 2013 It's not a bug, and there are currently no plans to support SIMD on Win3...

Manu (4/14) Jun 21 2013 It would certainly be nice in Win32, but I tend to think Win32 COFF shou...

Rainer Schuetze (24/26) Jun 23 2013 I have removed the dust from these patches and pushed them successfully

Manu (2/29) Jun 23 2013
Jacob Carlborg (5/17) Jun 23 2013 So, you have implemented support for COFF 32bit? How long have you been

Rainer Schuetze (3/7) Jun 23 2013 I experimented with it a few times, but it didn't work good enough until...

Michael (2/2) Jun 23 2013 Cool)))

Rainer Schuetze (7/9) Jun 24 2013 Let's see if Walter approves.

Jonathan Dunlap (18/18) Jun 29 2013 Alright, I'm now officially building for Windows x64 (amd64).

jerro (21/40) Jun 29 2013 First of all, calcSIMD and calcScalar are virtual functions so

Jonathan Dunlap (9/15) Jun 29 2013 I've updated the project with your suggestions at

Iain Buclaw (5/19) Jun 29 2013 s/class/struct/
jerro (33/50) Jun 29 2013 The multiples 2 and 1 were the reason why the scalar code
jerro (5/7) Jun 29 2013 The call to calcScalar compiles to this:

Jonathan Dunlap (5/8) Jun 29 2013 It seems like auto-vectorization to SIMD code may be an ideal

jerro (13/18) Jun 29 2013 The things is that using SIMD efficiently often requires you to

Jonathan Dunlap (7/7) Jun 29 2013 I did watch Manu's a few days ago which inspired me to start this

Manu (28/34) Jun 29 2013 You should probably watch my talk again ;)

Jonathan Dunlap (19/19) Jul 01 2013 Thanks Manu, I think I understand. Quick questions, so I've

jerro (4/7) Jul 01 2013 The loop body in testSimd doesn't do anything. This line:

Jonathan Dunlap (8/8) Jul 01 2013 Thanks Jerro, I went ahead and used a pointer reference to ensure

Manu (5/12) Jul 01 2013 Maybe make the arrays public? it's conceivable the optimiser could
bearophile (4/12) Jul 01 2013 Have you taken a look at the asm?

Kiith-Sa (4/4) Jun 29 2013 See Manu's talk and google how to use it. If you don't know what

Michael (1/2) Jun 29 2013 Why? or What exactly? Details please)

Rainer Schuetze (3/5) Jun 23 2013 https://github.com/D-Programming-Language/dmd/pull/2253

bearophile (4/6) Jun 21 2013 LDC2 supports SIMD on Win32.
Jonathan Dunlap (5/16) Jun 21 2013 How do you compile for Win64? The only package for Windows I see

Brad Anderson (7/25) Jun 21 2013 The i386 DMD can produce Win64 COFF object files which can then
Geancarlo Rocha (3/21) Jun 21 2013 If just installing VC++ doesn't work...

Jonathan Dunlap (16/16) Jun 21 2013 Alright, I installed VC2010 (with x64 libs) and added the -m64

Jonathan Dunlap (2/2) Jun 21 2013 Also tried VC2012 Express (with x64 libs)... received the same
Walter Bright (4/17) Jun 21 2013 Anytime you see a message like "Internal error" it's a compiler bug and ...
Manu (2/16) Jun 22 2013 You can't use SIMD and symbolic debuginfo. The compiler will crash.

Walter Bright (2/3) Jun 22 2013 I didn't know that. Bugzilla?

Manu (3/7) Jun 22 2013 Pretty sure it's in there...

Michael (2/2) Jun 21 2013 Check twice where is yours 64 bit tools installed. Paths

Benjamin Thaut (11/15) Jun 22 2013 In its current state you don't want to be using SIMD with dmd because

jerro (6/9) Jun 22 2013 That may be true for some kinds of code, but it isn't true int

Benjamin Thaut (6/14) Jun 22 2013 Well, but judging from the assembly it generates, it could be even

jerro (5/9) Jun 22 2013 It's a FFT implementation. It does most of the work using + - and

Benjamin Thaut (5/13) Jun 22 2013 Ok I saw that you did write quite a few cirtical functions in inline

jerro (13/35) Jun 22 2013 I have actually run that benchmark with the code from this branch:

SomeDude (3/6) Jul 03 2013 Hello, did you propose your pfft library as a replacement in

jerro (3/5) Jul 04 2013 I have thought about it, but haven't gotten around to doing it

"Jonathan Dunlap" <jadit2 gmail.com> writes:

In D 2.063.2 on Windows 7:
Error: SIMD vector types not supported on this platform

Should I file a bug for this or is this currently on a roadmap? 
I'm SUPER excited to get into SIMD development with D. :D

Jun 21 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

Btw, is it possible to check for SIMD support as a compilation 
condition? Ideally I'm looking to 'polyfill' SIMD if it's not 
supported on the platform.

Jun 21 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

It's not a bug, and there are currently no plans to support SIMD on Win32. 
However, it is supported for Win64 compilations.

You can, however, file an enhancement request on Bugzilla for it.

Jun 21 2013

Manu <turkeyman gmail.com> writes:

On 22 June 2013 09:04, Walter Bright <newshound2 digitalmars.com> wrote:

 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:

 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

 It's not a bug, and there are currently no plans to support SIMD on Win32.
 However, it is supported for Win64 compilations.

 You can, however, file an enhancement request on Bugzilla for it.

It would certainly be nice in Win32, but I tend to think Win32 COFF should
be much higher priority.

And to OP: There's a version(SIMD) you can test.

Jun 21 2013

Rainer Schuetze <r.sagitario gmx.de> writes:

On 22.06.2013 02:07, Manu wrote:
 It would certainly be nice in Win32, but I tend to think Win32 COFF
 should be much higher priority.

I have removed the dust from these patches and pushed them successfully 
through the test suite and unittests:

https://github.com/rainers/dmd/tree/coff32
https://github.com/rainers/druntime/tree/coff32
https://github.com/rainers/phobos/tree/coff32

Compile dmd as usual, but druntime and phobos with something like

druntime:
  make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>"
phobos:
  make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" 
"AR=<path-to-32bit-lib>"

COFF32 files are generated when -m32ms is used on the command line.

If you put the resulting libraries into the lib folder, using a standard 
installation of VS2010 might work, but I recommend adding a new section 
to sc.ini and adjust paths there. Mine looks like this:

[Environment32ms]
PATH=c:\l\vs9\Common7\IDE;%PATH%
LIB="% P%\..\..\lib32";c:\l\vs9\vc\lib;c:\Program Files (x86)\Microsoft 
SDKs\Windows\v7.1A\Lib"
DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO
LINKCMD=c:\l\vs9\vc\bin\link.exe

BTW: I also found some bugs in the Win64 along the way, I'll create pull 
requests for these.

Jun 23 2013

Manu <turkeyman gmail.com> writes:

I've said it before, but this man is a genius! :)

On 23 June 2013 23:33, Rainer Schuetze <r.sagitario gmx.de> wrote:

 On 22.06.2013 02:07, Manu wrote:

 It would certainly be nice in Win32, but I tend to think Win32 COFF
 should be much higher priority.

 I have removed the dust from these patches and pushed them successfully
 through the test suite and unittests:

 https://github.com/rainers/**dmd/tree/coff32<https://github.com/rainers/dmd/tree/coff32>
 https://github.com/rainers/**druntime/tree/coff32<https://github.com/rainers/druntime/tree/coff32>
 https://github.com/rainers/**phobos/tree/coff32<https://github.com/rainers/phobos/tree/coff32>

 Compile dmd as usual, but druntime and phobos with something like

 druntime:
  make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>"
 phobos:
  make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>"
 "AR=<path-to-32bit-lib>"

 COFF32 files are generated when -m32ms is used on the command line.

 If you put the resulting libraries into the lib folder, using a standard
 installation of VS2010 might work, but I recommend adding a new section to
 sc.ini and adjust paths there. Mine looks like this:

 [Environment32ms]
 PATH=c:\l\vs9\Common7\IDE;%**PATH%
 LIB="% P%\..\..\lib32";c:\l\**vs9\vc\lib;c:\Program Files (x86)\Microsoft
 SDKs\Windows\v7.1A\Lib"
 DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO
 LINKCMD=c:\l\vs9\vc\bin\link.**exe

 BTW: I also found some bugs in the Win64 along the way, I'll create pull
 requests for these.

Jun 23 2013

Jacob Carlborg <doob me.com> writes:

On 2013-06-23 15:33, Rainer Schuetze wrote:

 I have removed the dust from these patches and pushed them successfully
 through the test suite and unittests:

 https://github.com/rainers/dmd/tree/coff32
 https://github.com/rainers/druntime/tree/coff32
 https://github.com/rainers/phobos/tree/coff32

 Compile dmd as usual, but druntime and phobos with something like

 druntime:
   make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>"
 phobos:
   make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>"
 "AR=<path-to-32bit-lib>"

 COFF32 files are generated when -m32ms is used on the command line.

So, you have implemented support for COFF 32bit? How long have you been 
hiding this :) Although I'm not a Windows user I consider it great news.

-- 
/Jacob Carlborg

Jun 23 2013

Rainer Schuetze <r.sagitario gmx.de> writes:

On 23.06.2013 20:24, Jacob Carlborg wrote:
 On 2013-06-23 15:33, Rainer Schuetze wrote:

 COFF32 files are generated when -m32ms is used on the command line.

 So, you have implemented support for COFF 32bit? How long have you been
 hiding this :) Although I'm not a Windows user I consider it great news.

I experimented with it a few times, but it didn't work good enough until 
this week-end.

Jun 23 2013

"Michael" <pr m1xa.com> writes:

Cool)))

Any chances to see it [coff32] in official build?

Jun 23 2013

Rainer Schuetze <r.sagitario gmx.de> writes:

On 23.06.2013 21:55, Michael wrote:
 Cool)))

 Any chances to see it [coff32] in official build?

Let's see if Walter approves.

There is one maybe disruptive change: with two different C runtimes 
available for Win32, versioning on Win32/Win64 no longer works. I added 
versions CRuntime_DigitalMars and CRuntime_Microsoft (and CRuntime_GNU 
for anything else), and adapting to this makes most of the changes in 
druntime and phobos.

Jun 24 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

Alright, I'm now officially building for Windows x64 (amd64). 
I've created this early benchmark http://dpaste.dzfl.pl/eae0233e 
to explore SIMD performance. As you can see below, on my machine 
there is almost zero difference. Am I missing something?

//===SIMD===
0 1.#INF 5 1.#INF <-- vector result
hnsecs: 100006 <-- duration time
0 1.#INF 5 1.#INF
hnsecs: 90005
0 1.#INF 5 1.#INF
hnsecs: 90006
//===SCALAR===
0 1.#INF 5 1.#INF
hnsecs: 90005
0 1.#INF 5 1.#INF
hnsecs: 100005
0 1.#INF 5 1.#INF
hnsecs: 100006

Jun 29 2013

"jerro" <a a.com> writes:

On Saturday, 29 June 2013 at 14:39:44 UTC, Jonathan Dunlap wrote:
 Alright, I'm now officially building for Windows x64 (amd64). 
 I've created this early benchmark 
 http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As 
 you can see below, on my machine there is almost zero 
 difference. Am I missing something?

 //===SIMD===
 0 1.#INF 5 1.#INF <-- vector result
 hnsecs: 100006 <-- duration time
 0 1.#INF 5 1.#INF
 hnsecs: 90005
 0 1.#INF 5 1.#INF
 hnsecs: 90006
 //===SCALAR===
 0 1.#INF 5 1.#INF
 hnsecs: 90005
 0 1.#INF 5 1.#INF
 hnsecs: 100005
 0 1.#INF 5 1.#INF
 hnsecs: 100006

First of all, calcSIMD and calcScalar are virtual functions so 
they can't be inlined, which prevents any further optimization. 
It also seems that the fact that g, s, i and d are class fields 
and that g is a static array makes DMD load them from memory and 
store them back on every iteration even when calcSIMD and 
calcScalar are inlined.

But even if I make the class final and build it with gdc -O3 
-finline-functions -frelease -march=native (in which case GDC 
generates assembly that looks optimal to me), the scalar version 
is still a bit faster than the vector version. The main reason 
for that is that even with scalar code, the compiler can do 
multiple operations in parallel. So on Sandy Bridge CPUs, for 
example, floating point multiplication takes 5 cycles to 
complete, but the processor can do one multiplication per cycle. 
So my guess is that the first four multiplications and the second 
four multiplications in calcScalar are done in parallel.

That would explain the scalar code being equaly fast, but not 
faster than vector code. The reason it's faster is that gdc 
replaces multiplication by 2 with addition and omits 
multiplication by 1.

Jun 29 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

I've updated the project with your suggestions at 
http://dpaste.dzfl.pl/fce2d93b but still get the same 
performance. Vectors defined in the benchmark function body, no 
function calling overhead, etc. See some of my comments below btw:

 First of all, calcSIMD and calcScalar are virtual functions so 
 they can't be inlined, which prevents any further optimization.

For the dlang docs: Member functions which are private or package 
are never virtual, and hence cannot be overridden.

 So my guess is that the first four multiplications and the 
 second four multiplications in calcScalar are done in parallel. 
 ... The reason it's faster is that gdc replaces multiplication 
 by 2 with addition and omits multiplication by 1.

I've changed the multiplies of 2 and 1 to 2.1 and 1.01 
respectively. Still no performance difference between the two for 
me.

Jun 29 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 29 June 2013 18:57, Jonathan Dunlap <jadit2 gmail.com> wrote:
 I've updated the project with your suggestions at
 http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors
 defined in the benchmark function body, no function calling overhead, etc.
 See some of my comments below btw:


 First of all, calcSIMD and calcScalar are virtual functions so they can't
 be inlined, which prevents any further optimization.


 For the dlang docs: Member functions which are private or package are never
 virtual, and hence cannot be overridden.

 So my guess is that the first four multiplications and the second four
 multiplications in calcScalar are done in parallel. ... The reason it's
 faster is that gdc replaces multiplication by 2 with addition and omits
 multiplication by 1.


 I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still
 no performance difference between the two for me.


s/class/struct/


--
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jun 29 2013

"jerro" <a a.com> writes:

On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote:
 I've updated the project with your suggestions at 
 http://dpaste.dzfl.pl/fce2d93b but still get the same 
 performance. Vectors defined in the benchmark function body, no 
 function calling overhead, etc. See some of my comments below 
 btw:

 First of all, calcSIMD and calcScalar are virtual functions so 
 they can't be inlined, which prevents any further optimization.

 For the dlang docs: Member functions which are private or 
 package are never virtual, and hence cannot be overridden.

 So my guess is that the first four multiplications and the 
 second four multiplications in calcScalar are done in 
 parallel. ... The reason it's faster is that gdc replaces 
 multiplication by 2 with addition and omits multiplication by 
 1.

 I've changed the multiplies of 2 and 1 to 2.1 and 1.01 
 respectively. Still no performance difference between the two 
 for me.

The multiples 2 and 1 were the reason why the scalar code 
performs a little bit better than SIMD code when compiled with 
GDC. The main reason why scalar code isn't much slower than SIMD 
code is instruction level parallelism. Because the first four 
operation in calcScalar are independent (none of them depends on 
the result of any of the other three) modern x86-64 processors 
can execute them in parallel. Because of that, the speed of your 
program is limited by instruction latency and not throughput. 
That's why it doesn't really make a difference that the scalar 
version does four times as many operations.

You can also make advantage of instruction level parallelism when 
using SIMD. For example, I get about the same number of 
iterations per second for the following two functions (when using 
GDC):

         import gcc.attribute;

	 attribute("forceinline") void calcSIMD1() {

		s0 = s0 * i0;

		s0 = s0 * d0;

		s1 = s1 * i1;

		s1 = s1 * d1;

		s2 = s2 * i2;

		s2 = s2 * d2;

		s3 = s3 * i3;

		s3 = s3 * d3;

	}

	 attribute("forceinline") void calcSIMD2() {

		s0 = s0 * i0;

		s0 = s0 * d0;
	}

By the way, if performance is very important to you, you should 
try GDC (or LDC, but I don't think LDC is currently fully usable 
on Windows).

Jun 29 2013

"jerro" <a a.com> writes:

 For the dlang docs: Member functions which are private or 
 package are never virtual, and hence cannot be overridden.

The call to calcScalar compiles to this:

mov    rax,QWORD PTR [r12]
rex.W call QWORD PTR [rax+0x40]

so I think the implementation doesn't conform to the spec in this 
case.

Jun 29 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

modern x86-64 processors can execute them in parallel. Because 
of that, the speed of your program is limited by instruction 
latency and not throughput.


It seems like auto-vectorization to SIMD code may be an ideal 
strategy (e.g. Java) since it seems that the conditions to get 
any performance improvement have to be very particular and 
situational... which is something the compiler may be best suited 
to handle. Thoughts?

Jun 29 2013

"jerro" <a a.com> writes:

 It seems like auto-vectorization to SIMD code may be an ideal 
 strategy (e.g. Java) since it seems that the conditions to get 
 any performance improvement have to be very particular and 
 situational... which is something the compiler may be best 
 suited to handle. Thoughts?

The things is that using SIMD efficiently often requires you to 
organize your data and your algorithm differently, which is 
something that the compiler can't do for you. Another problem is 
that the compiler doesn't know how often different code paths 
will be executed so it can't know how to use SIMD in the best way 
(that could be solved with profile guided optimization, though). 
Alignment restrictions are another thing that can cause problems. 
For those reasons auto-vectorization only works in the simplest 
of cases. But if you want auto-vectorization, GDC and LDC already 
do it.

I recommend watching Manu's talk (as Kiith-Sa has already 
suggested):

http://youtube.com/watch?v=q_39RnxtkgM

Jun 29 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

I did watch Manu's a few days ago which inspired me to start this 
project. With the updates in http://dpaste.dzfl.pl/fce2d93b, I'm 
still a bit clueless as to why there is almost zero performance 
difference... considering that is seems like an ideal setup to 
benefit from SIMD. I feel that if I can't see gains here: that I 
shouldn't bother using them in practice, where sometimes 
non-ideal operations must be done.

Jun 29 2013

Manu <turkeyman gmail.com> writes:

You should probably watch my talk again ;)
Most of the points I make towards the end when I make the claim "almost
everyone who tries to use SIMD will see the same or slower performance, and
the reason is they have simply revealed other bottlenecks".
And I also made the point "only by strictly applying ALL of the points I
demonstrated, will you see significant performance improvement".

The problem with your code is that it doesn't do any real work. Your
operations are all dependent on the result of the previous operation. The
scalar operations have a shorter latency than the SIMD operations, and they
all execute in parallel.
This is exactly the pathological worst-case comparison that basically
everyone new to SIMD tries to write and wonders why it's slow.
I guess I should have demonstrated this point more clearly in my talk. It
was very rushed (actually, the script was basically on the spot), sorry
about that!

There's not enough code in those loops. You're basically profiling loop
iteration performance and the latency of a float opcode vs a simd opcode...
not any significant work.
You should see a big difference if you unroll the loop 4-8 times (or more
for such a short loop, depending on the CPU).
I also made the point that you should always avoid doing SIMD profiling on
an x86, and certainly not an x64, since it is both, the most forgiving
(results in the least wins of any arch), and also the hardest to predict;
the performance difference you see will almost certainly not be the same on
someone else's chip..

Look again to my points about latency, reducing the overall pipeline length
(demonstrated with the addition sequence), and unrolling the loops.


On 30 June 2013 06:34, Jonathan Dunlap <jadit2 gmail.com> wrote:

 I did watch Manu's a few days ago which inspired me to start this project.
 With the updates in http://dpaste.dzfl.pl/fce2d93b**, I'm still a bit
 clueless as to why there is almost zero performance difference...
 considering that is seems like an ideal setup to benefit from SIMD. I feel
 that if I can't see gains here: that I shouldn't bother using them in
 practice, where sometimes non-ideal operations must be done.

Jun 29 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

Thanks Manu, I think I understand. Quick questions, so I've 
updated my test to allow for loop unrolling 
http://dpaste.dzfl.pl/12933bc8 as the calculation is done over an 
array of elements and does not depend on the last operation. My 
problem is that the program reports using 0 time. However, as 
soon as I start printing out elements the time then jumps to 
looking more realistic. However, even if I print the elements of 
the list after I print the calculation operation, I still get 
zero seconds. Like:

1: calc time
2: do operations
3: print time delta (result:0 time)
4: print all values from operation

1: calc time
2: do operations
3: print all values from operation
4: print time delta (result:large time delta actually shown)

Is D performing operations lazily by default or am I missing 
something?

Jul 01 2013

"jerro" <a a.com> writes:

On Monday, 1 July 2013 at 17:19:02 UTC, Jonathan Dunlap wrote:
 Thanks Manu, I think I understand. Quick questions, so I've 
 updated my test to allow for loop unrolling 
 http://dpaste.dzfl.pl/12933bc8

The loop body in testSimd doesn't do anything. This line:

auto di = d[i];

copies the vector, it does not reference it.

Jul 01 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

Thanks Jerro, I went ahead and used a pointer reference to ensure 
it's being saved back into the array 
(http://dpaste.dzfl.pl/52710926). Two things:
1) still showing zero time delta
2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher 
will cause the program to immediately quit with no output at all. 
Even the first statement of writeln in the constructor doesn't 
execute.

Jul 01 2013

Manu <turkeyman gmail.com> writes:

Maybe make the arrays public? it's conceivable the optimiser could
eliminate all that code, since it can prove the results are never
referenced...
I doubt that's the problem though, just a first guess.

On 2 July 2013 09:14, Jonathan Dunlap <jadit2 gmail.com> wrote:

 Thanks Jerro, I went ahead and used a pointer reference to ensure it's
 being saved back into the array
(http://dpaste.dzfl.pl/**52710926<http://dpaste.dzfl.pl/52710926>).
 Two things:
 1) still showing zero time delta
 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher will cause
 the program to immediately quit with no output at all. Even the first
 statement of writeln in the constructor doesn't execute.

Jul 01 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Jonathan Dunlap:

 Thanks Jerro, I went ahead and used a pointer reference to 
 ensure it's being saved back into the array 
 (http://dpaste.dzfl.pl/52710926). Two things:
 1) still showing zero time delta
 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher 
 will cause the program to immediately quit with no output at 
 all. Even the first statement of writeln in the constructor 
 doesn't execute.

Have you taken a look at the asm?

Bye,
bearophile

Jul 01 2013

"Kiith-Sa" <kiithsacmp gmail.com> writes:

See Manu's talk and google how to use it. If you don't know what 
you're doing you are unlikely to see performance improvements.

I'm not even sure if you're benchmarking SIMD performance or 
function call overhead there.

Jun 29 2013

"Michael" <pr m1xa.com> writes:

 versioning on Win32/Win64 no longer works.

Why? or What exactly? Details please)

Jun 29 2013

Rainer Schuetze <r.sagitario gmx.de> writes:

On 23.06.2013 15:33, Rainer Schuetze wrote:
 BTW: I also found some bugs in the Win64 along the way, I'll create pull
 requests for these.

https://github.com/D-Programming-Language/dmd/pull/2253
https://github.com/D-Programming-Language/dmd/pull/2254

Jun 23 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 It's not a bug, and there are currently no plans to support 
 SIMD on Win32. However, it is supported for Win64 compilations.

LDC2 supports SIMD on Win32.

Bye,
bearophile

Jun 21 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:
 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a 
 roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

 It's not a bug, and there are currently no plans to support 
 SIMD on Win32. However, it is supported for Win64 compilations.

 You can, however, file an enhancement request on Bugzilla for 
 it.

How do you compile for Win64? The only package for Windows I see 
is i386 which doesn't seem to support Win64 offhand... does the 
compiler require a flag?

(yes, I'm on a Win64 OS/system)

Jun 21 2013

"Brad Anderson" <eco gnuk.net> writes:

On Saturday, 22 June 2013 at 04:28:57 UTC, Jonathan Dunlap wrote:
 On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:
 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a 
 roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

 It's not a bug, and there are currently no plans to support 
 SIMD on Win32. However, it is supported for Win64 compilations.

 You can, however, file an enhancement request on Bugzilla for 
 it.

 How do you compile for Win64? The only package for Windows I 
 see is i386 which doesn't seem to support Win64 offhand... does 
 the compiler require a flag?

 (yes, I'm on a Win64 OS/system)

The i386 DMD can produce Win64 COFF object files which can then 
be linked by the free MSVC toolchain.  Basically you just need to 
install that toolchain and DMD -m64 should just work in my 
experience.  Someone probably has a link handy for the MSVC 
toolchain.  I'm  not sure because I have Visual Studio installed 
for work so I've always already got it installed.

Jun 21 2013

"Geancarlo Rocha" <asdf mailinator.com> writes:

If just installing VC++ doesn't work... 
http://forum.dlang.org/post/mailman.2800.1355837582.5162.digitalmars-d puremagic.com
On Saturday, 22 June 2013 at 04:28:57 UTC, Jonathan Dunlap wrote:
 On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:
 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a 
 roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

 It's not a bug, and there are currently no plans to support 
 SIMD on Win32. However, it is supported for Win64 compilations.

 You can, however, file an enhancement request on Bugzilla for 
 it.

 How do you compile for Win64? The only package for Windows I 
 see is i386 which doesn't seem to support Win64 offhand... does 
 the compiler require a flag?

 (yes, I'm on a Win64 OS/system)

Jun 21 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

Alright, I installed VC2010 (with x64 libs) and added the -m64 
option to the compiler. Sadly the compiler dies with the below 
message. Should I file a bug or did I miss something?
-----

Building: Easy (Debug)

Performing main compilation...

Current dictionary: 
C:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\

C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d"  
"-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" 
"-odobj\Debug" 
"-ofC:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\bin\Debug\SIMDTests.exe" 
-m64

Internal error: ..\ztc\cgcv.c 2162

Exit code 1

Build complete -- 1 error, 0 warnings

Jun 21 2013

"Jonathan Dunlap" <jadit2 gmail.com> writes:

Also tried VC2012 Express (with x64 libs)... received the same 
compiler error.

Jun 21 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/21/2013 10:54 PM, Jonathan Dunlap wrote:
 Alright, I installed VC2010 (with x64 libs) and added the -m64 option to the
 compiler. Sadly the compiler dies with the below message. Should I file a bug
or
 did I miss something?

Anytime you see a message like "Internal error" it's a compiler bug and should 
be reported to bugzilla.

To work around, try replacing -gc with -g.

 -----

 Building: Easy (Debug)

 Performing main compilation...

 Current dictionary: C:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\

 C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d"
 "-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" "-odobj\Debug"
 "-ofC:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\bin\Debug\SIMDTests.exe"
-m64

 Internal error: ..\ztc\cgcv.c 2162

 Exit code 1

 Build complete -- 1 error, 0 warnings

Jun 21 2013

Manu <turkeyman gmail.com> writes:

On 22 June 2013 15:54, Jonathan Dunlap <jadit2 gmail.com> wrote:

 Alright, I installed VC2010 (with x64 libs) and added the -m64 option to
 the compiler. Sadly the compiler dies with the below message. Should I file
 a bug or did I miss something?
 -----

 Building: Easy (Debug)

 Performing main compilation...

 Current dictionary: C:\Users\dunlap\Documents\**GitHub\CodeEval\Dlang\

 C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d"
  "-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" "-odobj\Debug"
 "-ofC:\Users\dunlap\Documents\**GitHub\CodeEval\Dlang\bin\**Debug\SIMDTests.exe"
 -m64

 Internal error: ..\ztc\cgcv.c 2162

 Exit code 1

 Build complete -- 1 error, 0 warnings

You can't use SIMD and symbolic debuginfo. The compiler will crash.

Jun 22 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/22/2013 1:10 AM, Manu wrote:
 You can't use SIMD and symbolic debuginfo. The compiler will crash.

I didn't know that. Bugzilla?

Jun 22 2013

Manu <turkeyman gmail.com> writes:

Pretty sure it's in there...
Here it is: http://d.puremagic.com/issues/show_bug.cgi?id=10224

On 22 June 2013 18:36, Walter Bright <newshound2 digitalmars.com> wrote:

 On 6/22/2013 1:10 AM, Manu wrote:

 You can't use SIMD and symbolic debuginfo. The compiler will crash.

 I didn't know that. Bugzilla?

Jun 22 2013

"Michael" <pr m1xa.com> writes:

Check twice where is yours 64 bit tools installed. Paths 
something diff in Win8, VS2010, VS2012 Express installations.

Jun 21 2013

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 22.06.2013 00:43, schrieb Jonathan Dunlap:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a roadmap? I'm
 SUPER excited to get into SIMD development with D. :D

In its current state you don't want to be using SIMD with dmd because 
the generated assembly will be significantly slower then if you just use 
the default FPU math. If you need simd you will need to write inline 
assembler. This will then also work on 32 bit windows. But you have to 
use unaligned loads / stores because the compiler will not garantuee 
alignment (on 32 bit).

More details on the underperforming generated assembly can be found here:
http://d.puremagic.com/issues/show_bug.cgi?id=10226

Kind Regards
Benjamin Thaut

Jun 22 2013

"jerro" <a a.com> writes:

 In its current state you don't want to be using SIMD with dmd 
 because the generated assembly will be significantly slower 
 then if you just use the default FPU math.

That may be true for some kinds of code, but it isn't true int 
general.
For example, see the comparison of pfft's performance when built 
with 64 bit DMD using SIMD and without SIMD:

http://i.imgur.com/kYYI9R9.png

This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.

Jun 22 2013

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 22.06.2013 15:53, schrieb jerro:
 In its current state you don't want to be using SIMD with dmd because
 the generated assembly will be significantly slower then if you just
 use the default FPU math.

 That may be true for some kinds of code, but it isn't true int general.
 For example, see the comparison of pfft's performance when built with 64
 bit DMD using SIMD and without SIMD:

 http://i.imgur.com/kYYI9R9.png

 This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.

Well, but judging from the assembly it generates, it could be even 
faster. What exactly is pfft? Does it use dmd's __simd intrinsics?
Or does it only do primitive operations (* / - +) on simd types?

Kind Regards
Benjamin Thaut

Jun 22 2013

"jerro" <a a.com> writes:

 Well, but judging from the assembly it generates, it could be 
 even faster. What exactly is pfft? Does it use dmd's __simd 
 intrinsics?
 Or does it only do primitive operations (* / - +) on simd types?

It's a FFT implementation. It does most of the work using + - and 
*. There's one
part off the algorithm that uses mostly shufps, and that part 
takes about 10% of the time (for sizes around 2 ^^ 10 when using 
SSE).

Jun 22 2013

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 22.06.2013 15:53, schrieb jerro:
 In its current state you don't want to be using SIMD with dmd because
 the generated assembly will be significantly slower then if you just
 use the default FPU math.

 That may be true for some kinds of code, but it isn't true int general.
 For example, see the comparison of pfft's performance when built with 64
 bit DMD using SIMD and without SIMD:

 http://i.imgur.com/kYYI9R9.png

 This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.

Ok I saw that you did write quite a few cirtical functions in inline 
assembly. Not really a good argument for dmds codegen with simd intrinsics.

Kind Regards
Benjamin Thaut

Jun 22 2013

"jerro" <a a.com> writes:

On Saturday, 22 June 2013 at 15:41:43 UTC, Benjamin Thaut wrote:
 Am 22.06.2013 15:53, schrieb jerro:
 In its current state you don't want to be using SIMD with dmd 
 because
 the generated assembly will be significantly slower then if 
 you just
 use the default FPU math.

 That may be true for some kinds of code, but it isn't true int 
 general.
 For example, see the comparison of pfft's performance when 
 built with 64
 bit DMD using SIMD and without SIMD:

 http://i.imgur.com/kYYI9R9.png

 This benchmark was run on a core i5 2500K on 64 bit Debian 
 Wheezy.

 Ok I saw that you did write quite a few cirtical functions in 
 inline assembly. Not really a good argument for dmds codegen 
 with simd intrinsics.

 Kind Regards
 Benjamin Thaut

I have actually run that benchmark with the code from this branch:

https://github.com/jerro/pfft/tree/experimental

The only function in sse_float.d on that branch that uses inline 
assembly is scalar_to_vector. The reason why I used more inline 
assembly in the master branch is that DMD didn't have intrinsics 
for some instructions such as shufps at the time.

I'm not really arguing for DMD's codegen with SIMD intrinsics. 
It's more that, from what I've seen, it doesn't produce very good 
scalar floating point code either (at least when compared to LDC 
or GDC). Whether I use scalar floating point or SIMD, pfft is 
about two times slower if I compile it with DMD than it is if I 
compile it with GDC.

Jun 22 2013

"SomeDude" <lovelydear mailmetrash.com> writes:

On Saturday, 22 June 2013 at 16:04:26 UTC, jerro wrote:

 I have actually run that benchmark with the code from this 
 branch:

 https://github.com/jerro/pfft/tree/experimental

Hello, did you propose your pfft library as a replacement in 
std.numeric ?

Jul 03 2013

"jerro" <a a.com> writes:

 Hello, did you propose your pfft library as a replacement in 
 std.numeric ?

I have thought about it, but haven't gotten around to doing it 
yet. I'd like to finish support for multidimensional transforms 
first.

Jul 04 2013

D Programming

C/C++ Programming

Other

digitalmars.D - SIMD on Windows