digitalmars.D - SIMD on Windows
- Jonathan Dunlap (4/4) Jun 21 2013 In D 2.063.2 on Windows 7:
- Jonathan Dunlap (3/3) Jun 21 2013 Btw, is it possible to check for SIMD support as a compilation
- Walter Bright (4/8) Jun 21 2013 It's not a bug, and there are currently no plans to support SIMD on Win3...
- Manu (4/14) Jun 21 2013 It would certainly be nice in Win32, but I tend to think Win32 COFF shou...
- Rainer Schuetze (24/26) Jun 23 2013 I have removed the dust from these patches and pushed them successfully
- Manu (2/29) Jun 23 2013
- Jacob Carlborg (5/17) Jun 23 2013 So, you have implemented support for COFF 32bit? How long have you been
- Rainer Schuetze (3/7) Jun 23 2013 I experimented with it a few times, but it didn't work good enough until...
- Michael (2/2) Jun 23 2013 Cool)))
- Rainer Schuetze (7/9) Jun 24 2013 Let's see if Walter approves.
- Jonathan Dunlap (18/18) Jun 29 2013 Alright, I'm now officially building for Windows x64 (amd64).
- jerro (21/40) Jun 29 2013 First of all, calcSIMD and calcScalar are virtual functions so
- Jonathan Dunlap (9/15) Jun 29 2013 I've updated the project with your suggestions at
- Iain Buclaw (5/19) Jun 29 2013 s/class/struct/
- jerro (33/50) Jun 29 2013 The multiples 2 and 1 were the reason why the scalar code
- jerro (5/7) Jun 29 2013 The call to calcScalar compiles to this:
- Jonathan Dunlap (5/8) Jun 29 2013 It seems like auto-vectorization to SIMD code may be an ideal
- jerro (13/18) Jun 29 2013 The things is that using SIMD efficiently often requires you to
- Jonathan Dunlap (7/7) Jun 29 2013 I did watch Manu's a few days ago which inspired me to start this
- Manu (28/34) Jun 29 2013 You should probably watch my talk again ;)
- Jonathan Dunlap (19/19) Jul 01 2013 Thanks Manu, I think I understand. Quick questions, so I've
- jerro (4/7) Jul 01 2013 The loop body in testSimd doesn't do anything. This line:
- Jonathan Dunlap (8/8) Jul 01 2013 Thanks Jerro, I went ahead and used a pointer reference to ensure
- Manu (5/12) Jul 01 2013 Maybe make the arrays public? it's conceivable the optimiser could
- bearophile (4/12) Jul 01 2013 Have you taken a look at the asm?
- Kiith-Sa (4/4) Jun 29 2013 See Manu's talk and google how to use it. If you don't know what
- Michael (1/2) Jun 29 2013 Why? or What exactly? Details please)
- Rainer Schuetze (3/5) Jun 23 2013 https://github.com/D-Programming-Language/dmd/pull/2253
- bearophile (4/6) Jun 21 2013 LDC2 supports SIMD on Win32.
- Jonathan Dunlap (5/16) Jun 21 2013 How do you compile for Win64? The only package for Windows I see
- Brad Anderson (7/25) Jun 21 2013 The i386 DMD can produce Win64 COFF object files which can then
- Geancarlo Rocha (3/21) Jun 21 2013 If just installing VC++ doesn't work...
- Jonathan Dunlap (16/16) Jun 21 2013 Alright, I installed VC2010 (with x64 libs) and added the -m64
- Jonathan Dunlap (2/2) Jun 21 2013 Also tried VC2012 Express (with x64 libs)... received the same
- Walter Bright (4/17) Jun 21 2013 Anytime you see a message like "Internal error" it's a compiler bug and ...
- Manu (2/16) Jun 22 2013 You can't use SIMD and symbolic debuginfo. The compiler will crash.
- Walter Bright (2/3) Jun 22 2013 I didn't know that. Bugzilla?
- Manu (3/7) Jun 22 2013 Pretty sure it's in there...
- Michael (2/2) Jun 21 2013 Check twice where is yours 64 bit tools installed. Paths
- Benjamin Thaut (11/15) Jun 22 2013 In its current state you don't want to be using SIMD with dmd because
- jerro (6/9) Jun 22 2013 That may be true for some kinds of code, but it isn't true int
- Benjamin Thaut (6/14) Jun 22 2013 Well, but judging from the assembly it generates, it could be even
- jerro (5/9) Jun 22 2013 It's a FFT implementation. It does most of the work using + - and
- Benjamin Thaut (5/13) Jun 22 2013 Ok I saw that you did write quite a few cirtical functions in inline
In D 2.063.2 on Windows 7: Error: SIMD vector types not supported on this platform Should I file a bug for this or is this currently on a roadmap? I'm SUPER excited to get into SIMD development with D. :D
Jun 21 2013
Btw, is it possible to check for SIMD support as a compilation condition? Ideally I'm looking to 'polyfill' SIMD if it's not supported on the platform.
Jun 21 2013
On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:In D 2.063.2 on Windows 7: Error: SIMD vector types not supported on this platform Should I file a bug for this or is this currently on a roadmap? I'm SUPER excited to get into SIMD development with D. :DIt's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.
Jun 21 2013
On 22 June 2013 09:04, Walter Bright <newshound2 digitalmars.com> wrote:On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:It would certainly be nice in Win32, but I tend to think Win32 COFF should be much higher priority. And to OP: There's a version(SIMD) you can test.In D 2.063.2 on Windows 7: Error: SIMD vector types not supported on this platform Should I file a bug for this or is this currently on a roadmap? I'm SUPER excited to get into SIMD development with D. :DIt's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.
Jun 21 2013
On 22.06.2013 02:07, Manu wrote:It would certainly be nice in Win32, but I tend to think Win32 COFF should be much higher priority.I have removed the dust from these patches and pushed them successfully through the test suite and unittests: https://github.com/rainers/dmd/tree/coff32 https://github.com/rainers/druntime/tree/coff32 https://github.com/rainers/phobos/tree/coff32 Compile dmd as usual, but druntime and phobos with something like druntime: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" phobos: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" "AR=<path-to-32bit-lib>" COFF32 files are generated when -m32ms is used on the command line. If you put the resulting libraries into the lib folder, using a standard installation of VS2010 might work, but I recommend adding a new section to sc.ini and adjust paths there. Mine looks like this: [Environment32ms] PATH=c:\l\vs9\Common7\IDE;%PATH% LIB="% P%\..\..\lib32";c:\l\vs9\vc\lib;c:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Lib" DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO LINKCMD=c:\l\vs9\vc\bin\link.exe BTW: I also found some bugs in the Win64 along the way, I'll create pull requests for these.
Jun 23 2013
I've said it before, but this man is a genius! :) On 23 June 2013 23:33, Rainer Schuetze <r.sagitario gmx.de> wrote:On 22.06.2013 02:07, Manu wrote:It would certainly be nice in Win32, but I tend to think Win32 COFF should be much higher priority.I have removed the dust from these patches and pushed them successfully through the test suite and unittests: https://github.com/rainers/**dmd/tree/coff32<https://github.com/rainers/dmd/tree/coff32> https://github.com/rainers/**druntime/tree/coff32<https://github.com/rainers/druntime/tree/coff32> https://github.com/rainers/**phobos/tree/coff32<https://github.com/rainers/phobos/tree/coff32> Compile dmd as usual, but druntime and phobos with something like druntime: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" phobos: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" "AR=<path-to-32bit-lib>" COFF32 files are generated when -m32ms is used on the command line. If you put the resulting libraries into the lib folder, using a standard installation of VS2010 might work, but I recommend adding a new section to sc.ini and adjust paths there. Mine looks like this: [Environment32ms] PATH=c:\l\vs9\Common7\IDE;%**PATH% LIB="% P%\..\..\lib32";c:\l\**vs9\vc\lib;c:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Lib" DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO LINKCMD=c:\l\vs9\vc\bin\link.**exe BTW: I also found some bugs in the Win64 along the way, I'll create pull requests for these.
Jun 23 2013
On 2013-06-23 15:33, Rainer Schuetze wrote:I have removed the dust from these patches and pushed them successfully through the test suite and unittests: https://github.com/rainers/dmd/tree/coff32 https://github.com/rainers/druntime/tree/coff32 https://github.com/rainers/phobos/tree/coff32 Compile dmd as usual, but druntime and phobos with something like druntime: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" phobos: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" "AR=<path-to-32bit-lib>" COFF32 files are generated when -m32ms is used on the command line.So, you have implemented support for COFF 32bit? How long have you been hiding this :) Although I'm not a Windows user I consider it great news. -- /Jacob Carlborg
Jun 23 2013
On 23.06.2013 20:24, Jacob Carlborg wrote:On 2013-06-23 15:33, Rainer Schuetze wrote:I experimented with it a few times, but it didn't work good enough until this week-end.COFF32 files are generated when -m32ms is used on the command line.So, you have implemented support for COFF 32bit? How long have you been hiding this :) Although I'm not a Windows user I consider it great news.
Jun 23 2013
Cool))) Any chances to see it [coff32] in official build?
Jun 23 2013
On 23.06.2013 21:55, Michael wrote:Cool))) Any chances to see it [coff32] in official build?Let's see if Walter approves. There is one maybe disruptive change: with two different C runtimes available for Win32, versioning on Win32/Win64 no longer works. I added versions CRuntime_DigitalMars and CRuntime_Microsoft (and CRuntime_GNU for anything else), and adapting to this makes most of the changes in druntime and phobos.
Jun 24 2013
Alright, I'm now officially building for Windows x64 (amd64). I've created this early benchmark http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As you can see below, on my machine there is almost zero difference. Am I missing something? //===SIMD=== 0 1.#INF 5 1.#INF <-- vector result hnsecs: 100006 <-- duration time 0 1.#INF 5 1.#INF hnsecs: 90005 0 1.#INF 5 1.#INF hnsecs: 90006 //===SCALAR=== 0 1.#INF 5 1.#INF hnsecs: 90005 0 1.#INF 5 1.#INF hnsecs: 100005 0 1.#INF 5 1.#INF hnsecs: 100006
Jun 29 2013
On Saturday, 29 June 2013 at 14:39:44 UTC, Jonathan Dunlap wrote:Alright, I'm now officially building for Windows x64 (amd64). I've created this early benchmark http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As you can see below, on my machine there is almost zero difference. Am I missing something? //===SIMD=== 0 1.#INF 5 1.#INF <-- vector result hnsecs: 100006 <-- duration time 0 1.#INF 5 1.#INF hnsecs: 90005 0 1.#INF 5 1.#INF hnsecs: 90006 //===SCALAR=== 0 1.#INF 5 1.#INF hnsecs: 90005 0 1.#INF 5 1.#INF hnsecs: 100005 0 1.#INF 5 1.#INF hnsecs: 100006First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization. It also seems that the fact that g, s, i and d are class fields and that g is a static array makes DMD load them from memory and store them back on every iteration even when calcSIMD and calcScalar are inlined. But even if I make the class final and build it with gdc -O3 -finline-functions -frelease -march=native (in which case GDC generates assembly that looks optimal to me), the scalar version is still a bit faster than the vector version. The main reason for that is that even with scalar code, the compiler can do multiple operations in parallel. So on Sandy Bridge CPUs, for example, floating point multiplication takes 5 cycles to complete, but the processor can do one multiplication per cycle. So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. That would explain the scalar code being equaly fast, but not faster than vector code. The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1.
Jun 29 2013
I've updated the project with your suggestions at http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors defined in the benchmark function body, no function calling overhead, etc. See some of my comments below btw:First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization.For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. ... The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1.I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.
Jun 29 2013
On 29 June 2013 18:57, Jonathan Dunlap <jadit2 gmail.com> wrote:I've updated the project with your suggestions at http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors defined in the benchmark function body, no function calling overhead, etc. See some of my comments below btw:s/class/struct/ -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization.For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. ... The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1.I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.
Jun 29 2013
On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote:I've updated the project with your suggestions at http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors defined in the benchmark function body, no function calling overhead, etc. See some of my comments below btw:The multiples 2 and 1 were the reason why the scalar code performs a little bit better than SIMD code when compiled with GDC. The main reason why scalar code isn't much slower than SIMD code is instruction level parallelism. Because the first four operation in calcScalar are independent (none of them depends on the result of any of the other three) modern x86-64 processors can execute them in parallel. Because of that, the speed of your program is limited by instruction latency and not throughput. That's why it doesn't really make a difference that the scalar version does four times as many operations. You can also make advantage of instruction level parallelism when using SIMD. For example, I get about the same number of iterations per second for the following two functions (when using GDC): import gcc.attribute; attribute("forceinline") void calcSIMD1() { s0 = s0 * i0; s0 = s0 * d0; s1 = s1 * i1; s1 = s1 * d1; s2 = s2 * i2; s2 = s2 * d2; s3 = s3 * i3; s3 = s3 * d3; } attribute("forceinline") void calcSIMD2() { s0 = s0 * i0; s0 = s0 * d0; } By the way, if performance is very important to you, you should try GDC (or LDC, but I don't think LDC is currently fully usable on Windows).First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization.For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. ... The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1.I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.
Jun 29 2013
For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.The call to calcScalar compiles to this: mov rax,QWORD PTR [r12] rex.W call QWORD PTR [rax+0x40] so I think the implementation doesn't conform to the spec in this case.
Jun 29 2013
It seems like auto-vectorization to SIMD code may be an ideal strategy (e.g. Java) since it seems that the conditions to get any performance improvement have to be very particular and situational... which is something the compiler may be best suited to handle. Thoughts?modern x86-64 processors can execute them in parallel. Because of that, the speed of your program is limited by instruction latency and not throughput.
Jun 29 2013
It seems like auto-vectorization to SIMD code may be an ideal strategy (e.g. Java) since it seems that the conditions to get any performance improvement have to be very particular and situational... which is something the compiler may be best suited to handle. Thoughts?The things is that using SIMD efficiently often requires you to organize your data and your algorithm differently, which is something that the compiler can't do for you. Another problem is that the compiler doesn't know how often different code paths will be executed so it can't know how to use SIMD in the best way (that could be solved with profile guided optimization, though). Alignment restrictions are another thing that can cause problems. For those reasons auto-vectorization only works in the simplest of cases. But if you want auto-vectorization, GDC and LDC already do it. I recommend watching Manu's talk (as Kiith-Sa has already suggested): http://youtube.com/watch?v=q_39RnxtkgM
Jun 29 2013
I did watch Manu's a few days ago which inspired me to start this project. With the updates in http://dpaste.dzfl.pl/fce2d93b, I'm still a bit clueless as to why there is almost zero performance difference... considering that is seems like an ideal setup to benefit from SIMD. I feel that if I can't see gains here: that I shouldn't bother using them in practice, where sometimes non-ideal operations must be done.
Jun 29 2013
You should probably watch my talk again ;) Most of the points I make towards the end when I make the claim "almost everyone who tries to use SIMD will see the same or slower performance, and the reason is they have simply revealed other bottlenecks". And I also made the point "only by strictly applying ALL of the points I demonstrated, will you see significant performance improvement". The problem with your code is that it doesn't do any real work. Your operations are all dependent on the result of the previous operation. The scalar operations have a shorter latency than the SIMD operations, and they all execute in parallel. This is exactly the pathological worst-case comparison that basically everyone new to SIMD tries to write and wonders why it's slow. I guess I should have demonstrated this point more clearly in my talk. It was very rushed (actually, the script was basically on the spot), sorry about that! There's not enough code in those loops. You're basically profiling loop iteration performance and the latency of a float opcode vs a simd opcode... not any significant work. You should see a big difference if you unroll the loop 4-8 times (or more for such a short loop, depending on the CPU). I also made the point that you should always avoid doing SIMD profiling on an x86, and certainly not an x64, since it is both, the most forgiving (results in the least wins of any arch), and also the hardest to predict; the performance difference you see will almost certainly not be the same on someone else's chip.. Look again to my points about latency, reducing the overall pipeline length (demonstrated with the addition sequence), and unrolling the loops. On 30 June 2013 06:34, Jonathan Dunlap <jadit2 gmail.com> wrote:I did watch Manu's a few days ago which inspired me to start this project. With the updates in http://dpaste.dzfl.pl/fce2d93b**, I'm still a bit clueless as to why there is almost zero performance difference... considering that is seems like an ideal setup to benefit from SIMD. I feel that if I can't see gains here: that I shouldn't bother using them in practice, where sometimes non-ideal operations must be done.
Jun 29 2013
Thanks Manu, I think I understand. Quick questions, so I've updated my test to allow for loop unrolling http://dpaste.dzfl.pl/12933bc8 as the calculation is done over an array of elements and does not depend on the last operation. My problem is that the program reports using 0 time. However, as soon as I start printing out elements the time then jumps to looking more realistic. However, even if I print the elements of the list after I print the calculation operation, I still get zero seconds. Like: 1: calc time 2: do operations 3: print time delta (result:0 time) 4: print all values from operation 1: calc time 2: do operations 3: print all values from operation 4: print time delta (result:large time delta actually shown) Is D performing operations lazily by default or am I missing something?
Jul 01 2013
On Monday, 1 July 2013 at 17:19:02 UTC, Jonathan Dunlap wrote:Thanks Manu, I think I understand. Quick questions, so I've updated my test to allow for loop unrolling http://dpaste.dzfl.pl/12933bc8The loop body in testSimd doesn't do anything. This line: auto di = d[i]; copies the vector, it does not reference it.
Jul 01 2013
Thanks Jerro, I went ahead and used a pointer reference to ensure it's being saved back into the array (http://dpaste.dzfl.pl/52710926). Two things: 1) still showing zero time delta 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher will cause the program to immediately quit with no output at all. Even the first statement of writeln in the constructor doesn't execute.
Jul 01 2013
Maybe make the arrays public? it's conceivable the optimiser could eliminate all that code, since it can prove the results are never referenced... I doubt that's the problem though, just a first guess. On 2 July 2013 09:14, Jonathan Dunlap <jadit2 gmail.com> wrote:Thanks Jerro, I went ahead and used a pointer reference to ensure it's being saved back into the array (http://dpaste.dzfl.pl/**52710926<http://dpaste.dzfl.pl/52710926>). Two things: 1) still showing zero time delta 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher will cause the program to immediately quit with no output at all. Even the first statement of writeln in the constructor doesn't execute.
Jul 01 2013
Jonathan Dunlap:Thanks Jerro, I went ahead and used a pointer reference to ensure it's being saved back into the array (http://dpaste.dzfl.pl/52710926). Two things: 1) still showing zero time delta 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher will cause the program to immediately quit with no output at all. Even the first statement of writeln in the constructor doesn't execute.Have you taken a look at the asm? Bye, bearophile
Jul 01 2013
See Manu's talk and google how to use it. If you don't know what you're doing you are unlikely to see performance improvements. I'm not even sure if you're benchmarking SIMD performance or function call overhead there.
Jun 29 2013
versioning on Win32/Win64 no longer works.Why? or What exactly? Details please)
Jun 29 2013
On 23.06.2013 15:33, Rainer Schuetze wrote:BTW: I also found some bugs in the Win64 along the way, I'll create pull requests for these.https://github.com/D-Programming-Language/dmd/pull/2253 https://github.com/D-Programming-Language/dmd/pull/2254
Jun 23 2013
Walter Bright:It's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations.LDC2 supports SIMD on Win32. Bye, bearophile
Jun 21 2013
On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:How do you compile for Win64? The only package for Windows I see is i386 which doesn't seem to support Win64 offhand... does the compiler require a flag? (yes, I'm on a Win64 OS/system)In D 2.063.2 on Windows 7: Error: SIMD vector types not supported on this platform Should I file a bug for this or is this currently on a roadmap? I'm SUPER excited to get into SIMD development with D. :DIt's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.
Jun 21 2013
On Saturday, 22 June 2013 at 04:28:57 UTC, Jonathan Dunlap wrote:On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:The i386 DMD can produce Win64 COFF object files which can then be linked by the free MSVC toolchain. Basically you just need to install that toolchain and DMD -m64 should just work in my experience. Someone probably has a link handy for the MSVC toolchain. I'm not sure because I have Visual Studio installed for work so I've always already got it installed.On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:How do you compile for Win64? The only package for Windows I see is i386 which doesn't seem to support Win64 offhand... does the compiler require a flag? (yes, I'm on a Win64 OS/system)In D 2.063.2 on Windows 7: Error: SIMD vector types not supported on this platform Should I file a bug for this or is this currently on a roadmap? I'm SUPER excited to get into SIMD development with D. :DIt's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.
Jun 21 2013
If just installing VC++ doesn't work... http://forum.dlang.org/post/mailman.2800.1355837582.5162.digitalmars-d puremagic.com On Saturday, 22 June 2013 at 04:28:57 UTC, Jonathan Dunlap wrote:On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:How do you compile for Win64? The only package for Windows I see is i386 which doesn't seem to support Win64 offhand... does the compiler require a flag? (yes, I'm on a Win64 OS/system)In D 2.063.2 on Windows 7: Error: SIMD vector types not supported on this platform Should I file a bug for this or is this currently on a roadmap? I'm SUPER excited to get into SIMD development with D. :DIt's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.
Jun 21 2013
Alright, I installed VC2010 (with x64 libs) and added the -m64 option to the compiler. Sadly the compiler dies with the below message. Should I file a bug or did I miss something? ----- Building: Easy (Debug) Performing main compilation... Current dictionary: C:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\ C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d" "-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" "-odobj\Debug" "-ofC:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\bin\Debug\SIMDTests.exe" -m64 Internal error: ..\ztc\cgcv.c 2162 Exit code 1 Build complete -- 1 error, 0 warnings
Jun 21 2013
Also tried VC2012 Express (with x64 libs)... received the same compiler error.
Jun 21 2013
On 6/21/2013 10:54 PM, Jonathan Dunlap wrote:Alright, I installed VC2010 (with x64 libs) and added the -m64 option to the compiler. Sadly the compiler dies with the below message. Should I file a bug or did I miss something?Anytime you see a message like "Internal error" it's a compiler bug and should be reported to bugzilla. To work around, try replacing -gc with -g.----- Building: Easy (Debug) Performing main compilation... Current dictionary: C:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\ C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d" "-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" "-odobj\Debug" "-ofC:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\bin\Debug\SIMDTests.exe" -m64 Internal error: ..\ztc\cgcv.c 2162 Exit code 1 Build complete -- 1 error, 0 warnings
Jun 21 2013
On 22 June 2013 15:54, Jonathan Dunlap <jadit2 gmail.com> wrote:Alright, I installed VC2010 (with x64 libs) and added the -m64 option to the compiler. Sadly the compiler dies with the below message. Should I file a bug or did I miss something? ----- Building: Easy (Debug) Performing main compilation... Current dictionary: C:\Users\dunlap\Documents\**GitHub\CodeEval\Dlang\ C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d" "-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" "-odobj\Debug" "-ofC:\Users\dunlap\Documents\**GitHub\CodeEval\Dlang\bin\**Debug\SIMDTests.exe" -m64 Internal error: ..\ztc\cgcv.c 2162 Exit code 1 Build complete -- 1 error, 0 warningsYou can't use SIMD and symbolic debuginfo. The compiler will crash.
Jun 22 2013
On 6/22/2013 1:10 AM, Manu wrote:You can't use SIMD and symbolic debuginfo. The compiler will crash.I didn't know that. Bugzilla?
Jun 22 2013
Pretty sure it's in there... Here it is: http://d.puremagic.com/issues/show_bug.cgi?id=10224 On 22 June 2013 18:36, Walter Bright <newshound2 digitalmars.com> wrote:On 6/22/2013 1:10 AM, Manu wrote:You can't use SIMD and symbolic debuginfo. The compiler will crash.I didn't know that. Bugzilla?
Jun 22 2013
Check twice where is yours 64 bit tools installed. Paths something diff in Win8, VS2010, VS2012 Express installations.
Jun 21 2013
Am 22.06.2013 00:43, schrieb Jonathan Dunlap:In D 2.063.2 on Windows 7: Error: SIMD vector types not supported on this platform Should I file a bug for this or is this currently on a roadmap? I'm SUPER excited to get into SIMD development with D. :DIn its current state you don't want to be using SIMD with dmd because the generated assembly will be significantly slower then if you just use the default FPU math. If you need simd you will need to write inline assembler. This will then also work on 32 bit windows. But you have to use unaligned loads / stores because the compiler will not garantuee alignment (on 32 bit). More details on the underperforming generated assembly can be found here: http://d.puremagic.com/issues/show_bug.cgi?id=10226 Kind Regards Benjamin Thaut
Jun 22 2013
In its current state you don't want to be using SIMD with dmd because the generated assembly will be significantly slower then if you just use the default FPU math.That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.
Jun 22 2013
Am 22.06.2013 15:53, schrieb jerro:Well, but judging from the assembly it generates, it could be even faster. What exactly is pfft? Does it use dmd's __simd intrinsics? Or does it only do primitive operations (* / - +) on simd types? Kind Regards Benjamin ThautIn its current state you don't want to be using SIMD with dmd because the generated assembly will be significantly slower then if you just use the default FPU math.That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.
Jun 22 2013
Well, but judging from the assembly it generates, it could be even faster. What exactly is pfft? Does it use dmd's __simd intrinsics? Or does it only do primitive operations (* / - +) on simd types?It's a FFT implementation. It does most of the work using + - and *. There's one part off the algorithm that uses mostly shufps, and that part takes about 10% of the time (for sizes around 2 ^^ 10 when using SSE).
Jun 22 2013
Am 22.06.2013 15:53, schrieb jerro:Ok I saw that you did write quite a few cirtical functions in inline assembly. Not really a good argument for dmds codegen with simd intrinsics. Kind Regards Benjamin ThautIn its current state you don't want to be using SIMD with dmd because the generated assembly will be significantly slower then if you just use the default FPU math.That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.
Jun 22 2013
On Saturday, 22 June 2013 at 15:41:43 UTC, Benjamin Thaut wrote:Am 22.06.2013 15:53, schrieb jerro:I have actually run that benchmark with the code from this branch: https://github.com/jerro/pfft/tree/experimental The only function in sse_float.d on that branch that uses inline assembly is scalar_to_vector. The reason why I used more inline assembly in the master branch is that DMD didn't have intrinsics for some instructions such as shufps at the time. I'm not really arguing for DMD's codegen with SIMD intrinsics. It's more that, from what I've seen, it doesn't produce very good scalar floating point code either (at least when compared to LDC or GDC). Whether I use scalar floating point or SIMD, pfft is about two times slower if I compile it with DMD than it is if I compile it with GDC.Ok I saw that you did write quite a few cirtical functions in inline assembly. Not really a good argument for dmds codegen with simd intrinsics. Kind Regards Benjamin ThautIn its current state you don't want to be using SIMD with dmd because the generated assembly will be significantly slower then if you just use the default FPU math.That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.
Jun 22 2013
On Saturday, 22 June 2013 at 16:04:26 UTC, jerro wrote:I have actually run that benchmark with the code from this branch: https://github.com/jerro/pfft/tree/experimentalHello, did you propose your pfft library as a replacement in std.numeric ?
Jul 03 2013
Hello, did you propose your pfft library as a replacement in std.numeric ?I have thought about it, but haven't gotten around to doing it yet. I'd like to finish support for multidimensional transforms first.
Jul 04 2013