digitalmars.D - SIMD benchmark
- Walter Bright (30/30) Jan 14 2012 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
- Walter Bright (3/5) Jan 14 2012 Here's what there is at the moment. Needs much more.
- Peter Alexander (17/21) Jan 15 2012 You sure you want proper bug reports for this? There still seems to be a...
- Walter Bright (2/3) Jan 15 2012 Yeah, it's just OSX. I had the test for that platform inadvertently disa...
- Iain Buclaw (5/9) Jan 15 2012 I get 20+ speedup without optimisations with GDC on that small test. :)
- Iain Buclaw (6/12) Jan 15 2012 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
- Walter Bright (2/4) Jan 15 2012 Woo-hoo!
- bearophile (4/6) Jan 15 2012 Please, show me the assembly code produced, with its relative D source :...
- Iain Buclaw (44/50) Jan 15 2012 D code:
- Iain Buclaw (28/34) Jan 15 2012 For those who can't read AT&T:
- Manu (5/19) Jan 15 2012 Oh my indeed.
- Andre Tampubolon (7/43) Jan 16 2012 I just built 32 & 64 bit DMD (latest commit on git tree is
- Walter Bright (2/8) Jan 16 2012 Which machine?
- Andre Tampubolon (5/17) Jan 16 2012 Well I only have 1 machine, a laptop running 64 bit Arch Linux.
- Walter Bright (3/20) Jan 16 2012 32 bit SIMD for Linux is not implemented.
- Andrei Alexandrescu (11/31) Jan 16 2012 These two functions should have the same speed. The function that ought
- Manu (3/31) Jan 16 2012 A function using float arrays and a function using hardware vectors shou...
- Andrei Alexandrescu (4/6) Jan 16 2012 My point was that the version using float arrays should
- Manu (14/21) Jan 16 2012 I think this is a mistake, because such a piece of code never exists
- Timon Gehr (3/25) Jan 16 2012 I think DMD now uses XMM registers for scalar floating point arithmetic
- Manu (6/40) Jan 16 2012 x64 can do the swapping too with no penalty, but that is the only
- Walter Bright (2/4) Jan 16 2012 Ah, that is a crucial bit of information.
- Michel Fortin (16/41) Jan 16 2012 Andrei's idea could be valid as an optimization when the compiler can
- Andrei Alexandrescu (5/9) Jan 16 2012 In this case it's the exact contrary: the float[4] and the operation are...
- Michel Fortin (17/26) Jan 16 2012 That's exactly what I meant, if everything is local to the function you
- Manu (6/9) Jan 16 2012 Yes, my first thought when I saw this test was "why is it generating any
- Walter Bright (3/8) Jan 16 2012 Compile with inlining off, and the compiler 'forgets' what the called fu...
- Walter Bright (15/17) Jan 16 2012 Currently, it is 4 byte aligned. But the compiler could align freestandi...
- Walter Bright (13/15) Jan 16 2012 Yes, you're right. The compiler can opportunistically convert a number o...
- Iain Buclaw (8/25) Jan 16 2012 ove
- Walter Bright (3/5) Jan 16 2012 Of course.
- Iain Buclaw (13/20) Jan 16 2012 There's auto-vectorisation for for(), foreach(), and foreach_reverse()
- Peter Alexander (7/28) Jan 16 2012 Unfortunately, if the function was this:
- Manu (2/40) Jan 16 2012 This is why D needs a __restrict attribute! ;)
- Walter Bright (4/11) Jan 16 2012 That's why D has:
- Manu (2/19) Jan 16 2012 Surely it would be possible for them to be overlapping slices?
- =?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= (3/28) Jan 16 2012 If they are, that's your fault and your problem.
-
=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?=
(6/36)
Jan 16 2012
On Mon, 16 Jan 2012 23:22:21 +0100, Simen Kj=C3=A6r=C3=A5s
- Walter Bright (2/3) Jan 16 2012 Not allowed, treated like array bounds checking.
- Iain Buclaw (10/48) Jan 16 2012 )
- Peter Alexander (10/54) Jan 17 2012 This has nothing to do with strict aliasing.
- Walter Bright (3/5) Jan 17 2012 No, you don't. It can be done with a runtime check, like array bounds ch...
- Peter Alexander (17/23) Jan 17 2012 So you'd change it to this, even in release builds?
- Walter Bright (3/12) Jan 17 2012 No. Like array bounds, if they overlap, an exception is thrown.
- Peter Alexander (12/25) Jan 17 2012 The D spec says that overlapping arrays are illegal for vector ops. The
- Walter Bright (4/31) Jan 17 2012 No, not illegal.
- Peter Alexander (11/48) Jan 17 2012 So, my original point still stands, you can't vectorise this function:
- Walter Bright (5/15) Jan 17 2012 No, you can rewrite it as:
- Timon Gehr (3/23) Jan 17 2012 Are they really a general solution? How do you use vector ops to
- F i L (22/24) Jan 17 2012 struct Matrix4
- Timon Gehr (5/29) Jan 17 2012 The parameter is just squared and returned?
- Iain Buclaw (8/58) Jan 16 2012 -)
- Manu (6/59) Jan 17 2012 What protects these ranges from being overlapping? What if they were
- Walter Bright (2/3) Jan 17 2012 A runtime check, like array bounds checking.
- Manu (3/7) Jan 17 2012 Awesome.
- Walter Bright (2/9) Jan 17 2012 It can't. Use dynamic arrays - that's what they're for.
- Iain Buclaw (14/76) Jan 17 2012 t
- Martin Nowak (8/13) Jan 16 2012 Thought of that too, but it's rather tough to manage slots in vector
- bearophile (17/18) Jan 16 2012 Until better optimizations are implemented, I see a "simple" optimizatio...
- Manu (6/24) Jan 17 2012 If this doesn't already exist, I think it's quite important. I was think...
- Martin Nowak (3/37) Jan 17 2012 If the compiler knows it's a compile time constant
- Manu (2/41) Jan 17 2012 Great idea! :)
- Martin Nowak (7/42) Jan 16 2012 Unfortunately druntime's array ops are a mess and fail
- Don Clugston (6/52) Jan 17 2012 Yes. The structural problem in the compiler is that array ops get turned...
- Martin Nowak (13/68) Jan 17 2012 Oh, I was literally speaking of the runtime implementation.
- Walter Bright (2/4) Jan 17 2012 I think you've got an innovative and clever solution. I'd like to see yo...
- Martin Nowak (5/10) Jan 17 2012 Mmh, there was something keeping me from specializing templates,
- Walter Bright (2/5) Jan 17 2012 I agree.
I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); } import std.stdio; import std.datetime; int main() { test1(); test2(); auto b = comparingBenchmark!(test1, test2, 100); writeln(b.point); return 0; }
Jan 14 2012
On 1/14/2012 10:56 PM, Walter Bright wrote:as there's a serious lack of source code to test it with.Here's what there is at the moment. Needs much more. https://github.com/D-Programming-Language/dmd/blob/master/test/runnable/testxmm.d
Jan 14 2012
On 15/01/12 6:56 AM, Walter Bright wrote:I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with.You sure you want proper bug reports for this? There still seems to be a lot of issues. For example, none of these work for me (OSX 64-bt). ---- int4 a = 2; // backend/cod2.c 2630 ---- int4 a = void; int4 b = void; a = b; // segfault ---- int4 a = void; a = simd(XMM.PXOR, a, a); // segfault ---- I could go on and on really. Very little seems to work at my end. Actually, looking at the auto-tester, I'm not alone. Just seems to be OSX though. http://d.puremagic.com/test-results/index.ghtml
Jan 15 2012
On 1/15/2012 3:49 AM, Peter Alexander wrote:Actually, looking at the auto-tester, I'm not alone. Just seems to be OSX though.Yeah, it's just OSX. I had the test for that platform inadvertently disabled, gak.
Jan 15 2012
On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com> wrote:I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with.I get 20+ speedup without optimisations with GDC on that small test. :) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 15 2012
On 15 January 2012 16:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com> wrote:Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my... -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with.I get 20+ speedup without optimisations with GDC on that small test. :)
Jan 15 2012
On 1/15/2012 10:10 AM, Iain Buclaw wrote:Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my...Woo-hoo!
Jan 15 2012
Iain Buclaw:Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my...Please, show me the assembly code produced, with its relative D source :-) Bye, bearophile
Jan 15 2012
On 15 January 2012 19:01, bearophile <bearophileHUGS lycos.com> wrote:Iain Buclaw:)Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. =A0My oh my...Please, show me the assembly code produced, with its relative D source :-=Bye, bearophileD code: ---- import core.simd; void test2a(float4 a) { } float4 test2() { float4 a =3D 1.2; a =3D a * 3 + 7; test2a(a); return a; } ---- Relevant assembly: ---- .LC5: .long 1067030938 .long 1067030938 .long 1067030938 .long 1067030938 .section .rodata.cst4,"aM", progbits,4 .align 4 _D4test5test2FZNhG4f: .cfi_startproc movl $3, %eax cvtsi2ss %eax, %xmm0 movb $7, %al cvtsi2ss %eax, %xmm1 unpcklps %xmm0, %xmm0 unpcklps %xmm1, %xmm1 movlhps %xmm0, %xmm0 movlhps %xmm1, %xmm1 mulps .LC5(%rip), %xmm0 addps %xmm1, %xmm0 ret .cfi_endproc ---- As someone pointed out to me, the only optimisation missing was constant propagation, but that doesn't matter too much for now. Regards --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 15 2012
On 15 January 2012 19:01, bearophile <bearophileHUGS lycos.com> wrote:Iain Buclaw:)Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. =A0My oh my...Please, show me the assembly code produced, with its relative D source :-=Bye, bearophileFor those who can't read AT&T: ---- .LC5: .long 1067030938 .long 1067030938 .long 1067030938 .long 1067030938 .align 16 _D4test5test2FZNhG4f: .cfi_startproc mov eax, 3 cvtsi2ss xmm0, eax mov al, 7 cvtsi2ss xmm1, eax unpcklps xmm0, xmm0 unpcklps xmm1, xmm1 movlhps xmm0, xmm0 movlhps xmm1, xmm1 mulps xmm0, XMMWORD PTR .LC5[rip] addps xmm0, xmm1 ret .cfi_endproc ---- --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 15 2012
On 15 January 2012 20:10, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 15 January 2012 16:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:Oh my indeed. Haha, well I'm sure that's a fairly artificial result, but yes, this is why I've been harping on for months that it's a bare necessity to provide language support :POn 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com>wrote:quality.I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alphacodeExpect bugs - but make bug reports, as there's a serious lack of sourceCorrection, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my...to test it with.I get 20+ speedup without optimisations with GDC on that small test. :)
Jan 15 2012
I just built 32 & 64 bit DMD (latest commit on git tree is f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) Using the 32-bit version, I got this error: Internal error: backend/cg87.c 1702 The 64-bit version went fine. Previously, both 32 and 64 bit version had no problem. On 01/15/2012 01:56 PM, Walter Bright wrote:I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); } import std.stdio; import std.datetime; int main() { test1(); test2(); auto b = comparingBenchmark!(test1, test2, 100); writeln(b.point); return 0; }
Jan 16 2012
On 1/16/2012 12:59 AM, Andre Tampubolon wrote:I just built 32& 64 bit DMD (latest commit on git tree is f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) Using the 32-bit version, I got this error: Internal error: backend/cg87.c 1702 The 64-bit version went fine. Previously, both 32 and 64 bit version had no problem.Which machine?
Jan 16 2012
Well I only have 1 machine, a laptop running 64 bit Arch Linux. Yesterday I did a git pull, built both 32 & 64 bit DMD, and this code compiled fine using those. But now, the 32 bit version fails. Walter Bright <newshound2 digitalmars.com> wrote:On 1/16/2012 12:59 AM, Andre Tampubolon wrote:I just built 32& 64 bit DMD (latest commit on git tree is f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) Using the 32-bit version, I got this error: Internal error: backend/cg87.c 1702 The 64-bit version went fine. Previously, both 32 and 64 bit version had no problem.Which machine?
Jan 16 2012
32 bit SIMD for Linux is not implemented. It's all 64 bit platforms, and 32 bit OS X. On 1/16/2012 2:35 AM, Andre Tampubolon wrote:Well I only have 1 machine, a laptop running 64 bit Arch Linux. Yesterday I did a git pull, built both 32& 64 bit DMD, and this code compiled fine using those. But now, the 32 bit version fails. Walter Bright<newshound2 digitalmars.com> wrote:On 1/16/2012 12:59 AM, Andre Tampubolon wrote:I just built 32& 64 bit DMD (latest commit on git tree is f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) Using the 32-bit version, I got this error: Internal error: backend/cg87.c 1702 The 64-bit version went fine. Previously, both 32 and 64 bit version had no problem.Which machine?
Jan 16 2012
On 1/15/12 12:56 AM, Walter Bright wrote:I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); }These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Jan 16 2012
On 16 January 2012 18:17, Andrei Alexandrescu <SeeWebsiteForEmail erdani.orgwrote:On 1/15/12 12:56 AM, Walter Bright wrote:A function using float arrays and a function using hardware vectors should certainly not be the same speed.I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); }These two functions should have the same speed.
Jan 16 2012
On 1/16/12 10:46 AM, Manu wrote:A function using float arrays and a function using hardware vectors should certainly not be the same speed.My point was that the version using float arrays should opportunistically use hardware ops whenever possible. Andrei
Jan 16 2012
On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.orgwrote:On 1/16/12 10:46 AM, Manu wrote:I think this is a mistake, because such a piece of code never exists outside of some context. If the context it exists within is all FPU code (and it is, it's a float array), then swapping between FPU and SIMD execution units will probably result in the function being slower than the original (also the float array is unaligned). The SIMD version however must exist within a SIMD context, since the API can't implicitly interact with floats, this guarantees that the context of each function matches that within which it lives. This is fundamental to fast vector performance. Using SIMD is an all or nothing decision, you can't just mix it in here and there. You don't go casting back and fourth between floats and ints on every other line... obviously it's imprecise, but it's also a major performance hazard. There is no difference here, except the performance hazard is much worse.A function using float arrays and a function using hardware vectors should certainly not be the same speed.My point was that the version using float arrays should opportunistically use hardware ops whenever possible.
Jan 16 2012
On 01/16/2012 05:59 PM, Manu wrote:On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>> wrote: On 1/16/12 10:46 AM, Manu wrote: A function using float arrays and a function using hardware vectors should certainly not be the same speed. My point was that the version using float arrays should opportunistically use hardware ops whenever possible. I think this is a mistake, because such a piece of code never exists outside of some context. If the context it exists within is all FPU code (and it is, it's a float array), then swapping between FPU and SIMD execution units will probably result in the function being slower than the original (also the float array is unaligned). The SIMD version however must exist within a SIMD context, since the API can't implicitly interact with floats, this guarantees that the context of each function matches that within which it lives. This is fundamental to fast vector performance. Using SIMD is an all or nothing decision, you can't just mix it in here and there. You don't go casting back and fourth between floats and ints on every other line... obviously it's imprecise, but it's also a major performance hazard. There is no difference here, except the performance hazard is much worse.I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.
Jan 16 2012
On 16 January 2012 19:01, Timon Gehr <timon.gehr gmx.ch> wrote:On 01/16/2012 05:59 PM, Manu wrote:x64 can do the swapping too with no penalty, but that is the only architecture that can. So it might be a viable x64 optimisation, but only for x64 codegen, which means any tech to detect and apply the optimisation should live in the back end, not in the front end as a higher level semantic.On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail **erdani.org<SeeWebsiteForEmail erdani.org>I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.wrote: On 1/16/12 10:46 AM, Manu wrote: A function using float arrays and a function using hardware vectors should certainly not be the same speed. My point was that the version using float arrays should opportunistically use hardware ops whenever possible. I think this is a mistake, because such a piece of code never exists outside of some context. If the context it exists within is all FPU code (and it is, it's a float array), then swapping between FPU and SIMD execution units will probably result in the function being slower than the original (also the float array is unaligned). The SIMD version however must exist within a SIMD context, since the API can't implicitly interact with floats, this guarantees that the context of each function matches that within which it lives. This is fundamental to fast vector performance. Using SIMD is an all or nothing decision, you can't just mix it in here and there. You don't go casting back and fourth between floats and ints on every other line... obviously it's imprecise, but it's also a major performance hazard. There is no difference here, except the performance hazard is much worse.
Jan 16 2012
On 1/16/2012 9:21 AM, Manu wrote:x64 can do the swapping too with no penalty, but that is the only architecture that can.Ah, that is a crucial bit of information.
Jan 16 2012
On 2012-01-16 16:59:44 +0000, Manu <turkeyman gmail.com> said:On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.orgAndrei's idea could be valid as an optimization when the compiler can see that all the operations can be performed with SIMD ops. In this particular case: if test1a(a) is inlined. But it can't work if the float[4] value crosses a function's boundary. Or instead the optimization could be performed at the semantic level, like this: try to change the type of a variable float[4] to a float4, and if it can compile, use it instead. So if you have the same function working with a float[4] and a float4, and if all the functions you call on a given variable supports float4, it'll go for float4. But doing that at the semantic level would be rather messy, not counting the combinatorial explosion when multiple variables are at play. -- Michel Fortin michel.fortin michelf.com http://michelf.com/wrote:On 1/16/12 10:46 AM, Manu wrote:I think this is a mistake, because such a piece of code never exists outside of some context. If the context it exists within is all FPU code (and it is, it's a float array), then swapping between FPU and SIMD execution units will probably result in the function being slower than the original (also the float array is unaligned). The SIMD version however must exist within a SIMD context, since the API can't implicitly interact with floats, this guarantees that the context of each function matches that within which it lives. This is fundamental to fast vector performance. Using SIMD is an all or nothing decision, you can't just mix it in here and there. You don't go casting back and fourth between floats and ints on every other line... obviously it's imprecise, but it's also a major performance hazard. There is no difference here, except the performance hazard is much worse.A function using float arrays and a function using hardware vectors should certainly not be the same speed.My point was that the version using float arrays should opportunistically use hardware ops whenever possible.
Jan 16 2012
On 1/16/12 11:32 AM, Michel Fortin wrote:Andrei's idea could be valid as an optimization when the compiler can see that all the operations can be performed with SIMD ops. In this particular case: if test1a(a) is inlined. But it can't work if the float[4] value crosses a function's boundary.In this case it's the exact contrary: the float[4] and the operation are both local to the function. So it all depends on the inlining of the dummy functions that follows. No? Andrei
Jan 16 2012
On 2012-01-16 17:57:14 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:On 1/16/12 11:32 AM, Michel Fortin wrote:That's exactly what I meant, if everything is local to the function you might be able to optimize. In this particular case, if test1a(a) is inlined, everything is local. But the current example has too much isolation for it to be meaningful. If you returned the result as a float[4] the the optimization doesn't work. If you took an argument as a float[4] it probably wouldn't work either (depending on what you do with the argument). So I don't think its an optimization you should count on very much. In fact, the optimization I'd expect the compiler to do in this case is just wipe out all the code, as it does nothing other than putting a value in a local variable which is never reused later. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Andrei's idea could be valid as an optimization when the compiler can see that all the operations can be performed with SIMD ops. In this particular case: if test1a(a) is inlined. But it can't work if the float[4] value crosses a function's boundary.In this case it's the exact contrary: the float[4] and the operation are both local to the function. So it all depends on the inlining of the dummy functions that follows. No?
Jan 16 2012
On 16 January 2012 21:27, Michel Fortin <michel.fortin michelf.com> wrote:In fact, the optimization I'd expect the compiler to do in this case is just wipe out all the code, as it does nothing other than putting a value in a local variable which is never reused later.Yes, my first thought when I saw this test was "why is it generating any code at all?".. But I tried to forget about that :) I am curious though, what is causing that code (on both sides) to not be eliminated? If I write that in C, I'm sure it would generate nothing. Is this a language implementation bug somehow?
Jan 16 2012
On 1/16/2012 12:22 PM, Manu wrote:Yes, my first thought when I saw this test was "why is it generating any code at all?".. But I tried to forget about that :) I am curious though, what is causing that code (on both sides) to not be eliminated? If I write that in C, I'm sure it would generate nothing. Is this a language implementation bug somehow?Compile with inlining off, and the compiler 'forgets' what the called function does, so it must call it.
Jan 16 2012
On 1/16/2012 8:59 AM, Manu wrote:(also the float array is unaligned).Currently, it is 4 byte aligned. But the compiler could align freestanding static arrays on 16 bytes without breaking anything. It just cannot align: struct S { int a; float[4] b; } b on a 16 byte boundary, as that would break the ABI. Even worse, struct S { int a; char[16] s; } can't be aligned on 16 bytes as that is a common "small string optimization".
Jan 16 2012
On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:My point was that the version using float arrays should opportunistically use hardware ops whenever possible.Yes, you're right. The compiler can opportunistically convert a number of vector operations on static arrays to the SIMD instructions. Now that the basics are there, there are many, many opportunities to improve the code generation. Even for things like: int i,j; i *= 3; foo(); j *= 3; the two multiplies can be combined. Also, if operations on a particular integer variable are a subset that is supported by SIMD, that variable could be enregistered in an XMM register, instead of a GP register. But don't worry, I'm not planning on working on that at the moment :-)
Jan 16 2012
On 16 January 2012 18:59, Walter Bright <newshound2 digitalmars.com> wrote:On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:yMy point was that the version using float arrays should opportunisticall=oveuse hardware ops whenever possible.Yes, you're right. The compiler can opportunistically convert a number of vector operations on static arrays to the SIMD instructions. Now that the basics are there, there are many, many opportunities to impr=the code generation. Even for things like: =A0int i,j; =A0i *=3D 3; =A0foo(); =A0j *=3D 3; the two multiplies can be combined. Also, if operations on a particular integer variable are a subset that is supported by SIMD, that variable co=uldbe enregistered in an XMM register, instead of a GP register. But don't worry, I'm not planning on working on that at the moment :-)Leave that sort of optimisation for the backend to handle please. ;-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 16 2012
On 1/16/2012 11:16 AM, Iain Buclaw wrote:Of course. I suspect Intel's compiler does that one, does gcc?But don't worry, I'm not planning on working on that at the moment :-)Leave that sort of optimisation for the backend to handle please. ;-)
Jan 16 2012
On 16 January 2012 19:25, Walter Bright <newshound2 digitalmars.com> wrote:On 1/16/2012 11:16 AM, Iain Buclaw wrote:There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';Of course. I suspect Intel's compiler does that one, does gcc?But don't worry, I'm not planning on working on that at the moment :-)Leave that sort of optimisation for the backend to handle please. ;-)
Jan 16 2012
On 16/01/12 8:56 PM, Iain Buclaw wrote:On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com> wrote:Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.On 1/16/2012 11:16 AM, Iain Buclaw wrote:There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }Of course. I suspect Intel's compiler does that one, does gcc?But don't worry, I'm not planning on working on that at the moment :-)Leave that sort of optimisation for the backend to handle please. ;-)
Jan 16 2012
On 16 January 2012 23:57, Peter Alexander <peter.alexander.au gmail.com>wrote:On 16/01/12 8:56 PM, Iain Buclaw wrote:This is why D needs a __restrict attribute! ;)On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.**com<newshound2 digitalmars.com>> wrote:Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.On 1/16/2012 11:16 AM, Iain Buclaw wrote:There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }Of course. I suspect Intel's compiler does that one, does gcc?But don't worry, I'm not planning on working on that at the moment :-)Leave that sort of optimisation for the backend to handle please. ;-)
Jan 16 2012
On 1/16/2012 1:54 PM, Manu wrote:Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing. This is why D needs a __restrict attribute! ;)That's why D has: a[] = b[] + c[]; because the language requires the arrays to be distinct.
Jan 16 2012
On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com> wrote:On 1/16/2012 1:54 PM, Manu wrote:Surely it would be possible for them to be overlapping slices?Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing. This is why D needs a __restrict attribute! ;)That's why D has: a[] = b[] + c[]; because the language requires the arrays to be distinct.
Jan 16 2012
On Mon, 16 Jan 2012 23:06:12 +0100, Manu <turkeyman gmail.com> wrote:On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com> wrote:If they are, that's your fault and your problem. "The lvalue slice and any rvalue slices must not overlap."On 1/16/2012 1:54 PM, Manu wrote:Surely it would be possible for them to be overlapping slices?Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing. This is why D needs a __restrict attribute! ;)That's why D has: a[] = b[] + c[]; because the language requires the arrays to be distinct.
Jan 16 2012
On Mon, 16 Jan 2012 23:22:21 +0100, Simen Kj=C3=A6r=C3=A5s <simen.kjaras= gmail.com> = wrote:On Mon, 16 Jan 2012 23:06:12 +0100, Manu <turkeyman gmail.com> wrote:=On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com> =Sorry, forgot the link: http://www.d-programming-language.org/arrays.html#array-operationswrote:If they are, that's your fault and your problem. "The lvalue slice and any rvalue slices must not overlap."On 1/16/2012 1:54 PM, Manu wrote:Surely it would be possible for them to be overlapping slices?Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=3D0; i<256; i++) a[i] =3D b[i] + c[i]; } Then it can't vectorize due to aliasing. This is why D needs a __restrict attribute! ;)That's why D has: a[] =3D b[] + c[]; because the language requires the arrays to be distinct.
Jan 16 2012
On 1/16/2012 2:06 PM, Manu wrote:Surely it would be possible for them to be overlapping slices?Not allowed, treated like array bounds checking.
Jan 16 2012
On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com> wr= ote:On 16/01/12 8:56 PM, Iain Buclaw wrote:)On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com> =A0wrote:On 1/16/2012 11:16 AM, Iain Buclaw wrote:But don't worry, I'm not planning on working on that at the moment :-=Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?) Regards --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { =A0for (int i=3D0; i<256; i++) =A0 =A0a[i] =3D b[i] + c[i]; } Then it can't vectorize due to aliasing.There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. =A0I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { =A0 for (int i=3D0; i<256; i++) =A0 =A0 a[i] =3D b[i] + c[i]; }Leave that sort of optimisation for the backend to handle please. ;-)Of course. I suspect Intel's compiler does that one, does gcc?
Jan 16 2012
On 16/01/12 10:36 PM, Iain Buclaw wrote:On 16 January 2012 21:57, Peter Alexander<peter.alexander.au gmail.com> wrote:This has nothing to do with strict aliasing. a[257]; foo(a[1..257], a[0..256], a[0..256]); This doesn't break any strict aliasing rule, but the loop still cannot be (trivially) vectorized. As Manu said, you need something like __restrict (or a linear type system) to solve this problem. http://en.wikipedia.org/wiki/Linear_type_system http://en.wikipedia.org/wiki/Uniqueness_typingOn 16/01/12 8:56 PM, Iain Buclaw wrote:Compile with -fstrict-aliasing then?On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com> wrote:Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.On 1/16/2012 11:16 AM, Iain Buclaw wrote:There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }Of course. I suspect Intel's compiler does that one, does gcc?But don't worry, I'm not planning on working on that at the moment :-)Leave that sort of optimisation for the backend to handle please. ;-)
Jan 17 2012
On 1/17/2012 1:20 PM, Peter Alexander wrote:As Manu said, you need something like __restrict (or a linear type system) to solve this problem.No, you don't. It can be done with a runtime check, like array bounds checking is done.
Jan 17 2012
On 17/01/12 9:24 PM, Walter Bright wrote:On 1/17/2012 1:20 PM, Peter Alexander wrote:So you'd change it to this, even in release builds? void foo(int[] a, int[] b, int[] c) { if ( /* arrays overlap */ ) { foreach(i; 0..256) a[i] = b[i] + c[i]; } else { /* vectorized code */ } } i.e. duplicate all loops that can be potentially vectorized depending on aliasing? Please bear in mind that this is a simple example. Seems a bit inefficient (code size).As Manu said, you need something like __restrict (or a linear type system) to solve this problem.No, you don't. It can be done with a runtime check, like array bounds checking is done.
Jan 17 2012
On 1/17/2012 1:47 PM, Peter Alexander wrote:On 17/01/12 9:24 PM, Walter Bright wrote:No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.On 1/17/2012 1:20 PM, Peter Alexander wrote:So you'd change it to this, even in release builds?As Manu said, you need something like __restrict (or a linear type system) to solve this problem.No, you don't. It can be done with a runtime check, like array bounds checking is done.
Jan 17 2012
On 17/01/12 10:55 PM, Walter Bright wrote:On 1/17/2012 1:47 PM, Peter Alexander wrote:The D spec says that overlapping arrays are illegal for vector ops. The foo(int[], int[], int[]) function does not use vector ops. Or am I missing something really major? For example, is this legal code? int[100] a; int[] b = a[0..100]; int[] c = a[10..90]; // Illegal? b and c overlap... foreach (i; 0..80) c[i] = b[i]; // Illegal? I know that b[] = c[] would be illegal, but that has nothing to do with the prior discussion.On 17/01/12 9:24 PM, Walter Bright wrote:No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.On 1/17/2012 1:20 PM, Peter Alexander wrote:So you'd change it to this, even in release builds?As Manu said, you need something like __restrict (or a linear type system) to solve this problem.No, you don't. It can be done with a runtime check, like array bounds checking is done.
Jan 17 2012
On 1/17/2012 3:23 PM, Peter Alexander wrote:On 17/01/12 10:55 PM, Walter Bright wrote:No, not illegal.On 1/17/2012 1:47 PM, Peter Alexander wrote:The D spec says that overlapping arrays are illegal for vector ops. The foo(int[], int[], int[]) function does not use vector ops. Or am I missing something really major? For example, is this legal code? int[100] a; int[] b = a[0..100]; int[] c = a[10..90]; // Illegal? b and c overlap...On 17/01/12 9:24 PM, Walter Bright wrote:No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.On 1/17/2012 1:20 PM, Peter Alexander wrote:So you'd change it to this, even in release builds?As Manu said, you need something like __restrict (or a linear type system) to solve this problem.No, you don't. It can be done with a runtime check, like array bounds checking is done.foreach (i; 0..80) c[i] = b[i]; // Illegal?No, not illegal.I know that b[] = c[] would be illegal, but that has nothing to do with the prior discussion.Yes, b[]=c[] is illegal.
Jan 17 2012
On 17/01/12 11:34 PM, Walter Bright wrote:On 1/17/2012 3:23 PM, Peter Alexander wrote:So, my original point still stands, you can't vectorise this function: void foo(int[] a, int[] b, int[] c) { foreach (i; 0..256) a[i] = b[i] + c[i]; } Those slices are allowed to overlap, so this cannot be automatically vectorised (without inlining to get better context about those arrays). Without inlining, you need something along the lines of __restrict or uniqueness typing.On 17/01/12 10:55 PM, Walter Bright wrote:No, not illegal.On 1/17/2012 1:47 PM, Peter Alexander wrote:The D spec says that overlapping arrays are illegal for vector ops. The foo(int[], int[], int[]) function does not use vector ops. Or am I missing something really major? For example, is this legal code? int[100] a; int[] b = a[0..100]; int[] c = a[10..90]; // Illegal? b and c overlap...On 17/01/12 9:24 PM, Walter Bright wrote:No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.On 1/17/2012 1:20 PM, Peter Alexander wrote:So you'd change it to this, even in release builds?As Manu said, you need something like __restrict (or a linear type system) to solve this problem.No, you don't. It can be done with a runtime check, like array bounds checking is done.foreach (i; 0..80) c[i] = b[i]; // Illegal?No, not illegal.I know that b[] = c[] would be illegal, but that has nothing to do with the prior discussion.Yes, b[]=c[] is illegal.
Jan 17 2012
On 1/17/2012 4:19 PM, Peter Alexander wrote:So, my original point still stands, you can't vectorise this function: void foo(int[] a, int[] b, int[] c) { foreach (i; 0..256) a[i] = b[i] + c[i]; } Those slices are allowed to overlap, so this cannot be automatically vectorised (without inlining to get better context about those arrays). Without inlining, you need something along the lines of __restrict or uniqueness typing.No, you can rewrite it as: a[] = b[] + c[]; and you don't need __restrict or uniqueness. That's what the vector operations are for.
Jan 17 2012
On 01/18/2012 02:04 AM, Walter Bright wrote:On 1/17/2012 4:19 PM, Peter Alexander wrote:Are they really a general solution? How do you use vector ops to implement an efficient matrix multiply, for instance?So, my original point still stands, you can't vectorise this function: void foo(int[] a, int[] b, int[] c) { foreach (i; 0..256) a[i] = b[i] + c[i]; } Those slices are allowed to overlap, so this cannot be automatically vectorised (without inlining to get better context about those arrays). Without inlining, you need something along the lines of __restrict or uniqueness typing.No, you can rewrite it as: a[] = b[] + c[]; and you don't need __restrict or uniqueness. That's what the vector operations are for.
Jan 17 2012
Timon Gehr wrote:Are they really a general solution? How do you use vector ops to implement an efficient matrix multiply, for instance?struct Matrix4 { float4 x, y, z, w; auto transform(Matrix4 mat) { Matrix4 rmat; float4 cx = {mat.x.x, mat.y.x, mat.z.x, mat.w.x}; float4 cy = {mat.x.y, mat.y.y, mat.z.y, mat.w.y}; float4 cz = {mat.x.z, mat.y.z, mat.z.z, mat.w.z}; float4 cw = {mat.x.w, mat.y.w, mat.z.w, mat.w.w}; float4 rx = {mat.x.x, mat.x.y, mat.x.z, mat.x.w}; float4 ry = {mat.y.x, mat.y.y, mat.y.z, mat.y.w}; float4 rz = {mat.z.x, mat.z.y, mat.z.z, mat.z.w}; float4 rw = {mat.w.x, mat.w.y, mat.w.z, mat.w.w}; rmat.x = cx * rx; // simd rmat.y = cy * ry; // simd rmat.z = cz * rz; // simd rmat.w = cw * rw; // simd return rmat; } }
Jan 17 2012
On 01/18/2012 02:32 AM, F i L wrote:Timon Gehr wrote:The parameter is just squared and returned? Anyway, I was after a general matrix*matrix multiplication, where the operands can get arbitrarily large and where any potential use of __restrict is rendered unnecessary by array vector ops.Are they really a general solution? How do you use vector ops to implement an efficient matrix multiply, for instance?struct Matrix4 { float4 x, y, z, w; auto transform(Matrix4 mat) { Matrix4 rmat; float4 cx = {mat.x.x, mat.y.x, mat.z.x, mat.w.x}; float4 cy = {mat.x.y, mat.y.y, mat.z.y, mat.w.y}; float4 cz = {mat.x.z, mat.y.z, mat.z.z, mat.w.z}; float4 cw = {mat.x.w, mat.y.w, mat.z.w, mat.w.w}; float4 rx = {mat.x.x, mat.x.y, mat.x.z, mat.x.w}; float4 ry = {mat.y.x, mat.y.y, mat.y.z, mat.y.w}; float4 rz = {mat.z.x, mat.z.y, mat.z.z, mat.z.w}; float4 rw = {mat.w.x, mat.w.y, mat.w.z, mat.w.w}; rmat.x = cx * rx; // simd rmat.y = cy * ry; // simd rmat.z = cz * rz; // simd rmat.w = cw * rw; // simd return rmat; } }
Jan 17 2012
On Wednesday, 18 January 2012 at 01:50:00 UTC, Timon Gehr wrote:Anyway, I was after a general matrix*matrix multiplication, where the operands can get arbitrarily large and where any potential use of __restrict is rendered unnecessary by array vector ops.Here you go. But I agree there are use cases for restrict where vector operations don't help void matmul(A,B,C)(A a, B b, C c, size_t n, size_t m, size_t l) { for(size_t i = 0; i < n; i++) { c[i*l..i*l + l] = 0; for(size_t j = 0; j < m; j++) c[i*l..i*l + l] += a[i*m + j] * b[j*l..j*l + l]; } }
Jan 17 2012
Timon Gehr wrote:The parameter is just squared and returned?No, sorry that code is all screwed up and missing a step. My Matrix multiply code looks like this: auto transform(U)(Matrix4!U m) if (isImplicitlyConvertible(U, T)) { return Matrix4 ( Vector4 ( (m.x.x*x.x) + (m.x.y*y.x) + (m.x.z*z.x) + (m.x.w*w.x), (m.x.x*x.y) + (m.x.y*y.Y) + (m.x.z*z.y) + (m.x.w*w.y), (m.x.x*x.z) + (m.x.y*y.z) + (m.x.z*z.z) + (m.x.w*w.z), (m.x.x*x.w) + (m.x.y*y.w) + (m.x.z*z.w) + (m.x.w*w.w) ), Vector4 ( (m.y.x*x.x) + (m.y.y*y.x) + (m.y.z*z.x) + (m.y.w*w.x), (m.y.x*x.y) + (m.y.y*y.y) + (m.y.z*z.y) + (m.y.w*w.y), (m.y.x*x.z) + (m.y.y*y.z) + (m.y.z*z.Z) + (m.y.w*w.z), (m.y.x*x.w) + (m.y.y*y.w) + (m.y.z*z.w) + (m.y.w*w.w) ), Vector4 ( (m.z.x*x.x) + (m.z.y*y.x) + (m.z.z*z.x) + (m.z.w*w.x), (m.z.x*x.Y) + (m.z.y*y.y) + (m.z.z*z.y) + (m.z.w*w.y), (m.z.x*x.z) + (m.z.y*y.z) + (m.z.z*z.z) + (m.z.w*w.z), (m.z.x*x.w) + (m.z.y*y.w) + (m.z.z*z.w) + (m.z.w*w.w) ), Vector4 ( (m.w.x*x.x) + (m.w.y*y.x) + (m.w.z*z.x) + (m.w.w*w.x), (m.w.x*x.Y) + (m.w.y*y.y) + (m.w.z*z.y) + (m.w.w*w.y), (m.w.x*x.Z) + (m.w.y*y.z) + (m.w.z*z.z) + (m.w.w*w.z), (m.w.x*x.w) + (m.w.y*y.w) + (m.w.z*z.w) + (m.w.w*w.w) ) ); } to be converted to something more like my previous example in order for SIMD to kick in. IDK if D's compile is good enough to optimize the above code into SIMD ops, but I doubt it.Anyway, I was after a general matrix*matrix multiplication, where the operands can get arbitrarily large and where any potential use of __restrict is rendered unnecessary by array vector ops.I don't know enough about simd to confidently discuss this, but I'd imagine there'd have to be quite a lot of compiler magic happening before arbitrarily sized matrix constructs could make use of simd.
Jan 17 2012
On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com> =wrote:-)On 16/01/12 8:56 PM, Iain Buclaw wrote:On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com> =A0wrote:On 1/16/2012 11:16 AM, Iain Buclaw wrote:But don't worry, I'm not planning on working on that at the moment :=OK, have turned on strict aliasing by default for D2. You should now be able to vectorise loops that use locals and parameters. :-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?) Regards -- Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { =A0for (int i=3D0; i<256; i++) =A0 =A0a[i] =3D b[i] + c[i]; } Then it can't vectorize due to aliasing.There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. =A0I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { =A0 for (int i=3D0; i<256; i++) =A0 =A0 a[i] =3D b[i] + c[i]; }Leave that sort of optimisation for the backend to handle please. ;-)Of course. I suspect Intel's compiler does that one, does gcc?
Jan 16 2012
On 17 January 2012 03:56, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:What protects these ranges from being overlapping? What if they were sourced from pointers? Are just we to say in D that aliasing is not allowed, and 'you shouldn't do it'? People almost never alias intentionally, it's usually the most insidious of bugs. :/On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com>wrote::-)On 16/01/12 8:56 PM, Iain Buclaw wrote:On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com> wrote:On 1/16/2012 11:16 AM, Iain Buclaw wrote:But don't worry, I'm not planning on working on that at the momentOK, have turned on strict aliasing by default for D2. You should now be able to vectorise loops that use locals and parameters. :-)Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?)Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }Leave that sort of optimisation for the backend to handle please. ;-)Of course. I suspect Intel's compiler does that one, does gcc?
Jan 17 2012
On 1/17/2012 12:17 AM, Manu wrote:What protects these ranges from being overlapping?A runtime check, like array bounds checking.
Jan 17 2012
On 17 January 2012 12:33, Walter Bright <newshound2 digitalmars.com> wrote:On 1/17/2012 12:17 AM, Manu wrote:Awesome. How does this map to pointer dereferencing?What protects these ranges from being overlapping?A runtime check, like array bounds checking.
Jan 17 2012
On 1/17/2012 2:43 AM, Manu wrote:On 17 January 2012 12:33, Walter Bright <newshound2 digitalmars.com <mailto:newshound2 digitalmars.com>> wrote: On 1/17/2012 12:17 AM, Manu wrote: What protects these ranges from being overlapping? A runtime check, like array bounds checking. Awesome. How does this map to pointer dereferencing?It can't. Use dynamic arrays - that's what they're for.
Jan 17 2012
On 17 January 2012 08:17, Manu <turkeyman gmail.com> wrote:On 17 January 2012 03:56, Iain Buclaw <ibuclaw ubuntu.com> wrote:m>On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.co=twrote:On 16/01/12 8:56 PM, Iain Buclaw wrote:On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com> =A0wrote:On 1/16/2012 11:16 AM, Iain Buclaw wrote:But don't worry, I'm not planning on working on that at the momen=()There's auto-vectorisation for for(), foreach(), and foreach_reverse=Of course. I suspect Intel's compiler does that one, does gcc?:-)Leave that sort of optimisation for the backend to handle please. ;-)cedWhat protects these ranges from being overlapping? What if they were sour=OK, have turned on strict aliasing by default for D2. =A0You should now be able to vectorise loops that use locals and parameters. :-)Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?)loops that I have written support for. =A0I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { =A0 for (int i=3D0; i<256; i++) =A0 =A0 a[i] =3D b[i] + c[i]; }Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { =A0for (int i=3D0; i<256; i++) =A0 =A0a[i] =3D b[i] + c[i]; } Then it can't vectorize due to aliasing.from pointers? Are just we to say in D that aliasing is not allowed, and 'you shouldn't =doit'? People almost never alias intentionally, it's usually the most=A0insidious=A0of bugs. :/D arrays have a .length property that keeps track of the length of the array. When array bounds checking is turned on (default when not compiling with -release) an assert is produced when you step outside the bounds of the array. Is this what you mean? --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 17 2012
On Mon, 16 Jan 2012 20:25:28 +0100, Walter Bright <newshound2 digitalmars.com> wrote:On 1/16/2012 11:16 AM, Iain Buclaw wrote:Thought of that too, but it's rather tough to manage slots in vector registers. Could probably dust of Don's BLADE library. It seems that gcc and icc are limited to loop optimization. http://gcc.gnu.org/projects/tree-ssa/vectorization.html http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/Of course. I suspect Intel's compiler does that one, does gcc?But don't worry, I'm not planning on working on that at the moment :-)Leave that sort of optimisation for the backend to handle please. ;-)
Jan 16 2012
Walter:But don't worry, I'm not planning on working on that at the moment :-)Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5]; Bye, bearophile
Jan 16 2012
On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:Walter:If this doesn't already exist, I think it's quite important. I was thinking about needing to repeatedly specialise a template last night for a bunch of short lengths of arrays, for this exact reason. Unrolling short loops must be one of the most trivial and worthwhile optimisations...But don't worry, I'm not planning on working on that at the moment :-)Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5];
Jan 17 2012
On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman gmail.com> wrote:On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:If the compiler knows it's a compile time constant thus you could use a static foreach.Walter:If this doesn't already exist, I think it's quite important. I was thinking about needing to repeatedly specialise a template last night for a bunch of short lengths of arrays, for this exact reason. Unrolling short loops must be one of the most trivial and worthwhile optimisations...But don't worry, I'm not planning on working on that at the moment :-)Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5];
Jan 17 2012
On 17 January 2012 11:48, Martin Nowak <dawg dawgfoto.de> wrote:On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman gmail.com> wrote: On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:Great idea! :)Walter:If the compiler knows it's a compile time constant thus you could use a static foreach.If this doesn't already exist, I think it's quite important. I was thinking about needing to repeatedly specialise a template last night for a bunch of short lengths of arrays, for this exact reason. Unrolling short loops must be one of the most trivial and worthwhile optimisations...But don't worry, I'm not planning on working on that at the moment :-)Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5];
Jan 17 2012
On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/15/12 12:56 AM, Walter Bright wrote:Unfortunately druntime's array ops are a mess and fail to speed up anything below 16 floats. Additionally there is overhead for a function call and they have to check alignment at runtime. martinI get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); }These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Jan 16 2012
On 16/01/12 17:51, Martin Nowak wrote:On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation. Incidentally, this is the reason that CTFE doesn't work with array ops.On 1/15/12 12:56 AM, Walter Bright wrote:Unfortunately druntime's array ops are a mess and fail to speed up anything below 16 floats. Additionally there is overhead for a function call and they have to check alignment at runtime. martinI get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); }These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Jan 17 2012
On Tue, 17 Jan 2012 09:42:12 +0100, Don Clugston <dac nospam.com> wrote:On 16/01/12 17:51, Martin Nowak wrote:Oh, I was literally speaking of the runtime implementation. It should loop with 4 XMM regs the continue with 1 XMM reg and finish up scalar. Right now it quantizes on 16 floats and does the remaining ones scalar, which is really bad for very small arrays. I was about to rewrite it at some point. https://gist.github.com/1235470 I think having a runtime template is better than doing this massive extern(C) interface that has to be kept in sync. That would also open up room for a better CTFE integration. martinOn Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation. Incidentally, this is the reason that CTFE doesn't work with array ops.On 1/15/12 12:56 AM, Walter Bright wrote:Unfortunately druntime's array ops are a mess and fail to speed up anything below 16 floats. Additionally there is overhead for a function call and they have to check alignment at runtime. martinI get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); }These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Jan 17 2012
On 1/17/2012 2:04 AM, Martin Nowak wrote:I was about to rewrite it at some point. https://gist.github.com/1235470I think you've got an innovative and clever solution. I'd like to see you finish it!
Jan 17 2012
On Tue, 17 Jan 2012 11:53:35 +0100, Walter Bright <newshound2 digitalmars.com> wrote:On 1/17/2012 2:04 AM, Martin Nowak wrote:Mmh, there was something keeping me from specializing templates, https://github.com/D-Programming-Language/dmd/pull/396 :). But right now I'd rather like to finish the shared library merging.I was about to rewrite it at some point. https://gist.github.com/1235470I think you've got an innovative and clever solution. I'd like to see you finish it!
Jan 17 2012
On 1/17/2012 5:20 AM, Martin Nowak wrote:Mmh, there was something keeping me from specializing templates, https://github.com/D-Programming-Language/dmd/pull/396 :). But right now I'd rather like to finish the shared library merging.I agree.
Jan 17 2012