digitalmars.D - Optimization problem: bulk Boolean operations on vectors
- Andrei Alexandrescu (3/3) Dec 23 2016 An interesting problem to look at:
- hardreset (31/34) Dec 23 2016 Ok some hand written assembler to and-assign ints...
- Seb (4/13) Dec 23 2016 138 ms, 603 μs, and 6 hnsecs
- Stefan Koch (2/17) Dec 23 2016 This is indeed already done.
- Martin Nowak (10/14) Dec 25 2016 We considered using gdc (or ldc) to compensate for the slowdown
- Walter Bright (33/60) Dec 23 2016 For this D code:
- Andrei Alexandrescu (4/10) Dec 23 2016 That's a neat trick, thanks hardreset (and Walter for making it
- Johan Engelen (12/20) Dec 23 2016 This code is not equivalent with the plain foreach loop.
- Walter Bright (5/26) Dec 23 2016 The assumption is that 'a' points to the start of an array [0..SIZE], so...
- Johan Engelen (35/53) Dec 24 2016 Of course, but it's not codified in the source and thus the
- Walter Bright (6/11) Dec 24 2016 You are pedantically correct. It's possible someone would write a bizarr...
- hardreset (6/40) Dec 23 2016 I patched up the prolog code and timed it and it came out
- Andrei Alexandrescu (2/3) Dec 23 2016 This passes the three flags to test.d. Place test.d at the end.
- Walter Bright (3/8) Dec 23 2016 dmd test -O
- Jacob Carlborg (5/6) Dec 24 2016 You cannot get the assembly output from DMD since it directly outputs
- safety0ff (3/11) Dec 23 2016 Is subtraction of pointers which do not belong to the same array
- Stefan Koch (3/18) Dec 23 2016 Yes, I think so.
- safety0ff (4/5) Dec 23 2016 The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit
- Observer (14/16) Dec 27 2016 With respect to bulk operations on vectors, of course I recognize
An interesting problem to look at: https://github.com/dlang/dmd/pull/6352 Andrei
Dec 23 2016
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote:An interesting problem to look at: https://github.com/dlang/dmd/pull/6352 AndreiOk some hand written assembler to and-assign ints... enum SIZE = 100000000; void foo3(int* a, int* b) { asm { mov EAX,a; mov EDX,b; sub EDX,EAX; mov ECX,EAX; add ECX,SIZE*4; l1:; cmp EAX,ECX; jae done; mov EBX,[EAX]; and EBX,[EAX+EDX]; mov [EAX],EBX; add EAX,4; jmp l1; done:; } } I get.. 456ms for unconditional 505ms for conditional 268ms for assembler So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.
Dec 23 2016
On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:I get.. 456ms for unconditional 505ms for conditional 268ms for assembler So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.Has anyone considered using ldc for DMD release compilations?dmd -O -release -inline -run foo.d138 ms, 603 μs, and 6 hnsecsldmd -O -release -inline -run foo.d84 ms and 719 μs
Dec 23 2016
On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:This is indeed already done.I get.. 456ms for unconditional 505ms for conditional 268ms for assembler So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.Has anyone considered using ldc for DMD release compilations?dmd -O -release -inline -run foo.d138 ms, 603 μs, and 6 hnsecsldmd -O -release -inline -run foo.d84 ms and 719 μs
Dec 23 2016
On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:We considered using gdc (or ldc) to compensate for the slowdown after the ddmd transition, but nobody complained enough for that to happen ;). Also Windows users were stuck with dmd at that point. Using a proper optimizer, profiles, and LTO it was fairly simple to speedup dmd by some 20+% (IIRC). Unfortunately wasn't considered before 2.069.0, but that's one of the main reasons why we setup Travis-CI to test gdc/ldc builds. https://trello.com/c/OT6jlFNa/85-rebench-ddmd-vs-dmd-compiler-speedSo the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.Has anyone considered using ldc for DMD release compilations?
Dec 25 2016
On 12/23/2016 10:03 AM, hardreset wrote:enum SIZE = 100000000; void foo3(int* a, int* b) { asm { mov EAX,a; mov EDX,b; sub EDX,EAX; mov ECX,EAX; add ECX,SIZE*4; l1:; cmp EAX,ECX; jae done; mov EBX,[EAX]; and EBX,[EAX+EDX]; mov [EAX],EBX; add EAX,4; jmp l1; done:; } } I get.. 456ms for unconditional 505ms for conditional 268ms for assembler So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.For this D code: enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); } The following asm is generated by DMD: push EBX mov EBX,8[ESP] sub EAX,EBX push ESI cdq and EDX,3 add EAX,EDX sar EAX,2 lea ECX,0FA0h[EBX] mov ESI,EAX cmp EBX,ECX jae L2C L20: mov EDX,[ESI*4][EBX] and [EBX],EDX add EBX,4 cmp EBX,ECX jb L20 L2C: pop ESI pop EBX ret 4 The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.
Dec 23 2016
On 12/23/2016 05:11 PM, Walter Bright wrote:void foo(int* a, int* b) { int* atop = a + 1000; ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); }That's a neat trick, thanks hardreset (and Walter for making it high-level). There are a bunch of places in druntime and phobos that could use it. -- Andrei
Dec 23 2016
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; // should be `a + SIZE`, right? ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); }This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE). Thus the "ptrdiff" loop cannot be vectorized (or can only be vectorized when guarded with a check for potential overflow first). LDC: https://godbolt.org/g/YcCJdZ (note the funny jmp .LBB1_6 to a ret instruction... what's that?!) GDC does something more complex: https://godbolt.org/g/3XeI9p Just for info. Don't know which is faster, but I'm guessing the vectorized foreach loop.
Dec 23 2016
On 12/23/2016 3:35 PM, Johan Engelen wrote:On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow.enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; // should be `a + SIZE`, right? ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); }This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE).Thus the "ptrdiff" loop cannot be vectorized (or can only be vectorized when guarded with a check for potential overflow first). LDC: https://godbolt.org/g/YcCJdZ (note the funny jmp .LBB1_6 to a ret instruction... what's that?!) GDC does something more complex: https://godbolt.org/g/3XeI9p Just for info. Don't know which is faster, but I'm guessing the vectorized foreach loop.The vectorized one probably is. But it sure is a lot of code. (The loop speed could possibly be doubled just by unrolling it a few times.)
Dec 23 2016
On Saturday, 24 December 2016 at 00:04:48 UTC, Walter Bright wrote:On 12/23/2016 3:35 PM, Johan Engelen wrote:Of course, but it's not codified in the source and thus the function is different from the foreach version. More importantly, I think it is broken. I find it very hard to reason about the ptrdiff_t version, because of all the potential overflows and the mixing of unsigned and signed math. So I wrote a little helper program to print out pointer values. ``` // file a.d import std.stdio; void main() { int* a = cast(int*) (size_t.max - 4000000LU + 2000000LU); int* b = cast(int*) (1000000LU); // ptrdiff_t cannot represent the large difference between //arrays at the start and end of memory. ptrdiff_t offset = b - a; writeln("a = ", a); writeln("b = ", b); writeln("a + offset = ", a + offset); } ``` Then: ``` ❯ dmd -run a.d a = FFFFFFFFFFC2F6FF b = F4240 a + offset = F423F ``` a+offset is not equal to b for the given pointer values. Whoops. I find it amazing that the GCC backend manages to come up with a bunch of checks for different overflow cases and creates a vectorized inner loop. -JohanOn Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow.enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; // should be `a + SIZE`, right? ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); }This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE).
Dec 24 2016
On 12/24/2016 2:28 AM, Johan Engelen wrote:Of course, but it's not codified in the source and thus the function is different from the foreach version. More importantly, I think it is broken. I find it very hard to reason about the ptrdiff_t version, because of all the potential overflows and the mixing of unsigned and signed math.You are pedantically correct. It's possible someone would write a bizarre loop, and compiler optimizers need to take that into account when doing loop transformations. This is one of the reasons why D's array syntax and semantics is superior - it guarantees no overflow.
Dec 24 2016
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:On 12/23/2016 10:03 AM, hardreset wrote: For this D code: enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); } The following asm is generated by DMD: push EBX mov EBX,8[ESP] sub EAX,EBX push ESI cdq and EDX,3 add EAX,EDX sar EAX,2 lea ECX,0FA0h[EBX] mov ESI,EAX cmp EBX,ECX jae L2C L20: mov EDX,[ESI*4][EBX] and [EBX],EDX add EBX,4 cmp EBX,ECX jb L20 L2C: pop ESI pop EBX ret 4 The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.I patched up the prolog code and timed it and it came out identical to my asm. I tried the ptrdiff C-like code and that still comes out 20% slower here. I'm compiling with... rdmd test.d -O -release -inline Am I missing something? How do I get the asm output?
Dec 23 2016
On 12/23/2016 08:06 PM, hardreset wrote:rdmd test.d -O -release -inlineThis passes the three flags to test.d. Place test.d at the end.
Dec 23 2016
On 12/23/2016 5:06 PM, hardreset wrote:I patched up the prolog code and timed it and it came out identical to my asm. I tried the ptrdiff C-like code and that still comes out 20% slower here. I'm compiling with... rdmd test.d -O -release -inline Am I missing something? How do I get the asm output?dmd test -O obj2asm test.obj
Dec 23 2016
On 2016-12-24 02:06, hardreset wrote:How do I get the asm output?You cannot get the assembly output from DMD since it directly outputs object code. Do get the assembly you need to use a disassembler. -- /Jacob Carlborg
Dec 24 2016
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:For this D code: enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); }Is subtraction of pointers which do not belong to the same array defined behavior in D?
Dec 23 2016
On Saturday, 24 December 2016 at 01:38:24 UTC, safety0ff wrote:On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:Yes, I think so. There are not crazy rules to make pointer math ub.For this D code: enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); }Is subtraction of pointers which do not belong to the same array defined behavior in D?
Dec 23 2016
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote:An interesting problem to look at:The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit for optimization as well.
Dec 23 2016
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote:An interesting problem to look at: https://github.com/dlang/dmd/pull/6352With respect to bulk operations on vectors, of course I recognize the desire to use high-level code which is portable across platforms. But I also wonder whether this is the sort of thing that is really best handled these days by using the multi-media instruction set instead of the general-purpose instruction set. There's a lot of special-purpose arithmetic that is provided by the CPU in such a context, and it seems a bit silly to ignore that possibility. Also more generally regarding low-level stuff like this, perhaps this reference might be of interest: https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685
Dec 27 2016