www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Taking pipeline processing to the next level

reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
I mostly code like this now:
  data.map!(x => transform(x)).copy(output);

It's convenient and reads nicely, but it's generally inefficient.
This sort of one-by-one software design is the core performance
problem with OOP. It seems a shame to be suffering OOP's failures even
when there is no OOP in sight.

A central premise of performance-oriented programming which I've
employed my entire career, is "where there is one, there is probably
many", and if you do something to one, you should do it to many.
With this in mind, the code I have always written doesn't tend to look
like this:
  R manipulate(Thing thing);

Instead:
  void manipulateThings(Thing *things, size_t numThings, R *output,
size_t outputLen);

Written this way for clarity. Obviously, the D equiv uses slices.

All functions are implemented with the presumption they will operate
on many things, rather than being called many times for each one.
This is the single most successful design pattern I have ever
encountered wrt high-performance code; ie, implement the array version
first.

The problem with this API design, is that it doesn't plug into
algorithms or generic code well.
  data.map!(x => transformThings(&x, 1)).copy(output);

I often wonder how we can integrate this design principle conveniently
(ie, seamlessly) into the design of algorithms, such that they can
make use of batching functions internally, and transparently?

Has anyone done any work in this area?

Ideas?
Sep 04 2016
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 05/09/2016 5:08 PM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.
 This sort of one-by-one software design is the core performance
 problem with OOP. It seems a shame to be suffering OOP's failures even
 when there is no OOP in sight.

 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is probably
 many", and if you do something to one, you should do it to many.
 With this in mind, the code I have always written doesn't tend to look
 like this:
   R manipulate(Thing thing);

 Instead:
   void manipulateThings(Thing *things, size_t numThings, R *output,
 size_t outputLen);

 Written this way for clarity. Obviously, the D equiv uses slices.

 All functions are implemented with the presumption they will operate
 on many things, rather than being called many times for each one.
 This is the single most successful design pattern I have ever
 encountered wrt high-performance code; ie, implement the array version
 first.

 The problem with this API design, is that it doesn't plug into
 algorithms or generic code well.
   data.map!(x => transformThings(&x, 1)).copy(output);

 I often wonder how we can integrate this design principle conveniently
 (ie, seamlessly) into the design of algorithms, such that they can
 make use of batching functions internally, and transparently?

 Has anyone done any work in this area?

 Ideas?
Just a random idea: import std.array : front, popFront, empty; import std.range : ElementType, isInputRange; int[] transformer(I)(I from, int[] buffer) if (is(ElementType!I == int) && isInputRange!I) { size_t used; // ... transformation algo return buffer[0 .. used]; } auto got = input.transformer(buffer).usage; Input range instead of straight array being passed in so it works on pretty much any input arrays or input ranges.
Sep 04 2016
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.
What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
Sep 05 2016
next sibling parent reply Johannes Pfau <nospam example.com> writes:
Am Mon, 5 Sep 2016 10:21:53 +0200
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.  
What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
As Manu posted this question (and he's working on a color/image library) it's not hard to guess one problem is SIMD/vectorization. E.g if transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4 values instead of 4 individual adds. As autovectorization is not very powerful in current compilers I can easily imagine that complex range based examples can't compete with hand-written SIMD loops. Manu: Have you had a look at std.parallelism? I think it has some kind of parallel map which could provide some inspiration?
Sep 05 2016
parent Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 5 September 2016 at 20:32, Johannes Pfau via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 Am Mon, 5 Sep 2016 10:21:53 +0200
 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.
What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
As Manu posted this question (and he's working on a color/image library) it's not hard to guess one problem is SIMD/vectorization. E.g if transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4 values instead of 4 individual adds. As autovectorization is not very powerful in current compilers I can easily imagine that complex range based examples can't compete with hand-written SIMD loops. Manu: Have you had a look at std.parallelism? I think it has some kind of parallel map which could provide some inspiration?
I have, but even just chunks() and joiner() can do the trick to a reasonable extent, but it's still not great. It's definitely not where I'd like it to be. End-users won't manually deploy these strategies correctly (or at all), I'd like to see design that enables more automatic deployment of batch processing. I treat the end-user like a javascript user; they shouldn't need to do hard work to make proper use of a lib, that's poor API offering on part of the lib author.
Sep 05 2016
prev sibling next sibling parent reply Ethan Watson <gooberman gmail.com> writes:
On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu 
wrote:
 What are the benchmarks and the numbers? What loss are you 
 looking at? -- Andrei
Just looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types.
Sep 05 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/5/16 1:43 PM, Ethan Watson wrote:
 On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:
 What are the benchmarks and the numbers? What loss are you looking at?
 -- Andrei
Just looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types.
Understood. Would explicitly asking for vectorized operations be acceptable? One school of thought has it that explicit invocation of parallel operations are preferable to autovectorization and its ilk. -- Andrei
Sep 05 2016
parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 5 September 2016 at 23:38, Andrei Alexandrescu via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 9/5/16 1:43 PM, Ethan Watson wrote:
 On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:
 What are the benchmarks and the numbers? What loss are you looking at?
 -- Andrei
Just looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types.
Understood. Would explicitly asking for vectorized operations be acceptable? One school of thought has it that explicit invocation of parallel operations are preferable to autovectorization and its ilk. -- Andrei
I still stand by this, and I listed some reasons above. Auto-vectorisation is a nice opportunistic optimisation, but it can't be relied on. The key reason is that scalar arithmetic semantics are different than vector semantics, and auto-vectorisation tends to produce a whole bunch of extra junk code to carefully (usually pointlessly) preserve the scalar semantics that it's trying to vectorise. This will never end well. But the vectorisation isn't the interesting problem here, I'm really just interested in how to work these batch-processing functions into our nice modern pipeline statements without placing an unreasonable burden on the end-user, who shouldn't be expected to go out of their way. If they even have to start manually chunking, I think we've already lost; they won't know optimal chunk-sizes, or anything about alignment boundaries, cache, etc.
Sep 05 2016
parent reply finalpatch <fengli gmail.com> writes:
On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:
 I still stand by this, and I listed some reasons above.
 Auto-vectorisation is a nice opportunistic optimisation, but it 
 can't
 be relied on. The key reason is that scalar arithmetic 
 semantics are
 different than vector semantics, and auto-vectorisation tends to
 produce a whole bunch of extra junk code to carefully (usually
 pointlessly) preserve the scalar semantics that it's trying to
 vectorise. This will never end well.
 But the vectorisation isn't the interesting problem here, I'm 
 really
 just interested in how to work these batch-processing functions 
 into
 our nice modern pipeline statements without placing an 
 unreasonable
 burden on the end-user, who shouldn't be expected to go out of 
 their
 way. If they even have to start manually chunking, I think we've
 already lost; they won't know optimal chunk-sizes, or anything 
 about
 alignment boundaries, cache, etc.
In a previous job I had successfully created a small c++ library to perform pipelined SIMD image processing. Not sure how relevant it is but think I'd share the design here, perhaps it'll give you guys some ideas. Basically the users of this library only need to write simple kernel classes, something like this: // A kernel that processes 4 pixels at a time struct MySimpleKernel : Kernel<4> { // Tell the library the input and output type using InputVector = Vector<__m128, 1>; using OutputVector = Vector<__m128, 2>; template<typename T> OutputVector apply(const T& src) { // T will be deduced to Vector<__m128, 1> // which is an array of one __m128 element // Awesome SIMD code goes here... // And return the output vector return OutputVector(...); } }; Of course the InputVector and OutputVector do not have to be __m128, they can totally be other types like int or float. The cool thing is kernels can be chained together with >> operators. So assume we have another kernel: struct AnotherKernel : Kernel<3> { ... } Then we can create a processing pipeline with these 2 kernels: InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >> OutputBuffer(...); Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. Any number of kernels can be chained together in this way, as long as your compiler doesn't explode. At that time, my benchmarks showed pipelines generated in this way often rivals the speed of hand tuned loops.
Sep 06 2016
parent reply finalpatch <fengli gmail.com> writes:
On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
 Then some template magic will figure out the LCM of the 2 
 kernels' pixel width is 3*4=12 and therefore they are fused 
 together into a composite kernel of pixel width 12.  The above 
 line compiles down into a single function invokation, with a 
 main loop that reads the source buffer in 4 pixels step, call 
 MySimpleKernel 3 times, then call AnotherKernel 4 times.
Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
Sep 06 2016
next sibling parent finalpatch <fengli gmail.com> writes:
On Tuesday, 6 September 2016 at 14:26:22 UTC, finalpatch wrote:
 On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
 Then some template magic will figure out the LCM of the 2 
 kernels' pixel width is 3*4=12 and therefore they are fused 
 together into a composite kernel of pixel width 12.  The above 
 line compiles down into a single function invokation, with a 
 main loop that reads the source buffer in 4 pixels step, call 
 MySimpleKernel 3 times, then call AnotherKernel 4 times.
Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
And of course the key to the speed is all function calls get inlined by the compiler.
Sep 06 2016
prev sibling parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 7 September 2016 at 00:26, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
 Then some template magic will figure out the LCM of the 2 kernels' pixel
 width is 3*4=12 and therefore they are fused together into a composite
 kernel of pixel width 12.  The above line compiles down into a single
 function invokation, with a main loop that reads the source buffer in 4
 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
It's interesting thoughts. What did you do when buffers weren't multiple of the kernels?
Sep 06 2016
parent reply finalpatch <fengli gmail.com> writes:
On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:

 with a main loop that reads the source buffer in *12* pixels 
 step, call
 MySimpleKernel 3 times, then call AnotherKernel 4 times.
It's interesting thoughts. What did you do when buffers weren't multiple of the kernels?
The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer.
Sep 06 2016
parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 7 September 2016 at 07:11, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:

 with a main loop that reads the source buffer in *12* pixels step, call
 MySimpleKernel 3 times, then call AnotherKernel 4 times.
It's interesting thoughts. What did you do when buffers weren't multiple of the kernels?
The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer.
Right, and this is a classic problem with this sort of function; it is only more efficient if numElements is suitable long. See, I often wonder if it would be worth being able to provide both functions, a scalar and array version, and have the algorithms select between them intelligently.
Sep 06 2016
parent reply finalpatch <fengli gmail.com> writes:
On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote:
 The end of a scan line is special cased . If I need 12 pixels 
 for the last iteration but there are only 8 left, an instance 
 of Kernel::InputVector is allocated on stack, 8 remaining 
 pixels are memcpy into it then send to the kernel. Output from 
 kernel are also assigned to a stack variable first, then 
 memcpy 8 pixels to the output buffer.
Right, and this is a classic problem with this sort of function; it is only more efficient if numElements is suitable long. See, I often wonder if it would be worth being able to provide both functions, a scalar and array version, and have the algorithms select between them intelligently.
We normally process full HD or higher resolution images so the overhead of having to copy the last iteration was negligible. It was fairly easy to put together a scalar version as they are much easier to write than the SIMD ones. In fact I had scalar version for every SIMD kernel, and use them for unit testing. It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it.
Sep 06 2016
parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 7 September 2016 at 11:04, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 It shouldn't be hard to have the framework look at the buffer size and
 choose the scalar version when number of elements are small, it wasn't done
 that way simply because we didn't need it.
No, what's hard is working this into D's pipeline patterns seamlessly.
Sep 06 2016
parent reply finalpatch <fengli gmail.com> writes:
On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:
 On 7 September 2016 at 11:04, finalpatch via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 It shouldn't be hard to have the framework look at the buffer 
 size and choose the scalar version when number of elements are 
 small, it wasn't done that way simply because we didn't need 
 it.
No, what's hard is working this into D's pipeline patterns seamlessly.
The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently. I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code. Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines.
Sep 06 2016
parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 7 September 2016 at 12:00, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:
 On 7 September 2016 at 11:04, finalpatch via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 It shouldn't be hard to have the framework look at the buffer size and
 choose the scalar version when number of elements are small, it wasn't done
 that way simply because we didn't need it.
No, what's hard is working this into D's pipeline patterns seamlessly.
The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently. I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code. Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines.
Exactly!
Sep 06 2016
parent reply finalpatch <fengli gmail.com> writes:
On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote:
 The lesson I learned from this is that you need the user code 
 to provide a lot of extra information about the algorithm at 
 compile time for the templates to work out a way to fuse 
 pipeline stages together efficiently.

 I believe it is possible to get something similar in D because 
 D has more powerful templates than C++ and D also has some 
 type introspection which C++ lacks.  Unfortunately I'm not as 
 good on D so I can only provide some ideas rather than actual 
 working code.

 Once this problem is solved, the benefit is huge.  It allowed 
 me to perform high level optimizations (streaming load/save, 
 prefetching, dynamic dispatching depending on data alignment 
 etc.) in the main loop which automatically benefits all 
 kernels and pipelines.
Exactly!
I think the problem here is two fold. First question, how do we combine pipeline stages with minimal overhead I think the key to this problem is reliable *forceinline* for example, a pipeline like this input.map!(x=>x.f1().f2().f3().store(output)); if we could make sure f1(), f2(), f3(), store(), and map() itself are all inlined, then we end up with a single loop with no function calls and the compiler is free to perform cross function optimizations. This is about as good as you can get. Unfortunately at the moment I hear it's difficult to make sure D functions get inlined. Second question, how do we combine SIMD pipeline stages with minimal overhead Besides reliable inlining, we also need some template code to repeat stages until their strides match. This requires details about each stage's logical unit size, input/output type and size at compile time. I can't think of what the interface of this would look like but the current map!() is likely insufficient to support this. I still don't believe auto-select between scalar or vector paths would be a very useful feature. Normally I would only consider SIMD solution when I know in advance that this is a performance hotspot. When the amount of data is small I simply don't care about performance and would just choose whatever simplest way to do it, like map!(), because the performance impact is not noticeable and definitely not worth the increased complexity.
Sep 07 2016
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Wednesday, 7 September 2016 at 10:31:13 UTC, finalpatch wrote:
 I think the problem here is two fold.

 First question, how do we combine pipeline stages with minimal 
 overhead

 I think the key to this problem is reliable *forceinline*

 for example, a pipeline like this

 input.map!(x=>x.f1().f2().f3().store(output));

 if we could make sure f1(), f2(), f3(), store(), and map() 
 itself are all inlined, then we end up with a single loop with 
 no function calls and the compiler is free to perform cross 
 function optimizations. This is about as good as you can get.  
 Unfortunately at the moment I hear it's difficult to make sure 
 D functions get inlined.
If the compiler is unable to inline (or wrongly decides it is too costly), I'd consider this a compiler bug. Of course, sometimes workarounds like `pragma(inline, true)` or ` forceinline` might be needed from time to time in practice, but they shouldn't influence the design of the pipeline interface.
 Second question, how do we combine SIMD pipeline stages with 
 minimal overhead

 Besides reliable inlining, we also need some template code to 
 repeat stages until their strides match. This requires details 
 about each stage's logical unit size, input/output type and 
 size at compile time. I can't think of what the interface of 
 this would look like but the current map!() is likely 
 insufficient to support this.
Would a `vectorize` range adapter be feasible that prepares the input to make it SIMD compatible? That is, force alignment, process left-over elements at the end, etc.? As far as I understand, the problems with auto vectorization stem from a difficulty of compilers to recognize vectorizing opportunities, and (as Manu described) from incompatible semantics of scalar and vector types that the compiler needs to preserve. But if that hypothetical `vectorize` helper forces the input data into one of a handful of well-known formats and types, wouldn't it be possible to make the compilers recognize those (especially if they are accompanied by suitable pragma or other compiler hints)?
 I still don't believe auto-select between scalar or vector 
 paths would be a very useful feature. Normally I would only 
 consider SIMD solution when I know in advance that this is a 
 performance hotspot. When the amount of data is small I simply 
 don't care about performance and would just choose whatever 
 simplest way to do it, like map!(), because the performance 
 impact is not noticeable and definitely not worth the increased 
 complexity.
In the above scenario, you can add `.vectorize` to the pipeline to enable vectorizing wherever you need it.
Sep 07 2016
parent finalpatch <fengli gmail.com> writes:
On Wednesday, 7 September 2016 at 11:53:00 UTC, Marc Schütz wrote:
 Would a `vectorize` range adapter be feasible that prepares the 
 input to make it SIMD compatible? That is, force alignment, 
 process left-over elements at the end, etc.? As far as I 
 understand, the problems with auto vectorization stem from a 
 difficulty of compilers to recognize vectorizing opportunities, 
 and (as Manu described) from incompatible semantics of scalar 
 and vector types that the compiler needs to preserve. But if 
 that hypothetical `vectorize` helper forces the input data into 
 one of a handful of well-known formats and types, wouldn't it 
 be possible to make the compilers recognize those (especially 
 if they are accompanied by suitable pragma or other compiler 
 hints)?
Contrary to popular belief, alignment is not a showstopper of SIMD code. Both Intel and ARM processors have instructions to access data from unaligned addresses. And on Intel processors, there is not even any speed penalty for using them on aligned addresses. Which means you can either forget it (on Intel) or just check the data alignment before you start and choose an optimal specialization of the main loop. However regarding auto vectorization, I'm with Manu. I won't put my bet on auto vectorization because I have never seen any non-trivial auto vectorized code that comes even close to hand tuned SIMD code. The compiler always have to play conservatively. The compiler has no idea that you are only using 10 bits of each 16bit components in a vector. It can't even help you shuffle RGBARGBARGBARGBA into RRRRGGGGBBBBAAAA. The best we can do is to create something that makes writing SIMD kernels easy/reusable/composable.
Sep 07 2016
prev sibling parent Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 5 September 2016 at 18:21, Andrei Alexandrescu via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.
What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
Well, it totally depends. Like right now in my case, 'transform' is some image processing code (in the past when I've had these same thoughts, it has been audio filters). You can't touch pixels (or samples) one at a time. They need manual SIMD deployment (I've never seen an auto-vectoriser handle saturation arithmetic, or type promotion), alpha components (every 4th byte) is treated differently, memory access patterns need to be tuned to be cache friendly. I haven't done benchmarks right now, but I've done them professionally in the past, and it's not unusual to expect a hand-written image processing loop to see 1 or even 2 orders of magnitude improvement when hand written, compared to calling a function for each pixel in a loop. The sorts of low-level optimisations you deploy in image and audio processing loops are not things I've ever seen any optimiser even attempt. Some core problems that tend to require manual intervention in hot-loops are: ubyte[16] <-> ushort[8][2] expansion/contraction ubyte[16] <-> float[4][4] expansion/contraction saturation scalar operator results promote to int, but wide-simd operations don't, which means some scalar expressions can't express losslessly collapsed to simd operations, and the compiler will always be conservative on this matter. If the auto-vectoriser tries at all, you will see a mountain of extra code to preserve those bits that the scalar operator semantics would have guaranteed wide-vector multiplication is semantically different than scalar multiplication, so the optimiser has a lot of trouble vectorising mul's assumptions about data alignment interleaved data; audio samples are usually [L,R] interleaved, images often [RGB,A], and different processes are applied across the separation, you want to unroll and shuffle the data so you have vectors [LLLL],[RRRR], or [RGBRGBRGBRGB],[AAAA], and I haven't seen an optimiser go near that vector dot-product is always a nuisance I could go on and on. The point is, as an end-user, pipeline API's are great. At a library author, I want to present the best performing library I can, which I think means we need to find a way to conveniently connect these 2 currently disconnected worlds. I've explored to some extent, but I've never come up with anything that I like.
Sep 05 2016
prev sibling next sibling parent Guillaume Piolat <first.last gmail.com> writes:
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is 
 probably
 many", and if you do something to one, you should do it to many.
 With this in mind, the code I have always written doesn't tend 
 to look
 like this:
   R manipulate(Thing thing);

 Instead:
   void manipulateThings(Thing *things, size_t numThings, R 
 *output,
 size_t outputLen);

 Written this way for clarity. Obviously, the D equiv uses 
 slices.
To chime in, processing in chunks is indeed absolutely key for audio performance. Processing by element is only tolerable for prototyping. I don't even use slices since but usually pointers and length, since most slice length would be the same. void processSamples(float* input, float* output, int samples); Some kind of buffered input and output range concept would be needed there if a lazy library was to be made. Unfortunately for this case inputs and outputs are rather diverse. Can't wait for a more formalized DbI introduction. As for graphics, ae.utils.graphics usually give you a whole line of pixels.
Sep 05 2016
prev sibling next sibling parent Jerry <Kickupx gmail.com> writes:
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);
So you basicly want to make the lazy computation eager and store the result? data.map!(x => transform(x)).array Will allocate a new array and fill it with the result of map. And if you want to recycle the buffer I guess writing a buffer function would be trivial.
Sep 06 2016
prev sibling parent reply Wyatt <wyatt.epp gmail.com> writes:
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is 
 probably
 many", and if you do something to one, you should do it to many.
From a conceptual standpoint, this sounds like the sort of thing array languages like APL and J thrive on, so there's solid precedent for the concept. I might suggest looking into optimising compilers in that space for inspiration and such; APEX, for example: http://www.snakeisland.com/apexup.htm Of course, this comes with the caveat that this is (still!) some relatively heavily-academic stuff. And I'm not sure to what extent that can help mitigate the problem of relaxing type requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors. -Wyatt
Sep 06 2016
parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 7 September 2016 at 01:54, Wyatt via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is probably
 many", and if you do something to one, you should do it to many.
From a conceptual standpoint, this sounds like the sort of thing array languages like APL and J thrive on, so there's solid precedent for the concept. I might suggest looking into optimising compilers in that space for inspiration and such; APEX, for example: http://www.snakeisland.com/apexup.htm
Thanks, that's really interesting, I'll check it out.
 Of course, this comes with the caveat that this is (still!) some relatively
 heavily-academic stuff.  And I'm not sure to what extent that can help
 mitigate the problem of relaxing type requirements such that you can e.g.
 efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors.
That's not what I want though. I intend to hand-write that function (I was just giving examples of how auto-vectorisation almost always fails), the question here is, how to work that new array function into our pipelines transparently...
Sep 06 2016
parent Wyatt <wyatt.epp gmail.com> writes:
On Wednesday, 7 September 2016 at 00:18:59 UTC, Manu wrote:
 On 7 September 2016 at 01:54, Wyatt via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:

 Thanks, that's really interesting, I'll check it out.
Here's some work on static rank polymorphism that might also be applicable?: http://www.ccs.neu.edu/home/pete/pub/esop-2014.pdf And in the Related Work, I just noticed Halide, which sounds like it's right up your alley: http://halide-lang.org/
 Of course, this comes with the caveat that this is (still!) 
 some relatively heavily-academic stuff.  And I'm not sure to 
 what extent that can help mitigate the problem of relaxing 
 type requirements such that you can e.g. efficiently ,/⍉ your 
 4 2⍴"LR" vector for SIMD on modern processors.
That's not what I want though. I intend to hand-write that function (I was just giving examples of how auto-vectorisation almost always fails), the question here is, how to work that new array function into our pipelines transparently...
Ah, I misunderstood. Sorry. I had the impression that you wanted to be able to simply write: data.map!(x => transform(x)).copy(output); ...for any data[] and have it lift the transformation to the whole vector. If you're doing the work, I'm curious what you're hoping the end result to look like in terms of the code you want to be able to write. Just a doodle is fine, it doesn't have to stand up to scrutiny. -Wyatt
Sep 07 2016