digitalmars.D - Taking pipeline processing to the next level
- Manu via Digitalmars-d (29/29) Sep 04 2016 I mostly code like this now:
- rikki cattermole (13/42) Sep 04 2016 Just a random idea:
- Andrei Alexandrescu (3/6) Sep 05 2016 What are the benchmarks and the numbers? What loss are you looking at?
- Johannes Pfau (11/19) Sep 05 2016 As Manu posted this question (and he's working on a color/image library)
- Manu via Digitalmars-d (9/28) Sep 05 2016 I have, but even just chunks() and joiner() can do the trick to a
- Ethan Watson (19/21) Sep 05 2016 Just looking at the example, and referencing the map code in
- Andrei Alexandrescu (5/24) Sep 05 2016 Understood. Would explicitly asking for vectorized operations be
- Manu via Digitalmars-d (16/45) Sep 05 2016 I still stand by this, and I listed some reasons above.
- finalpatch (45/66) Sep 06 2016 In a previous job I had successfully created a small c++ library
- finalpatch (5/11) Sep 06 2016 Correction:
- finalpatch (3/14) Sep 06 2016 And of course the key to the speed is all function calls get
- Manu via Digitalmars-d (4/14) Sep 06 2016 It's interesting thoughts. What did you do when buffers weren't
- finalpatch (7/12) Sep 06 2016 The end of a scan line is special cased . If I need 12 pixels for
- Manu via Digitalmars-d (7/19) Sep 06 2016 Right, and this is a classic problem with this sort of function; it is
- finalpatch (9/23) Sep 06 2016 We normally process full HD or higher resolution images so the
- Manu via Digitalmars-d (3/6) Sep 06 2016 No, what's hard is working this into D's pipeline patterns seamlessly.
- finalpatch (15/24) Sep 06 2016 The lesson I learned from this is that you need the user code to
- Manu via Digitalmars-d (3/26) Sep 06 2016 Exactly!
- finalpatch (28/45) Sep 07 2016 I think the problem here is two fold.
- Marc =?UTF-8?B?U2Now7x0eg==?= (19/47) Sep 07 2016 If the compiler is unable to inline (or wrongly decides it is too
- finalpatch (17/29) Sep 07 2016 Contrary to popular belief, alignment is not a showstopper of
- Manu via Digitalmars-d (42/50) Sep 05 2016 Well, it totally depends. Like right now in my case, 'transform' is
- Guillaume Piolat (12/26) Sep 05 2016 To chime in, processing in chunks is indeed absolutely key for
- Jerry (7/9) Sep 06 2016 So you basicly want to make the lazy computation eager and store
- Wyatt (12/16) Sep 06 2016 From a conceptual standpoint, this sounds like the sort of thing
- Manu via Digitalmars-d (7/22) Sep 06 2016 Thanks, that's really interesting, I'll check it out.
- Wyatt (16/31) Sep 07 2016 Here's some work on static rank polymorphism that might also be
I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient. This sort of one-by-one software design is the core performance problem with OOP. It seems a shame to be suffering OOP's failures even when there is no OOP in sight. A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many. With this in mind, the code I have always written doesn't tend to look like this: R manipulate(Thing thing); Instead: void manipulateThings(Thing *things, size_t numThings, R *output, size_t outputLen); Written this way for clarity. Obviously, the D equiv uses slices. All functions are implemented with the presumption they will operate on many things, rather than being called many times for each one. This is the single most successful design pattern I have ever encountered wrt high-performance code; ie, implement the array version first. The problem with this API design, is that it doesn't plug into algorithms or generic code well. data.map!(x => transformThings(&x, 1)).copy(output); I often wonder how we can integrate this design principle conveniently (ie, seamlessly) into the design of algorithms, such that they can make use of batching functions internally, and transparently? Has anyone done any work in this area? Ideas?
Sep 04 2016
On 05/09/2016 5:08 PM, Manu via Digitalmars-d wrote:I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient. This sort of one-by-one software design is the core performance problem with OOP. It seems a shame to be suffering OOP's failures even when there is no OOP in sight. A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many. With this in mind, the code I have always written doesn't tend to look like this: R manipulate(Thing thing); Instead: void manipulateThings(Thing *things, size_t numThings, R *output, size_t outputLen); Written this way for clarity. Obviously, the D equiv uses slices. All functions are implemented with the presumption they will operate on many things, rather than being called many times for each one. This is the single most successful design pattern I have ever encountered wrt high-performance code; ie, implement the array version first. The problem with this API design, is that it doesn't plug into algorithms or generic code well. data.map!(x => transformThings(&x, 1)).copy(output); I often wonder how we can integrate this design principle conveniently (ie, seamlessly) into the design of algorithms, such that they can make use of batching functions internally, and transparently? Has anyone done any work in this area? Ideas?Just a random idea: import std.array : front, popFront, empty; import std.range : ElementType, isInputRange; int[] transformer(I)(I from, int[] buffer) if (is(ElementType!I == int) && isInputRange!I) { size_t used; // ... transformation algo return buffer[0 .. used]; } auto got = input.transformer(buffer).usage; Input range instead of straight array being passed in so it works on pretty much any input arrays or input ranges.
Sep 04 2016
On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient.What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
Sep 05 2016
Am Mon, 5 Sep 2016 10:21:53 +0200 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:As Manu posted this question (and he's working on a color/image library) it's not hard to guess one problem is SIMD/vectorization. E.g if transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4 values instead of 4 individual adds. As autovectorization is not very powerful in current compilers I can easily imagine that complex range based examples can't compete with hand-written SIMD loops. Manu: Have you had a look at std.parallelism? I think it has some kind of parallel map which could provide some inspiration?I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient.What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
Sep 05 2016
On 5 September 2016 at 20:32, Johannes Pfau via Digitalmars-d <digitalmars-d puremagic.com> wrote:Am Mon, 5 Sep 2016 10:21:53 +0200 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:I have, but even just chunks() and joiner() can do the trick to a reasonable extent, but it's still not great. It's definitely not where I'd like it to be. End-users won't manually deploy these strategies correctly (or at all), I'd like to see design that enables more automatic deployment of batch processing. I treat the end-user like a javascript user; they shouldn't need to do hard work to make proper use of a lib, that's poor API offering on part of the lib author.On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:As Manu posted this question (and he's working on a color/image library) it's not hard to guess one problem is SIMD/vectorization. E.g if transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4 values instead of 4 individual adds. As autovectorization is not very powerful in current compilers I can easily imagine that complex range based examples can't compete with hand-written SIMD loops. Manu: Have you had a look at std.parallelism? I think it has some kind of parallel map which could provide some inspiration?I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient.What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
Sep 05 2016
On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:What are the benchmarks and the numbers? What loss are you looking at? -- AndreiJust looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types.
Sep 05 2016
On 9/5/16 1:43 PM, Ethan Watson wrote:On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:Understood. Would explicitly asking for vectorized operations be acceptable? One school of thought has it that explicit invocation of parallel operations are preferable to autovectorization and its ilk. -- AndreiWhat are the benchmarks and the numbers? What loss are you looking at? -- AndreiJust looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types.
Sep 05 2016
On 5 September 2016 at 23:38, Andrei Alexandrescu via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 9/5/16 1:43 PM, Ethan Watson wrote:I still stand by this, and I listed some reasons above. Auto-vectorisation is a nice opportunistic optimisation, but it can't be relied on. The key reason is that scalar arithmetic semantics are different than vector semantics, and auto-vectorisation tends to produce a whole bunch of extra junk code to carefully (usually pointlessly) preserve the scalar semantics that it's trying to vectorise. This will never end well. But the vectorisation isn't the interesting problem here, I'm really just interested in how to work these batch-processing functions into our nice modern pipeline statements without placing an unreasonable burden on the end-user, who shouldn't be expected to go out of their way. If they even have to start manually chunking, I think we've already lost; they won't know optimal chunk-sizes, or anything about alignment boundaries, cache, etc.On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:Understood. Would explicitly asking for vectorized operations be acceptable? One school of thought has it that explicit invocation of parallel operations are preferable to autovectorization and its ilk. -- AndreiWhat are the benchmarks and the numbers? What loss are you looking at? -- AndreiJust looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types.
Sep 05 2016
On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:I still stand by this, and I listed some reasons above. Auto-vectorisation is a nice opportunistic optimisation, but it can't be relied on. The key reason is that scalar arithmetic semantics are different than vector semantics, and auto-vectorisation tends to produce a whole bunch of extra junk code to carefully (usually pointlessly) preserve the scalar semantics that it's trying to vectorise. This will never end well. But the vectorisation isn't the interesting problem here, I'm really just interested in how to work these batch-processing functions into our nice modern pipeline statements without placing an unreasonable burden on the end-user, who shouldn't be expected to go out of their way. If they even have to start manually chunking, I think we've already lost; they won't know optimal chunk-sizes, or anything about alignment boundaries, cache, etc.In a previous job I had successfully created a small c++ library to perform pipelined SIMD image processing. Not sure how relevant it is but think I'd share the design here, perhaps it'll give you guys some ideas. Basically the users of this library only need to write simple kernel classes, something like this: // A kernel that processes 4 pixels at a time struct MySimpleKernel : Kernel<4> { // Tell the library the input and output type using InputVector = Vector<__m128, 1>; using OutputVector = Vector<__m128, 2>; template<typename T> OutputVector apply(const T& src) { // T will be deduced to Vector<__m128, 1> // which is an array of one __m128 element // Awesome SIMD code goes here... // And return the output vector return OutputVector(...); } }; Of course the InputVector and OutputVector do not have to be __m128, they can totally be other types like int or float. The cool thing is kernels can be chained together with >> operators. So assume we have another kernel: struct AnotherKernel : Kernel<3> { ... } Then we can create a processing pipeline with these 2 kernels: InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >> OutputBuffer(...); Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. Any number of kernels can be chained together in this way, as long as your compiler doesn't explode. At that time, my benchmarks showed pipelines generated in this way often rivals the speed of hand tuned loops.
Sep 06 2016
On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
Sep 06 2016
On Tuesday, 6 September 2016 at 14:26:22 UTC, finalpatch wrote:On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:And of course the key to the speed is all function calls get inlined by the compiler.Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
Sep 06 2016
On 7 September 2016 at 00:26, finalpatch via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:It's interesting thoughts. What did you do when buffers weren't multiple of the kernels?Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
Sep 06 2016
On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer.with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.It's interesting thoughts. What did you do when buffers weren't multiple of the kernels?
Sep 06 2016
On 7 September 2016 at 07:11, finalpatch via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:Right, and this is a classic problem with this sort of function; it is only more efficient if numElements is suitable long. See, I often wonder if it would be worth being able to provide both functions, a scalar and array version, and have the algorithms select between them intelligently.The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer.with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.It's interesting thoughts. What did you do when buffers weren't multiple of the kernels?
Sep 06 2016
On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote:We normally process full HD or higher resolution images so the overhead of having to copy the last iteration was negligible. It was fairly easy to put together a scalar version as they are much easier to write than the SIMD ones. In fact I had scalar version for every SIMD kernel, and use them for unit testing. It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it.The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer.Right, and this is a classic problem with this sort of function; it is only more efficient if numElements is suitable long. See, I often wonder if it would be worth being able to provide both functions, a scalar and array version, and have the algorithms select between them intelligently.
Sep 06 2016
On 7 September 2016 at 11:04, finalpatch via Digitalmars-d <digitalmars-d puremagic.com> wrote:It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it.No, what's hard is working this into D's pipeline patterns seamlessly.
Sep 06 2016
On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:On 7 September 2016 at 11:04, finalpatch via Digitalmars-d <digitalmars-d puremagic.com> wrote:The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently. I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code. Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines.It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it.No, what's hard is working this into D's pipeline patterns seamlessly.
Sep 06 2016
On 7 September 2016 at 12:00, finalpatch via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:Exactly!On 7 September 2016 at 11:04, finalpatch via Digitalmars-d <digitalmars-d puremagic.com> wrote:The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently. I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code. Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines.It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it.No, what's hard is working this into D's pipeline patterns seamlessly.
Sep 06 2016
On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote:I think the problem here is two fold. First question, how do we combine pipeline stages with minimal overhead I think the key to this problem is reliable *forceinline* for example, a pipeline like this input.map!(x=>x.f1().f2().f3().store(output)); if we could make sure f1(), f2(), f3(), store(), and map() itself are all inlined, then we end up with a single loop with no function calls and the compiler is free to perform cross function optimizations. This is about as good as you can get. Unfortunately at the moment I hear it's difficult to make sure D functions get inlined. Second question, how do we combine SIMD pipeline stages with minimal overhead Besides reliable inlining, we also need some template code to repeat stages until their strides match. This requires details about each stage's logical unit size, input/output type and size at compile time. I can't think of what the interface of this would look like but the current map!() is likely insufficient to support this. I still don't believe auto-select between scalar or vector paths would be a very useful feature. Normally I would only consider SIMD solution when I know in advance that this is a performance hotspot. When the amount of data is small I simply don't care about performance and would just choose whatever simplest way to do it, like map!(), because the performance impact is not noticeable and definitely not worth the increased complexity.The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently. I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code. Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines.Exactly!
Sep 07 2016
On Wednesday, 7 September 2016 at 10:31:13 UTC, finalpatch wrote:I think the problem here is two fold. First question, how do we combine pipeline stages with minimal overhead I think the key to this problem is reliable *forceinline* for example, a pipeline like this input.map!(x=>x.f1().f2().f3().store(output)); if we could make sure f1(), f2(), f3(), store(), and map() itself are all inlined, then we end up with a single loop with no function calls and the compiler is free to perform cross function optimizations. This is about as good as you can get. Unfortunately at the moment I hear it's difficult to make sure D functions get inlined.If the compiler is unable to inline (or wrongly decides it is too costly), I'd consider this a compiler bug. Of course, sometimes workarounds like `pragma(inline, true)` or ` forceinline` might be needed from time to time in practice, but they shouldn't influence the design of the pipeline interface.Second question, how do we combine SIMD pipeline stages with minimal overhead Besides reliable inlining, we also need some template code to repeat stages until their strides match. This requires details about each stage's logical unit size, input/output type and size at compile time. I can't think of what the interface of this would look like but the current map!() is likely insufficient to support this.Would a `vectorize` range adapter be feasible that prepares the input to make it SIMD compatible? That is, force alignment, process left-over elements at the end, etc.? As far as I understand, the problems with auto vectorization stem from a difficulty of compilers to recognize vectorizing opportunities, and (as Manu described) from incompatible semantics of scalar and vector types that the compiler needs to preserve. But if that hypothetical `vectorize` helper forces the input data into one of a handful of well-known formats and types, wouldn't it be possible to make the compilers recognize those (especially if they are accompanied by suitable pragma or other compiler hints)?I still don't believe auto-select between scalar or vector paths would be a very useful feature. Normally I would only consider SIMD solution when I know in advance that this is a performance hotspot. When the amount of data is small I simply don't care about performance and would just choose whatever simplest way to do it, like map!(), because the performance impact is not noticeable and definitely not worth the increased complexity.In the above scenario, you can add `.vectorize` to the pipeline to enable vectorizing wherever you need it.
Sep 07 2016
On Wednesday, 7 September 2016 at 11:53:00 UTC, Marc Schütz wrote:Would a `vectorize` range adapter be feasible that prepares the input to make it SIMD compatible? That is, force alignment, process left-over elements at the end, etc.? As far as I understand, the problems with auto vectorization stem from a difficulty of compilers to recognize vectorizing opportunities, and (as Manu described) from incompatible semantics of scalar and vector types that the compiler needs to preserve. But if that hypothetical `vectorize` helper forces the input data into one of a handful of well-known formats and types, wouldn't it be possible to make the compilers recognize those (especially if they are accompanied by suitable pragma or other compiler hints)?Contrary to popular belief, alignment is not a showstopper of SIMD code. Both Intel and ARM processors have instructions to access data from unaligned addresses. And on Intel processors, there is not even any speed penalty for using them on aligned addresses. Which means you can either forget it (on Intel) or just check the data alignment before you start and choose an optimal specialization of the main loop. However regarding auto vectorization, I'm with Manu. I won't put my bet on auto vectorization because I have never seen any non-trivial auto vectorized code that comes even close to hand tuned SIMD code. The compiler always have to play conservatively. The compiler has no idea that you are only using 10 bits of each 16bit components in a vector. It can't even help you shuffle RGBARGBARGBARGBA into RRRRGGGGBBBBAAAA. The best we can do is to create something that makes writing SIMD kernels easy/reusable/composable.
Sep 07 2016
On 5 September 2016 at 18:21, Andrei Alexandrescu via Digitalmars-d <digitalmars-d puremagic.com> wrote:On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:Well, it totally depends. Like right now in my case, 'transform' is some image processing code (in the past when I've had these same thoughts, it has been audio filters). You can't touch pixels (or samples) one at a time. They need manual SIMD deployment (I've never seen an auto-vectoriser handle saturation arithmetic, or type promotion), alpha components (every 4th byte) is treated differently, memory access patterns need to be tuned to be cache friendly. I haven't done benchmarks right now, but I've done them professionally in the past, and it's not unusual to expect a hand-written image processing loop to see 1 or even 2 orders of magnitude improvement when hand written, compared to calling a function for each pixel in a loop. The sorts of low-level optimisations you deploy in image and audio processing loops are not things I've ever seen any optimiser even attempt. Some core problems that tend to require manual intervention in hot-loops are: ubyte[16] <-> ushort[8][2] expansion/contraction ubyte[16] <-> float[4][4] expansion/contraction saturation scalar operator results promote to int, but wide-simd operations don't, which means some scalar expressions can't express losslessly collapsed to simd operations, and the compiler will always be conservative on this matter. If the auto-vectoriser tries at all, you will see a mountain of extra code to preserve those bits that the scalar operator semantics would have guaranteed wide-vector multiplication is semantically different than scalar multiplication, so the optimiser has a lot of trouble vectorising mul's assumptions about data alignment interleaved data; audio samples are usually [L,R] interleaved, images often [RGB,A], and different processes are applied across the separation, you want to unroll and shuffle the data so you have vectors [LLLL],[RRRR], or [RGBRGBRGBRGB],[AAAA], and I haven't seen an optimiser go near that vector dot-product is always a nuisance I could go on and on. The point is, as an end-user, pipeline API's are great. At a library author, I want to present the best performing library I can, which I think means we need to find a way to conveniently connect these 2 currently disconnected worlds. I've explored to some extent, but I've never come up with anything that I like.I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient.What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
Sep 05 2016
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many. With this in mind, the code I have always written doesn't tend to look like this: R manipulate(Thing thing); Instead: void manipulateThings(Thing *things, size_t numThings, R *output, size_t outputLen); Written this way for clarity. Obviously, the D equiv uses slices.To chime in, processing in chunks is indeed absolutely key for audio performance. Processing by element is only tolerable for prototyping. I don't even use slices since but usually pointers and length, since most slice length would be the same. void processSamples(float* input, float* output, int samples); Some kind of buffered input and output range concept would be needed there if a lazy library was to be made. Unfortunately for this case inputs and outputs are rather diverse. Can't wait for a more formalized DbI introduction. As for graphics, ae.utils.graphics usually give you a whole line of pixels.
Sep 05 2016
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:I mostly code like this now: data.map!(x => transform(x)).copy(output);So you basicly want to make the lazy computation eager and store the result? data.map!(x => transform(x)).array Will allocate a new array and fill it with the result of map. And if you want to recycle the buffer I guess writing a buffer function would be trivial.
Sep 06 2016
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many.From a conceptual standpoint, this sounds like the sort of thing array languages like APL and J thrive on, so there's solid precedent for the concept. I might suggest looking into optimising compilers in that space for inspiration and such; APEX, for example: http://www.snakeisland.com/apexup.htm Of course, this comes with the caveat that this is (still!) some relatively heavily-academic stuff. And I'm not sure to what extent that can help mitigate the problem of relaxing type requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors. -Wyatt
Sep 06 2016
On 7 September 2016 at 01:54, Wyatt via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:Thanks, that's really interesting, I'll check it out.A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many.From a conceptual standpoint, this sounds like the sort of thing array languages like APL and J thrive on, so there's solid precedent for the concept. I might suggest looking into optimising compilers in that space for inspiration and such; APEX, for example: http://www.snakeisland.com/apexup.htmOf course, this comes with the caveat that this is (still!) some relatively heavily-academic stuff. And I'm not sure to what extent that can help mitigate the problem of relaxing type requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors.That's not what I want though. I intend to hand-write that function (I was just giving examples of how auto-vectorisation almost always fails), the question here is, how to work that new array function into our pipelines transparently...
Sep 06 2016
On Wednesday, 7 September 2016 at 00:18:59 UTC, Manu wrote:On 7 September 2016 at 01:54, Wyatt via Digitalmars-d <digitalmars-d puremagic.com> wrote: Thanks, that's really interesting, I'll check it out.Here's some work on static rank polymorphism that might also be applicable?: http://www.ccs.neu.edu/home/pete/pub/esop-2014.pdf And in the Related Work, I just noticed Halide, which sounds like it's right up your alley: http://halide-lang.org/Ah, I misunderstood. Sorry. I had the impression that you wanted to be able to simply write: data.map!(x => transform(x)).copy(output); ...for any data[] and have it lift the transformation to the whole vector. If you're doing the work, I'm curious what you're hoping the end result to look like in terms of the code you want to be able to write. Just a doodle is fine, it doesn't have to stand up to scrutiny. -WyattOf course, this comes with the caveat that this is (still!) some relatively heavily-academic stuff. And I'm not sure to what extent that can help mitigate the problem of relaxing type requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors.That's not what I want though. I intend to hand-write that function (I was just giving examples of how auto-vectorisation almost always fails), the question here is, how to work that new array function into our pipelines transparently...
Sep 07 2016