digitalmars.D - Taking pipeline processing to the next level

Manu via Digitalmars-d (29/29) Sep 04 2016 I mostly code like this now:

rikki cattermole (13/42) Sep 04 2016 Just a random idea:
Andrei Alexandrescu (3/6) Sep 05 2016 What are the benchmarks and the numbers? What loss are you looking at?

Johannes Pfau (11/19) Sep 05 2016 As Manu posted this question (and he's working on a color/image library)

Manu via Digitalmars-d (9/28) Sep 05 2016 I have, but even just chunks() and joiner() can do the trick to a

Ethan Watson (19/21) Sep 05 2016 Just looking at the example, and referencing the map code in

Andrei Alexandrescu (5/24) Sep 05 2016 Understood. Would explicitly asking for vectorized operations be

Manu via Digitalmars-d (16/45) Sep 05 2016 I still stand by this, and I listed some reasons above.

finalpatch (45/66) Sep 06 2016 In a previous job I had successfully created a small c++ library

finalpatch (5/11) Sep 06 2016 Correction:

finalpatch (3/14) Sep 06 2016 And of course the key to the speed is all function calls get
Manu via Digitalmars-d (4/14) Sep 06 2016 It's interesting thoughts. What did you do when buffers weren't

finalpatch (7/12) Sep 06 2016 The end of a scan line is special cased . If I need 12 pixels for

Manu via Digitalmars-d (7/19) Sep 06 2016 Right, and this is a classic problem with this sort of function; it is

finalpatch (9/23) Sep 06 2016 We normally process full HD or higher resolution images so the

Manu via Digitalmars-d (3/6) Sep 06 2016 No, what's hard is working this into D's pipeline patterns seamlessly.

finalpatch (15/24) Sep 06 2016 The lesson I learned from this is that you need the user code to

Manu via Digitalmars-d (3/26) Sep 06 2016 Exactly!

finalpatch (28/45) Sep 07 2016 I think the problem here is two fold.

Marc =?UTF-8?B?U2Now7x0eg==?= (19/47) Sep 07 2016 If the compiler is unable to inline (or wrongly decides it is too

finalpatch (17/29) Sep 07 2016 Contrary to popular belief, alignment is not a showstopper of

Manu via Digitalmars-d (42/50) Sep 05 2016 Well, it totally depends. Like right now in my case, 'transform' is

Guillaume Piolat (12/26) Sep 05 2016 To chime in, processing in chunks is indeed absolutely key for
Jerry (7/9) Sep 06 2016 So you basicly want to make the lazy computation eager and store
Wyatt (12/16) Sep 06 2016 From a conceptual standpoint, this sounds like the sort of thing

Manu via Digitalmars-d (7/22) Sep 06 2016 Thanks, that's really interesting, I'll check it out.

Wyatt (16/31) Sep 07 2016 Here's some work on static rank polymorphism that might also be

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

I mostly code like this now:
  data.map!(x => transform(x)).copy(output);

It's convenient and reads nicely, but it's generally inefficient.
This sort of one-by-one software design is the core performance
problem with OOP. It seems a shame to be suffering OOP's failures even
when there is no OOP in sight.

A central premise of performance-oriented programming which I've
employed my entire career, is "where there is one, there is probably
many", and if you do something to one, you should do it to many.
With this in mind, the code I have always written doesn't tend to look
like this:
  R manipulate(Thing thing);

Instead:
  void manipulateThings(Thing *things, size_t numThings, R *output,
size_t outputLen);

Written this way for clarity. Obviously, the D equiv uses slices.

All functions are implemented with the presumption they will operate
on many things, rather than being called many times for each one.
This is the single most successful design pattern I have ever
encountered wrt high-performance code; ie, implement the array version
first.

The problem with this API design, is that it doesn't plug into
algorithms or generic code well.
  data.map!(x => transformThings(&x, 1)).copy(output);

I often wonder how we can integrate this design principle conveniently
(ie, seamlessly) into the design of algorithms, such that they can
make use of batching functions internally, and transparently?

Has anyone done any work in this area?

Ideas?

Sep 04 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 05/09/2016 5:08 PM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.
 This sort of one-by-one software design is the core performance
 problem with OOP. It seems a shame to be suffering OOP's failures even
 when there is no OOP in sight.

 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is probably
 many", and if you do something to one, you should do it to many.
 With this in mind, the code I have always written doesn't tend to look
 like this:
   R manipulate(Thing thing);

 Instead:
   void manipulateThings(Thing *things, size_t numThings, R *output,
 size_t outputLen);

 Written this way for clarity. Obviously, the D equiv uses slices.

 All functions are implemented with the presumption they will operate
 on many things, rather than being called many times for each one.
 This is the single most successful design pattern I have ever
 encountered wrt high-performance code; ie, implement the array version
 first.

 The problem with this API design, is that it doesn't plug into
 algorithms or generic code well.
   data.map!(x => transformThings(&x, 1)).copy(output);

 I often wonder how we can integrate this design principle conveniently
 (ie, seamlessly) into the design of algorithms, such that they can
 make use of batching functions internally, and transparently?

 Has anyone done any work in this area?

 Ideas?

Just a random idea:

import std.array : front, popFront, empty;
import std.range : ElementType, isInputRange;
int[] transformer(I)(I from, int[] buffer) if (is(ElementType!I == int) 
&& isInputRange!I) {
	size_t used;

	// ... transformation algo

	return buffer[0 .. used];
}

auto got = input.transformer(buffer).usage;

Input range instead of straight array being passed in so it works on 
pretty much any input arrays or input ranges.

Sep 04 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.

What are the benchmarks and the numbers? What loss are you looking at? 
-- Andrei

Sep 05 2016

Johannes Pfau <nospam example.com> writes:

Am Mon, 5 Sep 2016 10:21:53 +0200
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.  

 
 What are the benchmarks and the numbers? What loss are you looking
 at? -- Andrei

As Manu posted this question (and he's working on a color/image library)
it's not hard to guess one problem is SIMD/vectorization. E.g if
transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4
values instead of 4 individual adds.

As autovectorization is not very powerful in current compilers I can
easily imagine that complex range based examples can't compete with
hand-written SIMD loops.

 Manu: Have you had a look at std.parallelism? I think it has some kind
of parallel map which could provide some inspiration?

Sep 05 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 5 September 2016 at 20:32, Johannes Pfau via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 Am Mon, 5 Sep 2016 10:21:53 +0200
 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.

 What are the benchmarks and the numbers? What loss are you looking
 at? -- Andrei

 As Manu posted this question (and he's working on a color/image library)
 it's not hard to guess one problem is SIMD/vectorization. E.g if
 transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4
 values instead of 4 individual adds.

 As autovectorization is not very powerful in current compilers I can
 easily imagine that complex range based examples can't compete with
 hand-written SIMD loops.

  Manu: Have you had a look at std.parallelism? I think it has some kind
 of parallel map which could provide some inspiration?

I have, but even just chunks() and joiner() can do the trick to a
reasonable extent, but it's still not great. It's definitely not where
I'd like it to be. End-users won't manually deploy these strategies
correctly (or at all), I'd like to see design that enables more
automatic deployment of batch processing. I treat the end-user like a
javascript user; they shouldn't need to do hard work to make proper
use of a lib, that's poor API offering on part of the lib author.

Sep 05 2016

Ethan Watson <gooberman gmail.com> writes:

On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu 
wrote:
 What are the benchmarks and the numbers? What loss are you 
 looking at? -- Andrei

Just looking at the example, and referencing the map code in 
std.algorithm.iteration, I can see multiple function calls 
instead of one thanks to every indexing of the new map doing a 
transformation instead of caching it. I'm not sure if the lambda 
declaration there will result in the argument being taken by ref 
or by value, but let's assume by value for the sake of argument. 
Depending on if it's taking by value a reference or a value type, 
that could either be a cheap function call or an expensive one.

But even if it took it by reference, it's still a function call. 
Function calls are generally The Devil(TM) in a gaming 
environment. The less you can make, the better.

Random aside: There are streaming store instructions available to 
me on x86 platforms so that I don't have to wait for the 
destination to hit L1 cache before writing. The pattern Manu 
talks about with a batching function can better exploit this. But 
I imagine copy could also take advantage of this when dealing 
with value types.

Sep 05 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/5/16 1:43 PM, Ethan Watson wrote:
 On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:
 What are the benchmarks and the numbers? What loss are you looking at?
 -- Andrei

 Just looking at the example, and referencing the map code in
 std.algorithm.iteration, I can see multiple function calls instead of
 one thanks to every indexing of the new map doing a transformation
 instead of caching it. I'm not sure if the lambda declaration there will
 result in the argument being taken by ref or by value, but let's assume
 by value for the sake of argument. Depending on if it's taking by value
 a reference or a value type, that could either be a cheap function call
 or an expensive one.

 But even if it took it by reference, it's still a function call.
 Function calls are generally The Devil(TM) in a gaming environment. The
 less you can make, the better.

 Random aside: There are streaming store instructions available to me on
 x86 platforms so that I don't have to wait for the destination to hit L1
 cache before writing. The pattern Manu talks about with a batching
 function can better exploit this. But I imagine copy could also take
 advantage of this when dealing with value types.

Understood. Would explicitly asking for vectorized operations be 
acceptable? One school of thought has it that explicit invocation of 
parallel operations are preferable to autovectorization and its ilk. -- 
Andrei

Sep 05 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 5 September 2016 at 23:38, Andrei Alexandrescu via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 9/5/16 1:43 PM, Ethan Watson wrote:
 On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:
 What are the benchmarks and the numbers? What loss are you looking at?
 -- Andrei


 Just looking at the example, and referencing the map code in
 std.algorithm.iteration, I can see multiple function calls instead of
 one thanks to every indexing of the new map doing a transformation
 instead of caching it. I'm not sure if the lambda declaration there will
 result in the argument being taken by ref or by value, but let's assume
 by value for the sake of argument. Depending on if it's taking by value
 a reference or a value type, that could either be a cheap function call
 or an expensive one.

 But even if it took it by reference, it's still a function call.
 Function calls are generally The Devil(TM) in a gaming environment. The
 less you can make, the better.

 Random aside: There are streaming store instructions available to me on
 x86 platforms so that I don't have to wait for the destination to hit L1
 cache before writing. The pattern Manu talks about with a batching
 function can better exploit this. But I imagine copy could also take
 advantage of this when dealing with value types.


 Understood. Would explicitly asking for vectorized operations be acceptable?
 One school of thought has it that explicit invocation of parallel operations
 are preferable to autovectorization and its ilk. -- Andrei

I still stand by this, and I listed some reasons above.
Auto-vectorisation is a nice opportunistic optimisation, but it can't
be relied on. The key reason is that scalar arithmetic semantics are
different than vector semantics, and auto-vectorisation tends to
produce a whole bunch of extra junk code to carefully (usually
pointlessly) preserve the scalar semantics that it's trying to
vectorise. This will never end well.
But the vectorisation isn't the interesting problem here, I'm really
just interested in how to work these batch-processing functions into
our nice modern pipeline statements without placing an unreasonable
burden on the end-user, who shouldn't be expected to go out of their
way. If they even have to start manually chunking, I think we've
already lost; they won't know optimal chunk-sizes, or anything about
alignment boundaries, cache, etc.

Sep 05 2016

finalpatch <fengli gmail.com> writes:

On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:
 I still stand by this, and I listed some reasons above.
 Auto-vectorisation is a nice opportunistic optimisation, but it 
 can't
 be relied on. The key reason is that scalar arithmetic 
 semantics are
 different than vector semantics, and auto-vectorisation tends to
 produce a whole bunch of extra junk code to carefully (usually
 pointlessly) preserve the scalar semantics that it's trying to
 vectorise. This will never end well.
 But the vectorisation isn't the interesting problem here, I'm 
 really
 just interested in how to work these batch-processing functions 
 into
 our nice modern pipeline statements without placing an 
 unreasonable
 burden on the end-user, who shouldn't be expected to go out of 
 their
 way. If they even have to start manually chunking, I think we've
 already lost; they won't know optimal chunk-sizes, or anything 
 about
 alignment boundaries, cache, etc.

In a previous job I had successfully created a small c++ library 
to perform pipelined SIMD image processing. Not sure how relevant 
it is but think I'd share the design here, perhaps it'll give you 
guys some ideas.

Basically the users of this library only need to write simple 
kernel classes, something like this:

// A kernel that processes 4 pixels at a time
struct MySimpleKernel : Kernel<4>
{
     // Tell the library the input and output type
     using InputVector  = Vector<__m128, 1>;
     using OutputVector = Vector<__m128, 2>;

     template<typename T>
     OutputVector apply(const T& src)
     {
         // T will be deduced to Vector<__m128, 1>
         // which is an array of one __m128 element
         // Awesome SIMD code goes here...
         // And return the output vector
         return OutputVector(...);
     }
};

Of course the InputVector and OutputVector do not have to be 
__m128, they can totally be other types like int or float.

The cool thing is kernels can be chained together with >> 
operators.

So assume we have another kernel:

struct AnotherKernel : Kernel<3>
{
...
}

Then we can create a processing pipeline with these 2 kernels:

InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >> 
OutputBuffer(...);

Then some template magic will figure out the LCM of the 2 
kernels' pixel width is 3*4=12 and therefore they are fused 
together into a composite kernel of pixel width 12.  The above 
line compiles down into a single function invokation, with a main 
loop that reads the source buffer in 4 pixels step, call 
MySimpleKernel 3 times, then call AnotherKernel 4 times.

Any number of kernels can be chained together in this way, as 
long as your compiler doesn't explode.

At that time, my benchmarks showed pipelines generated in this 
way often rivals the speed of hand tuned loops.

Sep 06 2016

finalpatch <fengli gmail.com> writes:

On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
 Then some template magic will figure out the LCM of the 2 
 kernels' pixel width is 3*4=12 and therefore they are fused 
 together into a composite kernel of pixel width 12.  The above 
 line compiles down into a single function invokation, with a 
 main loop that reads the source buffer in 4 pixels step, call 
 MySimpleKernel 3 times, then call AnotherKernel 4 times.

Correction:
with a main loop that reads the source buffer in *12* pixels 
step, call MySimpleKernel 3 times, then call AnotherKernel 4 
times.

Sep 06 2016

finalpatch <fengli gmail.com> writes:

On Tuesday, 6 September 2016 at 14:26:22 UTC, finalpatch wrote:
 On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
 Then some template magic will figure out the LCM of the 2 
 kernels' pixel width is 3*4=12 and therefore they are fused 
 together into a composite kernel of pixel width 12.  The above 
 line compiles down into a single function invokation, with a 
 main loop that reads the source buffer in 4 pixels step, call 
 MySimpleKernel 3 times, then call AnotherKernel 4 times.

 Correction:
 with a main loop that reads the source buffer in *12* pixels 
 step, call MySimpleKernel 3 times, then call AnotherKernel 4 
 times.

And of course the key to the speed is all function calls get 
inlined by the compiler.

Sep 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 7 September 2016 at 00:26, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote:
 Then some template magic will figure out the LCM of the 2 kernels' pixel
 width is 3*4=12 and therefore they are fused together into a composite
 kernel of pixel width 12.  The above line compiles down into a single
 function invokation, with a main loop that reads the source buffer in 4
 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.


 Correction:
 with a main loop that reads the source buffer in *12* pixels step, call
 MySimpleKernel 3 times, then call AnotherKernel 4 times.

It's interesting thoughts. What did you do when buffers weren't
multiple of the kernels?

Sep 06 2016

finalpatch <fengli gmail.com> writes:

On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:

 with a main loop that reads the source buffer in *12* pixels 
 step, call
 MySimpleKernel 3 times, then call AnotherKernel 4 times.

 It's interesting thoughts. What did you do when buffers weren't 
 multiple of the kernels?

The end of a scan line is special cased . If I need 12 pixels for 
the last iteration but there are only 8 left, an instance of 
Kernel::InputVector is allocated on stack, 8 remaining pixels are 
memcpy into it then send to the kernel. Output from kernel are 
also assigned to a stack variable first, then memcpy 8 pixels to 
the output buffer.

Sep 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 7 September 2016 at 07:11, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote:

 with a main loop that reads the source buffer in *12* pixels step, call
 MySimpleKernel 3 times, then call AnotherKernel 4 times.


 It's interesting thoughts. What did you do when buffers weren't multiple
 of the kernels?


 The end of a scan line is special cased . If I need 12 pixels for the last
 iteration but there are only 8 left, an instance of Kernel::InputVector is
 allocated on stack, 8 remaining pixels are memcpy into it then send to the
 kernel. Output from kernel are also assigned to a stack variable first, then
 memcpy 8 pixels to the output buffer.

Right, and this is a classic problem with this sort of function; it is
only more efficient if numElements is suitable long.
See, I often wonder if it would be worth being able to provide both
functions, a scalar and array version, and have the algorithms select
between them intelligently.

Sep 06 2016

finalpatch <fengli gmail.com> writes:

On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote:
 The end of a scan line is special cased . If I need 12 pixels 
 for the last iteration but there are only 8 left, an instance 
 of Kernel::InputVector is allocated on stack, 8 remaining 
 pixels are memcpy into it then send to the kernel. Output from 
 kernel are also assigned to a stack variable first, then 
 memcpy 8 pixels to the output buffer.

 Right, and this is a classic problem with this sort of 
 function; it is
 only more efficient if numElements is suitable long.
 See, I often wonder if it would be worth being able to provide 
 both
 functions, a scalar and array version, and have the algorithms 
 select
 between them intelligently.

We normally process full HD or higher resolution images so the 
overhead of having to copy the last iteration was negligible.

It was fairly easy to put together a scalar version as they are 
much easier to write than the SIMD ones.  In fact I had scalar 
version for every SIMD kernel,  and use them for unit testing.

It shouldn't be hard to have the framework look at the buffer 
size and choose the scalar version when number of elements are 
small, it wasn't done that way simply because we didn't need it.

Sep 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 7 September 2016 at 11:04, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 It shouldn't be hard to have the framework look at the buffer size and
 choose the scalar version when number of elements are small, it wasn't done
 that way simply because we didn't need it.

No, what's hard is working this into D's pipeline patterns seamlessly.

Sep 06 2016

finalpatch <fengli gmail.com> writes:

On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:
 On 7 September 2016 at 11:04, finalpatch via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 It shouldn't be hard to have the framework look at the buffer 
 size and choose the scalar version when number of elements are 
 small, it wasn't done that way simply because we didn't need 
 it.

 No, what's hard is working this into D's pipeline patterns 
 seamlessly.

The lesson I learned from this is that you need the user code to 
provide a lot of extra information about the algorithm at compile 
time for the  templates to work out a way to fuse pipeline stages 
together efficiently.

I believe it is possible to get something similar in D because D 
has more powerful templates than C++ and D also has some type 
introspection which C++ lacks.  Unfortunately I'm not as good on 
D so I can only provide some ideas rather than actual working 
code.

Once this problem is solved, the benefit is huge.  It allowed me 
to perform high level optimizations (streaming load/save, 
prefetching, dynamic dispatching depending on data alignment 
etc.) in the main loop which automatically benefits all kernels 
and pipelines.

Sep 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 7 September 2016 at 12:00, finalpatch via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote:
 On 7 September 2016 at 11:04, finalpatch via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 It shouldn't be hard to have the framework look at the buffer size and
 choose the scalar version when number of elements are small, it wasn't done
 that way simply because we didn't need it.


 No, what's hard is working this into D's pipeline patterns seamlessly.


 The lesson I learned from this is that you need the user code to provide a
 lot of extra information about the algorithm at compile time for the
 templates to work out a way to fuse pipeline stages together efficiently.

 I believe it is possible to get something similar in D because D has more
 powerful templates than C++ and D also has some type introspection which C++
 lacks.  Unfortunately I'm not as good on D so I can only provide some ideas
 rather than actual working code.

 Once this problem is solved, the benefit is huge.  It allowed me to perform
 high level optimizations (streaming load/save, prefetching, dynamic
 dispatching depending on data alignment etc.) in the main loop which
 automatically benefits all kernels and pipelines.

Exactly!

Sep 06 2016

finalpatch <fengli gmail.com> writes:

On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote:
 The lesson I learned from this is that you need the user code 
 to provide a lot of extra information about the algorithm at 
 compile time for the templates to work out a way to fuse 
 pipeline stages together efficiently.

 I believe it is possible to get something similar in D because 
 D has more powerful templates than C++ and D also has some 
 type introspection which C++ lacks.  Unfortunately I'm not as 
 good on D so I can only provide some ideas rather than actual 
 working code.

 Once this problem is solved, the benefit is huge.  It allowed 
 me to perform high level optimizations (streaming load/save, 
 prefetching, dynamic dispatching depending on data alignment 
 etc.) in the main loop which automatically benefits all 
 kernels and pipelines.

 Exactly!

I think the problem here is two fold.

First question, how do we combine pipeline stages with minimal 
overhead

I think the key to this problem is reliable *forceinline*

for example, a pipeline like this

input.map!(x=>x.f1().f2().f3().store(output));

if we could make sure f1(), f2(), f3(), store(), and map() itself 
are all inlined, then we end up with a single loop with no 
function calls and the compiler is free to perform cross function 
optimizations. This is about as good as you can get.  
Unfortunately at the moment I hear it's difficult to make sure D 
functions get inlined.

Second question, how do we combine SIMD pipeline stages with 
minimal overhead

Besides reliable inlining, we also need some template code to 
repeat stages until their strides match. This requires details 
about each stage's logical unit size, input/output type and size 
at compile time. I can't think of what the interface of this 
would look like but the current map!() is likely insufficient to 
support this.

I still don't believe auto-select between scalar or vector paths 
would be a very useful feature. Normally I would only consider 
SIMD solution when I know in advance that this is a performance 
hotspot. When the amount of data is small I simply don't care 
about performance and would just choose whatever simplest way to 
do it, like map!(), because the performance impact is not 
noticeable and definitely not worth the increased complexity.

Sep 07 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Wednesday, 7 September 2016 at 10:31:13 UTC, finalpatch wrote:
 I think the problem here is two fold.

 First question, how do we combine pipeline stages with minimal 
 overhead

 I think the key to this problem is reliable *forceinline*

 for example, a pipeline like this

 input.map!(x=>x.f1().f2().f3().store(output));

 if we could make sure f1(), f2(), f3(), store(), and map() 
 itself are all inlined, then we end up with a single loop with 
 no function calls and the compiler is free to perform cross 
 function optimizations. This is about as good as you can get.  
 Unfortunately at the moment I hear it's difficult to make sure 
 D functions get inlined.

If the compiler is unable to inline (or wrongly decides it is too 
costly), I'd consider this a compiler bug. Of course, sometimes 
workarounds like `pragma(inline, true)` or ` forceinline` might 
be needed from time to time in practice, but they shouldn't 
influence the design of the pipeline interface.

 Second question, how do we combine SIMD pipeline stages with 
 minimal overhead

 Besides reliable inlining, we also need some template code to 
 repeat stages until their strides match. This requires details 
 about each stage's logical unit size, input/output type and 
 size at compile time. I can't think of what the interface of 
 this would look like but the current map!() is likely 
 insufficient to support this.

Would a `vectorize` range adapter be feasible that prepares the 
input to make it SIMD compatible? That is, force alignment, 
process left-over elements at the end, etc.? As far as I 
understand, the problems with auto vectorization stem from a 
difficulty of compilers to recognize vectorizing opportunities, 
and (as Manu described) from incompatible semantics of scalar and 
vector types that the compiler needs to preserve. But if that 
hypothetical `vectorize` helper forces the input data into one of 
a handful of well-known formats and types, wouldn't it be 
possible to make the compilers recognize those (especially if 
they are accompanied by suitable pragma or other compiler hints)?

 I still don't believe auto-select between scalar or vector 
 paths would be a very useful feature. Normally I would only 
 consider SIMD solution when I know in advance that this is a 
 performance hotspot. When the amount of data is small I simply 
 don't care about performance and would just choose whatever 
 simplest way to do it, like map!(), because the performance 
 impact is not noticeable and definitely not worth the increased 
 complexity.

In the above scenario, you can add `.vectorize` to the pipeline 
to enable vectorizing wherever you need it.

Sep 07 2016

finalpatch <fengli gmail.com> writes:

On Wednesday, 7 September 2016 at 11:53:00 UTC, Marc Schütz wrote:
 Would a `vectorize` range adapter be feasible that prepares the 
 input to make it SIMD compatible? That is, force alignment, 
 process left-over elements at the end, etc.? As far as I 
 understand, the problems with auto vectorization stem from a 
 difficulty of compilers to recognize vectorizing opportunities, 
 and (as Manu described) from incompatible semantics of scalar 
 and vector types that the compiler needs to preserve. But if 
 that hypothetical `vectorize` helper forces the input data into 
 one of a handful of well-known formats and types, wouldn't it 
 be possible to make the compilers recognize those (especially 
 if they are accompanied by suitable pragma or other compiler 
 hints)?

Contrary to popular belief, alignment is not a showstopper of 
SIMD code. Both Intel and ARM processors have instructions to 
access data from unaligned addresses. And on Intel processors, 
there is not even any speed penalty for using them on aligned 
addresses.  Which means you can either forget it (on Intel) or 
just check the data alignment before you start and choose an 
optimal specialization of the main loop.

However regarding auto vectorization, I'm with Manu. I won't put 
my bet on auto vectorization because I have never seen any 
non-trivial auto vectorized code that comes even close to hand 
tuned SIMD code. The compiler always have to play conservatively. 
  The compiler has no idea that you are only using 10 bits of each 
16bit components in a vector. It can't even help you shuffle 
RGBARGBARGBARGBA into RRRRGGGGBBBBAAAA.  The best we can do is to 
create something that makes writing SIMD kernels 
easy/reusable/composable.

Sep 07 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 5 September 2016 at 18:21, Andrei Alexandrescu via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

 It's convenient and reads nicely, but it's generally inefficient.


 What are the benchmarks and the numbers? What loss are you looking at? --
 Andrei

Well, it totally depends. Like right now in my case, 'transform' is
some image processing code (in the past when I've had these same
thoughts, it has been audio filters). You can't touch pixels (or
samples) one at a time. They need manual SIMD deployment (I've never
seen an auto-vectoriser handle saturation arithmetic, or type
promotion), alpha components (every 4th byte) is treated differently,
memory access patterns need to be tuned to be cache friendly.

I haven't done benchmarks right now, but I've done them professionally
in the past, and it's not unusual to expect a hand-written image
processing loop to see 1 or even 2 orders of magnitude improvement
when hand written, compared to calling a function for each pixel in a
loop. The sorts of low-level optimisations you deploy in image and
audio processing loops are not things I've ever seen any optimiser
even attempt.
Some core problems that tend to require manual intervention in hot-loops are:
  ubyte[16] <-> ushort[8][2] expansion/contraction
  ubyte[16] <-> float[4][4] expansion/contraction
  saturation
  scalar operator results promote to int, but wide-simd operations
don't, which means some scalar expressions can't express losslessly
collapsed to simd operations, and the compiler will always be
conservative on this matter. If the auto-vectoriser tries at all, you
will see a mountain of extra code to preserve those bits that the
scalar operator semantics would have guaranteed
  wide-vector multiplication is semantically different than scalar
multiplication, so the optimiser has a lot of trouble vectorising
mul's
  assumptions about data alignment
  interleaved data; audio samples are usually [L,R] interleaved,
images often [RGB,A], and different processes are applied across the
separation, you want to unroll and shuffle the data so you have
vectors [LLLL],[RRRR], or [RGBRGBRGBRGB],[AAAA], and I haven't seen an
optimiser go near that
  vector dot-product is always a nuisance

I could go on and on.

The point is, as an end-user, pipeline API's are great. At a library
author, I want to present the best performing library I can, which I
think means we need to find a way to conveniently connect these 2
currently disconnected worlds.
I've explored to some extent, but I've never come up with anything that I like.

Sep 05 2016

Guillaume Piolat <first.last gmail.com> writes:

On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is 
 probably
 many", and if you do something to one, you should do it to many.
 With this in mind, the code I have always written doesn't tend 
 to look
 like this:
   R manipulate(Thing thing);

 Instead:
   void manipulateThings(Thing *things, size_t numThings, R 
 *output,
 size_t outputLen);

 Written this way for clarity. Obviously, the D equiv uses 
 slices.

To chime in, processing in chunks is indeed absolutely key for 
audio performance. Processing by element is only tolerable for 
prototyping. I don't even use slices since but usually pointers 
and length, since most slice length would be the same.

void processSamples(float* input, float* output, int samples);

Some kind of buffered input and output range concept would be 
needed there if a lazy library was to be made.

Unfortunately for this case inputs and outputs are rather 
diverse. Can't wait for a more formalized DbI introduction.

As for graphics, ae.utils.graphics usually give you a whole line 
of pixels.

Sep 05 2016

Jerry <Kickupx gmail.com> writes:

On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 I mostly code like this now:
   data.map!(x => transform(x)).copy(output);

So you basicly want to make the lazy computation eager and store 
the result?

data.map!(x => transform(x)).array

Will allocate a new array and fill it with the result of map.
And if you want to recycle the buffer I guess writing a buffer 
function would be trivial.

Sep 06 2016

Wyatt <wyatt.epp gmail.com> writes:

On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is 
 probably
 many", and if you do something to one, you should do it to many.

 From a conceptual standpoint, this sounds like the sort of thing 
array languages like APL and J thrive on, so there's solid 
precedent for the concept.  I might suggest looking into 
optimising compilers in that space for inspiration and such; 
APEX, for example: http://www.snakeisland.com/apexup.htm

Of course, this comes with the caveat that this is (still!) some 
relatively heavily-academic stuff.  And I'm not sure to what 
extent that can help mitigate the problem of relaxing type 
requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" 
vector for SIMD on modern processors.

-Wyatt

Sep 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 7 September 2016 at 01:54, Wyatt via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote:
 A central premise of performance-oriented programming which I've
 employed my entire career, is "where there is one, there is probably
 many", and if you do something to one, you should do it to many.


 From a conceptual standpoint, this sounds like the sort of thing array
 languages like APL and J thrive on, so there's solid precedent for the
 concept.  I might suggest looking into optimising compilers in that space
 for inspiration and such; APEX, for example:
 http://www.snakeisland.com/apexup.htm

Thanks, that's really interesting, I'll check it out.


 Of course, this comes with the caveat that this is (still!) some relatively
 heavily-academic stuff.  And I'm not sure to what extent that can help
 mitigate the problem of relaxing type requirements such that you can e.g.
 efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors.

That's not what I want though.
I intend to hand-write that function (I was just giving examples of
how auto-vectorisation almost always fails), the question here is, how
to work that new array function into our pipelines transparently...

Sep 06 2016

Wyatt <wyatt.epp gmail.com> writes:

On Wednesday, 7 September 2016 at 00:18:59 UTC, Manu wrote:
 On 7 September 2016 at 01:54, Wyatt via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:

 Thanks, that's really interesting, I'll check it out.

Here's some work on static rank polymorphism that might also be 
applicable?:
http://www.ccs.neu.edu/home/pete/pub/esop-2014.pdf

And in the Related Work, I just noticed Halide, which sounds like 
it's right up your alley:
http://halide-lang.org/

 Of course, this comes with the caveat that this is (still!) 
 some relatively heavily-academic stuff.  And I'm not sure to 
 what extent that can help mitigate the problem of relaxing 
 type requirements such that you can e.g. efficiently ,/⍉ your 
 4 2⍴"LR" vector for SIMD on modern processors.

 That's not what I want though.
 I intend to hand-write that function (I was just giving 
 examples of
 how auto-vectorisation almost always fails), the question here 
 is, how
 to work that new array function into our pipelines 
 transparently...

Ah, I misunderstood.  Sorry.  I had the impression that you 
wanted to be able to simply write:

data.map!(x => transform(x)).copy(output);

...for any data[] and have it lift the transformation to the 
whole vector.  If you're doing the work, I'm curious what you're 
hoping the end result to look like in terms of the code you want 
to be able to write.  Just a doodle is fine, it doesn't have to 
stand up to scrutiny.

-Wyatt

Sep 07 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Taking pipeline processing to the next level