www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Efficient idiom for fastest code

reply IntegratedDimensions <IntegratedDimensions gmail.com> writes:
Many times in expensive loops one must make decisions. Decisions 
must be determined and the determination costs.

for(int i = 0; i < N; i++)
{
     if(decision(i)) A; else B;
}

the if statement costs N times the cycle cost.

In some cases the decision holds for continuous ranges. For some 
0 <= n <= N the decision is constant, but n is 
arbitrary(determined by unknown factors at compile time).

One can speed up the routine by using something akin to a 
simplified strategy pattern where one uses 
functions/delegates/lambdas to code a faster version without the 
test:


for(int i = 0; i < N; i++)
{
     d = (){ if(decision(i)) A; else d = () { B; };
     d();
}

this code basically reduces to

for(int i = 0; i < N; i++)
{
     B;
}

Once the decision fails, which we assume once it fails it always 
fails in this particular case.

Therefor, once the decision fails it kicks in the "faster" code. 
Suppose decision is very slow.

The cycle cost is then n times rather than N times.

In general, such techniques can be used to speed up code, 
sometimes significantly, but are difficult to implement using a 
standard pattern and for more complex decision functions.


Does anyone see any way to use some some of D's power to help 
simplify such things but not using something like a strategy 
pattern or complexifying the overall design(using advanced 
techniques such as templates, mixins is not making the code more 
complex).
May 22 2018
next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions 
wrote:
 Many times in expensive loops one must make decisions. 
 Decisions must be determined and the determination costs.

 for(int i = 0; i < N; i++)
 {
     if(decision(i)) A; else B;
 }

 the if statement costs N times the cycle cost.

 In some cases the decision holds for continuous ranges. For 
 some 0 <= n <= N the decision is constant, but n is 
 arbitrary(determined by unknown factors at compile time).

 One can speed up the routine by using something akin to a 
 simplified strategy pattern where one uses 
 functions/delegates/lambdas to code a faster version without 
 the test:


 for(int i = 0; i < N; i++)
 {
     d = (){ if(decision(i)) A; else d = () { B; };
     d();
 }
assuming you meant
 auto d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
 for(int i = 0; i < N; i++)
 {
     d(i);
 }
 this code basically reduces to

 for(int i = 0; i < N; i++)
 {
     B;
 }

 Once the decision fails, which we assume once it fails it 
 always fails in this particular case.

 Therefor, once the decision fails it kicks in the "faster" 
 code. Suppose decision is very slow.

 The cycle cost is then n times rather than N times.

 In general, such techniques can be used to speed up code, 
 sometimes significantly, but are difficult to implement using a 
 standard pattern and for more complex decision functions.


 Does anyone see any way to use some some of D's power to help 
 simplify such things but not using something like a strategy 
 pattern or complexifying the overall design(using advanced 
 techniques such as templates, mixins is not making the code 
 more complex).
If decision is pure, consider using static foreach (iota(N).map!decision) { ... }to unroll it at compile time. even if it isn't the compiler may still be able to factor out parts of repeated calls to decision, look at the assembly/IR to confirm this. Otherwise PROFILE!!!!! (and use profile guided optimisation! this implies using a compiler other than DMD which if you care about performance you should be doing anyway) Blindly trying to optimise is just as likely to hurt performance. in particular don't underestimate the branch predictor. Even if decision is complex, if there is a pattern in evaluating iota(n).map!decision the branch predictor will pick up on it. In terms of exploiting knowledge of decision a priori that the compiler somehow lacks you really don't have much option but to do it yourself.
May 22 2018
parent reply IntegratedDimensions <IntegratedDimensions gmail.com> writes:
On Wednesday, 23 May 2018 at 03:00:17 UTC, Nicholas Wilson wrote:
 On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions 
 wrote:
 Many times in expensive loops one must make decisions. 
 Decisions must be determined and the determination costs.

 for(int i = 0; i < N; i++)
 {
     if(decision(i)) A; else B;
 }

 the if statement costs N times the cycle cost.

 In some cases the decision holds for continuous ranges. For 
 some 0 <= n <= N the decision is constant, but n is 
 arbitrary(determined by unknown factors at compile time).

 One can speed up the routine by using something akin to a 
 simplified strategy pattern where one uses 
 functions/delegates/lambdas to code a faster version without 
 the test:


 for(int i = 0; i < N; i++)
 {
     d = (){ if(decision(i)) A; else d = () { B; };
     d();
 }
assuming you meant
 auto d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
 for(int i = 0; i < N; i++)
 {
     d(i);
 }
 this code basically reduces to

 for(int i = 0; i < N; i++)
 {
     B;
 }

 Once the decision fails, which we assume once it fails it 
 always fails in this particular case.

 Therefor, once the decision fails it kicks in the "faster" 
 code. Suppose decision is very slow.

 The cycle cost is then n times rather than N times.

 In general, such techniques can be used to speed up code, 
 sometimes significantly, but are difficult to implement using 
 a standard pattern and for more complex decision functions.


 Does anyone see any way to use some some of D's power to help 
 simplify such things but not using something like a strategy 
 pattern or complexifying the overall design(using advanced 
 techniques such as templates, mixins is not making the code 
 more complex).
If decision is pure, consider using static foreach (iota(N).map!decision) { ... }to unroll it at compile time. even if it isn't the compiler may still be able to factor out parts of repeated calls to decision, look at the assembly/IR to confirm this. Otherwise PROFILE!!!!! (and use profile guided optimisation! this implies using a compiler other than DMD which if you care about performance you should be doing anyway) Blindly trying to optimise is just as likely to hurt performance. in particular don't underestimate the branch predictor. Even if decision is complex, if there is a pattern in evaluating iota(n).map!decision the branch predictor will pick up on it. In terms of exploiting knowledge of decision a priori that the compiler somehow lacks you really don't have much option but to do it yourself.
I knew someone was going to say that and I forgot to say DON'T! Saying to profile when I clearly said these ARE cases where they are slow is just moronic. Please don't use default answers to arguments. This was a general question about cases on how to attack a problem WHEN profiling says I need to optimize. Your SO 101 answer sucks! Sorry! To prove to you that your answer is invalid: I profile my code, it says that it is very slow and shows that it is do to the decision checking... I then I have to come here and write up a post trying to explain how to solve the problem. I then get a post telling me I should profile. I then respond I did profile and that this is my problem. A lot of wasted energy when it is better to know a general attack strategy. Yes, some of us can judge if code is needed to be optimized before profiling. It is not difficult. Giving a generic answer that always does not apply and is obvious to anyone trying to do optimization is not helpful. Everyone today pretty must does not even optimize code anymore... this isn't 1979. It's not ok to keep repeating the same mantra. I guess we should turn this in to a meme? The reason I'm getting on to you is that the "profile before optimization" sounds a bit grade school, specially since I wasn't talking anything about profiling but a general programming pattern speed up code, which is always valid but not always useful(and hence that is when profiling comes in).
May 22 2018
next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Wednesday, 23 May 2018 at 03:12:52 UTC, IntegratedDimensions 
wrote:
 I knew someone was going to say that and I forgot to say DON'T!

 Saying to profile when I clearly said these ARE cases where 
 they are slow is just moronic. Please don't use default answers 
 to arguments.

 This was a general question about cases on how to attack a 
 problem WHEN profiling says I need to optimize. Your SO 101 
 answer sucks! Sorry!

 To prove to you that your answer is invalid:

 I profile my code, it says that it is very slow and shows that 
 it is do to the decision checking... I then I have to come here 
 and write up a post trying to explain how to solve the problem. 
 I then get a post telling me I should profile. I then respond I 
 did profile and that this is my problem. A lot of wasted energy 
 when it is better to know a general attack strategy. Yes, some 
 of us can judge if code is needed to be optimized before 
 profiling. It is not difficult. Giving a generic answer that 
 always does not apply and is obvious to anyone trying to do 
 optimization is not helpful. Everyone today pretty must does 
 not even optimize code anymore... this isn't 1979. It's not ok 
 to keep repeating the same mantra. I guess we should turn this 
 in to a meme?

 The reason I'm getting on to you is that the "profile before 
 optimization" sounds a bit grade school, specially since I 
 wasn't talking anything about profiling but a general 
 programming pattern speed up code, which is always valid but 
 not always useful(and hence that is when profiling comes in).
I'm going to ignore the tone of your response, but I am going to say that responding like that isn't going to get you very far. Don't expect others to do likewise. Assuming that your decision function is indeed the bottleneck, you'll see I did actually provide some hints as to how to optimise the case where decision is pure. even if you can't convince the compiler to inline and expression combine, as in the case for the other answer, you can memoize it (look in std.functional). One of the great things about D is that you can force lots of computation to happen at compile time, so in the case where decision is impure, factoring it into pure and impure parts and `enum x = pureFunc(args);`ing the part that can be can make a large difference if you can't convince the optimiser to do it for you. If you still can't do that consider Jitting it (https://github.com/ldc-developers/druntime/blob/e3bfc5fb780967f1b6807039ff00b2ccaf4b03d9/src/ dc/attributes.d#L78 ) with `-enable-dynamic-compile` or running the loop in parallel with std.parallelism.
May 23 2018
prev sibling parent biocyberman <biocyberman gmail.com> writes:
On Wednesday, 23 May 2018 at 03:12:52 UTC, IntegratedDimensions 
wrote:
 On Wednesday, 23 May 2018 at 03:00:17 UTC, Nicholas Wilson 
 wrote:
 [...]
I knew someone was going to say that and I forgot to say DON'T! Saying to profile when I clearly said these ARE cases where they are slow is just moronic. Please don't use default answers to arguments. This was a general question about cases on how to attack a problem WHEN profiling says I need to optimize. Your SO 101 answer sucks! Sorry! To prove to you that your answer is invalid: I profile my code, it says that it is very slow and shows that it is do to the decision checking... I then I have to come here and write up a post trying to explain how to solve the problem. I then get a post telling me I should profile. I then respond I did profile and that this is my problem. A lot of wasted energy when it is better to know a general attack strategy. Yes, some of us can judge if code is needed to be optimized before profiling. It is not difficult. Giving a generic answer that always does not apply and is obvious to anyone trying to do optimization is not helpful. Everyone today pretty must does not even optimize code anymore... this isn't 1979. It's not ok to keep repeating the same mantra. I guess we should turn this in to a meme? The reason I'm getting on to you is that the "profile before optimization" sounds a bit grade school, specially since I wasn't talking anything about profiling but a general programming pattern speed up code, which is always valid but not always useful(and hence that is when profiling comes in).
Very challenging. Wish I could help you out with the tough work. People don't share the same context, especially via online, so it is necessary to clarify the problem so other can understand and help. I've been beaten on stackoverflow many times for not providing sufficient information for my questions. It seems like one can do the reverse here at forum.dlang.org. With that said, I think you know what you are doing, and you can do it. Just relax and give it more time and experimentation.
May 24 2018
prev sibling parent reply Malte <no valid.mail> writes:
On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions 
wrote:
 In some cases the decision holds for continuous ranges. For 
 some 0 <= n <= N the decision is constant, but n is 
 arbitrary(determined by unknown factors at compile time).

 One can speed up the routine by using something akin to a 
 simplified strategy pattern where one uses 
 functions/delegates/lambdas to code a faster version without 
 the test:


 for(int i = 0; i < N; i++)
 {
     d = (){ if(decision(i)) A; else d = () { B; };
     d();
 }

 this code basically reduces to

 for(int i = 0; i < N; i++)
 {
     B;
 }

 Once the decision fails, which we assume once it fails it 
 always fails in this particular case.

 Therefor, once the decision fails it kicks in the "faster" 
 code. Suppose decision is very slow.
I would just do
int i=0;
for(;decision(i) && i < N; i++)
{
    A;
}
for(;i < N; i++)
{
    B;
}
This could be turned to a mixin template with something like this:
mixin template forSplit(alias condition, alias A, alias B)
{
    void execute()
    {
        int i = 0;
        for (; condition(i) && i < N; i++)
        {
            A();
        }
        for (; i < N; i++)
        {
            B();
        }
    }
}
and to use it in code (assuming N is defined in the scope)
mixin forSplit!((int i)=>(decision(i)), {A;}, {B;}) loop;
loop.execute();
I have't measured anything, but I would assume that delegates come with an overhead that you just don't need here. In fact when trying to use
 auto d = (int i) {};
 d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
 for(int i = 0; i < N; i++)
 {
     d(i);
 }
All I got was "cannot access frame of function D main" which sums up my experiences with lambdas in D so far. While PGO and branch predictor are good, they don't help much here. Not executing an expression is out of scope for them. All they do is prevent pipeline flushes. Also I think you overestimate what the compiler can do. My decision function to do some testing was this:
bool decision(int a) pure
out (result) {
    	assert(result == (a < 10));
} do {
    import std.algorithm, std.range;
 
    // stolen from https://dlang.org/library/std/parallelism.html
    enum n = 1_000_000;
    enum delta = 1.0 / n;
    alias getTerm = (int i)
    {
        immutable x = ( i - 0.5 ) * delta;
        return delta / ( 1.0 + x * x ) ;
    };
    immutable pi = 4.0 * reduce!"a + b"(n.iota.map!getTerm);
 
    return a < 3*pi;
}
With N=100 I got a speedup of ~10 (ldc -O3 -release). Even though this function is pure and could be optimized a lot. It calculated pi for every single call. And optimizing the decision function isn't even the point of that question.
May 23 2018
parent IntegratedDimensions <IntegratedDimensions gmail.com> writes:
On Wednesday, 23 May 2018 at 10:55:02 UTC, Malte wrote:
 On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions 
 wrote:
 [...]
I would just do
[...]
[...]
Thanks, I didn't think about using a for loop like that. While it is not the most general it does solve the specific case for a simple step/toggled decision.
May 23 2018