www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Divide & Conquer divides, but doesn't conquer

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Consider staticMap's current version (there are a few upcoming changes 
that don't change the point):

template staticMap(alias F, T...)
{
     static if (T.length == 0)
     {
         alias staticMap = AliasSeq!();
     }
     else static if (T.length == 1)
     {
         alias staticMap = AliasSeq!(F!(T[0]));
     }
     else
     {
         alias staticMap =
             AliasSeq!(
                 staticMap!(F, T[ 0  .. $/2]),
                 staticMap!(F, T[$/2 ..  $ ]));
     }
}

In the instantiation staticMap!(pred, A, B, C, D, E, F, G, H), recursion 
will instantiate:

staticMap!(pred, A, B, C, D)
staticMap!(pred, A, B)
staticMap!(pred, A)
staticMap!(pred, B)
staticMap!(pred, C, D)
staticMap!(pred, C)
staticMap!(pred, D)
staticMap!(pred, E. F, G, H)
staticMap!(pred, E, F)
staticMap!(pred, E)
staticMap!(pred, F)
staticMap!(pred, G, H)
staticMap!(pred, G)
staticMap!(pred, H)

Total: 15 instantiations including original. This is because a full tree 
with 8 leaves is being created. Generally, a tree with N leaves has 
about 2N - 1 nodes total (exactly so when N is a power of 2).

The more naive version would simply use linear decomposition:

template staticMap(alias F, T...)
{
     static if (T.length == 0)
     {
         alias staticMap = AliasSeq!();
     }
     else
     {
         alias staticMap =
             AliasSeq!(
                 F!(T[0]),
                 staticMap!(F, T[1 .. $]));
     }
}

There' no more need to handle the 1-element case, which is already a 
plus. But that's not the topic here. The instantiations created are:

staticMap!(pred, B, C, D, E, F, G, H)
staticMap!(pred, C, D, E, F, G, H)
staticMap!(pred, D, E, F, G, H)
staticMap!(pred, E, F, G, H)
staticMap!(pred, F, G, H)
staticMap!(pred, G, H)
staticMap!(pred, H)

Total: 8 instantiations including original. That's about half the 
current number of instantiations.

So it seems divide et impera doesn't quite help with certain recursive 
templates.
May 24 2020
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 25.05.20 01:40, Andrei Alexandrescu wrote:
 ...
 
 So it seems divide et impera doesn't quite help with certain recursive 
 templates.
Assuming the cost of a template instantiation is linear in the number of template arguments, the first version has running time Θ(n log n) while the second version has running time Θ(n²).
May 24 2020
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/24/20 8:44 PM, Timon Gehr wrote:
 On 25.05.20 01:40, Andrei Alexandrescu wrote:
 ...

 So it seems divide et impera doesn't quite help with certain recursive 
 templates.
Assuming the cost of a template instantiation is linear in the number of template arguments, the first version has running time Θ(n log n) while the second version has running time Θ(n²).
Of course. However, I'm assuming most of a template instantiation's cost is fixed. There are other algorithms looking at in std.meta. Consider Reverse: template Reverse(TList...) { static if (TList.length <= 1) { alias Reverse = TList; } else { alias Reverse = AliasSeq!( Reverse!(TList[$/2 .. $ ]), Reverse!(TList[ 0 .. $/2])); } } The recurrence formula for complexity is: T(0) = k; T(1) = k; T(n) = 2 * T(n/2) + k Again, I assume instantiating AliasSeq with two lists of any lengths has the same cost k. That makes complexity linear. So is the complexity of the simpler implementation: template Reverse(TList...) { static if (TList.length <= 1) { alias Reverse = TList; } else { alias Reverse = AliasSeq!(Reverse!(TList[1 .. $]), TList[0]); } } But this insantiates half the templates, so it goes easier on resources. This makes the use of the halving strategy not only an unnecessary distraction, but a less efficient choice. There are other unnecessary (and/or unnecessarily baroque) artifacts in std.meta, such as EraseAllN. Far as I can tell it's unnecessary because NoDuplicates can be implemented without it (and if I'm not wrong, at the same complexity). std.meta is due for a serious review.
May 24 2020
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 25.05.20 03:25, Andrei Alexandrescu wrote:
 
 Again, I assume instantiating AliasSeq with two lists of any lengths has 
 the same cost k.
I don't think this is a realistic assumption with the current compiler implementation. Expression tuples are stored as arrays.
May 24 2020
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Another one:

template Filter(alias pred, TList...)
{
     static if (TList.length == 0)
     {
         alias Filter = AliasSeq!();
     }
     else static if (TList.length == 1)
     {
         static if (pred!(TList[0]))
             alias Filter = AliasSeq!(TList[0]);
         else
             alias Filter = AliasSeq!();
     }
     else
     {
         alias Filter =
             AliasSeq!(
                 Filter!(pred, TList[ 0  .. $/2]),
                 Filter!(pred, TList[$/2 ..  $ ]));
     }
}

Why cut the problem size in half if it doesn't improve complexity?
May 24 2020
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 5/24/20 9:48 PM, Andrei Alexandrescu wrote:
 Another one:
 
 template Filter(alias pred, TList...)
 {
      static if (TList.length == 0)
      {
          alias Filter = AliasSeq!();
      }
      else static if (TList.length == 1)
      {
          static if (pred!(TList[0]))
              alias Filter = AliasSeq!(TList[0]);
          else
              alias Filter = AliasSeq!();
      }
      else
      {
          alias Filter =
              AliasSeq!(
                  Filter!(pred, TList[ 0  .. $/2]),
                  Filter!(pred, TList[$/2 ..  $ ]));
      }
 }
 
 Why cut the problem size in half if it doesn't improve complexity?
Because it doesn't work for large sets unless you do this. If you do a "linear" recursive template, it fails after about 1000 levels. A divide-and-conquer will not get that deep. So in essence, it's a solution for a compiler restriction, not for performance. -Steve
May 24 2020
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/24/20 10:32 PM, Steven Schveighoffer wrote:
 On 5/24/20 9:48 PM, Andrei Alexandrescu wrote:
 Another one:

 template Filter(alias pred, TList...)
 {
      static if (TList.length == 0)
      {
          alias Filter = AliasSeq!();
      }
      else static if (TList.length == 1)
      {
          static if (pred!(TList[0]))
              alias Filter = AliasSeq!(TList[0]);
          else
              alias Filter = AliasSeq!();
      }
      else
      {
          alias Filter =
              AliasSeq!(
                  Filter!(pred, TList[ 0  .. $/2]),
                  Filter!(pred, TList[$/2 ..  $ ]));
      }
 }

 Why cut the problem size in half if it doesn't improve complexity?
Because it doesn't work for large sets unless you do this. If you do a "linear" recursive template, it fails after about 1000 levels. A divide-and-conquer will not get that deep. So in essence, it's a solution for a compiler restriction, not for performance. -Steve
I see, thanks.
May 24 2020
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2020 8:10 PM, Andrei Alexandrescu wrote:
 I see, thanks.
So we don't waste time on this point again: https://github.com/dlang/phobos/pull/7497
May 25 2020
prev sibling next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
I ran a test to compare the two implementations. First was 
comparing 10,000 references to the same instance.

0.47user 0.11system 0:00.58elapsed 100%CPU (0avgtext+0avgdata 
399688maxresident)k

No significant difference between implementations. This is 
expected since it caches based on arguments.

Second test compared Phobos' impl vs the simple impl but this 
time with 10,000 different random instantiations: 
http://arsdnet.net/dcode/ala.d

I'd comment one impl out, uncomment the other, rerun 
/usr/bin/test dmd ala.d as well as dmd -c ala.d.

                simple impl     phobos impl
compile time:         3.1s            3.7s
dmd peak RAM:         1.5G            1.9G
  binary size:         6.2M            6.2M

The generated files (bin and object) are the same between them, 
but the simple implementation beats the Phobos impl in both 
compile time and dmd memory.

Note that all the user-visible static maps had 9 arguments, but 
the two separate implementations do different things.

It is possible my test is flawed, I encourage you to create your 
own and see what happens.


I also tried a string mixin version inside the static map 
template and it took 4.0s and used 1.2G of memory. Win on RAM, 
but lost on speed. (It was not the CTFE that hit it - that got 
cached - it was the mixin being run on each unique argument list).

And a hand-written static map because the number of arguments is 
known ahead of time: 1.4s compile, 0.7G memory. Absolutely 
crushed it. It might be worth seeing what lengths of static map 
are most common in the wild and just having hand expansions. 
Adding static if has a cost though (upped my test to 1.5s/720MB). 
But not as high as template constraints (1.5s/740MB). But this is 
all getting increasingly artificial.

So I am willing to say number of template instantiations is the 
probable key metric.


All this was done on a stock DMD64 D Compiler v2.091.1 on linux. 
Several enhancements were merged to dmd master recently that may 
change the results btw.

quick test: simple impl: 2s, 1.57G. phobos impl: 2.3s, 1.97G

hand impl: 1.1s, 725M. w/ static if: 1.2s, 775M

Decent time improvement across the board... but the binary got 
bigger at 6.4M :( (prolly druntime's fault more than anything 
else but idk, im not focusing on that rn)
May 24 2020
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2020 8:19 PM, Adam D. Ruppe wrote:
 I ran a test to compare the two implementations. First was comparing 10,000 
 references to the same instance.
Thank you. This is good information. I'm curious how the new staticMap compares: https://github.com/dlang/phobos/pull/7490
May 24 2020
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 05:55:07 UTC, Walter Bright wrote:
 Thank you. This is good information. I'm curious how the new 
 staticMap compares:

 https://github.com/dlang/phobos/pull/7490
Using dmd master: 1.8s 1.2G Using released dmd: 2.1s, 1.1G (I think something in dmd master has a higher fixed memory cost, so remember to compare apples and apples.) So yes, the static if version is an improvement over what it has now, at least for these cases. We should probably make another test rig to see more variety...
May 25 2020
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2020 6:27 AM, Adam D. Ruppe wrote:
 Using dmd master: 1.8s 1.2G
 Using released dmd: 2.1s, 1.1G
A 15% speed win is a big deal! Especially for such a simple change.
May 25 2020
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 21:25:33 UTC, Walter Bright wrote:
 On 5/25/2020 6:27 AM, Adam D. Ruppe wrote:
 Using dmd master: 1.8s 1.2G
 Using released dmd: 2.1s, 1.1G
A 15% speed win is a big deal! Especially for such a simple change.
How much of a deal is 100% ? Enough to talk about including ... ?
May 25 2020
parent reply Nick Treleaven <nick geany.org> writes:
On Monday, 25 May 2020 at 21:29:00 UTC, Stefan Koch wrote:
 How much of a deal is 100% ? Enough to talk about including ... 
 ?
How would a type function compare to S... ? Ideally we'd just use a type function.
May 27 2020
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 27 May 2020 at 19:05:51 UTC, Nick Treleaven wrote:
 On Monday, 25 May 2020 at 21:29:00 UTC, Stefan Koch wrote:
 How much of a deal is 100% ? Enough to talk about including 
 ... ?
How would a type function compare to S... ? Ideally we'd just use a type function.
It's gonna be a while until they can be used. For that usecase you couldn't use a typefunction at all. type functions don't work on values.
May 27 2020
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 27 May 2020 at 20:20:37 UTC, Stefan Koch wrote:
 On Wednesday, 27 May 2020 at 19:05:51 UTC, Nick Treleaven wrote:
 On Monday, 25 May 2020 at 21:29:00 UTC, Stefan Koch wrote:
 How much of a deal is 100% ? Enough to talk about including 
 ... ?
How would a type function compare to S... ? Ideally we'd just use a type function.
It's gonna be a while until they can be used. For that usecase you couldn't use a typefunction at all. type functions don't work on values.
Assuming you mean the benchmark I posted.
May 27 2020
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/20 1:55 AM, Walter Bright wrote:
 On 5/24/2020 8:19 PM, Adam D. Ruppe wrote:
 I ran a test to compare the two implementations. First was comparing 
 10,000 references to the same instance.
Thank you. This is good information. I'm curious how the new staticMap compares: https://github.com/dlang/phobos/pull/7490
This is reasonable. I've seen a *lot* worse in optimizing libraries. A *LOT*. Another possible improvement: binary search through the special cases.
May 25 2020
prev sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 05:55:07 UTC, Walter Bright wrote:
 On 5/24/2020 8:19 PM, Adam D. Ruppe wrote:
 I ran a test to compare the two implementations. First was 
 comparing 10,000 references to the same instance.
Thank you. This is good information. I'm curious how the new staticMap compares: https://github.com/dlang/phobos/pull/7490
This has brought down the compile time of my test indeed. Almost but not quite down to `...` Let me post my test. I measured that for my test your staticMap is 1.2x faster than the regular the version before the pull. -- './dmd.sh sm.d -c -version=Walter' ran 1.21 ± 0.10 times faster than './dmd.sh sm.d -c' -- I measured that our `...` is about 1.5x faster than than your improved staticMap -- './dmd.sh sm.d -c -version=DotDotDot' ran 1.49 ± 0.18 times faster than './dmd.sh sm.d -c -version=Walter' -- Which should make our version about 2 times faster than the baseline and indeed -- './dmd.sh sm.d -c -version=DotDotDot' ran 2.05 ± 0.23 times faster than './dmd.sh sm.d -c' -- The source used file used for the benchmark is here: https://gist.github.com/UplinkCoder/9772f27cf088f08901b44fb57935aaaf The dmd used is at commit 2f23e3348e9a927c025637a4852d4d77beb997d3 which is on manu's fork of the compiler.
May 25 2020
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 15:59:43 UTC, Stefan Koch wrote:
 The dmd used is at commit 
 2f23e3348e9a927c025637a4852d4d77beb997d3
 which is on manu's fork of the compiler.
Does that include the other shortcut prs? I wonder if they would further open or close the gap
May 25 2020
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 20:57:07 UTC, Adam D. Ruppe wrote:
 On Monday, 25 May 2020 at 15:59:43 UTC, Stefan Koch wrote:
 The dmd used is at commit 
 2f23e3348e9a927c025637a4852d4d77beb997d3
 which is on manu's fork of the compiler.
Does that include the other shortcut prs? I wonder if they would further open or close the gap
Open it I suspect. The `...` version would be around 20x faster if it could create the new sequence as one thing.
May 25 2020
prev sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 25 May 2020 at 03:19:51 UTC, Adam D. Ruppe wrote:
 All this was done on a stock DMD64 D Compiler v2.091.1 on 
 linux. Several enhancements were merged to dmd master recently 
 that may change the results btw.

 quick test: simple impl: 2s, 1.57G. phobos impl: 2.3s, 1.97G

 hand impl: 1.1s, 725M. w/ static if: 1.2s, 775M

 Decent time improvement across the board... but the binary got 
 bigger at 6.4M :( (prolly druntime's fault more than anything 
 else but idk, im not focusing on that rn)
Could you also try with this version, please? template staticMap(alias F, T...) { mixin(staticMapHelper!(T.length)); } private enum staticMapHelper(size_t length) = staticMapHelperGen!(length); private string staticMapHelperGen(size_t length)() { string res = "alias staticMap = AliasSeq!("; static foreach (i; 0 .. length) { if (i) res ~= ", "; res ~= "F!(T[" ~ i.stringof ~ "])"; } res ~= ");"; return res; } static foreach and stringof were used so I didn't have to pull in `format` or `to!string`, which had issues in std.meta. I'm seeing some improvement over 2.086 at least; it should be about equivalent to the hand-unrolled version in master.
May 24 2020
next sibling parent reply Max Samukha <maxsamukha gmail.com> writes:
On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:

 static foreach and stringof were used so I didn't have to pull 
 in `format` or `to!string`, which had issues in std.meta.

 I'm seeing some improvement over 2.086 at least; it should be 
 about equivalent to the hand-unrolled version in master.
How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack! http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dmd-and-static-foreach
May 25 2020
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/20 3:15 AM, Max Samukha wrote:
 On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
 
 static foreach and stringof were used so I didn't have to pull in 
 `format` or `to!string`, which had issues in std.meta.

 I'm seeing some improvement over 2.086 at least; it should be about 
 equivalent to the hand-unrolled version in master.
How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack! http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dm -and-static-foreach
static foreach is quadratic? Is this a matter of principle, or QoI?
May 25 2020
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 16:04:24 UTC, Andrei Alexandrescu wrote:
 On 5/25/20 3:15 AM, Max Samukha wrote:
 On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
 
 static foreach and stringof were used so I didn't have to 
 pull in `format` or `to!string`, which had issues in std.meta.

 I'm seeing some improvement over 2.086 at least; it should be 
 about equivalent to the hand-unrolled version in master.
How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack! http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dmd-and-static-foreach
static foreach is quadratic? Is this a matter of principle, or QoI?
Yes it is. I did present this fact at the pre-event to Dconf 2018. As for your other question it's both. The QoI is not stellar. But it is fundamentally limited by having to create independent scopes. -- Stefan
May 25 2020
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 16:07:56 UTC, Stefan Koch wrote:

 I did present this fact at the pre-event to Dconf 2018.
Could have been 19 ?
May 25 2020
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/20 12:07 PM, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:04:24 UTC, Andrei Alexandrescu wrote:
 On 5/25/20 3:15 AM, Max Samukha wrote:
 On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:

 static foreach and stringof were used so I didn't have to pull in 
 `format` or `to!string`, which had issues in std.meta.

 I'm seeing some improvement over 2.086 at least; it should be about 
 equivalent to the hand-unrolled version in master.
How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack! http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dm -and-static-foreach
static foreach is quadratic? Is this a matter of principle, or QoI?
Yes it is. I did present this fact at the pre-event to Dconf 2018. As for your other question it's both. The QoI is not stellar. But it is fundamentally limited by having to create independent scopes.
How do you mean that? static foreach does not create a scope.
May 25 2020
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 18:33:13 UTC, Andrei Alexandrescu wrote:
 On 5/25/20 12:07 PM, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:04:24 UTC, Andrei Alexandrescu 
 wrote:
 On 5/25/20 3:15 AM, Max Samukha wrote:
 On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature 
 wrote:

 static foreach and stringof were used so I didn't have to 
 pull in `format` or `to!string`, which had issues in 
 std.meta.

 I'm seeing some improvement over 2.086 at least; it should 
 be about equivalent to the hand-unrolled version in master.
How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack! http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dmd-and-static-foreach
static foreach is quadratic? Is this a matter of principle, or QoI?
Yes it is. I did present this fact at the pre-event to Dconf 2018. As for your other question it's both. The QoI is not stellar. But it is fundamentally limited by having to create independent scopes.
How do you mean that? static foreach does not create a scope.
The scope I am talking about is what dmd calls Scope. This has to do with compiler internals. static foreach has a manifest variable as IV. That means it introduces a static name. That name for example I has to resolve to a different value on every iteration. This violates the rules of the language outside of static foreach. Therefore the a construct has to be introduces which allows special behavior inside a foreach body.
May 25 2020
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 16:04:24 UTC, Andrei Alexandrescu wrote:
 static foreach is quadratic? Is this a matter of principle, or 
 QoI?
I have a hard time seeing why it HAS to be this way. Regular tuple foreach is better in most cases, string mixins are better in most cases (if written with appropriate care). There might be some cases where it is fundamental - I just honestly don't know - and it is possible automatically differentiating those cases is impossible and/or more trouble than it is worth, but I continue to believe our primary problem is the implementation right now. The problem is idk what the fix is. I spent a couple hours looking at the code a couple weeks ago and didn't find a solution. But that doesn't mean there isn't one.
May 25 2020
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 25.05.20 18:04, Andrei Alexandrescu wrote:
 On 5/25/20 3:15 AM, Max Samukha wrote:
 On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:

 static foreach and stringof were used so I didn't have to pull in 
 `format` or `to!string`, which had issues in std.meta.

 I'm seeing some improvement over 2.086 at least; it should be about 
 equivalent to the hand-unrolled version in master.
How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack! http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dm -and-static-foreach
static foreach is quadratic? Is this a matter of principle, or QoI?
QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE. The implementation of `static foreach` in DMD is based on lowering, and the existing "tuple" foreach implementation. I suspect the culprit for bad performance for very large `static foreach` is a slow expression "tuple" implementation, but I am not sure. Therefore, maybe performance could be improved by modifying the foreach expansion code such that it can accept a CTFE array instead of an expression tuple. One could also add a special-case implementation for range `static foreach` to work around inefficient CTFE.
May 25 2020
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 16:46:16 UTC, Timon Gehr wrote:
 On 25.05.20 18:04, Andrei Alexandrescu wrote:
 [...]
QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE. [...]
What do you do about the scopes?
May 25 2020
next sibling parent Tove <tove fransson.se> writes:
On Monday, 25 May 2020 at 16:57:48 UTC, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:46:16 UTC, Timon Gehr wrote:
 On 25.05.20 18:04, Andrei Alexandrescu wrote:
 [...]
QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE. [...]
What do you do about the scopes?
Can't they be elided if the body contains no declarations?
May 25 2020
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 25.05.20 18:57, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:46:16 UTC, Timon Gehr wrote:
 On 25.05.20 18:04, Andrei Alexandrescu wrote:
 [...]
QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE. [...]
What do you do about the scopes?
Nothing special. Why would I have to?
May 25 2020
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 17:16:12 UTC, Timon Gehr wrote:
 On 25.05.20 18:57, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:46:16 UTC, Timon Gehr wrote:
 On 25.05.20 18:04, Andrei Alexandrescu wrote:
 [...]
QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE. [...]
What do you do about the scopes?
Nothing special. Why would I have to?
You need a manifest constant for you IV (induction variable), no? Which would clash with the one previously introduced.
May 25 2020
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 25.05.20 19:21, Stefan Koch wrote:
 On Monday, 25 May 2020 at 17:16:12 UTC, Timon Gehr wrote:
 On 25.05.20 18:57, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:46:16 UTC, Timon Gehr wrote:
 On 25.05.20 18:04, Andrei Alexandrescu wrote:
 [...]
QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE. [...]
What do you do about the scopes?
Nothing special. Why would I have to?
You need a manifest constant for you IV (induction variable), no? Which would clash with the one previously introduced.
Sure, so I create one forwarding scope per iteration. This is not particularly expensive, it's constant overhead per iteration.
May 25 2020
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 17:23:50 UTC, Timon Gehr wrote:
 On 25.05.20 19:21, Stefan Koch wrote:
 On Monday, 25 May 2020 at 17:16:12 UTC, Timon Gehr wrote:
 On 25.05.20 18:57, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:46:16 UTC, Timon Gehr wrote:
 [...]
What do you do about the scopes?
Nothing special. Why would I have to?
You need a manifest constant for you IV (induction variable), no? Which would clash with the one previously introduced.
Sure, so I create one forwarding scope per iteration. This is not particularly expensive, it's constant overhead per iteration.
Since Forwarding scopes where not in the language before you introduced static if. Could you give a quick explanation how they work?
May 25 2020
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 25.05.20 19:25, Stefan Koch wrote:
 On Monday, 25 May 2020 at 17:23:50 UTC, Timon Gehr wrote:
 On 25.05.20 19:21, Stefan Koch wrote:
 On Monday, 25 May 2020 at 17:16:12 UTC, Timon Gehr wrote:
 On 25.05.20 18:57, Stefan Koch wrote:
 On Monday, 25 May 2020 at 16:46:16 UTC, Timon Gehr wrote:
 [...]
What do you do about the scopes?
Nothing special. Why would I have to?
You need a manifest constant for you IV (induction variable), no? Which would clash with the one previously introduced.
Sure, so I create one forwarding scope per iteration. This is not particularly expensive, it's constant overhead per iteration.
Since Forwarding scopes where not in the language before you introduced static if. Could you give a quick explanation how they work?
It's a scope that forwards new symbols to its parent scope on insertion. This way, each iteration of the `static foreach` will have its own local scope with distinct loop variables, but any declarations that are generated in the `static foreach` body will skip this scope and instead populate the parent scope. (As an experiment, the original implementation also had `__local` declarations to insert symbols into the local scope without forwarding.)
May 25 2020
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 17:34:36 UTC, Timon Gehr wrote:
 On 25.05.20 19:25, Stefan Koch wrote:
 On Monday, 25 May 2020 at 17:23:50 UTC, Timon Gehr wrote:
 [...]
Since Forwarding scopes where not in the language before you introduced static if. Could you give a quick explanation how they work?
It's a scope that forwards new symbols to its parent scope on insertion. This way, each iteration of the `static foreach` will have its own local scope with distinct loop variables, but any declarations that are generated in the `static foreach` body will skip this scope and instead populate the parent scope. (As an experiment, the original implementation also had `__local` declarations to insert symbols into the local scope without forwarding.)
Thank a lot for the explanation! I was wondering about this for a while. That's a clever solution.
May 25 2020
prev sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 17:34:36 UTC, Timon Gehr wrote:
 (As an experiment, the original implementation also had 
 `__local` declarations to insert symbols into the local scope 
 without forwarding.)
That would actually be really useful in some places. Like static foreach(memberName; __traits(allMembers, f)) { __local alias member = __traits(getMember, f, memberName); } Though of course it is also possible to do such with staticMap in places.
May 25 2020
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 25 May 2020 at 20:47:42 UTC, Adam D. Ruppe wrote:
 On Monday, 25 May 2020 at 17:34:36 UTC, Timon Gehr wrote:
 (As an experiment, the original implementation also had 
 `__local` declarations to insert symbols into the local scope 
 without forwarding.)
That would actually be really useful in some places. Like static foreach(memberName; __traits(allMembers, f)) { __local alias member = __traits(getMember, f, memberName); } Though of course it is also possible to do such with staticMap in places.
you want type functions :) What you are doing there works by default since declarations are temporary.
May 25 2020
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/20 12:46 PM, Timon Gehr wrote:
 On 25.05.20 18:04, Andrei Alexandrescu wrote:
 On 5/25/20 3:15 AM, Max Samukha wrote:
 On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:

 static foreach and stringof were used so I didn't have to pull in 
 `format` or `to!string`, which had issues in std.meta.

 I'm seeing some improvement over 2.086 at least; it should be about 
 equivalent to the hand-unrolled version in master.
How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack! http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dm -and-static-foreach
static foreach is quadratic? Is this a matter of principle, or QoI?
QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE. The implementation of `static foreach` in DMD is based on lowering, and the existing "tuple" foreach implementation. I suspect the culprit for bad performance for very large `static foreach` is a slow expression "tuple" implementation, but I am not sure. Therefore, maybe performance could be improved by modifying the foreach expansion code such that it can accept a CTFE array instead of an expression tuple. One could also add a special-case implementation for range `static foreach` to work around inefficient CTFE.
Thank you! It would be great if you could contribute or shepherd a contribution to dmd. Anything quadratic fails at scale.
May 25 2020
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2020 9:46 AM, Timon Gehr wrote:
 QoI. The original `static foreach` implementation in my compiler frontend runs 
 in linear time, even though it is implemented in a very similar way. Note that 
 even `iota(n).array` has superlinear running time with DMD's CTFE.
 
 The implementation of `static foreach` in DMD is based on lowering, and the 
 existing "tuple" foreach implementation. I suspect the culprit for bad 
 performance for very large `static foreach` is a slow expression "tuple" 
 implementation, but I am not sure.
 Therefore, maybe performance could be improved by modifying the foreach 
 expansion code such that it can accept a CTFE array instead of an expression 
 tuple. One could also add a special-case implementation for range `static 
 foreach` to work around inefficient CTFE.
This is worth looking in to. I've had customers tell me they don't use static foreach due to performance problems.
May 25 2020
prev sibling next sibling parent reply Max Samukha <maxsamukha gmail.com> writes:
On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:

 static foreach and stringof were used so I didn't have to pull 
 in `format` or `to!string`, which had issues in std.meta.

 I'm seeing some improvement over 2.086 at least; it should be 
 about equivalent to the hand-unrolled version in master.
State-of-the-art is to preallocate the result string and use the string counter hack)) http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dmd-and-static-foreach Stefan's work could make all this unnecessary but alas.
May 25 2020
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 07:26:52 UTC, Max Samukha wrote:
 State-of-the-art is to preallocate the result string and use 
 the string counter hack))
In this case, it doesn't matter much because the string is generated only once per unique length, then it is cached thanks to the template arg in the helper enum. So the CTFE only actually happens once, however you do it this way. The preallocation becomes important when the generated string becomes large because it avoids the compiler leaking intermediates and wasting a LOT of time realloc and copying again - when it is just a few hundred bytes like this it doesn't really matter. When the CTFE expression finishes btw the compiler does free the memory pool. So even if you did this little thing over and over, it can be OK. But again when the generated string becomes very large, the peak memory can be prohibitive (it reaches tens of gigabytes of pure waste surprisingly quickly). BTW on ctfe cache vs recalculate it depends on how much reuse you have. In my jni.d, the first version had an inline string replace in a loop. It took 30 GB of memory [!!!] and took several minutes to compile. Simply moving that out and pregenerating a `static immutable` so CTFE was only invoked once, took it down to 3 GB and 25 seconds. Still slow, but huge, huge improvement. But that's because there were only actually two variants of that string and it was fairly large. So caching made sense (static immutable btw is a bit better than a helper template holding the result - both cache but it is like a single variable vs a hash table; in fact that's how it is implemented). But if there were hundreds of variants instead of two, the cache itself would get very expensive.
May 25 2020
parent Max Samukha <maxsamukha gmail.com> writes:
On Monday, 25 May 2020 at 13:13:11 UTC, Adam D. Ruppe wrote:

 In this case, it doesn't matter much because the string is 
 generated only once per unique length, then it is cached thanks 
 to the template arg in the helper enum.
Yes, I overlooked the enum. My version of static map just calls the generating function directly.
 So the CTFE only actually happens once, however you do it this 
 way.
+ When the CTFE expression finishes btw the compiler does free
 the memory pool. So even if you did this little thing over and 
 over, it can be OK. But again when the generated string becomes 
 very large, the peak memory can be prohibitive (it reaches tens 
 of gigabytes of pure waste surprisingly quickly).

 BTW on ctfe cache vs recalculate it depends on how much reuse 
 you have. In my jni.d, the first version had an inline string 
 replace in a loop. It took 30 GB of memory [!!!] and took 
 several minutes to compile.

 Simply moving that out and pregenerating a `static immutable` 
 so CTFE was only invoked once, took it down to 3 GB and 25 
 seconds. Still slow, but huge, huge improvement.

 But that's because there were only actually two variants of 
 that string and it was fairly large. So caching made sense 
 (static immutable btw is a bit better than a helper template 
 holding the result - both cache but it is like a single 
 variable vs a hash table; in fact that's how it is implemented).
 But if there were hundreds of variants instead of two, the 
 cache itself would get very expensive.
Right. This is applicable to memoization in general.
May 25 2020
prev sibling next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
 private enum staticMapHelper(size_t length) = 
 staticMapHelperGen!(length);
lol, also called it staticMapHelper when doing this myself! Solid result though: 2s, 810MB. I'm kinda surprised it beat the crap out of my mixin version. Mine was 3s, 1.2G using a mixin expression (alias staticMap = mixin(helper()). Changing it to a mixin declaration (the alias is in the mixin, like yours) brought it to 2.2s, 924MB. Interesting. And changing the helper from a template to a function gives another small win: 2.2s but 906MB.
 static foreach and stringof were used so I didn't have to pull 
 in `format` or `to!string`, which had issues in std.meta.
Yeah, format and to!string are both surprisingly heavy. Just using to!string(int) even in runtime stuff has a large hit.
May 25 2020
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 13:00:59 UTC, Adam D. Ruppe wrote:
 Solid result though: 2s, 810MB.
That was using the new dmd master, more specifically it was 1.6s that I rounded up. With the released dmd, 1.9s, 750MB. either way, the mixin helps here - but that may not be true if there was a variety of lengths. This test did fixed length because I wanted to specifically prove that arg length isn't as important as number of instantiations here... we should prolly test a variety of lengths too, it could change a lot.
May 25 2020
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/20 9:00 AM, Adam D. Ruppe wrote:
 On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
 private enum staticMapHelper(size_t length) = 
 staticMapHelperGen!(length);
lol, also called it staticMapHelper when doing this myself! Solid result though: 2s, 810MB. I'm kinda surprised it beat the crap out of my mixin version. Mine was 3s, 1.2G using a mixin expression (alias staticMap = mixin(helper()). Changing it to a mixin declaration (the alias is in the mixin, like yours) brought it to 2.2s, 924MB. Interesting. And changing the helper from a template to a function gives another small win: 2.2s but 906MB.
So it seems a mixin-based version + Max's precalculated length = win!
May 25 2020
parent reply Max Samukha <maxsamukha gmail.com> writes:
On Monday, 25 May 2020 at 16:07:06 UTC, Andrei Alexandrescu wrote:

 So it seems a mixin-based version + Max's precalculated length 
 = win!
Precalculated length is Adam's, and the link is to his blog. I should have stated that clearly in my post. Sorry.
May 25 2020
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 25 May 2020 at 17:08:32 UTC, Max Samukha wrote:
 Precalculated length is Adam's
I stole it from Stefan too IIRC. The way I see it, it is all just D community knowledge. A lot of the things I know just come from helping other users through their problems too - they point out something that is cool or is broken and by working on it together we both learn.
May 25 2020
parent Max Samukha <maxsamukha gmail.com> writes:
On Monday, 25 May 2020 at 20:50:19 UTC, Adam D. Ruppe wrote:
 On Monday, 25 May 2020 at 17:08:32 UTC, Max Samukha wrote:
 Precalculated length is Adam's
I stole it from Stefan too IIRC.
Preallocating buffers for CTFE is not a new idea, but the string counter hack was new to me.
 The way I see it, it is all just D community knowledge. A lot 
 of the things I know just come from helping other users through 
 their problems too - they point out something that is cool or 
 is broken and by working on it together we both learn.
Agree.
May 26 2020
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/20 2:18 AM, FeepingCreature wrote:
 On Monday, 25 May 2020 at 03:19:51 UTC, Adam D. Ruppe wrote:
 All this was done on a stock DMD64 D Compiler v2.091.1 on linux. 
 Several enhancements were merged to dmd master recently that may 
 change the results btw.

 quick test: simple impl: 2s, 1.57G. phobos impl: 2.3s, 1.97G

 hand impl: 1.1s, 725M. w/ static if: 1.2s, 775M

 Decent time improvement across the board... but the binary got bigger 
 at 6.4M :( (prolly druntime's fault more than anything else but idk, 
 im not focusing on that rn)
Could you also try with this version, please? template staticMap(alias F, T...) {     mixin(staticMapHelper!(T.length)); } private enum staticMapHelper(size_t length) = staticMapHelperGen!(length); private string staticMapHelperGen(size_t length)() {     string res = "alias staticMap = AliasSeq!(";     static foreach (i; 0 .. length)     {         if (i) res ~= ", ";         res ~= "F!(T[" ~ i.stringof ~ "])";     }     res ~= ");";     return res; }
Simpler: private string staticMapHelperGen(size_t length)() { string res = "alias staticMap = AliasSeq!("; static foreach (i; 0 .. length) res ~= "F!(T[" ~ i.stringof ~ "]),"; res ~= ");"; return res; }
May 25 2020
prev sibling next sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Sunday, 24 May 2020 at 23:40:48 UTC, Andrei Alexandrescu wrote:
 Consider staticMap's current version (there are a few upcoming 
 changes that don't change the point):
 
...

 In the instantiation staticMap!(pred, A, B, C, D, E, F, G, H), 
 recursion will instantiate:

 staticMap!(pred, A, B, C, D)
 staticMap!(pred, A, B)
 staticMap!(pred, A)
 staticMap!(pred, B)
 staticMap!(pred, C, D)
 staticMap!(pred, C)
 staticMap!(pred, D)
 staticMap!(pred, E. F, G, H)
 staticMap!(pred, E, F)
 staticMap!(pred, E)
 staticMap!(pred, F)
 staticMap!(pred, G, H)
 staticMap!(pred, G)
 staticMap!(pred, H)

 Total: 15 instantiations including original. This is because a 
 full tree with 8 leaves is being created. Generally, a tree 
 with N leaves has about 2N - 1 nodes total (exactly so when N 
 is a power of 2).

 The more naive version would simply use linear decomposition:
 ...
 There' no more need to handle the 1-element case, which is 
 already a plus. But that's not the topic here. The 
 instantiations created are:

 staticMap!(pred, B, C, D, E, F, G, H)
 staticMap!(pred, C, D, E, F, G, H)
 staticMap!(pred, D, E, F, G, H)
 staticMap!(pred, E, F, G, H)
 staticMap!(pred, F, G, H)
 staticMap!(pred, G, H)
 staticMap!(pred, H)

 Total: 8 instantiations including original. That's about half 
 the current number of instantiations.

 So it seems divide et impera doesn't quite help with certain 
 recursive templates.
To add to the good previous points: the divide-and-conquer strategy can take advantage of caching for the leaf instantiations with one parameter. That is, 8 template instantiations can trivially be reused, and the others can sometimes be reused if they hit a common subtree. The linear instantiation can only be reused once or twice at the very end.
May 24 2020
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/20 1:59 AM, FeepingCreature wrote:
 On Sunday, 24 May 2020 at 23:40:48 UTC, Andrei Alexandrescu wrote:
 Consider staticMap's current version (there are a few upcoming changes 
 that don't change the point):

 ...

 In the instantiation staticMap!(pred, A, B, C, D, E, F, G, H), 
 recursion will instantiate:

 staticMap!(pred, A, B, C, D)
 staticMap!(pred, A, B)
 staticMap!(pred, A)
 staticMap!(pred, B)
 staticMap!(pred, C, D)
 staticMap!(pred, C)
 staticMap!(pred, D)
 staticMap!(pred, E. F, G, H)
 staticMap!(pred, E, F)
 staticMap!(pred, E)
 staticMap!(pred, F)
 staticMap!(pred, G, H)
 staticMap!(pred, G)
 staticMap!(pred, H)

 Total: 15 instantiations including original. This is because a full 
 tree with 8 leaves is being created. Generally, a tree with N leaves 
 has about 2N - 1 nodes total (exactly so when N is a power of 2).

 The more naive version would simply use linear decomposition:
 ...
 There' no more need to handle the 1-element case, which is already a 
 plus. But that's not the topic here. The instantiations created are:

 staticMap!(pred, B, C, D, E, F, G, H)
 staticMap!(pred, C, D, E, F, G, H)
 staticMap!(pred, D, E, F, G, H)
 staticMap!(pred, E, F, G, H)
 staticMap!(pred, F, G, H)
 staticMap!(pred, G, H)
 staticMap!(pred, H)

 Total: 8 instantiations including original. That's about half the 
 current number of instantiations.

 So it seems divide et impera doesn't quite help with certain recursive 
 templates.
To add to the good previous points: the divide-and-conquer strategy can take advantage of caching for the leaf instantiations with one parameter. That is, 8 template instantiations can trivially be reused, and the others can sometimes be reused if they hit a common subtree. The linear instantiation can only be reused once or twice at the very end.
Not sure I understand. A,...,H are assumed to be distinct. The number of templates generated is the same regardless of memoization.
May 25 2020
parent Paul Backus <snarwin gmail.com> writes:
On Monday, 25 May 2020 at 15:59:57 UTC, Andrei Alexandrescu wrote:
 On 5/25/20 1:59 AM, FeepingCreature wrote:
 To add to the good previous points: the divide-and-conquer 
 strategy can take advantage of caching for the leaf 
 instantiations with one parameter. That is, 8 template 
 instantiations can trivially be reused, and the others can 
 sometimes be reused if they hit a common subtree. The linear 
 instantiation can only be reused once or twice at the very end.
Not sure I understand. A,...,H are assumed to be distinct. The number of templates generated is the same regardless of memoization.
I think the idea is that you can potentially re-use template instances across separate calls to staticMap. So for example, if you have `staticMap!(A, B, C, D)` and `staticMap!(A, B, E, F)`, you get: staticMap!(pred, A, B, C, D) staticMap!(pred, A, B) staticMap!(pred, A) staticMap!(pred, B) staticMap!(pred, C, D) staticMap!(pred, C) staticMap!(pred, D) staticMap!(pred, A, B, E, F) staticMap!(pred, E, F) staticMap!(pred, E) staticMap!(pred, F)
May 25 2020
prev sibling parent johnsmith101 <johnsmith908412 gmail.com> writes:
On Monday, 25 May 2020 at 05:59:41 UTC, FeepingCreature wrote:
 On Sunday, 24 May 2020 at 23:40:48 UTC, Andrei Alexandrescu 
 wrote:
 Consider staticMap's current version (there are a few upcoming 
 changes that don't change the point):
 
...

 In the instantiation staticMap!(pred, A, B, C, D, E, F, G, H), 
 recursion will instantiate:

 staticMap!(pred, A, B, C, D)
 staticMap!(pred, A, B)
 staticMap!(pred, A)
 staticMap!(pred, B)
 staticMap!(pred, C, D)
 staticMap!(pred, C)
 staticMap!(pred, D)
 staticMap!(pred, E. F, G, H)
 staticMap!(pred, E, F)
 staticMap!(pred, E)
 staticMap!(pred, F)
 staticMap!(pred, G, H)
 staticMap!(pred, G)
 staticMap!(pred, H)

 Total: 15 instantiations including original. This is because a 
 full tree with 8 leaves is being created. Generally, a tree 
 with N leaves has about 2N - 1 nodes total (exactly so when N 
 is a power of 2).

 The more naive version would simply use linear decomposition:
 ...
 There' no more need to handle the 1-element case, which is 
 already a plus. But that's not the topic here. The 
 instantiations created are:

 staticMap!(pred, B, C, D, E, F, G, H)
 staticMap!(pred, C, D, E, F, G, H)
 staticMap!(pred, D, E, F, G, H)
 staticMap!(pred, E, F, G, H)
 staticMap!(pred, F, G, H)
 staticMap!(pred, G, H)
 staticMap!(pred, H)

 Total: 8 instantiations including original. That's about half 
 the current number of instantiations.

 So it seems divide et impera doesn't quite help with certain 
 recursive templates.
To add to the good previous points: the divide-and-conquer strategy can take advantage of caching for the leaf instantiations with one parameter. That is, 8 template instantiations can trivially be reused, and the others can sometimes be reused if they hit a common subtree. The linear instantiation can only be reused once or twice at the very end.
Thank u so much for taking time to help.
May 26 2020
prev sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 24 May 2020 at 23:40:48 UTC, Andrei Alexandrescu wrote:
 Consider staticMap's current version (there are a few upcoming 
 changes that don't change the point):

 [...]
Manu's DIP does solve this problem. In O(n) guaranteed.
May 25 2020