digitalmars.D - Re: Errors in TDPL

bearophile <bearophileHUGS lycos.com> Jun 21 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jun 21 2010

Brad Roberts <braddr puremagic.com> Jun 21 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

That's not too difficult; for integers, retro(iota(a, b)) could actually be a
rewrite to iota(b - 1, a, -1).<


This is good. In my dlibs the xchain converts xchain(xchain(x, y), z) and
similar into xchain(x, y, z).


Figuring out all corner cases, steps greater than 1, and what to do for
floating point numbers is doable but not trivial either, and works against
modularity.<


I suggest to focus on the most important case, integer/uint indexes (and +1 or
-1 increment).

My post has shown some different problems:
- A longer compilation time (and binary size)
- A 1/10 performance when the code is compiled in standard way, this is bad
- a smaller performance when the code is compiled in optimized mode.

The asm of the opt version shows calls to two or more functions inside the
loop, and one of those functions is not so small, this probably reduces
performance more than an extra inlined product. So in my opinion getting rid of
those calls (inlining them) is more important if you want a faster retro(iota).

------------------------

I have added your last version:

// test4
import std.c.stdio: printf;
import std.range: iota;
void main() {
    enum int N = 100_000_000;
    int count;
    foreach (i; iota(N - 1, 0, -1))
        count++;
    printf("%d\n", count);
}



Running time, best of 3, seconds:
  test1:     0.31
  test1 opt: 0.07
  test2:     0.31
  test2 opt: 0.12
  test3:     6.38
  test3 opt: 0.52
  test4:     4.70
  test4 opt: 1.25

not opt version = dmd (no other option)
opt version = dmd -O -release -inline

Compile times opt version, seconds:
  test1: 0.05
  test2: 0.05
  test3: 0.28
  test4: 0.29

The compilation time is the same, the not-opt test4 is faster than not-opt
test3, but opt test4 is quite slower than opt test3.

Bye,
bearophile

Jun 21 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/21/2010 08:28 PM, bearophile wrote:
 Andrei Alexandrescu:

 That's not too difficult; for integers, retro(iota(a, b)) could actually be a
rewrite to iota(b - 1, a, -1).<


 This is good. In my dlibs the xchain converts xchain(xchain(x, y), z) and
similar into xchain(x, y, z).


 Figuring out all corner cases, steps greater than 1, and what to do for
floating point numbers is doable but not trivial either, and works against
modularity.<


 I suggest to focus on the most important case, integer/uint indexes (and +1 or
-1 increment).

 My post has shown some different problems:
 - A longer compilation time (and binary size)
 - A 1/10 performance when the code is compiled in standard way, this is bad
 - a smaller performance when the code is compiled in optimized mode.

 The asm of the opt version shows calls to two or more functions inside the
loop, and one of those functions is not so small, this probably reduces
performance more than an extra inlined product. So in my opinion getting rid of
those calls (inlining them) is more important if you want a faster retro(iota).

 ------------------------

 I have added your last version:

 // test4
 import std.c.stdio: printf;
 import std.range: iota;
 void main() {
      enum int N = 100_000_000;
      int count;
      foreach (i; iota(N - 1, 0, -1))
          count++;
      printf("%d\n", count);
 }



 Running time, best of 3, seconds:
    test1:     0.31
    test1 opt: 0.07
    test2:     0.31
    test2 opt: 0.12
    test3:     6.38
    test3 opt: 0.52
    test4:     4.70
    test4 opt: 1.25

 not opt version = dmd (no other option)
 opt version = dmd -O -release -inline

 Compile times opt version, seconds:
    test1: 0.05
    test2: 0.05
    test3: 0.28
    test4: 0.29

 The compilation time is the same, the not-opt test4 is faster than not-opt
test3, but opt test4 is quite slower than opt test3.

 Bye,
 bearophile


Thanks. Hmm. I redefined iota as follows:

struct Iota(N, S) if ((isIntegral!N || isPointer!N) && isIntegral!S) {
     private N current, pastLast;
     private S step;
     this(N current, N pastLast, S step)
     {
         enforce(step != 0 && current != pastLast);
         this.current = current;
         this.step = step;
         if (step > 0)
         {
             this.pastLast = pastLast - 1;
             this.pastLast -= (this.pastLast - current) % step;
         }
         else
         {
             this.pastLast = pastLast + 1;
             this.pastLast += (this.pastLast - current) % step;
         }
         this.pastLast += step;
     }
     bool empty() const { return current == pastLast; }
     N front() { return current; }
     alias front moveFront;
     void popFront()
     {
         current += step;
     }
     N back() { return pastLast - step; }
     alias back moveBack;
     void popBack()
     {
         pastLast -= step;
     }
     Iota save() { return this; }
     N opIndex(size_t n)
     {
         return current + step * n;
     }
     size_t length()
     {
         return (pastLast - current) / step;
     }
}

Iota!(CommonType!(B, E), S)
iota(B, E, S)(B begin, E end, S step)
if (is(typeof((E.init - B.init) + 1 * S.init)))
{
     return Iota!(CommonType!(B, E), S)(begin, end, step);
}

I optimized things such that the commonly used path (many calls to 
empty, front, and popFront) is as fast as possible. The initial work 
will be amortized for most loops.

On my machine, test4 is still 2x slower than foreach_reverse in an 
optimized build. The disassembly reveals that the compiler refuses to 
inline some calls, which is disappointing as their bodies are very simple.


Andrei

Jun 21 2010

Brad Roberts <braddr puremagic.com> writes:

On 6/21/2010 8:12 PM, Andrei Alexandrescu wrote:
 
 I optimized things such that the commonly used path (many calls to
 empty, front, and popFront) is as fast as possible. The initial work
 will be amortized for most loops.
 
 On my machine, test4 is still 2x slower than foreach_reverse in an
 optimized build. The disassembly reveals that the compiler refuses to
 inline some calls, which is disappointing as their bodies are very simple.
 
 
 Andrei


If you would, file bugs with reduced tests and I'll work on the inliner.

Jun 21 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Errors in TDPL