digitalmars.D - Reduce has dreadful performance?
- Russel Winder via Digitalmars-d (33/33) Jun 18 2015 On a given machine, the code:
- Laeeth Isharc (2/30) Jun 18 2015 which compiler ?
- Laeeth Isharc (10/46) Jun 18 2015 fwiw n=1bn, delta=0.1
- weaselcat (3/31) Jun 18 2015 first two run in the same time on LDC, third one runs about
- Ilya Yaroshenko (43/71) Jun 18 2015 import std.stdio, std.datetime, std.algorithm, std.range,
- Andrea Fontana (7/18) Jun 18 2015 What about this:
- Ilya Yaroshenko (3/22) Jun 18 2015 This is not equivalent, because sum uses another (more precise)
- Russel Winder via Digitalmars-d (23/26) Jun 18 2015 [=E2=80=A6]
- Laeeth Isharc (2/18) Jun 18 2015 python setup.py pydexe --compiler=ldc
- Laeeth Isharc (5/31) Jun 18 2015 GDC doesn't work anyway, I think, because of the shared library
- John Colvin (3/29) Jun 19 2015 Just a note: if you're using PydMagic you can pass --compiler=ldc
- Russel Winder via Digitalmars-d (14/17) Jun 19 2015 I now have this working and reduce is no longer an embarrassment. Also w...
- Walter Bright (8/12) Jun 18 2015 I expect that at the moment, range+algorithms code will likely be somewh...
- weaselcat (3/17) Jun 18 2015 ldc and gdc have no issue optimizing the range code(range code is
- Andrei Alexandrescu (2/22) Jun 18 2015 Russel, do you have numbers for ldc by any chance? Thx! -- Andrei
- Russel Winder via Digitalmars-d (20/23) Jun 19 2015 Single data point only, and I need to check this, but:
- weaselcat (3/17) Jun 20 2015 Interesting, I saw a bigger speedup from the ranges.
- Russel Winder via Digitalmars-d (17/22) Jun 19 2015 reduce already has achieved parity using LDC =E2=80=93 once you install ...
- Martin Nowak (3/7) Jun 20 2015 Don't compare performance numbers on dmd, particularly not when
- John Colvin (6/34) Jun 18 2015 What you've forgotten is to turn on your optimisation flags. Even
- weaselcat (2/9) Jun 18 2015 third one was faster for me on ldc, fwiw
On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum =3D 0.0; foreach (immutable i; 1 .. n + 1) { immutable x =3D (i - 0.5) * delta; sum +=3D 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x =3D (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x =3D (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } takes about 28.02s. Unless I am missing something (very possible), this is not going to create a good advert for D as an imperative language with declarative (internal iteration) expression. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jun 18 2015
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x = (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x = (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } takes about 28.02s. Unless I am missing something (very possible), this is not going to create a good advert for D as an imperative language with declarative (internal iteration) expression.which compiler ?
Jun 18 2015
On Thursday, 18 June 2015 at 10:34:46 UTC, Laeeth Isharc wrote:On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:fwiw n=1bn, delta=0.1 ldc no params: 8.749s/17.466s ldc -O2 4.541s/4.562s gdc no params 4.690/22.163 (!) gdc -O2 4.552/4.551On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x = (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x = (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } takes about 28.02s. Unless I am missing something (very possible), this is not going to create a good advert for D as an imperative language with declarative (internal iteration) expression.which compiler ?
Jun 18 2015
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x = (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x = (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } takes about 28.02s. Unless I am missing something (very possible), this is not going to create a good advert for D as an imperative language with declarative (internal iteration) expression.first two run in the same time on LDC, third one runs about 30-40% faster.
Jun 18 2015
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x = (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x = (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } takes about 28.02s. Unless I am missing something (very possible), this is not going to create a good advert for D as an imperative language with declarative (internal iteration) expression.import std.stdio, std.datetime, std.algorithm, std.range, std.conv; double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } //runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x = (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } //runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x = (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } void main() { auto res = benchmark!( {sequential_loop(1000, 10);}, {sequential_reduce(1000, 10);}, {sequential_reduce_alt(1000, 10);}, )(1000); writeln(res[].map!(to!Duration)); } $ dmd -run test.d [9 ms, 305 μs, and 9 hnsecs, 16 ms, 625 μs, and 3 hnsecs, 27 ms, 417 μs, and 3 hnsecs] $ dmd -O -inline -release -run test.d [4 ms, 567 μs, and 6 hnsecs, 4 ms, 853 μs, and 8 hnsecs, 5 ms, 52 μs, and 5 hnsecs] OS X, DMD64 D Compiler v2.067.1 Looks like you forgot optimisation troika "-O -inline -release" Regards, Ilya
Jun 18 2015
On Thursday, 18 June 2015 at 10:46:18 UTC, Ilya Yaroshenko wrote:On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:What about this: double sequential_alternative(const int n, const double delta) { return 4.0 * delta * iota(1, n+1).map!(x => (x-0.5)*delta).map!(x => 1.0/(1.0 + x*x)).sum; } ?On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; }
Jun 18 2015
On Thursday, 18 June 2015 at 10:50:09 UTC, Andrea Fontana wrote:On Thursday, 18 June 2015 at 10:46:18 UTC, Ilya Yaroshenko wrote:This is not equivalent, because sum uses another (more precise) algorithms to do summation. It should work a little bit slower.On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:What about this: double sequential_alternative(const int n, const double delta) { return 4.0 * delta * iota(1, n+1).map!(x => (x-0.5)*delta).map!(x => 1.0/(1.0 + x*x)).sum; } ?On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; }
Jun 18 2015
On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via Digitalmars-d wrote:=20[=E2=80=A6]Looks like you forgot optimisation troika "-O -inline -release" =20Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.) --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jun 18 2015
On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via Digitalmars-d wrote:python setup.py pydexe --compiler=ldc[…]Looks like you forgot optimisation troika "-O -inline -release"Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.)
Jun 18 2015
On Thursday, 18 June 2015 at 20:18:31 UTC, Laeeth Isharc wrote:On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:GDC doesn't work anyway, I think, because of the shared library problem. would be great to have a dub setup so one can build in a standard way without using python.On Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via Digitalmars-d wrote:python setup.py pydexe --compiler=ldc[…]Looks like you forgot optimisation troika "-O -inline -release"Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.)
Jun 18 2015
On Thursday, 18 June 2015 at 20:18:31 UTC, Laeeth Isharc wrote:On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:Just a note: if you're using PydMagic you can pass --compiler=ldc to %%pydOn Thu, 2015-06-18 at 10:46 +0000, Ilya Yaroshenko via Digitalmars-d wrote:python setup.py pydexe --compiler=ldc[…]Looks like you forgot optimisation troika "-O -inline -release"Don't I feel stupid :-) Thanks to all for quick answers, most helpful. Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067 Anyone know how to get PyD to use LDC rather than DMD in the setuptools build spec? (I can't use GDC just now as it is not packaged for Fedora.)
Jun 19 2015
On Thu, 2015-06-18 at 20:18 +0000, Laeeth Isharc via Digitalmars-d wrote:[=E2=80=A6] =20 python setup.py pydexe --compiler=3DldcI now have this working and reduce is no longer an embarrassment. Also we have reaffirmed that LDC is the tool of choice and DMD unusable. But this surprises no-one involved with CPU-bound activity I suspect. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t:+44 20 7585 2200 voip:sip: russel.winder ekiga.net 41 Buckmaster Road m:+44 7770 465 077 xmpp:russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype:russel_winder
Jun 19 2015
On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12sI expect that at the moment, range+algorithms code will likely be somewhat slower than old fashioned loops. This is because code generators have been tuned for decades to do a great job with loops. There's no intrinsic reason why ranges must do worse, so I expect they'll achieve parity. Ranges can move ahead because they can reduce the algorithmic complexity, whereas user written loops tend to be suboptimal.
Jun 18 2015
On Thursday, 18 June 2015 at 20:53:20 UTC, Walter Bright wrote:On 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:ldc and gdc have no issue optimizing the range code(range code is actually faster with ldc)Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12sI expect that at the moment, range+algorithms code will likely be somewhat slower than old fashioned loops. This is because code generators have been tuned for decades to do a great job with loops. There's no intrinsic reason why ranges must do worse, so I expect they'll achieve parity. Ranges can move ahead because they can reduce the algorithmic complexity, whereas user written loops tend to be suboptimal.
Jun 18 2015
On 6/18/15 2:04 PM, weaselcat wrote:On Thursday, 18 June 2015 at 20:53:20 UTC, Walter Bright wrote:Russel, do you have numbers for ldc by any chance? Thx! -- AndreiOn 6/18/2015 7:04 AM, Russel Winder via Digitalmars-d wrote:ldc and gdc have no issue optimizing the range code(range code is actually faster with ldc)Now I get Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12sI expect that at the moment, range+algorithms code will likely be somewhat slower than old fashioned loops. This is because code generators have been tuned for decades to do a great job with loops. There's no intrinsic reason why ranges must do worse, so I expect they'll achieve parity. Ranges can move ahead because they can reduce the algorithmic complexity, whereas user written loops tend to be suboptimal.
Jun 18 2015
On Thu, 2015-06-18 at 15:54 -0700, Andrei Alexandrescu via Digitalmars-d wrote:[=E2=80=A6] =20 Russel, do you have numbers for ldc by any chance? Thx! -- AndreiSingle data point only, and I need to check this, but: Loop: 1.44s Reduce 1: 1.43s Reduce 2: 1.42s LDC seems to beat DMD into the ground. But then I think we all know this ;-) The PyD setuptools support appears to not properly ser RPATH which is a bit annoying. Actually it is a lot annoying. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t:+44 20 7585 2200 voip:sip: russel.winder ekiga.net 41 Buckmaster Road m:+44 7770 465 077 xmpp:russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype:russel_winder
Jun 19 2015
On Saturday, 20 June 2015 at 05:55:07 UTC, Russel Winder wrote:On Thu, 2015-06-18 at 15:54 -0700, Andrei Alexandrescu via Digitalmars-d wrote:Interesting, I saw a bigger speedup from the ranges. Do you use windows? What LLVM/LDC version? flags?[…] Russel, do you have numbers for ldc by any chance? Thx! -- AndreiSingle data point only, and I need to check this, but: Loop: 1.44s Reduce 1: 1.43s Reduce 2: 1.42s LDC seems to beat DMD into the ground. But then I think we all know this ;-) The PyD setuptools support appears to not properly ser RPATH which is a bit annoying. Actually it is a lot annoying.
Jun 20 2015
On Thu, 2015-06-18 at 13:53 -0700, Walter Bright via Digitalmars-d wrote:[=E2=80=A6] =20 There's no intrinsic reason why ranges must do worse, so I expect they'll==20achieve parity. =20reduce already has achieved parity using LDC =E2=80=93 once you install it = properly, unlike what I had been doing. Explicit loops are just so last millenium. :-) PyData London 2015 here we come. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t:+44 20 7585 2200 voip:sip: russel.winder ekiga.net 41 Buckmaster Road m:+44 7770 465 077 xmpp:russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype:russel_winder
Jun 19 2015
On Thursday, 18 June 2015 at 18:55:25 UTC, Russel Winder wrote:Loop: 3.14s Reduce 1: 4.76s Reduce 2: 5.12s This is DMD 2.067Don't compare performance numbers on dmd, particularly not when assessing abstraction overhead.
Jun 20 2015
On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:On a given machine, the code: double sequential_loop(const int n, const double delta) { auto sum = 0.0; foreach (immutable i; 1 .. n + 1) { immutable x = (i - 0.5) * delta; sum += 1.0 / (1.0 + x * x); } return 4.0 * delta * sum; } runs in about 6.70s. However the code: double sequential_reduce(const int n, const double delta) { return 4.0 * delta * reduce!((double t, int i){immutable x = (i - 0.5) * delta; return t + 1.0 / (1.0 + x * x);})(0.0, iota(1, n + 1)); } runs in about 17.03s, whilst: double sequential_reduce_alt(const int n, const double delta) { return 4.0 * delta * reduce!"a + b"( map!((int i){immutable x = (i - 0.5) * delta; return 1.0 / (1.0 + x * x);})(iota(1, n + 1))); } takes about 28.02s. Unless I am missing something (very possible), this is not going to create a good advert for D as an imperative language with declarative (internal iteration) expression.What you've forgotten is to turn on your optimisation flags. Even DMD gets the first 2 vaguely right with -O -inline -release gdc is able to vectorize all 3 examples if you give it -Ofast The 3rd one does appear to be a bit slower in all cases, not sure why right now.
Jun 18 2015
On Thursday, 18 June 2015 at 13:20:04 UTC, John Colvin wrote:On Thursday, 18 June 2015 at 10:27:58 UTC, Russel Winder wrote:third one was faster for me on ldc, fwiw[...]What you've forgotten is to turn on your optimisation flags. Even DMD gets the first 2 vaguely right with -O -inline -release gdc is able to vectorize all 3 examples if you give it -Ofast The 3rd one does appear to be a bit slower in all cases, not sure why right now.
Jun 18 2015