digitalmars.D.learn - Simple performance question from a newcomer
- dextorious (94/94) Feb 21 2016 I've been vaguely aware of D for many years, but the recent
- jmh530 (6/11) Feb 21 2016 I didn't look at your code that thoroughly, but it is generally
- Daniel Kozak via Digitalmars-d-learn (11/102) Feb 21 2016 You can use -profile to see what is causing it.
- Daniel Kozak via Digitalmars-d-learn (2/113) Feb 21 2016
- Jack Stouffer (15/20) Feb 21 2016 First off, you should really be using GDC or LDC if you want
- bachmeier (13/16) Feb 21 2016 First, a minor point, the D community is usually pretty careful
- dextorious (15/27) Feb 22 2016 While I certainly do not doubt the open mindedness of the D
- sigod (4/20) Feb 22 2016 I can't agree with that. Between `for` and `foreach` you should
- ZombineDev (30/124) Feb 21 2016 The problem is not with ranges, but with the particualr algorithm
- ZombineDev (34/40) Feb 21 2016 I did some more testing and clearly the larger times for N=1000
- ZombineDev (39/56) Feb 21 2016 Just for the record, with `DMD -release -O -inline`, Kahan,
- dextorious (35/46) Feb 22 2016 First of all, I am pleasantly surprised by the rapid influx of
- ixid (17/23) Feb 23 2016 Your experience is exactly what the D community needs to get
- Marc =?UTF-8?B?U2Now7x0eg==?= (8/10) Feb 23 2016 While I agree with most of what you're saying, I don't think we
- ixid (7/18) Feb 23 2016 Wouldn't it be better to have technically perfect implementations
- dextorious (12/23) Feb 23 2016 Being new to the language, I certainly make no claims about what
- =?UTF-8?Q?Ali_=c3=87ehreli?= (7/14) Feb 24 2016 According to Wikipedia, pairwise summation is the default algorithm in
- bachmeier (6/17) Feb 23 2016 His concern is with the default settings of Dub. I've tried Dub
- dextorious (42/65) Feb 23 2016 Personally, I think a few aspects of documentation for the
- jmh530 (9/25) Feb 23 2016 I think that's fair. I think part of the reason for the focus on
- Mike Parker (10/14) Feb 23 2016 If you're referring to this:
- dextorious (14/28) Feb 24 2016 There's part of what I'm referring to, yes. There doesn't seem to
- jmh530 (9/11) Feb 24 2016 There are examples like in the package format page
- Kapps (8/8) Feb 21 2016 If you do want to test the differences between the range approach
- Kapps (24/32) Feb 21 2016 Using LDC with the mir version of ndslice so it compiles, and the
I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results: import std.stdio; import std.array : array; import std.algorithm; import std.datetime; import std.range; import std.experimental.ndslice; void main() { int N = 1000; int Q = 20; int times = 1_000; double[] res1 = uninitializedArray!(double[])(N); double[] res2 = uninitializedArray!(double[])(N); double[] res3 = uninitializedArray!(double[])(N); auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q); StopWatch sw; double t0, t1, t2; sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res1[i] = sumtest1(f[i]); } } sw.stop(); t0 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res2[i] = sumtest2(f[i]); } } sw.stop(); t1 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { sumtest3(f, res3, N, Q); } t2 = sw.peek().msecs; writeln(t0, " ms"); writeln(t1, " ms"); writeln(t2, " ms"); assert( res1 == res2 ); assert( res2 == res3 ); } auto sumtest1(Range)(Range range) safe pure nothrow nogc { return range.sum; } auto sumtest2(Range)(Range f) safe pure nothrow nogc { double retval = 0.0; foreach (double f_ ; f) { retval += f_; } return retval; } auto sumtest3(Range)(Range f, double[] retval, double N, double Q) safe pure nothrow nogc { for (int i=0; i<N; ++i) { for (int j=1; j<Q; ++j) { retval[i] += f[i,j]; } } } When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings: 1268 ms 312 ms 271 ms I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked. Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?
Feb 21 2016
On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?I didn't look at your code that thoroughly, but it is generally recommended that if you're concerned about performance to compile with gdc or ldc. dmd has fast compile times, but does not produce as fast of code. You might want to check if the performance differential is still large with one of those.
Feb 21 2016
You can use -profile to see what is causing it. Num Tree Func Per Calls Time Time Call 23000000 550799875 550243765 23 pure nothrow nogc safe double std.algorithm.iteration.sumPairwise!(double, std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice) Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results: import std.stdio; import std.array : array; import std.algorithm; import std.datetime; import std.range; import std.experimental.ndslice; void main() { int N = 1000; int Q = 20; int times = 1_000; double[] res1 = uninitializedArray!(double[])(N); double[] res2 = uninitializedArray!(double[])(N); double[] res3 = uninitializedArray!(double[])(N); auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q); StopWatch sw; double t0, t1, t2; sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res1[i] = sumtest1(f[i]); } } sw.stop(); t0 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res2[i] = sumtest2(f[i]); } } sw.stop(); t1 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { sumtest3(f, res3, N, Q); } t2 = sw.peek().msecs; writeln(t0, " ms"); writeln(t1, " ms"); writeln(t2, " ms"); assert( res1 == res2 ); assert( res2 == res3 ); } auto sumtest1(Range)(Range range) safe pure nothrow nogc { return range.sum; } auto sumtest2(Range)(Range f) safe pure nothrow nogc { double retval = 0.0; foreach (double f_ ; f) { retval += f_; } return retval; } auto sumtest3(Range)(Range f, double[] retval, double N, double Q) safe pure nothrow nogc { for (int i=0; i<N; ++i) { for (int j=1; j<Q; ++j) { retval[i] += f[i,j]; } } } When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings: 1268 ms 312 ms 271 ms I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked. Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?
Feb 21 2016
So I guess pairwise summation is one to blame here. Dne 21.2.2016 v 16:56 Daniel Kozak napsal(a):You can use -profile to see what is causing it. Num Tree Func Per Calls Time Time Call 23000000 550799875 550243765 23 pure nothrow nogc safe double std.algorithm.iteration.sumPairwise!(double, std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice) Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results: import std.stdio; import std.array : array; import std.algorithm; import std.datetime; import std.range; import std.experimental.ndslice; void main() { int N = 1000; int Q = 20; int times = 1_000; double[] res1 = uninitializedArray!(double[])(N); double[] res2 = uninitializedArray!(double[])(N); double[] res3 = uninitializedArray!(double[])(N); auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q); StopWatch sw; double t0, t1, t2; sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res1[i] = sumtest1(f[i]); } } sw.stop(); t0 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res2[i] = sumtest2(f[i]); } } sw.stop(); t1 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { sumtest3(f, res3, N, Q); } t2 = sw.peek().msecs; writeln(t0, " ms"); writeln(t1, " ms"); writeln(t2, " ms"); assert( res1 == res2 ); assert( res2 == res3 ); } auto sumtest1(Range)(Range range) safe pure nothrow nogc { return range.sum; } auto sumtest2(Range)(Range f) safe pure nothrow nogc { double retval = 0.0; foreach (double f_ ; f) { retval += f_; } return retval; } auto sumtest3(Range)(Range f, double[] retval, double N, double Q) safe pure nothrow nogc { for (int i=0; i<N; ++i) { for (int j=1; j<Q; ++j) { retval[i] += f[i,j]; } } } When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings: 1268 ms 312 ms 271 ms I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked. Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?
Feb 21 2016
On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?First off, you should really be using GDC or LDC if you want speed. On how to do that, see my blog post about here: http://jackstouffer.com/blog/nd_slice.html. Specifically the section titled "Getting Hands On". Secondly, both of your other sum examples use naive element by element summation rather than the more accurate pairwise summation which sum uses with random access floating point ranges. So your not really comparing apples to apples here. Since Phobos' pairwise summation is recursive, it's very likely that DMD isn't doing all the optimizations that LDC or GDC can, such as inlining or tail call optimizations. I haven't compiled your code so I can't check myself. Also, templates are auto attributed, so there's no reason to include safe nothrow, etc. on templated functions.
Feb 21 2016
On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons.First, a minor point, the D community is usually pretty careful not to frown on a particular coding style (unlike some communities) so if you are comfortable writing loops and it gives you the fastest code, you should do so. On the performance issue, you can see this related post about performance with reduce: http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d puremagic.com This was Walter's response: http://forum.dlang.org/post/mlvb40$1tdf$1 digitalmars.com And this shows that LDC flat out does a better job of optimization in this case: http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d puremagic.com
Feb 21 2016
On Sunday, 21 February 2016 at 16:20:30 UTC, bachmeier wrote:First, a minor point, the D community is usually pretty careful not to frown on a particular coding style (unlike some communities) so if you are comfortable writing loops and it gives you the fastest code, you should do so. On the performance issue, you can see this related post about performance with reduce: http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d puremagic.com This was Walter's response: http://forum.dlang.org/post/mlvb40$1tdf$1 digitalmars.com And this shows that LDC flat out does a better job of optimization in this case: http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d puremagic.comWhile I certainly do not doubt the open mindedness of the D community, it was in part Walter Bright's statement during a keynote speech of how "loops are bugs" that motivated me to look at D for a fresh approach to writing numerical code. For decades, explicit loops have been the only way to attain good performance for certain kinds of code in virtually all languages (discounting a few quirky high level languages like MATLAB) and the notion that this need not be the case is quite attractive to many people, myself included. While the point Walter makes, that there is no mathematical reason ranges should be slower than loops and that loops are generally easier to get wrong is certainly true, D is the first general purpose language I've ever seen that makes this sentiment come close to reality.
Feb 22 2016
On Sunday, 21 February 2016 at 16:20:30 UTC, bachmeier wrote:On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:I can't agree with that. Between `for` and `foreach` you should choose one that is more readable/understandable for particular situation. It's compiler's task to optimize such small things.I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons.First, a minor point, the D community is usually pretty careful not to frown on a particular coding style (unlike some communities) so if you are comfortable writing loops and it gives you the fastest code, you should do so. On the performance issue, you can see this related post about performance with reduce: http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d puremagic.com This was Walter's response: http://forum.dlang.org/post/mlvb40$1tdf$1 digitalmars.com And this shows that LDC flat out does a better job of optimization in this case: http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d puremagic.com
Feb 22 2016
On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results: import std.stdio; import std.array : array; import std.algorithm; import std.datetime; import std.range; import std.experimental.ndslice; void main() { int N = 1000; int Q = 20; int times = 1_000; double[] res1 = uninitializedArray!(double[])(N); double[] res2 = uninitializedArray!(double[])(N); double[] res3 = uninitializedArray!(double[])(N); auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q); StopWatch sw; double t0, t1, t2; sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res1[i] = sumtest1(f[i]); } } sw.stop(); t0 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res2[i] = sumtest2(f[i]); } } sw.stop(); t1 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { sumtest3(f, res3, N, Q); } t2 = sw.peek().msecs; writeln(t0, " ms"); writeln(t1, " ms"); writeln(t2, " ms"); assert( res1 == res2 ); assert( res2 == res3 ); } auto sumtest1(Range)(Range range) safe pure nothrow nogc { return range.sum; } auto sumtest2(Range)(Range f) safe pure nothrow nogc { double retval = 0.0; foreach (double f_ ; f) { retval += f_; } return retval; } auto sumtest3(Range)(Range f, double[] retval, double N, double Q) safe pure nothrow nogc { for (int i=0; i<N; ++i) { for (int j=1; j<Q; ++j) { retval[i] += f[i,j]; } } } When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings: 1268 ms 312 ms 271 ms I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked. Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?The problem is not with ranges, but with the particualr algorithm used for summing. If you look at the docs see that if the range has random-access `sum` will use the pair-wise algorithm. About the second and third tests, the problem is with DMD which should not be used when measuring performance (but only for development, because it has fast compile-times). These are the results that I get with LDC: Pair-wise (sumtest1): 415 ms 21 ms 20 ms And if I use the Kahan algorithm: 106 ms 36 ms 31 ms The second two results are probably larger due to noise. And if I increase N to 100_000: Pair-wise (sumtest1): 29557 ms 2061 ms 1990 ms Kahan: 4566 ms 2067 ms 1990 ms According to `dub --verbose`, my command-line was roughly this: ldc2 -ofapp -release -O5 -singleobj -w source/app.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d
Feb 21 2016
On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:... And if I use the Kahan algorithm: 106 ms 36 ms 31 ms The second two results are probably larger due to noise.I did some more testing and clearly the larger times for N=1000 were just noise: [LDC Kahan N=1000] 106 ms 36 ms 31 ms 59 ms 24 ms 22 ms 46 ms 21 ms 20 ms 45 ms 21 ms 20 ms 45 ms 21 ms 20 ms 59 ms 24 ms 21 ms 102 ms 35 ms 30 ms 104 ms 37 ms 29 ms 107 ms 36 ms 31 ms 46 ms 21 ms 20 ms
Feb 21 2016
On Sunday, 21 February 2016 at 16:36:22 UTC, ZombineDev wrote:On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:Just for the record, with `DMD -release -O -inline`, Kahan, N=1000 I get: 325 ms 217 ms 165 ms 231 ms 117 ms 58 ms 131 ms 109 ms 58 ms 131 ms 109 ms 57 ms 131 ms 112 ms 57 ms 125 ms 106 ms 55 ms 125 ms 104 ms 55 ms 125 ms 105 ms 55 ms 125 ms 104 ms 55 ms 230 ms 115 ms 58 ms 131 ms 112 ms 58 ms 131 ms 109 ms 57 ms... And if I use the Kahan algorithm: 106 ms 36 ms 31 ms The second two results are probably larger due to noise.I did some more testing and clearly the larger times for N=1000 were just noise: [LDC Kahan N=1000] 106 ms 36 ms 31 ms 59 ms 24 ms 22 ms ...
Feb 21 2016
First of all, I am pleasantly surprised by the rapid influx of helpful responses. The community here seems quite wonderful. In the interests of not cluttering the thread too much, since the advice given here has many commonalities, I will only try to respond once to each type of suggestion. On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:The problem is not with ranges, but with the particualr algorithm used for summing. If you look at the docs see that if the range has random-access `sum` will use the pair-wise algorithm. About the second and third tests, the problem is with DMD which should not be used when measuring performance (but only for development, because it has fast compile-times). ... According to `dub --verbose`, my command-line was roughly this: ldc2 -ofapp -release -O5 -singleobj -w source/app.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.dIt appears that I cannot use the GDC compiler for this particular problem due to it using a comparatively older version of the DMD frontend (I understand Mir requires >=2.068), but I did manage to get LDC working on my system after a bit of work. Since I've been using dub to manage my project, I used the default "release" build type. I also tried compiling manually with LDC, using the -O5 switch you mentioned. These are the results (I increased the iteration count to lessen the noise, the array is now 10000x20, each function is run a thousand times): DMD LDC (dub) LDC (-release -enable-inlining -O5 -w -singleobj) sumtest1:12067 ms 6899 ms 1940 ms sumtest2: 3076 ms 1349 ms 452 ms sumtest3: 2526 ms 847 ms 434 ms sumtest4: 5614 ms 1481 ms 452 ms The sumtest1, 2 and 3 functions are as given in the first post, sumtest4 uses the range.reduce!((a, b) => a + b) approach to enforce naive summation. Much to my satisfaction, the range.reduce version is now exactly as quick as the traditional loop and while function inlining isn't quite perfect, the 4% performance penalty incurred by the 10_000 function calls (or whatever inlined form the function finally takes) is quite acceptable. I do have to wonder, however, about the default settings of dub in this case. Having gone through its documentation, I might still not have guessed to try the compiler options you provided, thereby losing out on a 2-3x performance improvement. What build options did you use in your dub.json that it managed to translate to the correct compiler switches?
Feb 22 2016
On Monday, 22 February 2016 at 15:43:23 UTC, dextorious wrote:I do have to wonder, however, about the default settings of dub in this case. Having gone through its documentation, I might still not have guessed to try the compiler options you provided, thereby losing out on a 2-3x performance improvement. What build options did you use in your dub.json that it managed to translate to the correct compiler switches?Your experience is exactly what the D community needs to get right. You've come in as an interested user with patience and initially D has offered slightly disappointing performance for both technical reasons and because of the different compilers. You've gotten to the right place in the end but we need point A to point B to be a lot smoother and more obvious so more people get a good initial impression of D. Every D user thread seems to go like this- someone starts with DMD, they then struggle a little and hopefully get LDC working with a list of slightly obscure compiler switches offered. A standard algorithm performs disappointingly for somewhat valid technical reasons and more clunky alternatives are then deployed. We really need to standard algorithms to be fast and perhaps have separate ones for perfect technical accuracy. What are your thoughts on D now? What would have helped you get to the right place much faster?
Feb 23 2016
On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:We really need to standard algorithms to be fast and perhaps have separate ones for perfect technical accuracy.While I agree with most of what you're saying, I don't think we should prioritize performance over accuracy or correctness. Especially for numerics people, precision is very important, and it can make a just as bad first impression if we don't get this right. We can however make the note in the documentation (which already talks about performance) a bit more prominent: http://dlang.org/phobos/std_algorithm_iteration.html#sum
Feb 23 2016
On Tuesday, 23 February 2016 at 14:07:22 UTC, Marc Schütz wrote:On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:Wouldn't it be better to have technically perfect implementations for those numerics people? Sum is a basic function that almost everyone may want to use, this is a factor of four slowdown for the sake of one user group who could be perfectly well served by a sub-library that contains high-accuracy versions. It might make sense if the speed difference were only a few percent.We really need to standard algorithms to be fast and perhaps have separate ones for perfect technical accuracy.While I agree with most of what you're saying, I don't think we should prioritize performance over accuracy or correctness. Especially for numerics people, precision is very important, and it can make a just as bad first impression if we don't get this right. We can however make the note in the documentation (which already talks about performance) a bit more prominent: http://dlang.org/phobos/std_algorithm_iteration.html#sum
Feb 23 2016
On Tuesday, 23 February 2016 at 14:07:22 UTC, Marc Schütz wrote:On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:Being new to the language, I certainly make no claims about what the Phobos library should do, but coming from a heavy numerics background in many languages, I can say that this is the first time I've seen a common summation function do anything beyond naive summation. Some languages feature more accurate options separately, but never as the default, so it did not occur to me to specifically check the documentation for something like sum() (which is my fault, of course, no issues there). Having the more accurate pairwise summation algorithm in the standard library is certainly worthwhile for some applications, but I was a bit surprised to see it as the default.We really need to standard algorithms to be fast and perhaps have separate ones for perfect technical accuracy.While I agree with most of what you're saying, I don't think we should prioritize performance over accuracy or correctness. Especially for numerics people, precision is very important, and it can make a just as bad first impression if we don't get this right. We can however make the note in the documentation (which already talks about performance) a bit more prominent: http://dlang.org/phobos/std_algorithm_iteration.html#sum
Feb 23 2016
On 02/23/2016 12:12 PM, dextorious wrote:Some languages feature more accurate options separately, but never as the default, so it did not occur to me to specifically check the documentation for something like sum() (which is my fault, of course, no issues there). Having the more accurate pairwise summation algorithm in the standard library is certainly worthwhile for some applications, but I was a bit surprised to see it as the default.According to Wikipedia, pairwise summation is the default algorithm in NumPy and Julia as well: "Pairwise summation is the default summation algorithm in NumPy and the Julia technical-computing language". https://en.wikipedia.org/wiki/Pairwise_summation Ali
Feb 24 2016
On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:On Monday, 22 February 2016 at 15:43:23 UTC, dextorious wrote:His concern is with the default settings of Dub. I've tried Dub and given up several times, and I've been using D since 2013. The community needs to provide real documentation. It's embarrassing that it's pushed as the official package manager and will soon be included with DMD.I do have to wonder, however, about the default settings of dub in this case. Having gone through its documentation, I might still not have guessed to try the compiler options you provided, thereby losing out on a 2-3x performance improvement. What build options did you use in your dub.json that it managed to translate to the correct compiler switches?Your experience is exactly what the D community needs to get right. You've come in as an interested user with patience and initially D has offered slightly disappointing performance for both technical reasons and because of the different compilers.
Feb 23 2016
On Tuesday, 23 February 2016 at 11:10:40 UTC, ixid wrote:On Monday, 22 February 2016 at 15:43:23 UTC, dextorious wrote:Personally, I think a few aspects of documentation for the various compilers, dub and possibly the dlang.org website itself could be improved, if accessibility is considered important. For instance, just to take my journey with trying out D as an example, I can immediately list a few points where I misunderstood or failed to find relevant information: 1. While the dlang.org website does a good job presenting the three compilers side by side with a short pro/con list for each and does mention that DMD produces slower code, I did not at first expect the difference to be half an order of magnitude or more. In retrospect, after reading the forums and learning about how each compiler works, this is quite obvious, but the initial impression was misleading. 2. The LDC compiler gave me a few issues during setup, particularly on Windows. The binaries supplied are dynamically linked against the MSVS2015 runtime (and will fail on any other system) and seem to require a full Visual Studio installation. I assume there are good reasons for this (though I hope in the future a more widely usable version could be made available), but the fact itself could be made clearer on the download page (it can be found after some searching on the D wiki and the forums). 3. The documentation for the dub package is useful, but somewhat difficult to read due to how it is structured and does not seem complete. For instance, I am still not sure how to make it pass the -O5 switch to the LDC2 compiler and the impression I got from the documentation is that explicit manual switches can only be supplied for the DMD compiler. It says that when using other compilers, the relevant switches are automatically translated to appropriate options for GDC/LDC, but no further details are supplied and no matter what options I set for the DMD compiler, using --compiler=ldc2 only yields -O and not -O5. For the moment, I'm compiling my code and managing dependencies manually like I would in C++, which is just fine for me personally, but does leave a slightly disappointing impression about what is apparently considered a semi-official package manager for the D language. Of course, this is just my anecdotal experience and should not be taken as major criticism. It may be that I missed something or did not do enough research. Certainly, some amount of adjustment is to be expected when learning a new language, but there does seem to be some room for improvement.I do have to wonder, however, about the default settings of dub in this case. Having gone through its documentation, I might still not have guessed to try the compiler options you provided, thereby losing out on a 2-3x performance improvement. What build options did you use in your dub.json that it managed to translate to the correct compiler switches?Your experience is exactly what the D community needs to get right. You've come in as an interested user with patience and initially D has offered slightly disappointing performance for both technical reasons and because of the different compilers. You've gotten to the right place in the end but we need point A to point B to be a lot smoother and more obvious so more people get a good initial impression of D. Every D user thread seems to go like this- someone starts with DMD, they then struggle a little and hopefully get LDC working with a list of slightly obscure compiler switches offered. A standard algorithm performs disappointingly for somewhat valid technical reasons and more clunky alternatives are then deployed. We really need to standard algorithms to be fast and perhaps have separate ones for perfect technical accuracy. What are your thoughts on D now? What would have helped you get to the right place much faster?
Feb 23 2016
On Tuesday, 23 February 2016 at 20:03:30 UTC, dextorious wrote:Personally, I think a few aspects of documentation for the various compilers, dub and possibly the dlang.org website itself could be improved, if accessibility is considered important.Couldn't agree more.Being new to the language, I certainly make no claims about what the Phobos library should do, but coming from a heavy numerics background in many languages, I can say that this is the first time I've seen a common summation function do anything beyond naive summation. Some languages feature more accurate options separately, but never as the default, so it did not occur to me to specifically check the documentation for something like sum() (which is my fault, of course, no issues there). Having the more accurate pairwise summation algorithm in the standard library is certainly worthwhile for some applications, but I was a bit surprised to see it as the default.I think that's fair. I think part of the reason for the focus on accuracy over speed is that floats can have really weird behavior sometimes. For most people, it's better to be a little slower all the time in order to get the right answer all the time (or as often as possible with floats). And people who want more speed, can look at the docs and figure out what they need to do to get more.
Feb 23 2016
On Tuesday, 23 February 2016 at 20:03:30 UTC, dextorious wrote:For instance, I am still not sure how to make it pass the -O5 switch to the LDC2 compiler and the impression I got from the documentation is that explicit manual switches can only be supplied for the DMD compiler.If you're referring to this: "Additional flags passed to the D compiler - note that these flags are usually specific to the compiler in use, but a set of flags is automatically translated from DMD to the selected compiler" My take is that a specific set of flags are automatically translated (so you don't need to make a separate dflags entry for each compiler you support if you only use those flags), but you can pass any compiler-specific flags you need.
Feb 23 2016
On Wednesday, 24 February 2016 at 03:33:14 UTC, Mike Parker wrote:On Tuesday, 23 February 2016 at 20:03:30 UTC, dextorious wrote:There's part of what I'm referring to, yes. There doesn't seem to be any documentation on what gets translated and what doesn't. For the moment, the only way I've found to manually pass specific compiler options ("-O5 -singleobj" in my case) is by settings the dflags attribute when defining a buildType. However, there doesn't seem to be any way to specify different dflags for different compilers, so I am forced to introduce separately named buildTypes for each compiler. Since I still need to manually specify the compiler using the --compiler option when running dub, this feels like I'm using a hacky workaround rather than a consistently designed CLI. Furthermore, from the documentation, I have no idea if what I'm doing is the intended way or just an ugly hack around whatever piece of information I've missed.For instance, I am still not sure how to make it pass the -O5 switch to the LDC2 compiler and the impression I got from the documentation is that explicit manual switches can only be supplied for the DMD compiler.If you're referring to this: "Additional flags passed to the D compiler - note that these flags are usually specific to the compiler in use, but a set of flags is automatically translated from DMD to the selected compiler" My take is that a specific set of flags are automatically translated (so you don't need to make a separate dflags entry for each compiler you support if you only use those flags), but you can pass any compiler-specific flags you need.
Feb 24 2016
On Wednesday, 24 February 2016 at 19:15:23 UTC, dextorious wrote:However, there doesn't seem to be any way to specify different dflags for different compilersThere are examples like in the package format page "dflags-dmd": ["-vtls"], "sourceFiles-windows-x86_64-dmd": ["lib/win32/mylib.lib"], that would give some idea of how to do it. Combining them together, you are able to do things like "dflags-windows-x86_64-dmd": ["-vtls"], So yes, it is do-able, but as you mention above, the docs page could use some work in making this functionality clear.
Feb 24 2016
If you do want to test the differences between the range approach and the loop approach, something like: auto sumtest4(Range)(Range range) safe pure { return range.reduce!((a, b) => a + b); } is a more fair comparison. I get results within 15% of sumtest2 with this using dmd. I think with ldc this would be identical, but the version in homebrew is too old to compile this.
Feb 21 2016
On Monday, 22 February 2016 at 07:10:23 UTC, Kapps wrote:If you do want to test the differences between the range approach and the loop approach, something like: auto sumtest4(Range)(Range range) safe pure { return range.reduce!((a, b) => a + b); } is a more fair comparison. I get results within 15% of sumtest2 with this using dmd. I think with ldc this would be identical, but the version in homebrew is too old to compile this.Using LDC with the mir version of ndslice so it compiles, and the following code: sw.reset(); sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res4[i] = sumtest4(f[i]); } } t3 = sw.peek().msecs; and auto sumtest4(Range)(Range range) { return range.reduce!((a, b) => a + b); } I get: 145 ms 19 ms 19 ms 19 ms So, with LDC, there is no performance hit doing this. The only performance hit is when .sum uses a different algorithm for a more accurate result. Also, the LDC version appears to be roughly 5x faster than the DMD version.
Feb 21 2016