digitalmars.D.learn - D outperformed by C++, what am I doing wrong?
- amfvcg (115/115) Aug 12 2017 Hi all,
- rikki cattermole (3/138) Aug 12 2017 Dmd compiles quickly, but doesn't create all that optimized code.
- Roman Hargrave (8/21) Aug 12 2017 If I had to guess, this could be due to backend and optimizer.
- Neia Neutuladh (14/16) Aug 13 2017 Well, for one thing, you are preallocating in C++ code but not in
- Daniel Kozak via Digitalmars-d-learn (44/60) Aug 13 2017 this works ok for me with ldc compiler, gdc does not work on my arch
- Daniel Kozak via Digitalmars-d-learn (38/104) Aug 13 2017 Here is more D idiomatic way:
- amfvcg (18/111) Aug 13 2017 Thank you all for the replies. Good to know the community is
- Daniel Kozak via Digitalmars-d-learn (4/123) Aug 13 2017 my second version on ldc takes 380ms and c++ version on same compiler
- amfvcg (6/8) Aug 13 2017 Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it
- ikod (59/68) Aug 13 2017 import std.stdio : writeln;
- amfvcg (3/8) Aug 13 2017 And you've compiled it with?
- ikod (2/11) Aug 13 2017 DMD64 D Compiler v2.074.1
- Petar Kirov [ZombineDev] (22/31) Aug 13 2017 Here are my results:
- Petar Kirov [ZombineDev] (8/41) Aug 13 2017 With Daniel's latest version (
- amfvcg (9/15) Aug 13 2017 Great!!! And that's what I was hoping for.
- Petar Kirov [ZombineDev] (57/74) Aug 13 2017 There's one especially interesting result:
- amfvcg (4/61) Aug 13 2017 Change the parameter for this array size to be taken from stdin
- Johan Engelen (25/27) Aug 13 2017 This is paramount for all of the testing, examining, and
- Petar Kirov [ZombineDev] (27/55) Aug 13 2017 I agree that in general this is not the right way to benchmark. I
- Johan Engelen (7/23) Aug 13 2017 Execution of sum_subranges is already O(1), because the
- Petar Kirov [ZombineDev] (4/13) Aug 13 2017 Heh, yeah you're absolutely right. I was just about to correct
- Daniel Kozak via Digitalmars-d-learn (3/133) Aug 13 2017 this one is even faster than c++:
- Daniel Kozak via Digitalmars-d-learn (3/142) Aug 13 2017 And this one is awesome :P
- data pulverizer (5/23) Aug 17 2017 From time to time the forum gets questions like these. It would
Hi all, I'm solving below task: given container T and value R return sum of R-ranges over T. An example: input : T=[1,1,1] R=2 output : [2, 1] input : T=[1,2,3] R=1 output : [1,2,3] (see dlang unittests for more examples) Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on my machine in 656 836 us. Below D code compiled with dmd v2.067.1 -O runs on my machine in ~ 14.5 sec. Each language has it's own "way of programming", and as I'm a beginner in D - probably I'm running through bushes instead of highway. Therefore I'd like to ask you, experienced dlang devs, to shed some light on "how to do it dlang-way". C++ code: #include <algorithm> #include <chrono> #include <iostream> #include <iterator> #include <list> #include <map> #include <string> #include <utility> #include <numeric> #include <vector> template<typename T, typename K> std::vector<K> sum_elements(const T& beg, const T& end, std::size_t k, K def) { if (k == 0) { return std::vector<K>{}; } return sum_elements(beg, end, k, def, [](auto &l, auto &r){ return r+l;}); } template<typename T, typename K, class BinaryOp> std::vector<K> sum_elements( const T& beg, const T& end, std::size_t k, K def, BinaryOp op) { std::vector<K> out; out.reserve((std::distance(beg, end) - 1)/k + 1); for (auto it = beg; it!=end; std::advance(it, std::min(static_cast<std::size_t>(std::distance(it, end)), k))) { out.push_back(std::accumulate(it, std::next(it, std::min(static_cast<std::size_t>(std::distance(it, end)), k)), def, op)); } return out; } int main() { std::vector<int> vec; auto size = 1000000; vec.reserve(size); for (int i=0; i < size; ++i) vec.push_back(i); auto beg = std::chrono::system_clock::now(); auto sum = 0; for (int i=0; i < 100; i++) sum += sum_elements(vec.begin(), vec.end(), 2, 0).size(); auto end = std::chrono::system_clock::now(); std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end-beg).count() << std::endl; std::cout << sum << std::endl; return sum; } D code: import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { T[] result; if (range == 0) { return result; } for (uint i; i < input.length; i=min(i+range, input.length)) { result ~= sum(input[i..min(i+range, input.length)]); } return result; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { int[1000000] v; for (int i=0; i < 1000000; ++i) v[i] = i; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; }
Aug 12 2017
On 13/08/2017 7:09 AM, amfvcg wrote:Hi all, I'm solving below task: given container T and value R return sum of R-ranges over T. An example: input : T=[1,1,1] R=2 output : [2, 1] input : T=[1,2,3] R=1 output : [1,2,3] (see dlang unittests for more examples) Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on my machine in 656 836 us. Below D code compiled with dmd v2.067.1 -O runs on my machine in ~ 14.5 sec. Each language has it's own "way of programming", and as I'm a beginner in D - probably I'm running through bushes instead of highway. Therefore I'd like to ask you, experienced dlang devs, to shed some light on "how to do it dlang-way". C++ code: #include <algorithm> #include <chrono> #include <iostream> #include <iterator> #include <list> #include <map> #include <string> #include <utility> #include <numeric> #include <vector> template<typename T, typename K> std::vector<K> sum_elements(const T& beg, const T& end, std::size_t k, K def) { if (k == 0) { return std::vector<K>{}; } return sum_elements(beg, end, k, def, [](auto &l, auto &r){ return r+l;}); } template<typename T, typename K, class BinaryOp> std::vector<K> sum_elements( const T& beg, const T& end, std::size_t k, K def, BinaryOp op) { std::vector<K> out; out.reserve((std::distance(beg, end) - 1)/k + 1); for (auto it = beg; it!=end; std::advance(it, std::min(static_cast<std::size_t>(std::distance(it, end)), k))) { out.push_back(std::accumulate(it, std::next(it, std::min(static_cast<std::size_t>(std::distance(it, end)), k)), def, op)); } return out; } int main() { std::vector<int> vec; auto size = 1000000; vec.reserve(size); for (int i=0; i < size; ++i) vec.push_back(i); auto beg = std::chrono::system_clock::now(); auto sum = 0; for (int i=0; i < 100; i++) sum += sum_elements(vec.begin(), vec.end(), 2, 0).size(); auto end = std::chrono::system_clock::now(); std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end-beg).count() << std::endl; std::cout << sum << std::endl; return sum; } D code: import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { T[] result; if (range == 0) { return result; } for (uint i; i < input.length; i=min(i+range, input.length)) { result ~= sum(input[i..min(i+range, input.length)]); } return result; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { int[1000000] v; for (int i=0; i < 1000000; ++i) v[i] = i; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; }Dmd compiles quickly, but doesn't create all that optimized code. Try ldc or gdc and get back to us about it ;)
Aug 12 2017
On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:Hi all, I'm solving below task: given container T and value R return sum of R-ranges over T. An example: input : T=[1,1,1] R=2 output : [2, 1] input : T=[1,2,3] R=1 output : [1,2,3] (see dlang unittests for more examples) Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on my machine in 656 836 us. Below D code compiled with dmd v2.067.1 -O runs on my machine in ~ 14.5 sec.If I had to guess, this could be due to backend and optimizer. I don't want to go in to detail on my thoughts because I am not an expert on codegen optimization, but I might suggest running your test compiled with GDC (using identical optimization settings as G++) and ldc2 with similar settings.
Aug 12 2017
On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:Hi all, I'm solving below task:Well, for one thing, you are preallocating in C++ code but not in D. On my machine, your version of the code completes in 3.175 seconds. Changing it a little reduces it to 0.420s: T[] result = new T[input.length]; size_t o = 0; for (uint i; i < input.length; i=min(i+range, input.length)) { result[o] = sum(input[i..min(i+range, input.length)]); o++; } return result[0..o]; You can also use Appender from std.array.
Aug 13 2017
this works ok for me with ldc compiler, gdc does not work on my arch machine so I can not do comparsion to your c++ versin (clang does not work with your c++ code) import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { import std.array : appender; auto app = appender!(T[])(); if (range == 0) { return app.data; } for (uint i; i < input.length; i=min(i+range, input.length)) { app.put(sum(input[i..min(i+range, input.length)])); } return app.data; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000).array; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:Hi all, I'm solving below task:Well, for one thing, you are preallocating in C++ code but not in D. On my machine, your version of the code completes in 3.175 seconds. Changing it a little reduces it to 0.420s: T[] result = new T[input.length]; size_t o = 0; for (uint i; i < input.length; i=min(i+range, input.length)) { result[o] = sum(input[i..min(i+range, input.length)]); o++; } return result[0..o]; You can also use Appender from std.array.
Aug 13 2017
Here is more D idiomatic way: import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; auto sum_subranges(T)(T input, uint range) { import std.array : array; import std.range : chunks, ElementType; import std.algorithm : map; if (range == 0) { return ElementType!(T)[].init; } return input.chunks(range).map!(sum).array; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000); int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote:this works ok for me with ldc compiler, gdc does not work on my arch machine so I can not do comparsion to your c++ versin (clang does not work with your c++ code) import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { import std.array : appender; auto app = appender!(T[])(); if (range == 0) { return app.data; } for (uint i; i < input.length; i=min(i+range, input.length)) { app.put(sum(input[i..min(i+range, input.length)])); } return app.data; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000).array; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:Hi all, I'm solving below task:Well, for one thing, you are preallocating in C++ code but not in D. On my machine, your version of the code completes in 3.175 seconds. Changing it a little reduces it to 0.420s: T[] result = new T[input.length]; size_t o = 0; for (uint i; i < input.length; i=min(i+range, input.length)) { result[o] = sum(input[i..min(i+range, input.length)]); o++; } return result[0..o]; You can also use Appender from std.array.
Aug 13 2017
On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:Here is more D idiomatic way: import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; auto sum_subranges(T)(T input, uint range) { import std.array : array; import std.range : chunks, ElementType; import std.algorithm : map; if (range == 0) { return ElementType!(T)[].init; } return input.chunks(range).map!(sum).array; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000); int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote:Thank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).this works ok for me with ldc compiler, gdc does not work on my arch machine so I can not do comparsion to your c++ versin (clang does not work with your c++ code) import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { import std.array : appender; auto app = appender!(T[])(); if (range == 0) { return app.data; } for (uint i; i < input.length; i=min(i+range, input.length)) { app.put(sum(input[i..min(i+range, input.length)])); } return app.data; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000).array; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:[...]
Aug 13 2017
my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:Here is more D idiomatic way: import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; auto sum_subranges(T)(T input, uint range) { import std.array : array; import std.range : chunks, ElementType; import std.algorithm : map; if (range == 0) { return ElementType!(T)[].init; } return input.chunks(range).map!(sum).array; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000); int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote: this works ok for me with ldc compiler, gdc does not work on my archThank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).machine so I can not do comparsion to your c++ versin (clang does not work with your c++ code) import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { import std.array : appender; auto app = appender!(T[])(); if (range == 0) { return app.data; } for (uint i; i < input.length; i=min(i+range, input.length)) { app.put(sum(input[i..min(i+range, input.length)])); } return app.data; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000).array; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote: [...]
Aug 13 2017
On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost sameOk, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Aug 13 2017
On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; import std.range; import std.algorithm; auto s1(T)(T input, uint r) { return input.chunks(r).map!sum; } T[] sum_subranges(T)(T[] input, uint range) { T[] result; if (range == 0) { return result; } for (uint i; i < input.length; i=min(i+range, input.length)) { result ~= sum(input[i..min(i+range, input.length)]); } return result; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); assert(s1([1,1,1], 2).array == [2, 1]); assert(s1([1,1,1,2,3,3], 2).array == [2, 3, 6]); } int main() { int sum; MonoTime beg0 = MonoTime.currTime; for (int i=0; i < 100; i++) sum += s1(iota(1000000),2).length; MonoTime end0 = MonoTime.currTime; writeln(end0-beg0); writeln(sum); sum = 0; int[1000000] v; for (int i=0; i < 1000000; ++i) v[i] = i; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } Gives me 5 μs and 2 hnsecs 50000000 3 secs, 228 ms, 837 μs, and 4 hnsecs 50000000my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost sameOk, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Aug 13 2017
Gives me 5 μs and 2 hnsecs 50000000 3 secs, 228 ms, 837 μs, and 4 hnsecs 50000000And you've compiled it with? Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).
Aug 13 2017
On Sunday, 13 August 2017 at 08:32:50 UTC, amfvcg wrote:DMD64 D Compiler v2.074.1Gives me 5 μs and 2 hnsecs 50000000 3 secs, 228 ms, 837 μs, and 4 hnsecs 50000000And you've compiled it with? Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).
Aug 13 2017
On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:Here are my results: $ uname -sri Linux 4.10.0-28-generic x86_64 $ lscpu | grep 'Model name' Model name: Intel(R) Core(TM) i7-3770K CPU 3.50GHz $ ldc2 --version | head -n5 LDC - the LLVM D compiler (1.3.0): based on DMD v2.073.2 and LLVM 4.0.0 built with LDC - the LLVM D compiler (1.3.0) Default target: x86_64-unknown-linux-gnu Host CPU: ivybridge $ g++ --version | head -n1 g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406 $ ldc2 -O3 --release sum_subranges.d $ ./sum_subranges 378 ms, 556 μs, and 9 hnsecs 50000000 $ g++ -O5 sum_subranges.cpp -o sum_subranges $ ./sum_subranges 237135 50000000my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost sameOk, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Aug 13 2017
On Sunday, 13 August 2017 at 08:29:30 UTC, Petar Kirov [ZombineDev] wrote:On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:With Daniel's latest version ( http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d learn puremagic.com ) $ ldc2 -O3 --release sum_subranges2.d $ ./sum_subranges2 210 ms, 838 μs, and 8 hnsecs 50000000On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:Here are my results: $ uname -sri Linux 4.10.0-28-generic x86_64 $ lscpu | grep 'Model name' Model name: Intel(R) Core(TM) i7-3770K CPU 3.50GHz $ ldc2 --version | head -n5 LDC - the LLVM D compiler (1.3.0): based on DMD v2.073.2 and LLVM 4.0.0 built with LDC - the LLVM D compiler (1.3.0) Default target: x86_64-unknown-linux-gnu Host CPU: ivybridge $ g++ --version | head -n1 g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406 $ ldc2 -O3 --release sum_subranges.d $ ./sum_subranges 378 ms, 556 μs, and 9 hnsecs 50000000 $ g++ -O5 sum_subranges.cpp -o sum_subranges $ ./sum_subranges 237135 50000000my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost sameOk, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Aug 13 2017
On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov [ZombineDev] wrote:With Daniel's latest version ( http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d learn puremagic.com ) $ ldc2 -O3 --release sum_subranges2.d $ ./sum_subranges2 210 ms, 838 μs, and 8 hnsecs 50000000Great!!! And that's what I was hoping for. So the conclusion is: use the latest ldc that's out there. Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?) Btw. the idiomatic version of this d sample looks how I imagined it should!
Aug 13 2017
On Sunday, 13 August 2017 at 08:43:29 UTC, amfvcg wrote:On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov [ZombineDev] wrote:There's one especially interesting result: This instantiation: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint) of the following function: auto sum_subranges(T)(T input, uint range) { import std.range : chunks, ElementType, array; import std.algorithm : map; return input.chunks(range).map!(sum); } gets optimized with LDC to: push rax test edi, edi je .LBB2_2 mov edx, edi mov rax, rsi pop rcx ret .LBB2_2: lea rsi, [rip + .L.str.3] lea rcx, [rip + .L.str] mov edi, 45 mov edx, 89 mov r8d, 6779 call _d_assert_msg PLT I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint): push rbp mov rbp,rsp sub rsp,0x30 mov DWORD PTR [rbp-0x8],edi mov r9d,DWORD PTR [rbp-0x8] test r9,r9 jne 41 mov r8d,0x1b67 mov ecx,0x0 mov eax,0x61 mov rdx,rax mov QWORD PTR [rbp-0x28],rdx mov edx,0x0 mov edi,0x2d mov rsi,rdx mov rdx,QWORD PTR [rbp-0x28] call 41 41: mov QWORD PTR [rbp-0x20],rsi mov QWORD PTR [rbp-0x18],r9 mov rdx,QWORD PTR [rbp-0x18] mov rax,QWORD PTR [rbp-0x20] mov rsp,rbp algorithms a pop rbp ret Moral of the story: templates + ranges are an awesome combination.With Daniel's latest version ( http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d learn puremagic.com ) $ ldc2 -O3 --release sum_subranges2.d $ ./sum_subranges2 210 ms, 838 μs, and 8 hnsecs 50000000Great!!! And that's what I was hoping for. So the conclusion is: use the latest ldc that's out there. Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?) Btw. the idiomatic version of this d sample looks how I imagined it should!
Aug 13 2017
On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov [ZombineDev] wrote:There's one especially interesting result: This instantiation: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint) of the following function: auto sum_subranges(T)(T input, uint range) { import std.range : chunks, ElementType, array; import std.algorithm : map; return input.chunks(range).map!(sum); } gets optimized with LDC to: push rax test edi, edi je .LBB2_2 mov edx, edi mov rax, rsi pop rcx ret .LBB2_2: lea rsi, [rip + .L.str.3] lea rcx, [rip + .L.str] mov edi, 45 mov edx, 89 mov r8d, 6779 call _d_assert_msg PLT I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint): push rbp mov rbp,rsp sub rsp,0x30 mov DWORD PTR [rbp-0x8],edi mov r9d,DWORD PTR [rbp-0x8] test r9,r9 jne 41 mov r8d,0x1b67 mov ecx,0x0 mov eax,0x61 mov rdx,rax mov QWORD PTR [rbp-0x28],rdx mov edx,0x0 mov edi,0x2d mov rsi,rdx mov rdx,QWORD PTR [rbp-0x28] call 41 41: mov QWORD PTR [rbp-0x20],rsi mov QWORD PTR [rbp-0x18],r9 mov rdx,QWORD PTR [rbp-0x18] mov rax,QWORD PTR [rbp-0x20] mov rsp,rbp algorithms a pop rbp ret Moral of the story: templates + ranges are an awesome combination.Change the parameter for this array size to be taken from stdin and I assume that these optimizations will go away.
Aug 13 2017
On Sunday, 13 August 2017 at 09:15:48 UTC, amfvcg wrote:Change the parameter for this array size to be taken from stdin and I assume that these optimizations will go away.This is paramount for all of the testing, examining, and comparisons that are discussed in this thread. Full information is given to the compiler, and you are basically testing the constant folding power of the compilers (not unimportant). No runtime calculation is needed for the sum. Your program could be optimized to the following code: ``` void main() { MonoTime beg = MonoTime.currTime; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(50000000); } ``` So actually you should be more surprised that the reported time is not equal to near-zero (just the time between two `MonoTime.currTime` calls)! Instead of `iota(1,1000000)`, you should initialize the array with random numbers with a randomization seed given by the user (e.g. commandline argument or stdin). Then, the program will actually have to do the runtime calculations that I assume you are expecting it to perform. - Johan
Aug 13 2017
On Sunday, 13 August 2017 at 09:56:44 UTC, Johan Engelen wrote:On Sunday, 13 August 2017 at 09:15:48 UTC, amfvcg wrote:I agree that in general this is not the right way to benchmark. I however am interested specifically in the pattern matching / constant folding abilities of the compiler. I would have expected `sum(iota(1, N + 1))` to be replaced with `(N*(N+1))/2`. LDC already does this optimization in some cases. I have opened an issue for some of the rest: https://github.com/ldc-developers/ldc/issues/2271Change the parameter for this array size to be taken from stdin and I assume that these optimizations will go away.This is paramount for all of the testing, examining, and comparisons that are discussed in this thread. Full information is given to the compiler, and you are basically testing the constant folding power of the compilers (not unimportant).No runtime calculation is needed for the sum. Your program could be optimized to the following code: ``` void main() { MonoTime beg = MonoTime.currTime; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(50000000); } ``` So actually you should be more surprised that the reported time is not equal to near-zero (just the time between two `MonoTime.currTime` calls)!On Posix, `MonoTime.currTime`'s implementation uses clock_gettime(CLOCK_MONOTONIC, ...) which quite a bit more involved than simply using the rdtsc instruciton on x86. See: http://linuxmogeb.blogspot.bg/2013/10/how-does-clockgettime-work.html On Windows, `MonoTime.currTime` uses QueryPerformanceCounter, which on Win 7 and later uses the rdtsc instruction, which makes it quite streamlined. In some testing I did several months ago QueryPerformanceCounter had really good latency and precision (though I forgot the exact numbers I got).Instead of `iota(1,1000000)`, you should initialize the array with random numbers with a randomization seed given by the user (e.g. commandline argument or stdin). Then, the program will actually have to do the runtime calculations that I assume you are expecting it to perform.Agreed, though I think Phobos's unpredictableSeed does an ok job w.r.t. seeding, so unless you want to repeat the benchmark on the exact same dataset, something like this does a good job: T[] generate(T)(size_t size) { import std.algorithm.iteration : map; import std.range : array, iota; import std.random : uniform; return size.iota.map!(_ => uniform!T()).array; }
Aug 13 2017
On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov [ZombineDev] wrote:This instantiation: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint) of the following function: auto sum_subranges(T)(T input, uint range) { import std.range : chunks, ElementType, array; import std.algorithm : map; return input.chunks(range).map!(sum); } gets optimized with LDC to: [snip]I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization: [snip]Execution of sum_subranges is already O(1), because the calculation of the sum is delayed: the return type of the function is not `uint`, it is `MapResult!(sum, <range>)` which does a lazy evaluation of the sum. - Johan
Aug 13 2017
On Sunday, 13 August 2017 at 09:41:39 UTC, Johan Engelen wrote:On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov [ZombineDev] wrote:Heh, yeah you're absolutely right. I was just about to correct myself, when I saw your reply. Don't know how I missed such an obvious thing :D[...][...]Execution of sum_subranges is already O(1), because the calculation of the sum is delayed: the return type of the function is not `uint`, it is `MapResult!(sum, <range>)` which does a lazy evaluation of the sum. - Johan
Aug 13 2017
this one is even faster than c++: http://ideone.com/TRDsOo On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11 gmail.com> wrote:my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:Here is more D idiomatic way: import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; auto sum_subranges(T)(T input, uint range) { import std.array : array; import std.range : chunks, ElementType; import std.algorithm : map; if (range == 0) { return ElementType!(T)[].init; } return input.chunks(range).map!(sum).array; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000); int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote: this works ok for me with ldc compiler, gdc does not work on my archThank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).machine so I can not do comparsion to your c++ versin (clang does not work with your c++ code) import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { import std.array : appender; auto app = appender!(T[])(); if (range == 0) { return app.data; } for (uint i; i < input.length; i=min(i+range, input.length)) { app.put(sum(input[i..min(i+range, input.length)])); } return app.data; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000).array; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote: [...]
Aug 13 2017
And this one is awesome :P http://ideone.com/muehUw On Sun, Aug 13, 2017 at 10:27 AM, Daniel Kozak <kozzi11 gmail.com> wrote:this one is even faster than c++: http://ideone.com/TRDsOo On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11 gmail.com> wrote:my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:Here is more D idiomatic way: import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; auto sum_subranges(T)(T input, uint range) { import std.array : array; import std.range : chunks, ElementType; import std.algorithm : map; if (range == 0) { return ElementType!(T)[].init; } return input.chunks(range).map!(sum).array; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000); int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote: this works ok for me with ldc compiler, gdc does not work on my archThank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).machine so I can not do comparsion to your c++ versin (clang does not work with your c++ code) import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; T[] sum_subranges(T)(T[] input, uint range) { import std.array : appender; auto app = appender!(T[])(); if (range == 0) { return app.data; } for (uint i; i < input.length; i=min(i+range, input.length)) { app.put(sum(input[i..min(i+range, input.length)])); } return app.data; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); } int main() { import std.range : iota, array; auto v = iota(0,1000000).array; int sum; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote: [...]
Aug 13 2017
On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:Hi all, I'm solving below task: given container T and value R return sum of R-ranges over T. An example: input : T=[1,1,1] R=2 output : [2, 1] input : T=[1,2,3] R=1 output : [1,2,3] (see dlang unittests for more examples) Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on my machine in 656 836 us. Below D code compiled with dmd v2.067.1 -O runs on my machine in ~ 14.5 sec. Each language has it's own "way of programming", and as I'm a beginner in D - probably I'm running through bushes instead of highway. Therefore I'd like to ask you, experienced dlang devs, to shed some light on "how to do it dlang-way". ...From time to time the forum gets questions like these. It would be nice it we had blog articles concerning high performance programming in D, a kind of best practice approach, everything from programming idioms, to compilers, and compiler flags.
Aug 17 2017