www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Language performance benchmark be updated 2019/11/09

reply zoujiaqing <zoujiaqing gmail.com> writes:
| Language        | Time, s | Memory, MiB |
| --------------- | ------- | ----------- |
| Kotlin          | 2.01    | 37.6        |
| Nim Gcc         | 2.17    | 0.7         |
| C++ Gcc         | 2.41    | 1.7         |
| OCaml           | 2.50    | 4.4         |
| Go              | 2.94    | 1.5         |
| Java            | 3.05    | 37.2        |
| Crystal         | 3.06    | 2.7         |
| ML MLton        | 3.22    | 0.7         |
| Go Gcc          | 3.30    | 19.2        |
| Rust            | 3.43    | 0.8         |
| Nim Clang       | 3.43    | 1.0         |
| D Ldc           | 3.57    | 1.4         |
| D Gdc           | 3.72    | 5.8         |

| Scala           | 4.30    | 136.3       |

| D Dmd           | 4.74    | 3.3         |
| Haskell (MArray)| 6.88    | 3.5         |

| Javascript Node | 6.97    | 31.5        |
| V Gcc           | 7.30    | 0.8         |
| V Clang         | 9.06    | 1.0         |
| Racket          | 10.49   | 77.4        |
| LuaJIT          | 10.99   | 2.1         |
| Python PyPy     | 21.51   | 95.4        |
| Chez Scheme     | 24.72   | 29.2        |
| Haskell         | 29.14   | 3.4         |
| Ruby truffle    | 32.52   | 613.3       |
| Ruby JRuby      | 180.65  | 241.7       |
| Ruby            | 191.36  | 13.1        |
| Lua 5.3         | 201.26  | 1.4         |
| Elixir          | 279.03  | 48.9        |
| Python3         | 388.22  | 7.8         |
| Python          | 399.75  | 6.2         |
| Tcl (FP)        | 494.78  | 4.3         |
| Perl            | 769.17  | 5.2         |
| Tcl (OO)        | 1000.55 | 4.3         |

https://github.com/kostya/benchmarks
Nov 14 2019
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks
Sadly, the benchmark entries appears to use different algorithms... despite the site claiming otherwise. As far as I can tell...
Nov 15 2019
parent reply aliak <something something.com> writes:
On Friday, 15 November 2019 at 08:25:26 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks
Sadly, the benchmark entries appears to use different algorithms... despite the site claiming otherwise. As far as I can tell...
You mean for sorting one uses quick sort while another uses bubble or something to that affect? Did you check a fair amount and found them all different? (I haven't looked yet obviously, and trying to avoid it depending on how much you peaked :) )
Nov 15 2019
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 15 November 2019 at 18:05:35 UTC, aliak wrote:
 You mean for sorting one uses quick sort while another uses 
 bubble or something to that affect? Did you check a fair amount 
 and found them all different?
The testset are very limited. Matrix multiplication for one... If you benchmark a language that calls into a C-implementation library... and get twice as good results as the C-benchmark... then you know something is not right! :-D
Nov 15 2019
prev sibling parent Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Friday, 15 November 2019 at 18:05:35 UTC, aliak wrote:
 On Friday, 15 November 2019 at 08:25:26 UTC, Ola Fosheim 
 Grøstad wrote:
 On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks
Sadly, the benchmark entries appears to use different algorithms... despite the site claiming otherwise. As far as I can tell...
You mean for sorting one uses quick sort while another uses bubble or something to that affect? Did you check a fair amount and found them all different? (I haven't looked yet obviously, and trying to avoid it depending on how much you peaked :) )
The Json test uses very different parser implementations. There are even multiple implementations for D. One of them uses the fast I/O library while the other one ises std.json, I think. I haven't checked the others, but I expect them to have a similar spread.
Nov 15 2019
prev sibling parent reply Jacob Shtokolov <jacob.100205 gmail.com> writes:
On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks
Sorry, but have you tried it by yourself? I'm running this benchmark outside of Docker, and the numbers I see are very interesting. First of all, the time measurement script is quite questionable. It's a Ruby script, and I found that the running time depends on the shell and environment. For example, when I run it under VSCode's console window, I get 20% worse results in time for all binaries than when I run it under the regular terminal emulator window. Second, the numbers are, hmmm, how to say... bullshit? Pardon my French! Here is what I get for the Brainfuck2 mandelbrot benchmark (a simple Brainfuck interpreter implemented in different languages): C++ gcc version 7.4.0 (g++ -flto -O3 -o bin_cpp bf.cpp): ``` $ ../xtime.rb ./bin_cpp mandel.b 18.05s, 3.6Mb ``` D LDC2 1.18.0 (ldc2 -ofbin_d_ldc -O5 -release -boundscheck=off bf.d) ``` $ ../xtime.rb ./bin_d_ldc mandel.b 19.53s, 3.6Mb ``` Nim 1.0.2 (nim c -o:bin_nim_gcc -d:danger --cc:gcc --verbosity:0 bf.nim) ``` $ ../xtime.rb ./bin_nim_gcc mandel.b 25.07s, 2.2Mb ``` Kotlin kotlinc-jvm 1.3.50 (JRE 1.8.0_201-b09) (kotlinc bf2.kt -include-runtime -d bf2-kt.jar) ``` $ ../xtime.rb java -jar bf2-kt.jar mandel.b JIT warming up time: 1.25s run 26.81s, 36.6Mb ``` Golang go1.13.4 linux/amd64 (go build -o bin_go bf.go) ``` $ ../xtime.rb ./bin_go mandel.b 38.73s, 2.9Mb ``` ============================================ So the results for Braunfuck2 mandel.b are: ``` C++ gcc: 18.05s, 3.6Mb D LDC2: 19.53s, 3.6Mb Nim: 25.07s, 2.2Mb Kotlin: 26.81s, 36.6Mb ``` Please note that I've added `boundscheck=off` to LDC2 command line there. Also, Kotlin is always printing `JIT warming up` and takes about 1 to 2 seconds to warm up, so the results for Brainfuck2 `bench.b` are VERY different. Kotlin is not the first one obviously. I'm running everything on my laptop with Intel(R) Core(TM) i7-8550U CPU 1.80GHz As I mentioned, no Docker containers, just using the included scripts to build the binaries. Haven't tried other tests, but I feel like I'll get very interesting results for them also. Would be great if someone else is able to confirm these results because these benchmarks look very manipulative.
Nov 16 2019
parent reply Jacob Shtokolov <jacob.100205 gmail.com> writes:
On Saturday, 16 November 2019 at 16:07:29 UTC, Jacob Shtokolov 
wrote:
 Haven't tried other tests
Just tried to compile and run Base64 The results are: ``` C: 1.46s, 1.9Mb Rust: 1.49s, 2.4Mb D LDC2: 1.98s, 4.2Mb Golang: 2.89s, 10.9Mb C++: 3.18s, 4.8Mb ``` This test is closer to the author's numbers, but Rust implementation isn't faster than the C implementation on my machine. Golang here is faster than C++. The D version was built with bounds checks this time.
Nov 16 2019
next sibling parent reply Jacob Shtokolov <jacob.100205 gmail.com> writes:
On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov 
wrote:
 Just tried to compile and run Base64
The Havlak test is closer to reality: ``` Nim: 12.24s, 477.8Mb C++: 17.33s, 179.3Mb Golang: 21.58s, 358.0Mb D LDC2: 23.55s, 460.4Mb D DMD: 29.04s, 461.9Mb ``` Nim is the winner. But here I would look into the code: what makes LDC produce such poorly optimized binary.
Nov 16 2019
next sibling parent reply Daniel Kozak <kozzi11 gmail.com> writes:
On Sat, Nov 16, 2019 at 5:50 PM Jacob Shtokolov via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov
 wrote:
 Just tried to compile and run Base64
The Havlak test is closer to reality: ``` Nim: 12.24s, 477.8Mb C++: 17.33s, 179.3Mb Golang: 21.58s, 358.0Mb D LDC2: 23.55s, 460.4Mb D DMD: 29.04s, 461.9Mb ``` Nim is the winner. But here I would look into the code: what makes LDC produce such poorly optimized binary.
LDC binary is ok, this is about GC, I was able to make it lamost as twice fast for ldc with some improvments
Nov 17 2019
parent reply Jacob Shtokolov <jacob.100205 gmail.com> writes:
On Sunday, 17 November 2019 at 10:36:41 UTC, Daniel Kozak wrote:
 LDC binary is ok, this is about GC, I was able to make it 
 almost as twice fast for ldc with some improvements
Just checked the code and found that they're using allocations with `new` in loops. But that's very interesting to see what changes you made to make it run so much faster! Could you please share it somewhere?
Nov 17 2019
parent reply Daniel Kozak <kozzi11 gmail.com> writes:
On Sun, Nov 17, 2019 at 2:50 PM Jacob Shtokolov via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Sunday, 17 November 2019 at 10:36:41 UTC, Daniel Kozak wrote:
 LDC binary is ok, this is about GC, I was able to make it
 almost as twice fast for ldc with some improvements
Just checked the code and found that they're using allocations with `new` in loops. But that's very interesting to see what changes you made to make it run so much faster! Could you please share it somewhere?
Sorry I missed insert the link. It is on my github: https://github.com/Kozzi11/benchmarks/tree/improve_d
Nov 17 2019
parent reply Jacob Shtokolov <jacob.100205 gmail.com> writes:
On Sunday, 17 November 2019 at 14:15:00 UTC, Daniel Kozak wrote:
 Sorry I missed insert the link. It is on my github: 
 https://github.com/Kozzi11/benchmarks/tree/improve_d
Now it's faster than the C++ version on my machine: ``` Nim: 12.01s, 478.1Mb D LDC2: 13.48s, 428.1Mb C++: 19.97s, 179.3Mb Golang: 21.90s, 364.7Mb ``` So basically the only critical change was to replace the built-in associative arrays with Appender types? That's really amazing!
Nov 17 2019
parent reply Daniel Kozak <kozzi11 gmail.com> writes:
 So basically the only critical change was to replace the built-in
 associative arrays with Appender types?

 That's really amazing!
Not only, other change is not filling number AA with UNVISITED, the other change is to disable parallel GC, because it is cause performance decrease
Nov 17 2019
parent reply Jon Degenhardt <jond noreply.com> writes:
On Sunday, 17 November 2019 at 16:25:52 UTC, Daniel Kozak wrote:
 So basically the only critical change was to replace the 
 built-in associative arrays with Appender types?

 That's really amazing!
Not only, other change is not filling number AA with UNVISITED, the other change is to disable parallel GC, because it is cause performance decrease
Regarding the benefits seen from switching from AAs to Appenders - This is a nice performance improvement. Also a nice example of often available performance improvements in D programs. At a high level, I feel I've seen this pattern a number of times. When people starting with D run benchmarks as part of their initial experiments, they naturally start with the simplest and most straightforward programming approaches. Nothing wrong with this. It's a strength of D that quality code can be written quickly. However, in many cases these simple approaches allocate a fair bit of GC memory, memory that becomes unused quickly and needs to be GC collected. Again, nothing wrong with this. But, I have the impression that many times there is an expectation that such code will perform similarly to code using manually managed memory in other native compiled languages. And often this expectation is not met, as memory allocation and use patterns are a major performance driver. What often gets missed in these assessments is that D has quite a few mechanisms available to enable better memory management use, without needing to drop GC paradigms entirely and move to fully manually managed memory. Modifying performance sensitive programs to use these mechanisms is often not hard. The switch here from AAs to Appenders is an example. Being able to improve program performance in this way is a strength of D. One consideration is that until one has some experience with the language, it may not be obvious that these options exist, and the specific changes and approaches that can be used. This can lead to perception issues if nothing else. --Jon
Nov 17 2019
parent reply JN <666total wp.pl> writes:
On Sunday, 17 November 2019 at 21:42:37 UTC, Jon Degenhardt wrote:
 At a high level, I feel I've seen this pattern a number of 
 times. When people starting with D run benchmarks as part of 
 their initial experiments, they naturally start with the 
 simplest and most straightforward programming approaches. 
 Nothing wrong with this. It's a strength of D that quality code 
 can be written quickly.
I think it signifies a deeper problem with these kind of benchmarks. Most people would expect these benchmarks to measure idiomatic code, "every day" kind of code. Most people would write their code with associative arrays in this case. Sure, you can optimize it later, but just as well you can just drop into asm {} block and write hand optimized code. Same with Java, you can write a lot of the code in a very C-like way for a large speedup, but the code will be completely foreign for most Java programmers and not very representative for the language.
Nov 18 2019
parent reply bachmeier <no spam.net> writes:
On Monday, 18 November 2019 at 21:35:08 UTC, JN wrote:

 I think it signifies a deeper problem with these kind of 
 benchmarks. Most people would expect these benchmarks to 
 measure idiomatic code, "every day" kind of code. Most people 
 would write their code with associative arrays in this case. 
 Sure, you can optimize it later, but just as well you can just 
 drop into asm {} block and write hand optimized code.
If you're in a position where you care about "fast as possible" code, how fast your "every day" code runs isn't really helpful. Now, I do understand that you might want to measure the performance of a piece of code written when you aren't optimizing for execution speed. Someone in that position is going to care about speed of execution and speed of development, among other things. The problem is that you can't learn anything useful in that case from a benchmark that reports execution time and nothing else.
Nov 18 2019
parent Jon Degenhardt <jond noreply.com> writes:
On Monday, 18 November 2019 at 21:50:04 UTC, bachmeier wrote:
 On Monday, 18 November 2019 at 21:35:08 UTC, JN wrote:

 I think it signifies a deeper problem with these kind of 
 benchmarks. Most people would expect these benchmarks to 
 measure idiomatic code, "every day" kind of code. Most people 
 would write their code with associative arrays in this case. 
 Sure, you can optimize it later, but just as well you can just 
 drop into asm {} block and write hand optimized code.
If you're in a position where you care about "fast as possible" code, how fast your "every day" code runs isn't really helpful. Now, I do understand that you might want to measure the performance of a piece of code written when you aren't optimizing for execution speed. Someone in that position is going to care about speed of execution and speed of development, among other things. The problem is that you can't learn anything useful in that case from a benchmark that reports execution time and nothing else.
Yes, there are often multiple goals behind a benchmark like this, goals that may not be explicitly identified. There is also the question of what "idiomatic" means. This is can be quite subjective, especially in multi-paradigm languages. And, what "idiomatic" means to an individual may change as familiarity with the language grows. For D performance studies, an example is that it can take time to learn how to use lazy, range-based programming facilities. This is certainly one idiomatic D coding style. And, it often results in much better memory management and performance improvements. Code can move further from the most common paradigms of course, including all the way to inline assembly blocks. Makes it difficult to say when versions of a program in different languages are similarly idiomatic.
Nov 18 2019
prev sibling next sibling parent reply Daniel Kozak <kozzi11 gmail.com> writes:
On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak <kozzi11 gmail.com> wrote:
 Nim is the winner.

 But here I would look into the code: what makes LDC produce such
 poorly optimized binary.
LDC binary is ok, this is about GC, I was able to make it lamost as twice fast for ldc with some improvements
original code Golang: 22.74s, 364.1Mb D LDC2: 29.55s, 463.9Mb D DMD: 29.42s, 462.5Mb D GDC: 25.28s, 415.3Mb Nim: 14.26s, 468,9Mb with small changes: Golang: 22.74s, 364.1Mb D LDC2: 15.90s, 389.8Mb D DMD: 16.86s, 387.3Mb D GDC: 19.48s, 403.8Mb Nim: 14.26s, 468,9Mb
Nov 17 2019
next sibling parent James Blachly <james.blachly gmail.com> writes:
On 11/17/19 6:04 AM, Daniel Kozak wrote:
 On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak <kozzi11 gmail.com> wrote:
 LDC binary is ok, this is about GC, I was able to make it lamost as
 twice fast for ldc with some improvements
Can you summarize or share the changes for learning purposes?
Nov 17 2019
prev sibling parent kinke <noone nowhere.com> writes:
On Sunday, 17 November 2019 at 11:04:55 UTC, Daniel Kozak wrote:
 On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak 
 <kozzi11 gmail.com> wrote:
 Nim is the winner.

 But here I would look into the code: what makes LDC produce 
 such poorly optimized binary.
LDC binary is ok, this is about GC, I was able to make it lamost as twice fast for ldc with some improvements
original code Golang: 22.74s, 364.1Mb D LDC2: 29.55s, 463.9Mb D DMD: 29.42s, 462.5Mb D GDC: 25.28s, 415.3Mb Nim: 14.26s, 468,9Mb with small changes: Golang: 22.74s, 364.1Mb D LDC2: 15.90s, 389.8Mb D DMD: 16.86s, 387.3Mb D GDC: 19.48s, 403.8Mb Nim: 14.26s, 468,9Mb
With full LTO, I'm seeing an additional 5% boost on Windows (-flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto). As they are using gcc LTO for the brainfuck2 benchmark too (https://github.com/kostya/benchmarks/blob/2777925c4e64987e83e9a53478910de080408057/brai fuck2/build.sh#L5), I wouldn't consider it to be cheating.
Nov 17 2019
prev sibling parent IGotD- <nise nise.com> writes:
On Saturday, 16 November 2019 at 16:45:02 UTC, Jacob Shtokolov 
wrote:
 On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov 
 wrote:
 Just tried to compile and run Base64
The Havlak test is closer to reality: ``` Nim: 12.24s, 477.8Mb C++: 17.33s, 179.3Mb Golang: 21.58s, 358.0Mb D LDC2: 23.55s, 460.4Mb D DMD: 29.04s, 461.9Mb ``` Nim is the winner. But here I would look into the code: what makes LDC produce such poorly optimized binary.
C++ memory consumption is way lower than the rest. Is this because of the tracing GC penalty? It would have been interesting to see Rust here as it doesn't use GC and if it would get close to the C++ memory consumption.
Nov 17 2019
prev sibling parent reply IGotD- <nise nise.com> writes:
On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov 
wrote:
 On Saturday, 16 November 2019 at 16:07:29 UTC, Jacob Shtokolov 
 wrote:
 Haven't tried other tests
Just tried to compile and run Base64 The results are: ``` C: 1.46s, 1.9Mb Rust: 1.49s, 2.4Mb D LDC2: 1.98s, 4.2Mb Golang: 2.89s, 10.9Mb C++: 3.18s, 4.8Mb ``` This test is closer to the author's numbers, but Rust implementation isn't faster than the C implementation on my machine. Golang here is faster than C++. The D version was built with bounds checks this time.
Why is C++ doing so badly? Is it because of inefficient usage of buffers and that it doesn't natively support slices?
Nov 16 2019
parent Jacob Shtokolov <jacob.100205 gmail.com> writes:
On Saturday, 16 November 2019 at 16:47:40 UTC, IGotD- wrote:
 Why is C++ doing so badly?
Looks like that's because they're using some libcrypto APIs (like the BIO). Also, my C++ compiler is not the latest one - 7.4.0. In the benchmark, it's claimed as GCC 9.2.1. But the rest compilers are up to date and actually the same versions as in the benchmark README file
Nov 16 2019