digitalmars.D - Language performance benchmark be updated 2019/11/09

zoujiaqing (40/40) Nov 14 2019 | Language | Time, s | Memory, MiB |

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/5) Nov 15 2019 Sadly, the benchmark entries appears to use different

aliak (6/11) Nov 15 2019 You mean for sorting one uses quick sort while another uses

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/9) Nov 15 2019 The testset are very limited.
Gregor =?UTF-8?B?TcO8Y2ts?= (6/18) Nov 15 2019 The Json test uses very different parser implementations. There

Jacob Shtokolov (71/72) Nov 16 2019 Sorry, but have you tried it by yourself?

Jacob Shtokolov (16/17) Nov 16 2019 Just tried to compile and run Base64

Jacob Shtokolov (13/14) Nov 16 2019 The Havlak test is closer to reality:

Daniel Kozak (4/18) Nov 17 2019 LDC binary is ok, this is about GC, I was able to make it lamost as

Jacob Shtokolov (5/7) Nov 17 2019 Just checked the code and found that they're using allocations

Daniel Kozak (4/11) Nov 17 2019 Sorry I missed insert the link. It is on my github:

Jacob Shtokolov (11/13) Nov 17 2019 Now it's faster than the C++ version on my machine:

Daniel Kozak (3/6) Nov 17 2019 Not only, other change is not filling number AA with UNVISITED, the

Jon Degenhardt (30/37) Nov 17 2019 Regarding the benefits seen from switching from AAs to Appenders

JN (11/17) Nov 18 2019 I think it signifies a deeper problem with these kind of

bachmeier (10/16) Nov 18 2019 If you're in a position where you care about "fast as possible"

Jon Degenhardt (14/30) Nov 18 2019 Yes, there are often multiple goals behind a benchmark like this,

Daniel Kozak (13/20) Nov 17 2019 original code

James Blachly (2/5) Nov 17 2019 Can you summarize or share the changes for learning purposes?
kinke (5/28) Nov 17 2019 With full LTO, I'm seeing an additional 5% boost on Windows

IGotD- (6/20) Nov 17 2019 C++ memory consumption is way lower than the rest. Is this

IGotD- (4/21) Nov 16 2019 Why is C++ doing so badly? Is it because of inefficient usage of

Jacob Shtokolov (6/7) Nov 16 2019 Looks like that's because they're using some libcrypto APIs (like

zoujiaqing <zoujiaqing gmail.com> writes:

| Language        | Time, s | Memory, MiB |
| --------------- | ------- | ----------- |
| Kotlin          | 2.01    | 37.6        |
| Nim Gcc         | 2.17    | 0.7         |
| C++ Gcc         | 2.41    | 1.7         |
| OCaml           | 2.50    | 4.4         |
| Go              | 2.94    | 1.5         |
| Java            | 3.05    | 37.2        |
| Crystal         | 3.06    | 2.7         |
| ML MLton        | 3.22    | 0.7         |
| Go Gcc          | 3.30    | 19.2        |
| Rust            | 3.43    | 0.8         |
| Nim Clang       | 3.43    | 1.0         |
| D Ldc           | 3.57    | 1.4         |
| D Gdc           | 3.72    | 5.8         |

| Scala           | 4.30    | 136.3       |

| D Dmd           | 4.74    | 3.3         |
| Haskell (MArray)| 6.88    | 3.5         |

| Javascript Node | 6.97    | 31.5        |
| V Gcc           | 7.30    | 0.8         |
| V Clang         | 9.06    | 1.0         |
| Racket          | 10.49   | 77.4        |
| LuaJIT          | 10.99   | 2.1         |
| Python PyPy     | 21.51   | 95.4        |
| Chez Scheme     | 24.72   | 29.2        |
| Haskell         | 29.14   | 3.4         |
| Ruby truffle    | 32.52   | 613.3       |
| Ruby JRuby      | 180.65  | 241.7       |
| Ruby            | 191.36  | 13.1        |
| Lua 5.3         | 201.26  | 1.4         |
| Elixir          | 279.03  | 48.9        |
| Python3         | 388.22  | 7.8         |
| Python          | 399.75  | 6.2         |
| Tcl (FP)        | 494.78  | 4.3         |
| Perl            | 769.17  | 5.2         |
| Tcl (OO)        | 1000.55 | 4.3         |

https://github.com/kostya/benchmarks

Nov 14 2019

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks

Sadly, the benchmark entries appears to use different 
algorithms... despite the site claiming otherwise. As far as I 
can tell...

Nov 15 2019

aliak <something something.com> writes:

On Friday, 15 November 2019 at 08:25:26 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks

 Sadly, the benchmark entries appears to use different 
 algorithms... despite the site claiming otherwise. As far as I 
 can tell...

You mean for sorting one uses quick sort while another uses 
bubble or something to that affect? Did you check a fair amount 
and found them all different? (I haven't looked yet obviously, 
and trying to avoid it depending on how much you peaked :) )

Nov 15 2019

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 15 November 2019 at 18:05:35 UTC, aliak wrote:
 You mean for sorting one uses quick sort while another uses 
 bubble or something to that affect? Did you check a fair amount 
 and found them all different?

The testset are very limited.

Matrix multiplication for one...

If you benchmark a language that calls into a C-implementation 
library... and get twice as good results as the C-benchmark... 
then you know something is not right! :-D

Nov 15 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Friday, 15 November 2019 at 18:05:35 UTC, aliak wrote:
 On Friday, 15 November 2019 at 08:25:26 UTC, Ola Fosheim 
 Grøstad wrote:
 On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks

 Sadly, the benchmark entries appears to use different 
 algorithms... despite the site claiming otherwise. As far as I 
 can tell...

 You mean for sorting one uses quick sort while another uses 
 bubble or something to that affect? Did you check a fair amount 
 and found them all different? (I haven't looked yet obviously, 
 and trying to avoid it depending on how much you peaked :) )

The Json test uses very different parser implementations. There 
are even multiple implementations for D. One of them uses the 
fast I/O library while the other one ises std.json, I think. I 
haven't checked the others, but I expect them to have a similar 
spread.

Nov 15 2019

Jacob Shtokolov <jacob.100205 gmail.com> writes:

On Friday, 15 November 2019 at 03:31:24 UTC, zoujiaqing wrote:
 https://github.com/kostya/benchmarks

Sorry, but have you tried it by yourself?

I'm running this benchmark outside of Docker, and the numbers I 
see are very interesting. First of all, the time measurement 
script is quite questionable. It's a Ruby script, and I found 
that the running time depends on the shell and environment. For 
example, when I run it under VSCode's console window, I get 20% 
worse results in time for all binaries than when I run it under 
the regular terminal emulator window.

Second, the numbers are, hmmm, how to say... bullshit? Pardon my 
French!

Here is what I get for the Brainfuck2 mandelbrot benchmark (a 
simple Brainfuck interpreter implemented in different languages):

C++ gcc version 7.4.0 (g++ -flto -O3 -o bin_cpp bf.cpp):
```
$ ../xtime.rb ./bin_cpp mandel.b

18.05s, 3.6Mb
```

D LDC2 1.18.0 (ldc2 -ofbin_d_ldc -O5 -release -boundscheck=off 
bf.d)
```
$ ../xtime.rb ./bin_d_ldc mandel.b

19.53s, 3.6Mb
```

Nim 1.0.2 (nim c -o:bin_nim_gcc -d:danger --cc:gcc --verbosity:0 
bf.nim)
```
$ ../xtime.rb ./bin_nim_gcc mandel.b

25.07s, 2.2Mb
```

Kotlin kotlinc-jvm 1.3.50 (JRE 1.8.0_201-b09) (kotlinc bf2.kt 
-include-runtime -d bf2-kt.jar)
```
$ ../xtime.rb java -jar bf2-kt.jar mandel.b
JIT warming up
time: 1.25s
run

26.81s, 36.6Mb
```

Golang go1.13.4 linux/amd64 (go build -o bin_go bf.go)
```
$ ../xtime.rb ./bin_go mandel.b

38.73s, 2.9Mb
```

============================================

So the results for Braunfuck2 mandel.b are:

```
C++ gcc: 18.05s, 3.6Mb
D LDC2:  19.53s, 3.6Mb
Nim:     25.07s, 2.2Mb
Kotlin:  26.81s, 36.6Mb
```

Please note that I've added `boundscheck=off` to LDC2 command 
line there.
Also, Kotlin is always printing `JIT warming up` and takes about 
1 to 2 seconds to warm up, so the results for Brainfuck2 
`bench.b` are VERY different. Kotlin is not the first one 
obviously.

I'm running everything on my laptop with Intel(R) Core(TM) 
i7-8550U CPU   1.80GHz
As I mentioned, no Docker containers, just using the included 
scripts to build the binaries.

Haven't tried other tests, but I feel like I'll get very 
interesting results for them also.

Would be great if someone else is able to confirm these results 
because these benchmarks look very manipulative.

Nov 16 2019

Jacob Shtokolov <jacob.100205 gmail.com> writes:

On Saturday, 16 November 2019 at 16:07:29 UTC, Jacob Shtokolov 
wrote:
 Haven't tried other tests

Just tried to compile and run Base64
The results are:

```
C:      1.46s, 1.9Mb
Rust:   1.49s, 2.4Mb
D LDC2: 1.98s, 4.2Mb
Golang: 2.89s, 10.9Mb
C++:    3.18s, 4.8Mb
```

This test is closer to the author's numbers, but Rust 
implementation isn't faster than the C implementation on my 
machine.

Golang here is faster than C++. The D version was built with 
bounds checks this time.

Nov 16 2019

Jacob Shtokolov <jacob.100205 gmail.com> writes:

On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov 
wrote:
 Just tried to compile and run Base64

The Havlak test is closer to reality:

```
Nim:    12.24s, 477.8Mb
C++:    17.33s, 179.3Mb
Golang: 21.58s, 358.0Mb
D LDC2: 23.55s, 460.4Mb
D DMD:  29.04s, 461.9Mb
```

Nim is the winner.

But here I would look into the code: what makes LDC produce such 
poorly optimized binary.

Nov 16 2019

Daniel Kozak <kozzi11 gmail.com> writes:

On Sat, Nov 16, 2019 at 5:50 PM Jacob Shtokolov via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov
 wrote:
 Just tried to compile and run Base64

 The Havlak test is closer to reality:

 ```
 Nim:    12.24s, 477.8Mb
 C++:    17.33s, 179.3Mb
 Golang: 21.58s, 358.0Mb
 D LDC2: 23.55s, 460.4Mb
 D DMD:  29.04s, 461.9Mb
 ```

 Nim is the winner.

 But here I would look into the code: what makes LDC produce such
 poorly optimized binary.

LDC binary is ok, this is about GC, I was able to make it lamost as
twice fast for ldc with some improvments

Nov 17 2019

Jacob Shtokolov <jacob.100205 gmail.com> writes:

On Sunday, 17 November 2019 at 10:36:41 UTC, Daniel Kozak wrote:
 LDC binary is ok, this is about GC, I was able to make it 
 almost as twice fast for ldc with some improvements

Just checked the code and found that they're using allocations 
with `new` in loops. But that's very interesting to see what 
changes you made to make it run so much faster!

Could you please share it somewhere?

Nov 17 2019

Daniel Kozak <kozzi11 gmail.com> writes:

On Sun, Nov 17, 2019 at 2:50 PM Jacob Shtokolov via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Sunday, 17 November 2019 at 10:36:41 UTC, Daniel Kozak wrote:
 LDC binary is ok, this is about GC, I was able to make it
 almost as twice fast for ldc with some improvements

 Just checked the code and found that they're using allocations
 with `new` in loops. But that's very interesting to see what
 changes you made to make it run so much faster!

 Could you please share it somewhere?

Sorry I missed insert the link. It is on my github:
https://github.com/Kozzi11/benchmarks/tree/improve_d

Nov 17 2019

Jacob Shtokolov <jacob.100205 gmail.com> writes:

On Sunday, 17 November 2019 at 14:15:00 UTC, Daniel Kozak wrote:
 Sorry I missed insert the link. It is on my github: 
 https://github.com/Kozzi11/benchmarks/tree/improve_d

Now it's faster than the C++ version on my machine:

```
Nim:    12.01s, 478.1Mb
D LDC2: 13.48s, 428.1Mb
C++:    19.97s, 179.3Mb
Golang: 21.90s, 364.7Mb
```

So basically the only critical change was to replace the built-in 
associative arrays with Appender types?

That's really amazing!

Nov 17 2019

Daniel Kozak <kozzi11 gmail.com> writes:

 So basically the only critical change was to replace the built-in
 associative arrays with Appender types?

 That's really amazing!

Not only, other change is not filling number AA with UNVISITED, the
other change is to disable parallel GC, because it is cause
performance decrease

Nov 17 2019

Jon Degenhardt <jond noreply.com> writes:

On Sunday, 17 November 2019 at 16:25:52 UTC, Daniel Kozak wrote:
 So basically the only critical change was to replace the 
 built-in associative arrays with Appender types?

 That's really amazing!

 Not only, other change is not filling number AA with UNVISITED, 
 the other change is to disable parallel GC, because it is cause 
 performance decrease

Regarding the benefits seen from switching from AAs to Appenders 
- This is a nice performance improvement. Also a nice example of 
often available performance improvements in D programs.

At a high level, I feel I've seen this pattern a number of times. 
When people starting with D run benchmarks as part of their 
initial experiments, they naturally start with the simplest and 
most straightforward programming approaches. Nothing wrong with 
this. It's a strength of D that quality code can be written 
quickly.

However, in many cases these simple approaches allocate a fair 
bit of GC memory, memory that becomes unused quickly and needs to 
be GC collected. Again, nothing wrong with this. But, I have the 
impression that many times there is an expectation that such code 
will perform similarly to code using manually managed memory in 
other native compiled languages. And often this expectation is 
not met, as memory allocation and use patterns are a major 
performance driver.

What often gets missed in these assessments is that D has quite a 
few mechanisms available to enable better memory management use, 
without needing to drop GC paradigms entirely and move to fully 
manually managed memory. Modifying performance sensitive programs 
to use these mechanisms is often not hard. The switch here from 
AAs to Appenders is an example.

Being able to improve program performance in this way is a 
strength of D. One consideration is that until one has some 
experience with the language, it may not be obvious that these 
options exist, and the specific changes and approaches that can 
be used. This can lead to perception issues if nothing else.

--Jon

Nov 17 2019

JN <666total wp.pl> writes:

On Sunday, 17 November 2019 at 21:42:37 UTC, Jon Degenhardt wrote:
 At a high level, I feel I've seen this pattern a number of 
 times. When people starting with D run benchmarks as part of 
 their initial experiments, they naturally start with the 
 simplest and most straightforward programming approaches. 
 Nothing wrong with this. It's a strength of D that quality code 
 can be written quickly.

I think it signifies a deeper problem with these kind of 
benchmarks. Most people would expect these benchmarks to measure 
idiomatic code, "every day" kind of code. Most people would write 
their code with associative arrays in this case. Sure, you can 
optimize it later, but just as well you can just drop into asm {} 
block and write hand optimized code.

Same with Java, you can write a lot of the code in a very C-like 
way for a large speedup, but the code will be completely foreign 
for most Java programmers and not very representative for the 
language.

Nov 18 2019

bachmeier <no spam.net> writes:

On Monday, 18 November 2019 at 21:35:08 UTC, JN wrote:

 I think it signifies a deeper problem with these kind of 
 benchmarks. Most people would expect these benchmarks to 
 measure idiomatic code, "every day" kind of code. Most people 
 would write their code with associative arrays in this case. 
 Sure, you can optimize it later, but just as well you can just 
 drop into asm {} block and write hand optimized code.

If you're in a position where you care about "fast as possible" 
code, how fast your "every day" code runs isn't really helpful.

Now, I do understand that you might want to measure the 
performance of a piece of code written when you aren't optimizing 
for execution speed. Someone in that position is going to care 
about speed of execution and speed of development, among other 
things. The problem is that you can't learn anything useful in 
that case from a benchmark that reports execution time and 
nothing else.

Nov 18 2019

Jon Degenhardt <jond noreply.com> writes:

On Monday, 18 November 2019 at 21:50:04 UTC, bachmeier wrote:
 On Monday, 18 November 2019 at 21:35:08 UTC, JN wrote:

 I think it signifies a deeper problem with these kind of 
 benchmarks. Most people would expect these benchmarks to 
 measure idiomatic code, "every day" kind of code. Most people 
 would write their code with associative arrays in this case. 
 Sure, you can optimize it later, but just as well you can just 
 drop into asm {} block and write hand optimized code.

 If you're in a position where you care about "fast as possible" 
 code, how fast your "every day" code runs isn't really helpful.

 Now, I do understand that you might want to measure the 
 performance of a piece of code written when you aren't 
 optimizing for execution speed. Someone in that position is 
 going to care about speed of execution and speed of 
 development, among other things. The problem is that you can't 
 learn anything useful in that case from a benchmark that 
 reports execution time and nothing else.

Yes, there are often multiple goals behind a benchmark like this, 
goals that may not be explicitly identified.

There is also the question of what "idiomatic" means. This is can 
be quite subjective, especially in multi-paradigm languages. And, 
what "idiomatic" means to an individual may change as familiarity 
with the language grows. For D performance studies, an example is 
that it can take time to learn how to use lazy, range-based 
programming facilities. This is certainly one idiomatic D coding 
style. And, it often results in much better memory management and 
performance improvements. Code can move further from the most 
common paradigms of course, including all the way to inline 
assembly blocks. Makes it difficult to say when versions of a 
program in different languages are similarly idiomatic.

Nov 18 2019

Daniel Kozak <kozzi11 gmail.com> writes:

On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak <kozzi11 gmail.com> wrote:
 Nim is the winner.

 But here I would look into the code: what makes LDC produce such
 poorly optimized binary.

 LDC binary is ok, this is about GC, I was able to make it lamost as
 twice fast for ldc with some improvements

original code
Golang: 22.74s, 364.1Mb
D LDC2: 29.55s, 463.9Mb
D DMD:  29.42s, 462.5Mb
D GDC:  25.28s, 415.3Mb
Nim:       14.26s, 468,9Mb

with small changes:
Golang: 22.74s, 364.1Mb
D LDC2: 15.90s, 389.8Mb
D DMD:  16.86s, 387.3Mb
D GDC:  19.48s, 403.8Mb
Nim:       14.26s, 468,9Mb

Nov 17 2019

James Blachly <james.blachly gmail.com> writes:

On 11/17/19 6:04 AM, Daniel Kozak wrote:
 On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak <kozzi11 gmail.com> wrote:
 LDC binary is ok, this is about GC, I was able to make it lamost as
 twice fast for ldc with some improvements



Can you summarize or share the changes for learning purposes?

Nov 17 2019

kinke <noone nowhere.com> writes:

On Sunday, 17 November 2019 at 11:04:55 UTC, Daniel Kozak wrote:
 On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak 
 <kozzi11 gmail.com> wrote:
 Nim is the winner.

 But here I would look into the code: what makes LDC produce 
 such poorly optimized binary.

 LDC binary is ok, this is about GC, I was able to make it 
 lamost as twice fast for ldc with some improvements

 original code
 Golang: 22.74s, 364.1Mb
 D LDC2: 29.55s, 463.9Mb
 D DMD:  29.42s, 462.5Mb
 D GDC:  25.28s, 415.3Mb
 Nim:       14.26s, 468,9Mb

 with small changes:
 Golang: 22.74s, 364.1Mb
 D LDC2: 15.90s, 389.8Mb
 D DMD:  16.86s, 387.3Mb
 D GDC:  19.48s, 403.8Mb
 Nim:       14.26s, 468,9Mb

With full LTO, I'm seeing an additional 5% boost on Windows 
(-flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto). As 
they are using gcc LTO for the brainfuck2 benchmark too 
(https://github.com/kostya/benchmarks/blob/2777925c4e64987e83e9a53478910de080408057/brai
fuck2/build.sh#L5), I wouldn't consider it to be cheating.

Nov 17 2019

IGotD- <nise nise.com> writes:

On Saturday, 16 November 2019 at 16:45:02 UTC, Jacob Shtokolov 
wrote:
 On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov 
 wrote:
 Just tried to compile and run Base64

 The Havlak test is closer to reality:

 ```
 Nim:    12.24s, 477.8Mb
 C++:    17.33s, 179.3Mb
 Golang: 21.58s, 358.0Mb
 D LDC2: 23.55s, 460.4Mb
 D DMD:  29.04s, 461.9Mb
 ```

 Nim is the winner.

 But here I would look into the code: what makes LDC produce 
 such poorly optimized binary.

C++ memory consumption is way lower than the rest. Is this 
because of the tracing GC penalty? It would have been interesting 
to see Rust here as it doesn't use GC and if it would get close 
to the C++ memory consumption.

Nov 17 2019

IGotD- <nise nise.com> writes:

On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov 
wrote:
 On Saturday, 16 November 2019 at 16:07:29 UTC, Jacob Shtokolov 
 wrote:
 Haven't tried other tests

 Just tried to compile and run Base64
 The results are:

 ```
 C:      1.46s, 1.9Mb
 Rust:   1.49s, 2.4Mb
 D LDC2: 1.98s, 4.2Mb
 Golang: 2.89s, 10.9Mb
 C++:    3.18s, 4.8Mb
 ```

 This test is closer to the author's numbers, but Rust 
 implementation isn't faster than the C implementation on my 
 machine.

 Golang here is faster than C++. The D version was built with 
 bounds checks this time.

Why is C++ doing so badly? Is it because of inefficient usage of 
buffers and that it doesn't natively support slices?

Nov 16 2019

Jacob Shtokolov <jacob.100205 gmail.com> writes:

On Saturday, 16 November 2019 at 16:47:40 UTC, IGotD- wrote:
 Why is C++ doing so badly?

Looks like that's because they're using some libcrypto APIs (like 
the BIO). Also, my C++ compiler is not the latest one - 7.4.0. In 
the benchmark, it's claimed as GCC 9.2.1.

But the rest compilers are up to date and actually the same 
versions as in the benchmark README file

Nov 16 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Language performance benchmark be updated 2019/11/09