digitalmars.D - Standard D, Mir D benchmarks against Numpy (BLAS)

Pavel Shkadzko (34/34) Mar 12 2020 I have done several benchmarks against Numpy for various 2D

rikki cattermole (3/3) Mar 12 2020 You forgot to disable the GC for both bench's.

Pavel Shkadzko (7/10) Mar 12 2020 Thank you.

rikki cattermole (7/18) Mar 12 2020 Okay that means no GC collection was triggered during your benchmarks.

9il (2/36) Mar 12 2020 Haha
9il (8/13) Mar 12 2020 Generally speaking, the D/Mir code of the benchmark is slow by

9il (3/18) Mar 12 2020 Ah, nevermind, the forum table didn't show mir numbers aligned.
jmh530 (6/14) Mar 12 2020 I saw your subsequent post about not seeing the numbers, but I
Pavel Shkadzko (8/23) Mar 12 2020 Didn't understand. You argue against D/Mir usage when talking to

9il (9/33) Mar 12 2020 Agreed. I just misunderstood the table at the forum, it was

Pavel Shkadzko (3/15) Mar 12 2020 Thank you for the comments!

9il (5/23) Mar 12 2020 Phobos sort bench bug report:

p.shkadzko (6/36) Mar 12 2020 I am actually intrigued with the timings of huge matrices. Why

bachmeier (7/12) Mar 12 2020 Been quite a while since I worked with numpy, but I think that's
Patrick Schluter (3/10) Mar 13 2020 The interpreter getting in the way of the hardware prefetcher,

jmh530 (13/14) Mar 12 2020 Looked into some of those that aren't faster than numpy:

Pavel Shkadzko (2/17) Mar 12 2020 Numpy uses BLAS "gemm" and D uses OpenBlas "gemm".

9il (4/7) Mar 12 2020 Depending on the system they can use the same or configure

drug (12/15) Mar 12 2020 How long the benchmark runs? It have already took 20 min and continue to...

Pavel Shkadzko (4/19) Mar 12 2020 For Numpy Python it's ~1m 30s, but for all D benchmarks it takes

drug (2/5) Mar 12 2020 Hmm, I see. In my case the benchmark hangs up in dgemm_ infinitely(

Jacob Carlborg (6/15) Mar 14 2020 Have you tried to compile with LTO (Link Time Optimization) and PGO

9il (4/19) Mar 14 2020 The problem is that Numpy uses its own version of OpenBLAS, that

Pavel Shkadzko (4/16) Mar 15 2020 My version of NumPy is installed with anaconda and it looks like

Pavel Shkadzko (6/21) Mar 15 2020 If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough then

Jon Degenhardt (29/52) Mar 15 2020 Try:

9il (4/10) Mar 15 2020 LTO and PGO are useless for this kind of stuff. Nothing to

Pavel Shkadzko <p.shkadzko gmail.com> writes:

I have done several benchmarks against Numpy for various 2D 
matrix operations. The purpose was mere curiosity and spread the 
word about Mir D library among the office data engineers.
Since I am not a D expert, I would be happy if someone could take 
a second look and double check.

https://github.com/tastyminerals/mir_benchmarks

Compile and run the project via: dub run --compiler=ldc 
--build=release

*Table descriptions reduced to fit into post width.

+---------------------------------+---------------------+--------------------+---------------------+
| Description                     | Numpy (BLAS) (sec.) | 
Standard D (sec.)  | Mir D (sec.)        |
+---------------------------------+---------------------+--------------------+---------------------+
| sum of two 250x200 (50 loops)   | 0.00115             | 
0.00400213(x3.5)   | 0.00014372(x1/8)    |
| mult of two 250x200 (50 loops)  | 0.0011578           | 
0.0132323(x11.4)   | 0.00013852(x1/8.3)  |
| sum of two 500x600 (50 loops)   | 0.0101275           | 
0.016496(x1.6)     | 0.00021556(x1/47)   |
| mult of two 500x600 (50 loops)  | 0.010182            | 
0.06857(x6.7)      | 0.00021717(x1/47)   |
| sum of two 1k x 1k (50 loops)   | 0.0493201           | 
0.0614544(x1.3)    | 0.000422135(x1/117) |
| mult of two 1k x 1k (50 loops)  | 0.0493693           | 
0.233827(x4.7)     | 0.000453535(x1/109) |
| Scalar product of two 30k       | 0.0152186           | 
0.0227465(x1.5)    | 0.0198812(x1.3)     |
| Dot product of 5k x 6k, 6k x 5k | 1.6084685           | 
--------------     | 2.03398(x1.2)       |
| L2 norm of 5k x 6k              | 0.0072423           | 
0.0160546(x2.2)    | 0.0110136(x1.6)     |
| Quicksort of 5k x 6k            | 2.6516816           | 
0.178071(x1/14.8)  | 1.52406(x1/0.6)     |
+---------------------------------+---------------------+--------------------+---------------------+

Mar 12 2020

rikki cattermole <rikki cattermole.co.nz> writes:

You forgot to disable the GC for both bench's.
Also  fastmath for standard_ops_bench.

FYI: standard_ops_bench does a LOT of memory allocations.

Mar 12 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Thursday, 12 March 2020 at 13:18:41 UTC, rikki cattermole 
wrote:
 You forgot to disable the GC for both bench's.
 Also  fastmath for standard_ops_bench.

 FYI: standard_ops_bench does a LOT of memory allocations.

Thank you.
Add GC.disable; inside the main function, right? It didn't really 
change anything for any of the benchmarks, maybe I did it wrong.
Does  fastmath work only on functions with plain loops or 
everything with math ops? It is not clear from LDC docs.

Mar 12 2020

rikki cattermole <rikki cattermole.co.nz> writes:

On 13/03/2020 3:27 AM, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 13:18:41 UTC, rikki cattermole wrote:
 You forgot to disable the GC for both bench's.
 Also  fastmath for standard_ops_bench.

 FYI: standard_ops_bench does a LOT of memory allocations.

 
 Thank you.
 Add GC.disable; inside the main function, right? It didn't really change 
 anything for any of the benchmarks, maybe I did it wrong.

Okay that means no GC collection was triggered during your benchmarks.

This is good to know, that means the performance problems are indeed on 
your end and not runtime related.

 Does  fastmath work only on functions with plain loops or everything 
 with math ops? It is not clear from LDC docs.

Try it :)

I have no idea how much it'll help. You have used it on one but not the 
other, so it seems odd to not do it on both.

Mar 12 2020

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.

 https://github.com/tastyminerals/mir_benchmarks

 Compile and run the project via: dub run --compiler=ldc 
 --build=release

 *Table descriptions reduced to fit into post width.

 +---------------------------------+---------------------+--------------------+---------------------+
 | Description                     | Numpy (BLAS) (sec.) | 
 Standard D (sec.)  | Mir D (sec.)        |
 +---------------------------------+---------------------+--------------------+---------------------+
 | sum of two 250x200 (50 loops)   | 0.00115             | 
 0.00400213(x3.5)   | 0.00014372(x1/8)    |
 | mult of two 250x200 (50 loops)  | 0.0011578           | 
 0.0132323(x11.4)   | 0.00013852(x1/8.3)  |
 | sum of two 500x600 (50 loops)   | 0.0101275           | 
 0.016496(x1.6)     | 0.00021556(x1/47)   |
 | mult of two 500x600 (50 loops)  | 0.010182            | 
 0.06857(x6.7)      | 0.00021717(x1/47)   |
 | sum of two 1k x 1k (50 loops)   | 0.0493201           | 
 0.0614544(x1.3)    | 0.000422135(x1/117) |
 | mult of two 1k x 1k (50 loops)  | 0.0493693           | 
 0.233827(x4.7)     | 0.000453535(x1/109) |
 | Scalar product of two 30k       | 0.0152186           | 
 0.0227465(x1.5)    | 0.0198812(x1.3)     |
 | Dot product of 5k x 6k, 6k x 5k | 1.6084685           | 
 --------------     | 2.03398(x1.2)       |
 | L2 norm of 5k x 6k              | 0.0072423           | 
 0.0160546(x2.2)    | 0.0110136(x1.6)     |
 | Quicksort of 5k x 6k            | 2.6516816           | 
 0.178071(x1/14.8)  | 1.52406(x1/0.6)     |
 +---------------------------------+---------------------+--------------------+---------------------+

Haha

Mar 12 2020

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.

Generally speaking, the D/Mir code of the benchmark is slow by 
how it has been written.
I am not arguing you to use  D/Mir. Furthermore, sometimes I am 
arguing my clients to do not to use it if you can. On the 
commercial request, I can write the benchmark or an applied 
algorithm so D/Mir will beat numpy in all the tests including 
gemm. --Ilya

Mar 12 2020

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.

 Generally speaking, the D/Mir code of the benchmark is slow by 
 how it has been written.
 I am not arguing you to use  D/Mir. Furthermore, sometimes I am 
 arguing my clients to do not to use it if you can. On the 
 commercial request, I can write the benchmark or an applied 
 algorithm so D/Mir will beat numpy in all the tests including 
 gemm. --Ilya

Ah, nevermind, the forum table didn't show mir numbers aligned. 
Thank you for the work. I will open an MR with a few addons.

Mar 12 2020

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 [snip]

 Generally speaking, the D/Mir code of the benchmark is slow by 
 how it has been written.
 I am not arguing you to use  D/Mir. Furthermore, sometimes I am 
 arguing my clients to do not to use it if you can. On the 
 commercial request, I can write the benchmark or an applied 
 algorithm so D/Mir will beat numpy in all the tests including 
 gemm. --Ilya

I saw your subsequent post about not seeing the numbers, but I 
think my broader response is that most people don't need to get 
every single drop of performance. Typical performance for numpy 
versus typical performance for mir is still valuable information 
for people to know.

Mar 12 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.

 Generally speaking, the D/Mir code of the benchmark is slow by 
 how it has been written.
 I am not arguing you to use  D/Mir. Furthermore, sometimes I am 
 arguing my clients to do not to use it if you can. On the 
 commercial request, I can write the benchmark or an applied 
 algorithm so D/Mir will beat numpy in all the tests including 
 gemm. --Ilya

Didn't understand. You argue against D/Mir usage when talking to 
your clients?

Actually, I feel like it is also useful to have unoptimized D 
code benchmarked because this is how most people will write their 
code when they first write it. Although, I can hardly call these 
benchmarks unoptimized because I use LDC optimization flags as 
well as some tips from you.

Mar 12 2020

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.

 Generally speaking, the D/Mir code of the benchmark is slow by 
 how it has been written.
 I am not arguing you to use  D/Mir. Furthermore, sometimes I 
 am arguing my clients to do not to use it if you can. On the 
 commercial request, I can write the benchmark or an applied 
 algorithm so D/Mir will beat numpy in all the tests including 
 gemm. --Ilya

 Didn't understand. You argue against D/Mir usage when talking 
 to your clients?

It depends on the problem they wanted me to solve.

 Actually, I feel like it is also useful to have unoptimized D 
 code benchmarked because this is how most people will write 
 their code when they first write it. Although, I can hardly 
 call these benchmarks unoptimized because I use LDC 
 optimization flags as well as some tips from you.

Agreed.  I just misunderstood the table at the forum, it was 
misaligned for me. The numbers look cool, thank you for the 
benchmark. Mir sorting looks slower then Phobos, it is 
interesting, and need a fix. You can use Phobos sorting with 
ndslice the same way with `each`.

Minor updates
https://github.com/tastyminerals/mir_benchmarks/pull/1

Mar 12 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko 
 wrote:
 [...]

 It depends on the problem they wanted me to solve.

 [...]

 Agreed.  I just misunderstood the table at the forum, it was 
 misaligned for me. The numbers look cool, thank you for the 
 benchmark. Mir sorting looks slower then Phobos, it is 
 interesting, and need a fix. You can use Phobos sorting with 
 ndslice the same way with `each`.

 Minor updates
 https://github.com/tastyminerals/mir_benchmarks/pull/1

Thank you for the comments!

Looks like I will be updating the benchmarks tables today :)

Mar 12 2020

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 12 March 2020 at 15:46:47 UTC, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko 
 wrote:
 [...]

 It depends on the problem they wanted me to solve.

 [...]

 Agreed.  I just misunderstood the table at the forum, it was 
 misaligned for me. The numbers look cool, thank you for the 
 benchmark. Mir sorting looks slower then Phobos, it is 
 interesting, and need a fix. You can use Phobos sorting with 
 ndslice the same way with `each`.


Phobos sort bench bug report:
https://github.com/tastyminerals/mir_benchmarks/issues/2

 Minor updates
 https://github.com/tastyminerals/mir_benchmarks/pull/1

 Thank you for the comments!

 Looks like I will be updating the benchmarks tables today :)

another small update that changes the ration a lot
https://github.com/tastyminerals/mir_benchmarks/pull/3

Mar 12 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko 
 wrote:
 On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
[...]

 Generally speaking, the D/Mir code of the benchmark is slow 
 by how it has been written.
 I am not arguing you to use  D/Mir. Furthermore, sometimes I 
 am arguing my clients to do not to use it if you can. On the 
 commercial request, I can write the benchmark or an applied 
 algorithm so D/Mir will beat numpy in all the tests including 
 gemm. --Ilya

 Didn't understand. You argue against D/Mir usage when talking 
 to your clients?

 It depends on the problem they wanted me to solve.

 Actually, I feel like it is also useful to have unoptimized D 
 code benchmarked because this is how most people will write 
 their code when they first write it. Although, I can hardly 
 call these benchmarks unoptimized because I use LDC 
 optimization flags as well as some tips from you.

 Agreed.  I just misunderstood the table at the forum, it was 
 misaligned for me. The numbers look cool, thank you for the 
 benchmark. Mir sorting looks slower then Phobos, it is 
 interesting, and need a fix. You can use Phobos sorting with 
 ndslice the same way with `each`.

 Minor updates
 https://github.com/tastyminerals/mir_benchmarks/pull/1

I am actually intrigued with the timings of huge matrices. Why 
Mir D and Standard D are so much better than NumPy? Once we get 
to 500x600, 1000x1000 sizes there is a huge drop in performance 
for NumPy and not so much for D. You mentioned L3 cache but CPU 
architecture is equal for all the benchmarks so what's going on?

Mar 12 2020

bachmeier <no spam.net> writes:

On Thursday, 12 March 2020 at 20:39:59 UTC, p.shkadzko wrote:

 I am actually intrigued with the timings of huge matrices. Why 
 Mir D and Standard D are so much better than NumPy? Once we get 
 to 500x600, 1000x1000 sizes there is a huge drop in performance 
 for NumPy and not so much for D. You mentioned L3 cache but CPU 
 architecture is equal for all the benchmarks so what's going on?

Been quite a while since I worked with numpy, but I think that's 
where you're hitting memory limits (easier to do with Python than 
with D) and it causes performance to deteriorate quickly. I had 
those problems with R, and I believe it's relatively easy to hit 
that constraint with numpy as well, but you definitely want to 
find a numpy expert to confirm - something I definitely am not.

Mar 12 2020

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 12 March 2020 at 20:39:59 UTC, p.shkadzko wrote:
 On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 [...]

 I am actually intrigued with the timings of huge matrices. Why 
 Mir D and Standard D are so much better than NumPy? Once we get 
 to 500x600, 1000x1000 sizes there is a huge drop in performance 
 for NumPy and not so much for D. You mentioned L3 cache but CPU 
 architecture is equal for all the benchmarks so what's going on?

The interpreter getting in the way of the hardware prefetcher, 
maybe.

Mar 13 2020

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko wrote:
 [snip]

Looked into some of those that aren't faster than numpy:

For dot product, (what I would just call matrix multiplication), 
both functions are using gemm. There might be some quirks that 
have caused a difference in performance, but otherwise I would 
expect to be pretty close and it is. It looks like you are 
allocating the output matrix with the GC, which could be a driver 
of the difference.

For the L2-norm, you are calculating the L2 norm entry-wise as a 
Froebenius norm. That should be the same as the default for 
numpy. For numpy, the only difference I can tell between yours 
and there is that it re-uses its dot product function. Otherwise 
it looks the same.

Mar 12 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Thursday, 12 March 2020 at 14:12:14 UTC, jmh530 wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 [snip]

 Looked into some of those that aren't faster than numpy:

 For dot product, (what I would just call matrix 
 multiplication), both functions are using gemm. There might be 
 some quirks that have caused a difference in performance, but 
 otherwise I would expect to be pretty close and it is. It looks 
 like you are allocating the output matrix with the GC, which 
 could be a driver of the difference.

 For the L2-norm, you are calculating the L2 norm entry-wise as 
 a Froebenius norm. That should be the same as the default for 
 numpy. For numpy, the only difference I can tell between yours 
 and there is that it re-uses its dot product function. 
 Otherwise it looks the same.

Numpy uses BLAS "gemm" and D uses OpenBlas "gemm".

Mar 12 2020

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 12 March 2020 at 15:18:43 UTC, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 14:12:14 UTC, jmh530 wrote:
 [...]

 Numpy uses BLAS "gemm" and D uses OpenBlas "gemm".

Depending on the system they can use the same or configure 
specific like OpenBlas or intel MKL (sure about mir, NumPy likely 
allows to do it as well )

Mar 12 2020

drug <drug2004 bk.ru> writes:

On 3/12/20 3:59 PM, Pavel Shkadzko wrote:
 
 [snip]
 

How long the benchmark runs? It have already took 20 min and continue to 
run in "Mir D" stage.

P.S.
Probably the reason is that I use
```
"subConfigurations": {"mir-blas": "blas"},
```
instead of
```
"subConfigurations": {"mir-blas": "twolib"},
```

Mar 12 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Thursday, 12 March 2020 at 14:26:14 UTC, drug wrote:
 On 3/12/20 3:59 PM, Pavel Shkadzko wrote:
 
 [snip]
 

 How long the benchmark runs? It have already took 20 min and 
 continue to run in "Mir D" stage.

 P.S.
 Probably the reason is that I use
 ```
 "subConfigurations": {"mir-blas": "blas"},
 ```
 instead of
 ```
 "subConfigurations": {"mir-blas": "twolib"},
 ```

For Numpy Python it's ~1m 30s, but for all D benchmarks it takes 
around ~2m on my machine which I think is the real benchmark here 
:)

Mar 12 2020

drug <drug2004 bk.ru> writes:

On 3/12/20 5:30 PM, Pavel Shkadzko wrote:
 
 For Numpy Python it's ~1m 30s, but for all D benchmarks it takes around 
 ~2m on my machine which I think is the real benchmark here :)

Hmm, I see. In my case the benchmark hangs up in dgemm_ infinitely(

Mar 12 2020

Jacob Carlborg <doob me.com> writes:

On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D matrix 
 operations. The purpose was mere curiosity and spread the word about Mir 
 D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could take a 
 second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc --build=release

Have you tried to compile with LTO (Link Time Optimization) and PGO 
(Profile Guided Optimization) enabled? You should also link with the 
versions of Phobos and druntime that has been compiled with LTO.

-- 
/Jacob Carlborg

Mar 14 2020

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc 
 --build=release

 Have you tried to compile with LTO (Link Time Optimization) and 
 PGO (Profile Guided Optimization) enabled? You should also link 
 with the versions of Phobos and druntime that has been compiled 
 with LTO.

The problem is that Numpy uses its own version of OpenBLAS, that 
is multithread including Level 1 BLAS operations like L2 norm and 
dot product, while D code is a single thread.

Mar 14 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Saturday, 14 March 2020 at 09:34:55 UTC, 9il wrote:
 On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg 
 wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 [...]

 Have you tried to compile with LTO (Link Time Optimization) 
 and PGO (Profile Guided Optimization) enabled? You should also 
 link with the versions of Phobos and druntime that has been 
 compiled with LTO.

 The problem is that Numpy uses its own version of OpenBLAS, 
 that is multithread including Level 1 BLAS operations like L2 
 norm and dot product, while D code is a single thread.

My version of NumPy is installed with anaconda and it looks like 
anaconda numpy package comes with mkl libraries.

I have updated the benchmarks with respect to single/multi thread.

Mar 15 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc 
 --build=release

 Have you tried to compile with LTO (Link Time Optimization) and 
 PGO (Profile Guided Optimization) enabled? You should also link 
 with the versions of Phobos and druntime that has been compiled 
 with LTO.

If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough then 
it doesn't improve anything.
For PGO, I am a bit confused how to use it with dub -- 
dflags-ldc: ["-O3"]? It compiles but I see no difference. By 
default, ldc2 should be using O2 -- good optimizations.

Mar 15 2020

Jon Degenhardt <jond noreply.com> writes:

On Sunday, 15 March 2020 at 12:13:39 UTC, Pavel Shkadzko wrote:
 On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg 
 wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc 
 --build=release

 Have you tried to compile with LTO (Link Time Optimization) 
 and PGO (Profile Guided Optimization) enabled? You should also 
 link with the versions of Phobos and druntime that has been 
 compiled with LTO.

 If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough 
 then it doesn't improve anything.

Try:
     "dflags-ldc" : ["-flto=thin", 
"-defaultlib=phobos2-ldc-lto,druntime-ldc-lto", "-singleobj" ]

The "-defaultlib=..." parameter engages LTO for phobos and 
druntime. You can also use "-flto=full" rather than "thin". I've 
had good results with "thin". Not sure if the "-singleobj" 
parameter helps.

 For PGO, I am a bit confused how to use it with dub -- 
 dflags-ldc: ["-O3"]? It compiles but I see no difference. By 
 default, ldc2 should be using O2 -- good optimizations.

PGO (profile guided optimization) is a multi-step process. First 
step is create an instrumented build (-fprofile-instr-generate). 
Second step is to run the instrumented binary on a representative 
workload. Last step is to use the resulting workload in the final 
build (-fprofile-instr-use).

For information on PGO see Johan Engelen's blog page: 
https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html

I have done studies on LTO and PGO and found both beneficial, 
often significantly. The largest gains came in code running in 
tight loops that were included code pulled from libraries (e.g. 
phobos, druntime). It was hard to predict what code was going 
benefit from LTO/PGO.

I've found it tricky to use dub for the full PGO process. 
(Creating the instrumented build, generating the profile data, 
and using it in the final build process.) Mostly I've used make 
for this. I did get it to work in a simple performance test app: 
https://github.com/jondegenhardt/dcat-perf. It doesn't document 
how the PGO steps work, but it dub.json file is relatively short 
and repository README.md contains the build instructions for both 
LTO and LTO plus PGO.

--Jon

Mar 15 2020

9il <ilyayaroshenko gmail.com> writes:

On Sunday, 15 March 2020 at 20:15:07 UTC, Jon Degenhardt wrote:
 On Sunday, 15 March 2020 at 12:13:39 UTC, Pavel Shkadzko wrote:
 [...]

 Try:
     "dflags-ldc" : ["-flto=thin", 
 "-defaultlib=phobos2-ldc-lto,druntime-ldc-lto", "-singleobj" ]

 [...]

LTO and PGO are useless for this kind of stuff. Nothing to 
inline, the code is to simple and generic. Nothing to apply this 
technology for.

Mar 15 2020

D Programming

C/C++ Programming

Other

digitalmars.D - Standard D, Mir D benchmarks against Numpy (BLAS)