www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Standard D, Mir D benchmarks against Numpy (BLAS)

reply Pavel Shkadzko <p.shkadzko gmail.com> writes:
I have done several benchmarks against Numpy for various 2D 
matrix operations. The purpose was mere curiosity and spread the 
word about Mir D library among the office data engineers.
Since I am not a D expert, I would be happy if someone could take 
a second look and double check.

https://github.com/tastyminerals/mir_benchmarks

Compile and run the project via: dub run --compiler=ldc 
--build=release

*Table descriptions reduced to fit into post width.

+---------------------------------+---------------------+--------------------+---------------------+
| Description                     | Numpy (BLAS) (sec.) | 
Standard D (sec.)  | Mir D (sec.)        |
+---------------------------------+---------------------+--------------------+---------------------+
| sum of two 250x200 (50 loops)   | 0.00115             | 
0.00400213(x3.5)   | 0.00014372(x1/8)    |
| mult of two 250x200 (50 loops)  | 0.0011578           | 
0.0132323(x11.4)   | 0.00013852(x1/8.3)  |
| sum of two 500x600 (50 loops)   | 0.0101275           | 
0.016496(x1.6)     | 0.00021556(x1/47)   |
| mult of two 500x600 (50 loops)  | 0.010182            | 
0.06857(x6.7)      | 0.00021717(x1/47)   |
| sum of two 1k x 1k (50 loops)   | 0.0493201           | 
0.0614544(x1.3)    | 0.000422135(x1/117) |
| mult of two 1k x 1k (50 loops)  | 0.0493693           | 
0.233827(x4.7)     | 0.000453535(x1/109) |
| Scalar product of two 30k       | 0.0152186           | 
0.0227465(x1.5)    | 0.0198812(x1.3)     |
| Dot product of 5k x 6k, 6k x 5k | 1.6084685           | 
--------------     | 2.03398(x1.2)       |
| L2 norm of 5k x 6k              | 0.0072423           | 
0.0160546(x2.2)    | 0.0110136(x1.6)     |
| Quicksort of 5k x 6k            | 2.6516816           | 
0.178071(x1/14.8)  | 1.52406(x1/0.6)     |
+---------------------------------+---------------------+--------------------+---------------------+
Mar 12 2020
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
You forgot to disable the GC for both bench's.
Also  fastmath for standard_ops_bench.

FYI: standard_ops_bench does a LOT of memory allocations.
Mar 12 2020
parent reply Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Thursday, 12 March 2020 at 13:18:41 UTC, rikki cattermole 
wrote:
 You forgot to disable the GC for both bench's.
 Also  fastmath for standard_ops_bench.

 FYI: standard_ops_bench does a LOT of memory allocations.
Thank you. Add GC.disable; inside the main function, right? It didn't really change anything for any of the benchmarks, maybe I did it wrong. Does fastmath work only on functions with plain loops or everything with math ops? It is not clear from LDC docs.
Mar 12 2020
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 13/03/2020 3:27 AM, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 13:18:41 UTC, rikki cattermole wrote:
 You forgot to disable the GC for both bench's.
 Also  fastmath for standard_ops_bench.

 FYI: standard_ops_bench does a LOT of memory allocations.
Thank you. Add GC.disable; inside the main function, right? It didn't really change anything for any of the benchmarks, maybe I did it wrong.
Okay that means no GC collection was triggered during your benchmarks. This is good to know, that means the performance problems are indeed on your end and not runtime related.
 Does  fastmath work only on functions with plain loops or everything 
 with math ops? It is not clear from LDC docs.
Try it :) I have no idea how much it'll help. You have used it on one but not the other, so it seems odd to not do it on both.
Mar 12 2020
prev sibling next sibling parent 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.

 https://github.com/tastyminerals/mir_benchmarks

 Compile and run the project via: dub run --compiler=ldc 
 --build=release

 *Table descriptions reduced to fit into post width.

 +---------------------------------+---------------------+--------------------+---------------------+
 | Description                     | Numpy (BLAS) (sec.) | 
 Standard D (sec.)  | Mir D (sec.)        |
 +---------------------------------+---------------------+--------------------+---------------------+
 | sum of two 250x200 (50 loops)   | 0.00115             | 
 0.00400213(x3.5)   | 0.00014372(x1/8)    |
 | mult of two 250x200 (50 loops)  | 0.0011578           | 
 0.0132323(x11.4)   | 0.00013852(x1/8.3)  |
 | sum of two 500x600 (50 loops)   | 0.0101275           | 
 0.016496(x1.6)     | 0.00021556(x1/47)   |
 | mult of two 500x600 (50 loops)  | 0.010182            | 
 0.06857(x6.7)      | 0.00021717(x1/47)   |
 | sum of two 1k x 1k (50 loops)   | 0.0493201           | 
 0.0614544(x1.3)    | 0.000422135(x1/117) |
 | mult of two 1k x 1k (50 loops)  | 0.0493693           | 
 0.233827(x4.7)     | 0.000453535(x1/109) |
 | Scalar product of two 30k       | 0.0152186           | 
 0.0227465(x1.5)    | 0.0198812(x1.3)     |
 | Dot product of 5k x 6k, 6k x 5k | 1.6084685           | 
 --------------     | 2.03398(x1.2)       |
 | L2 norm of 5k x 6k              | 0.0072423           | 
 0.0160546(x2.2)    | 0.0110136(x1.6)     |
 | Quicksort of 5k x 6k            | 2.6516816           | 
 0.178071(x1/14.8)  | 1.52406(x1/0.6)     |
 +---------------------------------+---------------------+--------------------+---------------------+
Haha
Mar 12 2020
prev sibling next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
Generally speaking, the D/Mir code of the benchmark is slow by how it has been written. I am not arguing you to use D/Mir. Furthermore, sometimes I am arguing my clients to do not to use it if you can. On the commercial request, I can write the benchmark or an applied algorithm so D/Mir will beat numpy in all the tests including gemm. --Ilya
Mar 12 2020
next sibling parent 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
Generally speaking, the D/Mir code of the benchmark is slow by how it has been written. I am not arguing you to use D/Mir. Furthermore, sometimes I am arguing my clients to do not to use it if you can. On the commercial request, I can write the benchmark or an applied algorithm so D/Mir will beat numpy in all the tests including gemm. --Ilya
Ah, nevermind, the forum table didn't show mir numbers aligned. Thank you for the work. I will open an MR with a few addons.
Mar 12 2020
prev sibling next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 [snip]

 Generally speaking, the D/Mir code of the benchmark is slow by 
 how it has been written.
 I am not arguing you to use  D/Mir. Furthermore, sometimes I am 
 arguing my clients to do not to use it if you can. On the 
 commercial request, I can write the benchmark or an applied 
 algorithm so D/Mir will beat numpy in all the tests including 
 gemm. --Ilya
I saw your subsequent post about not seeing the numbers, but I think my broader response is that most people don't need to get every single drop of performance. Typical performance for numpy versus typical performance for mir is still valuable information for people to know.
Mar 12 2020
prev sibling parent reply Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
Generally speaking, the D/Mir code of the benchmark is slow by how it has been written. I am not arguing you to use D/Mir. Furthermore, sometimes I am arguing my clients to do not to use it if you can. On the commercial request, I can write the benchmark or an applied algorithm so D/Mir will beat numpy in all the tests including gemm. --Ilya
Didn't understand. You argue against D/Mir usage when talking to your clients? Actually, I feel like it is also useful to have unoptimized D code benchmarked because this is how most people will write their code when they first write it. Although, I can hardly call these benchmarks unoptimized because I use LDC optimization flags as well as some tips from you.
Mar 12 2020
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
Generally speaking, the D/Mir code of the benchmark is slow by how it has been written. I am not arguing you to use D/Mir. Furthermore, sometimes I am arguing my clients to do not to use it if you can. On the commercial request, I can write the benchmark or an applied algorithm so D/Mir will beat numpy in all the tests including gemm. --Ilya
Didn't understand. You argue against D/Mir usage when talking to your clients?
It depends on the problem they wanted me to solve.
 Actually, I feel like it is also useful to have unoptimized D 
 code benchmarked because this is how most people will write 
 their code when they first write it. Although, I can hardly 
 call these benchmarks unoptimized because I use LDC 
 optimization flags as well as some tips from you.
Agreed. I just misunderstood the table at the forum, it was misaligned for me. The numbers look cool, thank you for the benchmark. Mir sorting looks slower then Phobos, it is interesting, and need a fix. You can use Phobos sorting with ndslice the same way with `each`. Minor updates https://github.com/tastyminerals/mir_benchmarks/pull/1
Mar 12 2020
next sibling parent reply Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko 
 wrote:
 [...]
It depends on the problem they wanted me to solve.
 [...]
Agreed. I just misunderstood the table at the forum, it was misaligned for me. The numbers look cool, thank you for the benchmark. Mir sorting looks slower then Phobos, it is interesting, and need a fix. You can use Phobos sorting with ndslice the same way with `each`. Minor updates https://github.com/tastyminerals/mir_benchmarks/pull/1
Thank you for the comments! Looks like I will be updating the benchmarks tables today :)
Mar 12 2020
parent 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 12 March 2020 at 15:46:47 UTC, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko 
 wrote:
 [...]
It depends on the problem they wanted me to solve.
 [...]
Agreed. I just misunderstood the table at the forum, it was misaligned for me. The numbers look cool, thank you for the benchmark. Mir sorting looks slower then Phobos, it is interesting, and need a fix. You can use Phobos sorting with ndslice the same way with `each`.
Phobos sort bench bug report: https://github.com/tastyminerals/mir_benchmarks/issues/2
 Minor updates
 https://github.com/tastyminerals/mir_benchmarks/pull/1
Thank you for the comments! Looks like I will be updating the benchmarks tables today :)
another small update that changes the ration a lot https://github.com/tastyminerals/mir_benchmarks/pull/3
Mar 12 2020
prev sibling parent reply p.shkadzko <p.shkadzko gmail.com> writes:
On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 14:37:13 UTC, Pavel Shkadzko 
 wrote:
 On Thursday, 12 March 2020 at 14:00:48 UTC, 9il wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
[...]
Generally speaking, the D/Mir code of the benchmark is slow by how it has been written. I am not arguing you to use D/Mir. Furthermore, sometimes I am arguing my clients to do not to use it if you can. On the commercial request, I can write the benchmark or an applied algorithm so D/Mir will beat numpy in all the tests including gemm. --Ilya
Didn't understand. You argue against D/Mir usage when talking to your clients?
It depends on the problem they wanted me to solve.
 Actually, I feel like it is also useful to have unoptimized D 
 code benchmarked because this is how most people will write 
 their code when they first write it. Although, I can hardly 
 call these benchmarks unoptimized because I use LDC 
 optimization flags as well as some tips from you.
Agreed. I just misunderstood the table at the forum, it was misaligned for me. The numbers look cool, thank you for the benchmark. Mir sorting looks slower then Phobos, it is interesting, and need a fix. You can use Phobos sorting with ndslice the same way with `each`. Minor updates https://github.com/tastyminerals/mir_benchmarks/pull/1
I am actually intrigued with the timings of huge matrices. Why Mir D and Standard D are so much better than NumPy? Once we get to 500x600, 1000x1000 sizes there is a huge drop in performance for NumPy and not so much for D. You mentioned L3 cache but CPU architecture is equal for all the benchmarks so what's going on?
Mar 12 2020
next sibling parent bachmeier <no spam.net> writes:
On Thursday, 12 March 2020 at 20:39:59 UTC, p.shkadzko wrote:

 I am actually intrigued with the timings of huge matrices. Why 
 Mir D and Standard D are so much better than NumPy? Once we get 
 to 500x600, 1000x1000 sizes there is a huge drop in performance 
 for NumPy and not so much for D. You mentioned L3 cache but CPU 
 architecture is equal for all the benchmarks so what's going on?
Been quite a while since I worked with numpy, but I think that's where you're hitting memory limits (easier to do with Python than with D) and it causes performance to deteriorate quickly. I had those problems with R, and I believe it's relatively easy to hit that constraint with numpy as well, but you definitely want to find a numpy expert to confirm - something I definitely am not.
Mar 12 2020
prev sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 12 March 2020 at 20:39:59 UTC, p.shkadzko wrote:
 On Thursday, 12 March 2020 at 15:34:58 UTC, 9il wrote:
 [...]
I am actually intrigued with the timings of huge matrices. Why Mir D and Standard D are so much better than NumPy? Once we get to 500x600, 1000x1000 sizes there is a huge drop in performance for NumPy and not so much for D. You mentioned L3 cache but CPU architecture is equal for all the benchmarks so what's going on?
The interpreter getting in the way of the hardware prefetcher, maybe.
Mar 13 2020
prev sibling next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko wrote:
 [snip]
Looked into some of those that aren't faster than numpy: For dot product, (what I would just call matrix multiplication), both functions are using gemm. There might be some quirks that have caused a difference in performance, but otherwise I would expect to be pretty close and it is. It looks like you are allocating the output matrix with the GC, which could be a driver of the difference. For the L2-norm, you are calculating the L2 norm entry-wise as a Froebenius norm. That should be the same as the default for numpy. For numpy, the only difference I can tell between yours and there is that it re-uses its dot product function. Otherwise it looks the same.
Mar 12 2020
parent reply Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Thursday, 12 March 2020 at 14:12:14 UTC, jmh530 wrote:
 On Thursday, 12 March 2020 at 12:59:41 UTC, Pavel Shkadzko 
 wrote:
 [snip]
Looked into some of those that aren't faster than numpy: For dot product, (what I would just call matrix multiplication), both functions are using gemm. There might be some quirks that have caused a difference in performance, but otherwise I would expect to be pretty close and it is. It looks like you are allocating the output matrix with the GC, which could be a driver of the difference. For the L2-norm, you are calculating the L2 norm entry-wise as a Froebenius norm. That should be the same as the default for numpy. For numpy, the only difference I can tell between yours and there is that it re-uses its dot product function. Otherwise it looks the same.
Numpy uses BLAS "gemm" and D uses OpenBlas "gemm".
Mar 12 2020
parent 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 12 March 2020 at 15:18:43 UTC, Pavel Shkadzko wrote:
 On Thursday, 12 March 2020 at 14:12:14 UTC, jmh530 wrote:
 [...]
Numpy uses BLAS "gemm" and D uses OpenBlas "gemm".
Depending on the system they can use the same or configure specific like OpenBlas or intel MKL (sure about mir, NumPy likely allows to do it as well )
Mar 12 2020
prev sibling next sibling parent reply drug <drug2004 bk.ru> writes:
On 3/12/20 3:59 PM, Pavel Shkadzko wrote:
 
 [snip]
 
How long the benchmark runs? It have already took 20 min and continue to run in "Mir D" stage. P.S. Probably the reason is that I use ``` "subConfigurations": {"mir-blas": "blas"}, ``` instead of ``` "subConfigurations": {"mir-blas": "twolib"}, ```
Mar 12 2020
parent reply Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Thursday, 12 March 2020 at 14:26:14 UTC, drug wrote:
 On 3/12/20 3:59 PM, Pavel Shkadzko wrote:
 
 [snip]
 
How long the benchmark runs? It have already took 20 min and continue to run in "Mir D" stage. P.S. Probably the reason is that I use ``` "subConfigurations": {"mir-blas": "blas"}, ``` instead of ``` "subConfigurations": {"mir-blas": "twolib"}, ```
For Numpy Python it's ~1m 30s, but for all D benchmarks it takes around ~2m on my machine which I think is the real benchmark here :)
Mar 12 2020
parent drug <drug2004 bk.ru> writes:
On 3/12/20 5:30 PM, Pavel Shkadzko wrote:
 
 For Numpy Python it's ~1m 30s, but for all D benchmarks it takes around 
 ~2m on my machine which I think is the real benchmark here :)
Hmm, I see. In my case the benchmark hangs up in dgemm_ infinitely(
Mar 12 2020
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D matrix 
 operations. The purpose was mere curiosity and spread the word about Mir 
 D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could take a 
 second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc --build=release
Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO. -- /Jacob Carlborg
Mar 14 2020
next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc 
 --build=release
Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.
The problem is that Numpy uses its own version of OpenBLAS, that is multithread including Level 1 BLAS operations like L2 norm and dot product, while D code is a single thread.
Mar 14 2020
parent Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Saturday, 14 March 2020 at 09:34:55 UTC, 9il wrote:
 On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg 
 wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 [...]
Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.
The problem is that Numpy uses its own version of OpenBLAS, that is multithread including Level 1 BLAS operations like L2 norm and dot product, while D code is a single thread.
My version of NumPy is installed with anaconda and it looks like anaconda numpy package comes with mkl libraries. I have updated the benchmarks with respect to single/multi thread.
Mar 15 2020
prev sibling parent reply Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc 
 --build=release
Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.
If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough then it doesn't improve anything. For PGO, I am a bit confused how to use it with dub -- dflags-ldc: ["-O3"]? It compiles but I see no difference. By default, ldc2 should be using O2 -- good optimizations.
Mar 15 2020
parent reply Jon Degenhardt <jond noreply.com> writes:
On Sunday, 15 March 2020 at 12:13:39 UTC, Pavel Shkadzko wrote:
 On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg 
 wrote:
 On 2020-03-12 13:59, Pavel Shkadzko wrote:
 I have done several benchmarks against Numpy for various 2D 
 matrix operations. The purpose was mere curiosity and spread 
 the word about Mir D library among the office data engineers.
 Since I am not a D expert, I would be happy if someone could 
 take a second look and double check.
 
 https://github.com/tastyminerals/mir_benchmarks
 
 Compile and run the project via: dub run --compiler=ldc 
 --build=release
Have you tried to compile with LTO (Link Time Optimization) and PGO (Profile Guided Optimization) enabled? You should also link with the versions of Phobos and druntime that has been compiled with LTO.
If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough then it doesn't improve anything.
Try: "dflags-ldc" : ["-flto=thin", "-defaultlib=phobos2-ldc-lto,druntime-ldc-lto", "-singleobj" ] The "-defaultlib=..." parameter engages LTO for phobos and druntime. You can also use "-flto=full" rather than "thin". I've had good results with "thin". Not sure if the "-singleobj" parameter helps.
 For PGO, I am a bit confused how to use it with dub -- 
 dflags-ldc: ["-O3"]? It compiles but I see no difference. By 
 default, ldc2 should be using O2 -- good optimizations.
PGO (profile guided optimization) is a multi-step process. First step is create an instrumented build (-fprofile-instr-generate). Second step is to run the instrumented binary on a representative workload. Last step is to use the resulting workload in the final build (-fprofile-instr-use). For information on PGO see Johan Engelen's blog page: https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html I have done studies on LTO and PGO and found both beneficial, often significantly. The largest gains came in code running in tight loops that were included code pulled from libraries (e.g. phobos, druntime). It was hard to predict what code was going benefit from LTO/PGO. I've found it tricky to use dub for the full PGO process. (Creating the instrumented build, generating the profile data, and using it in the final build process.) Mostly I've used make for this. I did get it to work in a simple performance test app: https://github.com/jondegenhardt/dcat-perf. It doesn't document how the PGO steps work, but it dub.json file is relatively short and repository README.md contains the build instructions for both LTO and LTO plus PGO. --Jon
Mar 15 2020
parent 9il <ilyayaroshenko gmail.com> writes:
On Sunday, 15 March 2020 at 20:15:07 UTC, Jon Degenhardt wrote:
 On Sunday, 15 March 2020 at 12:13:39 UTC, Pavel Shkadzko wrote:
 [...]
Try: "dflags-ldc" : ["-flto=thin", "-defaultlib=phobos2-ldc-lto,druntime-ldc-lto", "-singleobj" ] [...]
LTO and PGO are useless for this kind of stuff. Nothing to inline, the code is to simple and generic. Nothing to apply this technology for.
Mar 15 2020