digitalmars.D.ldc - Profile-guided optimization (PGO)
- Johan Engelen (22/22) Dec 08 2015 Hi all,
- David Nadlinger via digitalmars-d-ldc (5/7) Dec 08 2015 Did you also try using it with sample profiles acquired by an external
- Johan Engelen (6/9) Dec 08 2015 Hi David,
- David Nadlinger via digitalmars-d-ldc (4/6) Dec 08 2015 You're welcome – I hope it's enough information to reproduce it, but I...
- Johan Engelen (14/24) Dec 10 2015 After two more bug fixes: the regexp microbench now works.
- David Nadlinger via digitalmars-d-ldc (6/9) Dec 10 2015 Don't forget that this was just a random program I pulled from the
- Johan Engelen (2/2) Dec 10 2015 What do you think about "llvm-profdata"? Should we ship that with
- David Nadlinger via digitalmars-d-ldc (21/22) Dec 13 2015 Yes, we should probably ship it with the binary packages. For distro
- Johan Engelen (31/47) Dec 13 2015 Yep :-) That's what the FIXME comments are there for: so I don't
- David Nadlinger via digitalmars-d-ldc (11/24) Dec 13 2015 Yeah, me neither.
- Johan Engelen (5/10) Dec 13 2015 Lol, I should try to read up on these simple things first...
- Johan Engelen (15/25) Jan 10 2016 Hi David,
- Kagamin (3/5) Dec 11 2015 PGO can also reduce physical memory consumption due to less code
- Johan Engelen (4/10) Dec 11 2015 Yes, indeed. I'd like to find a good example of code where this
- David Nadlinger (6/8) Dec 10 2015 Speaking of test cases: This might be an obvious and/or stupid
- Johan Engelen (22/27) Dec 10 2015 Nope didn't do that yet :S :S Looks like it is needed to iron
- David Nadlinger (7/9) Dec 10 2015 Let me add that this would probably be something nice to have for
- Liran Zvibel via digitalmars-d-ldc (2/10) Dec 10 2015
- Kagamin (3/8) Dec 23 2015 As I understand, if the profiling runs long enough, the
- Johan Engelen (4/13) Dec 23 2015 The profiling is pretty simple: it is just a bunch of counters
- Johan Engelen (8/10) Dec 08 2015 I fixed a nasty [*] bug in compile-rt's profile writing code, and
- Johan Engelen (3/5) Dec 10 2015 Clearly I was too optimistic about the quality of my work so far,
- David Nadlinger via digitalmars-d-ldc (5/7) Dec 10 2015 Quite the contrary – you chose to start with the hard part
- Johan Engelen (4/7) Dec 23 2015 It now works with LLVM 3.7 and LLVM 3.8 (trunk), on Mac OS X,
Hi all, I have been working on getting rudimentary PGO going in LDC. It's pretty much ready! [1] (does not work on Windows yet... I have to fix LLVM's compile-rt code) I've implemented something very similar to Clang: LDC uses profile information (generated by an instrumented executable built by LDC) to tag each branch in the code with branch weights. The actual optimizations are done by LLVM; at the moment LDC only adds metadata to the IR. At this point, I want your input: commandline option naming, easy to use? (llvm-profdata is needed...), do you get substantial performance boosts, runtime library inclusion or separate lib for profile data file writing, bugs, uninstrumented branches/switches, etc. All comments are welcome (please be kind ;-). Before I announce it in the "Announce" forum, I want to hear your thoughts first. Thanks! Johan [1] http://wiki.dlang.org/LDC_LLVM_profiling_instrumentation#Profile-Guided_Optimization_.28PGO.29_status_in_LDC
Dec 08 2015
Hi Johan, On 8 Dec 2015, at 20:13, Johan Engelen via digitalmars-d-ldc wrote:I've implemented something very similar to Clang: LDC uses profile information (generated by an instrumented executable built by LDC)Did you also try using it with sample profiles acquired by an external profiler yet, as described in the Clang page on PGO? — David
Dec 08 2015
On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger wrote:Did you also try using it with sample profiles acquired by an external profiler yet, as described in the Clang page on PGO?Hi David, No, I have not look at that yet. Thanks a lot for the testcase you posted on Github. Will sink my teeth in fixing that first.
Dec 08 2015
On 8 Dec 2015, at 23:35, Johan Engelen via digitalmars-d-ldc wrote:Thanks a lot for the testcase you posted on Github. Will sink my teeth in fixing that first.You're welcome – I hope it's enough information to reproduce it, but I don't have a debug build of LLVM on this machine right now. — David
Dec 08 2015
On Tuesday, 8 December 2015 at 22:41:22 UTC, David Nadlinger wrote:On 8 Dec 2015, at 23:35, Johan Engelen via digitalmars-d-ldc wrote:After two more bug fixes: the regexp microbench now works. Results with the regexp bench (bench.d):Thanks a lot for the testcase you posted on Github. Will sink my teeth in fixing that first.You're welcome – I hope it's enough information to reproduce it, but I don't have a debug build of LLVM on this machine right now.time ldc2 bench.d -O3 -of=bench_normal --> 52stime ./bench_normal --> 2.55s 98%cputime ldc2 bench.d -fprofile-instr-generate -of=bench_instr --> 11stime ./bench_instr --> 6.72s 99%cpu llvm-profdata merge default.profraw -o bench.profdata time ldc2 bench.d -O3 -fprofile-instr-use=bench.profdata -of=bench_pgo --> 48.35s time ./bench_pgo --> 2.48s 98%cpu (timing numbers for bench_normal and bench_pgo are +- 0.01) So PGO brings it from 2.55 to 2.48 sec, ~3% improvement. Disappointing, but well... it works!
Dec 10 2015
On 11 Dec 2015, at 1:26, Johan Engelen via digitalmars-d-ldc wrote:After two more bug fixes: the regexp microbench now works. […] Disappointing, but well... it works!Don't forget that this was just a random program I pulled from the Rosettacode compilation, though – I didn't have a benchmark ready where I know that branch prediction or inlining improvements would make a difference. — David
Dec 10 2015
What do you think about "llvm-profdata"? Should we ship that with LDC?
Dec 10 2015
On 11 Dec 2015, at 1:38, Johan Engelen via digitalmars-d-ldc wrote:What do you think about "llvm-profdata"? Should we ship that with LDC?Yes, we should probably ship it with the binary packages. For distro packages, we are of course dependent on the LLMV packages to include the tools, but at least the Homebrew package actually does. What is left to do before we can merge a first version into the main repository? A partial list: - Deal with the remaining FIXME comments (at least open separate GitHub issues for them), as well as with commented-out fragments from the Clang implementation. - Find some way to avoid ICE-type regressions on real-world D code, for example by building the druntime/Phobos unit tests with instrumentation on. - Decide on a name for the command line switches. The GCC-style "-f" prefix isn't currently used for most of the options, but that's not necessarily much of an argument. On a rather unrelated note, did you try whether the profile data also gives sensible results with llvm-cov? If yes, that might be something nice to mention in that upcoming announcement, even though we also have DMD-style -cov support, of course. — David
Dec 13 2015
On Sunday, 13 December 2015 at 13:21:33 UTC, David Nadlinger wrote:What is left to do before we can merge a first version into the main repository? A partial list: - Deal with the remaining FIXME comments (at least open separate GitHub issues for them), as well as with commented-out fragments from the Clang implementation.Yep :-) That's what the FIXME comments are there for: so I don't forget :-) The plan is to deal with all FIXME's, and remove the unused commented-out Clang fragments.- Find some way to avoid ICE-type regressions on real-world D code, for example by building the druntime/Phobos unit tests with instrumentation on.I just fixed two more ICEs, and now the dmd-testsuite succeeds with -fprofile-instr-generate. Running the druntime/Phobos unittests now with -fprofile-instr-generate. Also, I added a pragma(LDC_profile_instr, true|false) to enable/disable instrumentation codegen for specific functions. My main reason for this is to help people speed up instrumented binaries, and it also helps circumventing ICEs. See tests/ir/profile/pragma.d. (perhaps you think of a better name for the pragma)- Decide on a name for the command line switches. The GCC-style "-f" prefix isn't currently used for most of the options, but that's not necessarily much of an argument.I have absolutely no preference here. I think we should do what the world is already familiar with. Iirc, -fprofile-instr-generate is a Clang option, and that Clang is moving towards / will support GCC's option naming (-fprofile-generate, -fprofile-use). DMD has a -profile option, but I have not read up on what that will do. I guess we will not add any option for PGO to ldmd2?On a rather unrelated note, did you try whether the profile data also gives sensible results with llvm-cov? If yes, that might be something nice to mention in that upcoming announcement, even though we also have DMD-style -cov support, of course.Did not look into this at all yet. Clang's PGOGen code has some extra functions for gcov support and more. It's all commented out for now, but it looks like we can support more tools relatively easily with the current implementation. I also see hints of sampling-based PGO in the code, for example. Another important TODO item: remove the profiling runtime from druntime, and instead add a separate runtime-profiling lib (suggestions for a name? ldc-profile.lib?).
Dec 13 2015
On 13 Dec 2015, at 14:59, Johan Engelen via digitalmars-d-ldc wrote:I just fixed two more ICEs, and now the dmd-testsuite succeeds with -fprofile-instr-generate. Running the druntime/Phobos unittests now with -fprofile-instr-generate.Nice!I have absolutely no preference here. I think we should do what the world is already familiar with. Iirc, -fprofile-instr-generate is a Clang option, and that Clang is moving towards / will support GCC's option naming (-fprofile-generate, -fprofile-use).Yeah, me neither.DMD has a -profile option, but I have not read up on what that will do.It makes DMD's druntime emit some profiling info as text files (trace.def/trace.log) at program exit. This is for manual analysis only, no PGO-type functionality in sight.I guess we will not add any option for PGO to ldmd2?Yeah, as DMD does not have any PGO functionality. Of course it will still pass the ldc2 options through.Another important TODO item: remove the profiling runtime from druntime, and instead add a separate runtime-profiling lib (suggestions for a name? ldc-profile.lib?).Maybe add a "rt" suffix to make clear that this is the actual program runtime part? Ultimately does not really matter, though. — David
Dec 13 2015
On Sunday, 13 December 2015 at 14:16:53 UTC, David Nadlinger wrote:On 13 Dec 2015, at 14:59, Johan Engelen via digitalmars-d-ldc wrote:Lol, I should try to read up on these simple things first... To do the tests, I had modified ldmd to recognize -fprofile-instr-generate...I guess we will not add any option for PGO to ldmd2?Yeah, as DMD does not have any PGO functionality. Of course it will still pass the ldc2 options through.
Dec 13 2015
Hi David, Could you have a look at the PR again? I think it is almost ready for merging. (two FIXME's left that I hope to address soon) On Sunday, 13 December 2015 at 13:21:33 UTC, David Nadlinger wrote:On 11 Dec 2015, at 1:38, Johan Engelen via digitalmars-d-ldc wrote:I have added llvm-profdata to the repo, and renamed the built executable to ldc-profdata (it has to be in-sync with the LLVM version that LDC was built with, and so the renaming prevents potential name clashing with a system installed profdata version).What do you think about "llvm-profdata"? Should we ship that with LDC?Yes, we should probably ship it with the binary packages.What is left to do before we can merge a first version into the main repository? A partial list: - Deal with the remaining FIXME comments (at least open separate GitHub issues for them), as well as with commented-out fragments from the Clang implementation.I think I will just remove all unused Clang implementation code (it is related to coverage stuff, which I think we should implement in a different branch after PGO is merged). Thanks! Johan
Jan 10 2016
On Friday, 11 December 2015 at 00:26:22 UTC, Johan Engelen wrote:So PGO brings it from 2.55 to 2.48 sec, ~3% improvement. Disappointing, but well... it works!PGO can also reduce physical memory consumption due to less code loaded into memory.
Dec 11 2015
On Friday, 11 December 2015 at 14:07:56 UTC, Kagamin wrote:On Friday, 11 December 2015 at 00:26:22 UTC, Johan Engelen wrote:Yes, indeed. I'd like to find a good example of code where this shows up in practice, so that PGO really does improve performance significantly (say, >10%).So PGO brings it from 2.55 to 2.48 sec, ~3% improvement. Disappointing, but well... it works!PGO can also reduce physical memory consumption due to less code loaded into memory.
Dec 11 2015
On Tuesday, 8 December 2015 at 22:35:11 UTC, Johan Engelen wrote:Thanks a lot for the testcase you posted on Github. Will sink my teeth in fixing that first.Speaking of test cases: This might be an obvious and/or stupid suggestion, but did you try building the Phobos unit tests (and maybe also dmd-testsuite/runnable) with PGO? I'd suspect it would give you quite a broad coverage of basic language constructs. - David
Dec 10 2015
On Thursday, 10 December 2015 at 14:27:59 UTC, David Nadlinger wrote:Speaking of test cases: This might be an obvious and/or stupid suggestion, but did you try building the Phobos unit tests (and maybe also dmd-testsuite/runnable) with PGO? I'd suspect it would give you quite a broad coverage of basic language constructs.Nope didn't do that yet :S :S Looks like it is needed to iron out some remaining bugs. I underestimated the complexity of D's AST (some objects are placed in multiple locations in the AST?), which gave rise to an assertion fail in your testcase; plus I forgot to add throw statements to the AST tree walker, leading to another assertion fail. Those issues have been fixed now, and now it breaks with the same error you found. It is confusing because I did not (mean to) change any of the codegen, other than adding counter increment instructions and branch instruction metadata (both trivial additions). But I did have to add extra basicblocks for switch statements... perhaps I can search there first. Hope to have a resolution for your test case quickly. I also have not tested at all how this works with multiple object files linked together, or other possibly more complicated things. I thought a fun testcase would be to compile DDMD with PGO enabled, compile itself as a profiling run, rebuild with PGO and test if compiling, say, Phobos is quicker/slower. I am very curious to see what constructs will see a significant performance boost, if any at all.
Dec 10 2015
On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger wrote:Did you also try using it with sample profiles acquired by an external profiler yet, as described in the Clang page on PGO?Let me add that this would probably be something nice to have for the initial release, as users could fall back to using perf, etc. if the instrumentation part is still buggy or incomplete for their code. - David
Dec 10 2015
Also, for use cases like ours, where the system runs for extended periods of time, and optimizing the init time, which may be minutes is not interesting at all, just being able to run perf while the system is doing something interesting to improve is a big plus. LiranOn Dec 10, 2015, at 16:30, David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> wrote: On Tuesday, 8 December 2015 at 20:08:15 UTC, David Nadlinger wrote:Did you also try using it with sample profiles acquired by an external profiler yet, as described in the Clang page on PGO?Let me add that this would probably be something nice to have for the initial release, as users could fall back to using perf, etc. if the instrumentation part is still buggy or incomplete for their code. - David
Dec 10 2015
On Thursday, 10 December 2015 at 14:38:19 UTC, Liran Zvibel wrote:Also, for use cases like ours, where the system runs for extended periods of time, and optimizing the init time, which may be minutes is not interesting at all, just being able to run perf while the system is doing something interesting to improve is a big plus.As I understand, if the profiling runs long enough, the long-running statistics will dominate startup statistics?
Dec 23 2015
On Wednesday, 23 December 2015 at 13:10:33 UTC, Kagamin wrote:On Thursday, 10 December 2015 at 14:38:19 UTC, Liran Zvibel wrote:The profiling is pretty simple: it is just a bunch of counters whereever the code branches. So indeed, long runs will have long-running statistics dominate startup statistics.Also, for use cases like ours, where the system runs for extended periods of time, and optimizing the init time, which may be minutes is not interesting at all, just being able to run perf while the system is doing something interesting to improve is a big plus.As I understand, if the profiling runs long enough, the long-running statistics will dominate startup statistics?
Dec 23 2015
On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:(does not work on Windows yet... I have to fix LLVM's compile-rt code)I fixed a nasty [*] bug in compile-rt's profile writing code, and now it also works on Windows. (The IR tests fail on Windows because running a compiled executable from LIT fails for some reason on Windows.) [*] https://stackoverflow.com/questions/5537066/strange-0x0d-being-added-to-my-binary-file Now I know what to look for first if I see 0x0D's in my files...
Dec 08 2015
On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:Before I announce it in the "Announce" forum, I want to hear your thoughts first.Clearly I was too optimistic about the quality of my work so far, hehe.
Dec 10 2015
On 10 Dec 2015, at 19:43, Johan Engelen via digitalmars-d-ldc wrote:Clearly I was too optimistic about the quality of my work so far, hehe.Quite the contrary – you chose to start with the hard part (instrumentation-based instead of sampling-based), and DMD's AST is notoriously, uh, fluid in meaning and under-documented. — David
Dec 10 2015
On Tuesday, 8 December 2015 at 19:13:41 UTC, Johan Engelen wrote:Hi all, I have been working on getting rudimentary PGO going in LDC. It's pretty much ready!It now works with LLVM 3.7 and LLVM 3.8 (trunk), on Mac OS X, Linux (tested on Ubuntu), and Windows! I have not tested other platforms.
Dec 23 2015