digitalmars.D - std.benchmark is in reviewable state
- Andrei Alexandrescu (16/16) Sep 25 2011 I've had a good time with std.benchmark today and made it ready for
- bearophile (4/5) Sep 25 2011 Is the micron symbol visible on Windows too?
- Adam D. Ruppe (6/7) Sep 25 2011 Yes, of course.
- Vladimir Panteleev (13/20) Sep 25 2011 Have you tried it? It's not a matter of installed fonts.
- Adam D. Ruppe (4/5) Sep 25 2011 I was thinking of the website documentation, not the program itself.
- Andrej Mitrovic (4/4) Sep 25 2011 I'm thinking we could shorten the function names. benchmarkResume is a
- so (6/10) Sep 25 2011 All the workarounds to this (that i know of) instead of solving the issu...
- Regan Heath (5/9) Sep 26 2011 What's wrong with "benchResume"?
- Andrej Mitrovic (2/3) Sep 26 2011 I'm ok with that.
- Andrei Alexandrescu (4/7) Sep 26 2011 I see negative progress from "benchmarkResume" to "benchResume". I think...
- Andrei Alexandrescu (3/6) Sep 25 2011 Ionno, haven't tried.
- Jonathan M Davis (6/13) Sep 25 2011 core.time.Duration has alwayse used the micron symbol in its toString. W...
- Robert Jacques (8/30) Sep 25 2011 Wikipedia also lists some of the challenges to modern benchmarking: http...
- Andrei Alexandrescu (12/70) Sep 25 2011 std.benchmark repeats the measurement in a loop, discounts the times
- Robert Jacques (6/80) Sep 25 2011 First, a core jump is guaranteed to, essentially, poison the cache. Whic...
- Andrei Alexandrescu (18/62) Sep 26 2011 Well it will be one of many runs. The questions are if (a) core jumps
- Jens Mueller (4/19) Sep 26 2011 500?
- Robert Jacques (2/21) Sep 26 2011 The minimum, automatic time window is 10 ms. Given the scaling factor, t...
- Robert Jacques (14/79) Sep 26 2011 Great, but MS still recommends benchmarking be done on a single core. An...
- Jonathan M Davis (8/11) Sep 26 2011 And how would you ensure that? Beyond having your application be single-
- Timon Gehr (2/13) Sep 26 2011 There are API functions to do that on any reasonable OS.
- Robert Jacques (2/17) Sep 27 2011 Such as SetThreadAffinityMask, which I mentioned in my original post.
- Marco Leise (3/28) Sep 28 2011 sched_setaffinity (per process / Linux)
- Andrei Alexandrescu (7/18) Sep 26 2011 I guess this may have relevance only on multithreaded apps, but it's
- Robert Jacques (4/21) Sep 27 2011 No problem. 10 is a good default and is generally what I use.
- travert phare.normalesup.org (Christophe) (11/27) Sep 28 2011 Sorry to question you, but what makes you say that the minimum is more
- Robert Jacques (5/30) Sep 28 2011 The reason that the mean + std is so pervasive in science is that it's p...
- travert phare.normalesup.org (Christophe) (4/4) Sep 28 2011 I am reassured now.
- Don (27/45) Sep 29 2011 I think that's quite misleading. Microsoft has made some assumptions in
- so (17/33) Sep 25 2011 I know it does same thing under the hood. Yet, wouldn't
- Walter Bright (4/7) Sep 25 2011 We already have std.perf:
- Jerry Quinn (3/8) Sep 26 2011 std.perf doesn't show up in the nav bar to the left. I didn't know it e...
- Jonathan M Davis (8/16) Sep 26 2011 I don't think that it ever showed up in the online docs. I believe that ...
- Jose Armando Garcia (20/21) Sep 26 2011 Very nice. Really like the functionality. Really like how you
- Andrei Alexandrescu (12/35) Sep 26 2011 Not sure whether the compiler allows it, but it shouldn't work if one
- Jonathan M Davis (5/24) Sep 26 2011 You might want to check the compilability of your code. I merged your
- Jonathan M Davis (5/29) Sep 26 2011 That is, the unittest target fails to compile (due to a template error)....
- travert phare.normalesup.org (Christophe) (10/30) Sep 28 2011 Nice piece of work.
- Andrei Alexandrescu (14/16) Sep 28 2011 [snip]
- Jens Mueller (97/122) Oct 03 2011 Overall I like this module very much.
- Andrei Alexandrescu (49/104) Oct 03 2011 On 10/3/11 8:59 AM, Jens Mueller wrote:
- Jens Mueller (16/103) Oct 03 2011 I just mean that the relative speed ups printed in the table does not
- Andrej Mitrovic (3/3) Oct 03 2011 With this new module we might as well implement __MODULE__ or
- Jacob Carlborg (16/27) Oct 03 2011 Not necessarily. The user could provide a delegate as a callback,
- Andrei Alexandrescu (6/30) Oct 03 2011 That's what I meant when I said "we'd need to expose the benchmark
- Jacob Carlborg (5/38) Oct 03 2011 Hehe, Ok. I don't think it gets THAT much more complicated, just call
I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking. Thanks, Andrei
Sep 25 2011
Andrei Alexandrescu:http://erdani.com/d/web/phobos-prerelease/std_benchmark.htmlIs the micron symbol visible on Windows too? Bye, bearophile
Sep 25 2011
bearophile wrote:Is the micron symbol visible on Windows too?Yes, of course. It's a matter of fonts though - if you don't have the character in your font (regardless of operating system) it might not show. The micro sign is very common though, so unlikely to have any font without it.
Sep 25 2011
On Mon, 26 Sep 2011 05:03:32 +0300, Adam D. Ruppe <destructionator gmail.com> wrote:bearophile wrote:Have you tried it? It's not a matter of installed fonts. There are two problems with Unicode characters in console D programs: 1. The default code page on Windows is practically never UTF-8. You could change the code page to UTF-8 using SetConsoleOutputCP, but weird things happen if you don't change it back before your program exits: if your program was called from a batch file, all batch files are instantly aborted. 2. The default font for the Windows console is not a Unicode font. -- Best regards, Vladimir mailto:vladimir thecybershadow.netIs the micron symbol visible on Windows too?Yes, of course. It's a matter of fonts though - if you don't have the character in your font (regardless of operating system) it might not show. The micro sign is very common though, so unlikely to have any font without it.
Sep 25 2011
Vladimir Panteleev wrote:There are two problems with Unicode characters in console D programs:I was thinking of the website documentation, not the program itself. The symbols works for the website, but not really for the program. (even most my linux terminals aren't UTF-8)
Sep 25 2011
I'm thinking we could shorten the function names. benchmarkResume is a mouthful, and we already came to the conclusion that we shouldn't repeat module names in function names. But I'm all out of ideas for naming things, sorry. :)
Sep 25 2011
On Mon, 26 Sep 2011 06:14:59 +0300, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:I'm thinking we could shorten the function names. benchmarkResume is a mouthful, and we already came to the conclusion that we shouldn't repeat module names in function names. But I'm all out of ideas for naming things, sorry. :)All the workarounds to this (that i know of) instead of solving the issue, makes the situation worse. I guess the C way was the best. benchmark::Resume, benchmark.Resume benchmarkResume. This is palatable to me, what i hate to see is things like array.Array.
Sep 25 2011
On Mon, 26 Sep 2011 04:14:59 +0100, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:I'm thinking we could shorten the function names. benchmarkResume is a mouthful, and we already came to the conclusion that we shouldn't repeat module names in function names. But I'm all out of ideas for naming things, sorry. :)What's wrong with "benchResume"? -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Sep 26 2011
On 9/26/11, Regan Heath <regan netmail.co.nz> wrote:What's wrong with "benchResume"?I'm ok with that.
Sep 26 2011
On 9/26/11 8:07 AM, Andrej Mitrovic wrote:On 9/26/11, Regan Heath<regan netmail.co.nz> wrote:I see negative progress from "benchmarkResume" to "benchResume". I think I'll simply name them "pause" and "resume". AndreiWhat's wrong with "benchResume"?I'm ok with that.
Sep 26 2011
On 9/25/11 8:41 PM, bearophile wrote:Andrei Alexandrescu:Ionno, haven't tried. Andreihttp://erdani.com/d/web/phobos-prerelease/std_benchmark.htmlIs the micron symbol visible on Windows too?
Sep 25 2011
On Sunday, September 25, 2011 21:51:49 Andrei Alexandrescu wrote:On 9/25/11 8:41 PM, bearophile wrote:core.time.Duration has alwayse used the micron symbol in its toString. Worst case, I would think that you'd get a box or other symbol which indicates that the symbol is unknown, and since the micron symbol is the only one used which isn't ASCII, it'll be fairly obvious what it was meant to be. - Jonathan M DavisAndrei Alexandrescu:Ionno, haven't tried.http://erdani.com/d/web/phobos-prerelease/std_benchmark.htmlIs the micron symbol visible on Windows too?
Sep 25 2011
On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking. Thanks, AndreiAndrei, good job on the framework design, but you've left all of the design flaws of the current benchmark routine. To quote myself from last Tuesday:Wikipedia also lists some of the challenges to modern benchmarking: http://en.wikipedia.org/wiki/Time_Stamp_Counter So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc) [ ] Repeat the measurement N times, taking the min. [X] Adaptively increase the number of loop iterations to ensure a valid reading. [ ] Adaptively decrease the number of loop iterations to ensure minimal context switching.On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:[snip]Thank you for making this more meaningful! I assumed the standard library benchmark function would take care of those things. Should it?Yes and no. Benchmark provides a good way to make a single measurement of a function, as for really short functions you do have to loop many timesto be able to get a reliable reading. However, actual benchmarking requiresa) tuning the benchmark() call time to about 10-20 ms and b) runningbenchmark() many times, taking the minimum. The idea is that on any givenrun you could hit a context switch, etc. so if you make multiple run, onewill get lucky and not be interrupted. Worse, if a core switch happensbetween StopWatch start and end, the number of ticks returned is random.Hence, the comment to manually limit execution to a single core. So, itmight be nice if benchmark() also took a second parameter, denoting thenumber of times to benchmark the function and had some better documentationon how to do good benchmarking.
Sep 25 2011
On 9/25/11 9:02 PM, Robert Jacques wrote:On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think that won't influence most measurements measurably.I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking. Thanks, AndreiAndrei, good job on the framework design, but you've left all of the design flaws of the current benchmark routine. To quote myself from last Tuesday:Wikipedia also lists some of the challenges to modern benchmarking: http://en.wikipedia.org/wiki/Time_Stamp_Counter So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:[snip]Thank you for making this more meaningful! I assumed the standard library benchmark function would take care of those things. Should it?Yes and no. Benchmark provides a good way to make a single measurement of a function, as for really short functions you do have to loop many timesto be able to get a reliable reading. However, actual benchmarking requiresa) tuning the benchmark() call time to about 10-20 ms and b) runningbenchmark() many times, taking the minimum. The idea is that on any givenrun you could hit a context switch, etc. so if you make multiple run, onewill get lucky and not be interrupted. Worse, if a core switch happensbetween StopWatch start and end, the number of ticks returned is random.Hence, the comment to manually limit execution to a single core. So, itmight be nice if benchmark() also took a second parameter, denoting thenumber of times to benchmark the function and had some better documentationon how to do good benchmarking.[ ] Repeat the measurement N times, taking the min.std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.[X] Adaptively increase the number of loop iterations to ensure a valid reading.The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.[ ] Adaptively decrease the number of loop iterations to ensure minimal context switching.OK. Andrei
Sep 25 2011
On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/25/11 9:02 PM, Robert Jacques wrote:First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run. Second, timing generally relies on the CPUs Time Stamp Counter, which is not multi-thread safe; a core switch invalidates all previous TSC values, and hence, the time measurement itself. Furthermore, the TSC is not even guaranteed to have a fixed frequency on some CPUs. Now there are ways around the problems of the TSC, but even so: (From the Wikipedia) "Under Windows platforms, Microsoft strongly discourages using the TSC for high-resolution timing for exactly these reasons, providing instead the Windows APIs QueryPerformanceCounter and QueryPerformanceFrequency.[2] Even when using these functions, Microsoft recommends the code to be locked to a single CPU."On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think that won't influence most measurements measurably.I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking. Thanks, AndreiAndrei, good job on the framework design, but you've left all of the design flaws of the current benchmark routine. To quote myself from last Tuesday:Wikipedia also lists some of the challenges to modern benchmarking: http://en.wikipedia.org/wiki/Time_Stamp_Counter So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:[snip]Thank you for making this more meaningful! I assumed the standard library benchmark function would take care of those things. Should it?Yes and no. Benchmark provides a good way to make a single measurement of a function, as for really short functions you do have to loop many timesto be able to get a reliable reading. However, actual benchmarking requiresa) tuning the benchmark() call time to about 10-20 ms and b) runningbenchmark() many times, taking the minimum. The idea is that on any givenrun you could hit a context switch, etc. so if you make multiple run, onewill get lucky and not be interrupted. Worse, if a core switch happensbetween StopWatch start and end, the number of ticks returned is random.Hence, the comment to manually limit execution to a single core. So, itmight be nice if benchmark() also took a second parameter, denoting thenumber of times to benchmark the function and had some better documentationon how to do good benchmarking.Which is necessity but not sufficient for proper benchmarking. The timers themselves are noisy, to say nothing of the effects of context switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday, the naive use benchmark lead to some very wrong conclusions simply because any single run had up to ~5% of error. The only way to get rid of this error, is to make multiple runs.[ ] Repeat the measurement N times, taking the min.std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.[X] Adaptively increase the number of loop iterations to ensure a valid reading.The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.[ ] Adaptively decrease the number of loop iterations to ensure minimal context switching.OK. Andrei
Sep 25 2011
On 9/25/11 11:56 PM, Robert Jacques wrote:On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)I think that won't influence most measurements measurably.Second, timing generally relies on the CPUs Time Stamp Counter, which is not multi-thread safe; a core switch invalidates all previous TSC values, and hence, the time measurement itself. Furthermore, the TSC is not even guaranteed to have a fixed frequency on some CPUs. Now there are ways around the problems of the TSC, but even so: (From the Wikipedia) "Under Windows platforms, Microsoft strongly discourages using the TSC for high-resolution timing for exactly these reasons, providing instead the Windows APIs QueryPerformanceCounter and QueryPerformanceFrequency.[2] Even when using these functions, Microsoft recommends the code to be locked to a single CPU."std.benchmark uses QueryPerformanceCounter on Windows and clock_gettime/gettimeofday on Unix.You may want to take a look at the code. This is a long argument ending with "...make multiple runs" as if std.benchmark is not doing them and you suggest it should. Again, std.benchmark runs the tested function in a loop in powers of 10 until the entire run lasts at least 500ms.Which is necessity but not sufficient for proper benchmarking. The timers themselves are noisy, to say nothing of the effects of context switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday, the naive use benchmark lead to some very wrong conclusions simply because any single run had up to ~5% of error. The only way to get rid of this error, is to make multiple runs.[ ] Repeat the measurement N times, taking the min.std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.In fact upon further thinking the iterations should not be too few. Context switches and other activity is inevitable, but the effects are negligible over long periods (as half a second would be). To conclude, I'm unsure what steps you suggest to take to improve std.benchmark, and I'd be grateful if you could clarify them after perusing its code and documentation. Thanks, Andrei[X] Adaptively increase the number of loop iterations to ensure a valid reading.The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.[ ] Adaptively decrease the number of loop iterations to ensure minimal context switching.OK.
Sep 26 2011
Andrei Alexandrescu wrote:On 9/25/11 11:56 PM, Robert Jacques wrote:500? To me it looks as if the time window is 10ms long (line 447)? JensOn Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)I think that won't influence most measurements measurably.
Sep 26 2011
On Mon, 26 Sep 2011 09:56:27 -0400, Jens Mueller <jens.k.mueller gmx.de> wrote:Andrei Alexandrescu wrote:The minimum, automatic time window is 10 ms. Given the scaling factor, the actual time window will be between 10 ms and 100 ms. IIRC A context switch occurs every 15 ms or less. And given that you'd be starting the time window at a random point inside the active time slice, at the minimum (i.e. 10 ms time window) you have a 66% chance of a switch happening.On 9/25/11 11:56 PM, Robert Jacques wrote:500? To me it looks as if the time window is 10ms long (line 447)? JensOn Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)I think that won't influence most measurements measurably.
Sep 26 2011
On Mon, 26 Sep 2011 09:39:28 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/25/11 11:56 PM, Robert Jacques wrote:I don't know exactly how likely core jump are, but context switches definitely happen. And yes, in my experience, they influence a 50 ms time window. Remember, context switches have two outcomes: your thread continues running or your thread is paused. But the timer doesn't pause, when your thread pauses. So all you need is one bad switch to see an effect.On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)I think that won't influence most measurements measurably.Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it. From MSDN documentation on QueryPerformanceCounter: "On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function."Second, timing generally relies on the CPUs Time Stamp Counter, which is not multi-thread safe; a core switch invalidates all previous TSC values, and hence, the time measurement itself. Furthermore, the TSC is not even guaranteed to have a fixed frequency on some CPUs. Now there are ways around the problems of the TSC, but even so: (From the Wikipedia) "Under Windows platforms, Microsoft strongly discourages using the TSC for high-resolution timing for exactly these reasons, providing instead the Windows APIs QueryPerformanceCounter and QueryPerformanceFrequency.[2] Even when using these functions, Microsoft recommends the code to be locked to a single CPU."std.benchmark uses QueryPerformanceCounter on Windows and clock_gettime/gettimeofday on Unix.*sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday: double min_f1 = int.max; double min_f2 = int.max; foreach(i;0..10_000) { auto b=benchmark!(f1,f2)(3); min_f1 = min(min_f1, b[0].to!("seconds",double)); min_f2 = min(min_f2, b[1].to!("seconds",double)); } All the looping that benchmark currently does is to lengthen the total time a function takes. This makes it possible to avoid sampling errors with regard to the performance counter itself. But you are still only making a single, noisy measurement of a (meta)function. In order to get a robust result, you need to make multiple measurements and take the minimum.You may want to take a look at the code. This is a long argument ending with "...make multiple runs" as if std.benchmark is not doing them and you suggest it should. Again, std.benchmark runs the tested function in a loop in powers of 10 until the entire run lasts at least 500ms.Which is necessity but not sufficient for proper benchmarking. The timers themselves are noisy, to say nothing of the effects of context switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday, the naive use benchmark lead to some very wrong conclusions simply because any single run had up to ~5% of error. The only way to get rid of this error, is to make multiple runs.[ ] Repeat the measurement N times, taking the min.std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.In fact upon further thinking the iterations should not be too few. Context switches and other activity is inevitable, but the effects are negligible over long periods (as half a second would be). To conclude, I'm unsure what steps you suggest to take to improve std.benchmark, and I'd be grateful if you could clarify them after perusing its code and documentation. Thanks, Andrei[X] Adaptively increase the number of loop iterations to ensure a valid reading.The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.[ ] Adaptively decrease the number of loop iterations to ensure minimal context switching.OK.
Sep 26 2011
On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it.And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis
Sep 26 2011
On 09/26/2011 05:59 PM, Jonathan M Davis wrote:On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:There are API functions to do that on any reasonable OS.Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it.And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis
Sep 26 2011
On Mon, 26 Sep 2011 13:20:01 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:On 09/26/2011 05:59 PM, Jonathan M Davis wrote:Such as SetThreadAffinityMask, which I mentioned in my original post.On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:There are API functions to do that on any reasonable OS.Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it.And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis
Sep 27 2011
Am 27.09.2011, 09:08 Uhr, schrieb Robert Jacques <sandford jhu.edu>:On Mon, 26 Sep 2011 13:20:01 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:sched_setaffinity (per process / Linux) pthread_setaffinity_np (per thread / Posix)On 09/26/2011 05:59 PM, Jonathan M Davis wrote:Such as SetThreadAffinityMask, which I mentioned in my original post.On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:There are API functions to do that on any reasonable OS.Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it.And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis
Sep 28 2011
On 9/26/11 8:43 AM, Robert Jacques wrote:Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it. From MSDN documentation on QueryPerformanceCounter: "On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function."I guess this may have relevance only on multithreaded apps, but it's best to do things by the book. I'll look into it.*sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday:I now got it, sorry for being slow. How about taking the minimum of 10 consecutive bigloop runs while committing to the same overall timing budget? Thanks, Andrei
Sep 26 2011
On Mon, 26 Sep 2011 17:56:26 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/26/11 8:43 AM, Robert Jacques wrote:Well, this applies to any multi-core PC; whether the app is single-threaded or multi-threaded makes no difference.Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it. From MSDN documentation on QueryPerformanceCounter: "On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function."I guess this may have relevance only on multithreaded apps, but it's best to do things by the book. I'll look into it.No problem. 10 is a good default and is generally what I use. Although, there should be at least one function which allows manual control out the number of bigloops. i.e. for the ptr+length discussion I ultimately used a lot more than 10 iterations, as this was a very precise comparison. And people wanting to test network or cold-start disk performance may only want to use a single bigloop.*sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday:I now got it, sorry for being slow. How about taking the minimum of 10 consecutive bigloop runs while committing to the same overall timing budget?
Sep 27 2011
"Robert Jacques" , dans le message (digitalmars.D:145443), a écrit :*sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday: double min_f1 = int.max; double min_f2 = int.max; foreach(i;0..10_000) { auto b=benchmark!(f1,f2)(3); min_f1 = min(min_f1, b[0].to!("seconds",double)); min_f2 = min(min_f2, b[1].to!("seconds",double)); } All the looping that benchmark currently does is to lengthen the total time a function takes. This makes it possible to avoid sampling errors with regard to the performance counter itself. But you are still only making a single, noisy measurement of a (meta)function. In order to get a robust result, you need to make multiple measurements and take the minimum.Sorry to question you, but what makes you say that the minimum is more interesting than, let's say, the mean + the standard deviation. Is there any paper supporting this ? I am not a benchmark specialist, but an average statistician who knows that the minimum is not always very informative (a very low value can be a piece of luck). I am sure you may have good reasons to choose the minimum, but I just want to make sure we make the right choice by choosing to use the minimun of consecutive benchmarks. -- Christophe
Sep 28 2011
On Wed, 28 Sep 2011 08:19:16 -0400, Christophe <travert phare.normalesup.org> wrote:"Robert Jacques" , dans le message (digitalmars.D:145443), a écrit :The reason that the mean + std is so pervasive in science is that it's perfect for a Gaussian noise model. In most physical measurements, you expect some readings to be both above and below the actual value and for most processes we assume the error of these measurements is normally distributed. Importantly, when we know the process not to produce normally distibuted error, we look to other statistics for meaning. Now, for computer timing measurements, you have access to the physical number of clock cycles taken. There is some noise in this, but it's on the order of 1-2 cycles (i.e. ns). As a benchmark is trying to measure the number of cycles something takes to complete, and it is physically impossible to complete that task in less than a fixed number of cycles, 'noise' in the measurement above 1-2 ns is by definition purely additive due to OS or hardware effects (i.e. a context switch occurred, a page fault happened, you put your computer to sleep, etc.). So the best way to get to the true number of cycles a task takes to complete is to take the minimum recorded value, on the premise that you did get lucky and those other issues which you are not trying to measure, had little to no impact. So that's the rational. As for academic support, I've published peer-reviewed papers using this methodology, but I can't remember where I originally got the advice. Also, there are certain types of benchmarks for which the mean + std is better (networking comes to mind), etc. P.S. A little Google-fu turned up: The science of computer benchmarking By Roger W. Hockney, which does recommend using the minimum, particularly to get around the problem of time-slicing (i.e. context switching) by the OS.*sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday: double min_f1 = int.max; double min_f2 = int.max; foreach(i;0..10_000) { auto b=benchmark!(f1,f2)(3); min_f1 = min(min_f1, b[0].to!("seconds",double)); min_f2 = min(min_f2, b[1].to!("seconds",double)); } All the looping that benchmark currently does is to lengthen the total time a function takes. This makes it possible to avoid sampling errors with regard to the performance counter itself. But you are still only making a single, noisy measurement of a (meta)function. In order to get a robust result, you need to make multiple measurements and take the minimum.Sorry to question you, but what makes you say that the minimum is more interesting than, let's say, the mean + the standard deviation. Is there any paper supporting this ? I am not a benchmark specialist, but an average statistician who knows that the minimum is not always very informative (a very low value can be a piece of luck). I am sure you may have good reasons to choose the minimum, but I just want to make sure we make the right choice by choosing to use the minimun of consecutive benchmarks.
Sep 28 2011
I am reassured now. Thanks a lot. -- Christophe
Sep 28 2011
On 26.09.2011 17:43, Robert Jacques wrote:I think that's quite misleading. Microsoft has made some assumptions in there about what you are measuring. QueryPerformanceCounter gives you the wall clock time. But, with modern CPUs (esp, core i7 Sandy Bridge, but also as far back as Pentium M) the CPU frequency isn't constant. So, the wall clock time depends both on the number of clock cycles required to execute the code, AND on the temperature of the CPU! If you run the same benchmark code enough times, it'll eventually get slower as the CPU heats up! Personally I always use the TSC, and then discard any results where the TSC is inconsistent. Specifically I use the hardware performance 1 cache misses, etc. I've managed to get rock-solid data that way, but it takes a lot of work (eg, you have to make sure that your stack is aligned to 16 bytes). BUT this sort of thing is only relevant to really small sections of code (< 1 time slice). If you're timing something which involves hardware other than the CPU (eg, database access, network access) then you really want the wall clock time. And if the section of code is long enough, then maybe you do care how much it heats up the CPU! But OTOH once you're in that regime, I don't think you care about processor affinity. In real life, you WILL get core transitions. So I don't think it's anywhere near as simple as what Microsoft makes out. It's very important to be clear about what kind of stuff you intend to measure.Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it.Second, timing generally relies on the CPUs Time Stamp Counter, which is not multi-thread safe; a core switch invalidates all previous TSC values, and hence, the time measurement itself. Furthermore, the TSC is not even guaranteed to have a fixed frequency on some CPUs. Now there are ways around the problems of the TSC, but even so: (From the Wikipedia) "Under Windows platforms, Microsoft strongly discourages using the TSC for high-resolution timing for exactly these reasons, providing instead the Windows APIs QueryPerformanceCounter and QueryPerformanceFrequency.[2] Even when using these functions, Microsoft recommends the code to be locked to a single CPU."std.benchmark uses QueryPerformanceCounter on Windows and clock_gettime/gettimeofday on Unix.
Sep 29 2011
On Mon, 26 Sep 2011 04:08:45 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking. Thanks, AndreiI know it does same thing under the hood. Yet, wouldn't this(AutoStart autostart) { if(autostart) start(); } be better off something explicit like: this(AutoStart autostart=AutoStart.no) { if(Autostart.yes == autostart) start(); } It overwrites default constructor but then again it is explicit. Maybe i am completely of the hook and there is a phobos/D consensus like "leave the def. ctor alone!"
Sep 25 2011
On 9/25/2011 6:08 PM, Andrei Alexandrescu wrote:I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes.We already have std.perf: http://www.d-programming-language.org/phobos/std_perf.html A review should compare the two.
Sep 25 2011
Walter Bright Wrote:We already have std.perf: http://www.d-programming-language.org/phobos/std_perf.html A review should compare the two.std.perf doesn't show up in the nav bar to the left. I didn't know it existed until your post :-) Jerry
Sep 26 2011
On Monday, September 26, 2011 11:30:58 Jerry Quinn wrote:Walter Bright Wrote:I don't think that it ever showed up in the online docs. I believe that the current StopWatch and benchmarking functions in std.datetime are supposed to have replaced it as well, so it's been scheduled for deprecation for a while (and probably should have been deprecated by now). Glancing at it though, it does appear to have more functionality that the StopWatch and benchmarking functions in std.datetime currently provide. - Jonathan M DavisWe already have std.perf: http://www.d-programming-language.org/phobos/std_perf.html A review should compare the two.std.perf doesn't show up in the nav bar to the left. I didn't know it existed until your post :-)
Sep 26 2011
On Sun, Sep 25, 2011 at 6:08 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Comments and suggestions are welcome.Very nice. Really like the functionality. Really like how you implemented "void benchmarkModule(string mod)(File target = stdout)". I have a few comments. My main concern on how this function works is that it looks at the module for methods starting with benchmark_. I like the idea in general but I am worry that it would pollute the symbol space. Does this work with private symbols? Can the user make all the members private and it will still work? What about versioning should the user version all the benchmark members with version(Benchmark) or something like that? Should the user just create a benchmark module that would just contain all these benchmark functions. What do you recommend to the user? Should the doc contain these recommendations? One last comment for now. Maybe we can make the prefix, 'benchmark_', configurable by passing it to benchmarkModule(...). I think that benchmarkModule would greatly benefit from annotation it would be nice if the user could write: benchmark("list") private void listInsert() {...}
Sep 26 2011
On 9/26/11 9:04 AM, Jose Armando Garcia wrote:On Sun, Sep 25, 2011 at 6:08 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Not sure whether the compiler allows it, but it shouldn't work if one module benchmarks another. That doesn't worry me - benchmark functions are meant to be exposed for anyone to play with and it doesn't seem a function starting with "benchmark_" is casually confusable.Comments and suggestions are welcome.Very nice. Really like the functionality. Really like how you implemented "void benchmarkModule(string mod)(File target = stdout)". I have a few comments. My main concern on how this function works is that it looks at the module for methods starting with benchmark_. I like the idea in general but I am worry that it would pollute the symbol space. Does this work with private symbols? Can the user make all the members private and it will still work?What about versioning should the user version all the benchmark members with version(Benchmark) or something like that?As they wish. We will indeed define a version(StdBenchmark) for Phobos itself.Should the user just create a benchmark module that would just contain all these benchmark functions. What do you recommend to the user? Should the doc contain these recommendations?Good idea. I will add some typical physical designs.One last comment for now. Maybe we can make the prefix, 'benchmark_', configurable by passing it to benchmarkModule(...).Good idea, will do.I think that benchmarkModule would greatly benefit from annotation it would be nice if the user could write: benchmark("list") private void listInsert() {...}I think the benefit would be only marginal. Thanks, Andrei
Sep 26 2011
On Sunday, September 25, 2011 20:08:45 Andrei Alexandrescu wrote:I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking.You might want to check the compilability of your code. I merged your benchmark branch into a branch that I just branched from master, and it fails to compile with the latest compiler. - Jonathan M Davis
Sep 26 2011
On Monday, September 26, 2011 20:09:37 Jonathan M Davis wrote:On Sunday, September 25, 2011 20:08:45 Andrei Alexandrescu wrote:That is, the unittest target fails to compile (due to a template error). The normal build succeeds (probably because it's not compiling the benchmark function - which is the templated function which is failing). - Jonathan M DavisI've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking.You might want to check the compilability of your code. I merged your benchmark branch into a branch that I just branched from master, and it fails to compile with the latest compiler.
Sep 26 2011
Andrei Alexandrescu , dans le message (digitalmars.D:145370), a écrit :I've had a good time with std.benchmark today and made it ready for submission to Phobos. As a perk I've added the %h format specifier, which formats a number in human-readable form using SI prefixes. For those interested in discussing this prior to the formal review process, the code is at: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d and the dox at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Comments and suggestions are welcome. If possible, it would be great if this submission were given a bump in the priority queue. This is because with robust benchmarking in Phobos we can ask new, performance-sensitive submissions, to add benchmarking.Nice piece of work. I quite don't like the use of a global (well, thread local) name for the benchmark clockWatch. What about giving the clockWatch instance as a parameter to the benchmarked functions if they need to pause/resume? That would also avoid the issue of naming benchmarkPause and benchmarkResume (btw, I would have named them pauseBenchmark and resumeBenchmark). -- Christophe
Sep 28 2011
On 9/25/11 6:08 PM, Andrei Alexandrescu wrote:I've had a good time with std.benchmark today and made it ready for submission to Phobos.[snip] You destroyed, and I listened. I updated my benchmark branch to accept a configurable time budget for measurements, and a number of trials that take the minimum time. Still need to do the thread affinity stuff. Code: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d Dox: http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Destroy once again. Andrei
Sep 28 2011
Andrei Alexandrescu wrote:On 9/25/11 6:08 PM, Andrei Alexandrescu wrote:Overall I like this module very much. 1. I'm not that convinced that prepending benchmark_ to a function is that nice even though it's simple. But following this idea I could remove all my unittest blocks and add functions starting with test_ and write a module std.test in same fashion. If this approach is considered the way to go we can remove unittest altogether and solve testing entirely in a library. After all it looks less elegant to me but I admit that I see no other way and it is the easiest solution after all. I just want to stress that when seeing the code I questioned myself whether having unittest in the language is nice to have. 2. When benchmarking code you often investigate the algorithm's properties regarding its scalability. That's why I think it's useful to give an example in the documentation. Something like void benchmark_insert_array(size_t length)(uint n) { benchmarkPause(); Array!bool array; array.length = length; benchmarkResume(); foreach(i; 0 .. n) array.insertAfter(array[0 .. uniform(0, length)], true); } void benchmark_insert_slist(size_t length)(uint n) { benchmarkPause(); SList!bool list; foreach (i; 0 .. length) list.insertFront(false); benchmarkResume(); foreach(i; 0 .. n) { auto r = take(list[], uniform(0, length)); list.insertAfter(r, true); } } alias benchmark_insert_slist!(10) benchmark_insert_slist_10; alias benchmark_insert_array!(10) benchmark_insert_array_10; alias benchmark_insert_slist!(100) benchmark_insert_slist_100; alias benchmark_insert_array!(100) benchmark_insert_array_100; This example needs more polishing. Input size is the most important dimension that CPU time depends on. The example should clarify how to benchmark depending on the input size. 3. (r[i] / 10_000_000.).to!("nsecs", int) should be written as (r[i] / 10_000_000).nsecs Same for msecs and similar. 4. Test results should be exposed to the caller. Usually, it is enough to write the results to the console. But you may want to post process the results. Being it either graphically or to output XML for whatever reason. I think it would be beneficial to add this flexibility and separate the benchmarking logic from its generating the output. It might be useful to add the used CPU using std.cpuid to the output. 5. In the first example I completely missed that it is not "relative calls". I thought this was one term even though I didn't know what it should mean. Later I figured that there is an empty column called relative. 6. Wrong numbers in documentation The examples in the documentation have wrong numbers. In the end one should pay special attention that the documentation is consistent. I think one cannot automate this. There are also rounding errors. That means when I try to compute these numbers I get different in the first digit after the decimal point. I wonder why this is. Even though the measurement may be noisy the computed numbers should be more precise I think. 7. Changing units in the output. One benchmark report may use different units in its output. The one in the example uses microseconds and nanoseconds. This makes comparison more difficult and should be avoided (at least for groups). It is error-prone when analyzing the numbers. 8. I prefer writing std.benchmark.pause() over benchmarkPause() and std.benchmark.resume() over benchmarkResume(). or import std.benchmark : benchmark; try benchmark.pause() Supporting benchmark.start() would improve the example. Since you just need to specify one line that indicates where the benchmarking should start. But this seems difficult to get to work with the current design and I don't consider it important. It's a very minor thing. 10. The file reading/writing benchmark should be improved. Since the file should be deleted. The file writing benchmark should be more like: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); } But this also measures the deletion of the file. I have no idea how to fix the benchmark for reading a file. But I believe it involves using mkstemp on POSIX and similar on Windows. mkstemp should be added to std.file in the long run. JensI've had a good time with std.benchmark today and made it ready for submission to Phobos.[snip] You destroyed, and I listened. I updated my benchmark branch to accept a configurable time budget for measurements, and a number of trials that take the minimum time. Still need to do the thread affinity stuff. Code: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d Dox: http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Destroy once again. Andrei
Oct 03 2011
On 10/3/11 8:59 AM, Jens Mueller wrote: [snip] Thanks for this rich feedback. I'll act on most or all points. A few notes follow.1. I'm not that convinced that prepending benchmark_ to a function is that nice even though it's simple. But following this idea I could remove all my unittest blocks and add functions starting with test_ and write a module std.test in same fashion. If this approach is considered the way to go we can remove unittest altogether and solve testing entirely in a library. After all it looks less elegant to me but I admit that I see no other way and it is the easiest solution after all. I just want to stress that when seeing the code I questioned myself whether having unittest in the language is nice to have.std.benchmark uses relatively new reflection features that weren't available until recently. Adding unittest now would be difficult to justify, but at the time unittest was a good feature.2. When benchmarking code you often investigate the algorithm's properties regarding its scalability.Yah, that definitely needs to be looked at. I think I need to add some support beyond mere documentation.3. (r[i] / 10_000_000.).to!("nsecs", int) should be written as (r[i] / 10_000_000).nsecs Same for msecs and similar.Noted.4. Test results should be exposed to the caller. Usually, it is enough to write the results to the console. But you may want to post process the results. Being it either graphically or to output XML for whatever reason. I think it would be beneficial to add this flexibility and separate the benchmarking logic from its generating the output. It might be useful to add the used CPU using std.cpuid to the output.We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.5. In the first example I completely missed that it is not "relative calls". I thought this was one term even though I didn't know what it should mean. Later I figured that there is an empty column called relative.There are two spaces there, I should add three.6. Wrong numbers in documentation The examples in the documentation have wrong numbers. In the end one should pay special attention that the documentation is consistent. I think one cannot automate this.Not sure how this can be addressed at all (or maybe I don't understand). The examples show real numbers collected from real runs on my laptop. I don't think one may expect to reproduce them with reasonable accuracy because they depend on a lot of imponderables.There are also rounding errors. That means when I try to compute these numbers I get different in the first digit after the decimal point. I wonder why this is. Even though the measurement may be noisy the computed numbers should be more precise I think.Where are calculations mistaken? Indeed I see e.g. that append built-in has 83ns/call and 11.96M calls/s. Reverting the first with a "calculator" program yields 12.04 Mcalls/s, and reverting the second yields 83.61ns/call. I'll see what can be done.7. Changing units in the output. One benchmark report may use different units in its output. The one in the example uses microseconds and nanoseconds. This makes comparison more difficult and should be avoided (at least for groups). It is error-prone when analyzing the numbers.Problem is space and wildly varying numbers. I initially had exponential notation for everything, but it was super ugly and actually more difficult to navigate. That's why I added the compact notation. I think we'll need to put up with it.8. I prefer writing std.benchmark.pause() over benchmarkPause() and std.benchmark.resume() over benchmarkResume().OK.10. The file reading/writing benchmark should be improved. Since the file should be deleted. The file writing benchmark should be more like: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); } But this also measures the deletion of the file.If you don't want to measure std.file.write but instead only the time spent in writeln, his should work and measure only the time spent in writing to the file: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); benchmarkPause(); } I haven't yet tested the case when a functions leaves the benchmark paused.I have no idea how to fix the benchmark for reading a file. But I believe it involves using mkstemp on POSIX and similar on Windows. mkstemp should be added to std.file in the long run.Why is there a need to fix it? The benchmark measures the performance of std.file.read soup to nuts. In related news, I got word from a colleagues that I should also use times() (http://linux.die.net/man/2/times) to distinguish time spent in user space vs. system calls. This is because some functions must issue certain system calls in any case and the user is interested in how much they can improve the situation by optimizing their end of the deal. This would apply e.g. to file I/O where the overhead of formatting etc. would become easier to distinguish. Andrei
Oct 03 2011
Andrei Alexandrescu wrote:On 10/3/11 8:59 AM, Jens Mueller wrote: [snip] Thanks for this rich feedback. I'll act on most or all points. A few notes follow.I see. This is also my feeling.1. I'm not that convinced that prepending benchmark_ to a function is that nice even though it's simple. But following this idea I could remove all my unittest blocks and add functions starting with test_ and write a module std.test in same fashion. If this approach is considered the way to go we can remove unittest altogether and solve testing entirely in a library. After all it looks less elegant to me but I admit that I see no other way and it is the easiest solution after all. I just want to stress that when seeing the code I questioned myself whether having unittest in the language is nice to have.std.benchmark uses relatively new reflection features that weren't available until recently. Adding unittest now would be difficult to justify, but at the time unittest was a good feature.I just mean that the relative speed ups printed in the table does not match the ones given in the surrounding text.6. Wrong numbers in documentation The examples in the documentation have wrong numbers. In the end one should pay special attention that the documentation is consistent. I think one cannot automate this.Not sure how this can be addressed at all (or maybe I don't understand). The examples show real numbers collected from real runs on my laptop. I don't think one may expect to reproduce them with reasonable accuracy because they depend on a lot of imponderables.Yes. This is what I meant here.There are also rounding errors. That means when I try to compute these numbers I get different in the first digit after the decimal point. I wonder why this is. Even though the measurement may be noisy the computed numbers should be more precise I think.Where are calculations mistaken? Indeed I see e.g. that append built-in has 83ns/call and 11.96M calls/s. Reverting the first with a "calculator" program yields 12.04 Mcalls/s, and reverting the second yields 83.61ns/call. I'll see what can be done.If formatting and collecting is separated it is fine I'll guess. It's okay for the default output.7. Changing units in the output. One benchmark report may use different units in its output. The one in the example uses microseconds and nanoseconds. This makes comparison more difficult and should be avoided (at least for groups). It is error-prone when analyzing the numbers.Problem is space and wildly varying numbers. I initially had exponential notation for everything, but it was super ugly and actually more difficult to navigate. That's why I added the compact notation. I think we'll need to put up with it.This looks good.10. The file reading/writing benchmark should be improved. Since the file should be deleted. The file writing benchmark should be more like: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); } But this also measures the deletion of the file.If you don't want to measure std.file.write but instead only the time spent in writeln, his should work and measure only the time spent in writing to the file: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); benchmarkPause(); } I haven't yet tested the case when a functions leaves the benchmark paused.Yes. But the benchmark should clean up after itself if possible. So it should remove the file.I have no idea how to fix the benchmark for reading a file. But I believe it involves using mkstemp on POSIX and similar on Windows. mkstemp should be added to std.file in the long run.Why is there a need to fix it? The benchmark measures the performance of std.file.read soup to nuts.In related news, I got word from a colleagues that I should also use times() (http://linux.die.net/man/2/times) to distinguish time spent in user space vs. system calls. This is because some functions must issue certain system calls in any case and the user is interested in how much they can improve the situation by optimizing their end of the deal. This would apply e.g. to file I/O where the overhead of formatting etc. would become easier to distinguish.Good hint from your colleague. The more data one can get when measuring code the better. You could even be interested in cache misses when benchmarking. But CPU time is usually what boils down to. I fear you may loose simplicity. I'm looking forward to your changes. Jens
Oct 03 2011
With this new module we might as well implement __MODULE__ or something similar to avoid hardcoding the module name or using the .stringof[7..$] trick.
Oct 03 2011
On 2011-10-03 17:44, Andrei Alexandrescu wrote:On 10/3/11 8:59 AM, Jens Mueller wrote:Not necessarily. The user could provide a delegate as a callback, something like this: struct Result { string func; double relative; double calls; double tCall; double callsPerSecond; } alias void delegate (Result result) Formatter; Doesn't seem complicated to me. Then the module could provide a default formatter that does what the module currently does. -- /Jacob Carlborg4. Test results should be exposed to the caller. Usually, it is enough to write the results to the console. But you may want to post process the results. Being it either graphically or to output XML for whatever reason. I think it would be beneficial to add this flexibility and separate the benchmarking logic from its generating the output. It might be useful to add the used CPU using std.cpuid to the output.We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.
Oct 03 2011
On 10/3/11 12:49 PM, Jacob Carlborg wrote:On 2011-10-03 17:44, Andrei Alexandrescu wrote:That's what I meant when I said "we'd need to expose the benchmark results structure" :o). There's probably no way out of this anyway, particularly if we provide richer information via times(), scalability, etc. AndreiOn 10/3/11 8:59 AM, Jens Mueller wrote:Not necessarily. The user could provide a delegate as a callback, something like this: struct Result { string func; double relative; double calls; double tCall; double callsPerSecond; } alias void delegate (Result result) Formatter;4. Test results should be exposed to the caller. Usually, it is enough to write the results to the console. But you may want to post process the results. Being it either graphically or to output XML for whatever reason. I think it would be beneficial to add this flexibility and separate the benchmarking logic from its generating the output. It might be useful to add the used CPU using std.cpuid to the output.We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.
Oct 03 2011
On 2011-10-03 20:04, Andrei Alexandrescu wrote:On 10/3/11 12:49 PM, Jacob Carlborg wrote:Hehe, Ok. I don't think it gets THAT much more complicated, just call the delegate with the results instead of writef(ln).On 2011-10-03 17:44, Andrei Alexandrescu wrote:That's what I meant when I said "we'd need to expose the benchmark results structure" :o).On 10/3/11 8:59 AM, Jens Mueller wrote:Not necessarily. The user could provide a delegate as a callback, something like this: struct Result { string func; double relative; double calls; double tCall; double callsPerSecond; } alias void delegate (Result result) Formatter;4. Test results should be exposed to the caller. Usually, it is enough to write the results to the console. But you may want to post process the results. Being it either graphically or to output XML for whatever reason. I think it would be beneficial to add this flexibility and separate the benchmarking logic from its generating the output. It might be useful to add the used CPU using std.cpuid to the output.We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.There's probably no way out of this anyway, particularly if we provide richer information via times(), scalability, etc. Andrei-- /Jacob Carlborg
Oct 03 2011