digitalmars.D - Review of Andrei's std.benchmark

Jens Mueller (33/33) Sep 17 2012 Hi,

Andrei Alexandrescu (6/9) Sep 17 2012 One note - I moved the benchmark-related stuff from std.datetime

=?UTF-8?B?IsOYaXZpbmQi?= (19/19) Sep 18 2012 I think the std.benchmark is definitely a useful library

Andrei Alexandrescu (11/28) Sep 18 2012 After extensive tests with a variety of aggregate functions, I can say

Thiez (5/8) Sep 19 2012 What if one tries to benchmark a nondeterministic function? In

Andrei Alexandrescu (4/11) Sep 21 2012 I agree. Currently std.benchmark is not geared for measuring

Manu (12/21) Sep 19 2012 The fastest execution time is rarely useful to me, I'm almost always muc...

Peter Alexander (4/13) Sep 19 2012 The problem with slowest is that you end up with the occasional

Jacob Carlborg (4/8) Sep 19 2012 That's way the average is good to have as well.

Andrei Alexandrescu (5/11) Sep 21 2012 The occasional hiccup is often orders of magnitude slower than the rest,...

Manu (3/14) Sep 20 2012 So then we need to start getting tricky, and choose the slowest one that...

Andrei Alexandrescu (11/28) Sep 20 2012 The "best way" according to some of the people who've advised my

Manu (11/45) Sep 20 2012 If you do more than a single iteration, the minimum will virtually alway...

Andrei Alexandrescu (18/34) Sep 20 2012 To measure performance against cold cache, you could always clear the

Manu (7/24) Sep 21 2012 Okay, I can buy this distinction in terminology.

Andrei Alexandrescu (10/14) Sep 21 2012 That's a good angle. Profiling is currently done by the -profile switch,...

Jacob Carlborg (7/14) Sep 21 2012 If you have an API that is fairly open and provides more of the raw

Jonathan M Davis (3/8) Sep 21 2012 dmd has the -profile flag.

Jacob Carlborg (4/7) Sep 20 2012 Why not min, max and average?

Andrei Alexandrescu (3/8) Sep 20 2012 Because max and average are misleading and uninformative, as I explained...
Andrei Alexandrescu (15/20) Sep 20 2012 For a very simple reason: unless the algorithm under benchmark is very

Jacob Carlborg (9/22) Sep 21 2012 I may have completely misunderstood this but aren't we talking about

Andrei Alexandrescu (9/30) Sep 21 2012 I disagree. I won't include something in my design just so people don't

Jens Mueller (6/45) Sep 21 2012 You mean like extending StopWatch and allowing the user to provide the
Simen Kjaeraas (7/14) Sep 22 2012 I certainly think the two use cases are similar enough to warrant their

Manu (5/7) Sep 21 2012 This is only true for systems with a comprehensive pre-emptive OS runnin...

Andrei Alexandrescu (4/14) Sep 21 2012 I think embedded systems that run e.g. Linux will be affected by task

foobar (9/53) Sep 20 2012 From the responses on the thread clearly there isn't a "best way".

Andrei Alexandrescu (16/29) Sep 20 2012 I don't quite agree. This is a domain in which intuition is having a

Manu (11/16) Sep 21 2012 I don't think this is a 'fair' claim, the situation is that different

Andrei Alexandrescu (5/22) Sep 21 2012 That's exactly where it all starts getting unprincipled. Just use the

Graham Fawcett (11/38) Sep 19 2012 For comparison's sake, the Criterion benchmarking package for

Andrei Alexandrescu (3/11) Sep 21 2012 Will look into it, thanks.

Nick Sabalausky (33/43) Sep 20 2012 *Ahem*: http://zedshaw.com/essays/programmer_stats.html

Andrei Alexandrescu (25/69) Sep 20 2012 What would be the orthodoxy? If orthodoxy is what google finds, it's

Tove (7/12) Sep 21 2012 I reached the same conclusion and use the same method at work.

Andrei Alexandrescu (3/6) Sep 21 2012 I think that's a great idea!

Johannes Pfau (5/8) Sep 21 2012 A perfect use case for user defined attributes ;-)

Jacob Carlborg (4/7) Sep 21 2012 Yes, we need user defined attributes and AST macros ASAP :)

Manu (10/24) Sep 21 2012 Facebook isn't exactly 'realtime' software. Obviously, faster is always

Andrei Alexandrescu (5/30) Sep 21 2012 You'd be surprised at how much we care about e.g. 90 percentile time to

Nick Sabalausky (7/20) Sep 21 2012 (Emphasis added for proper context.)

David Piepgrass (74/77) Sep 21 2012 Like others, I must also disagree in princple. The minimum sounds

jerro (3/4) Sep 21 2012 It is possible to use a sampling profiler on D executables
Dmitry Olshansky (24/59) Sep 21 2012 Real good profilers are the ones served by CPU vendor. See AMD's
Andrei Alexandrescu (32/105) Sep 21 2012 I understand. What we currently do at Facebook is support benchmark

David Piepgrass (6/14) Sep 21 2012 Huh? It's not nearly as confusing as reading the documentation

Andrei Alexandrescu (9/35) Sep 21 2012 I misunderstood. I agree that it's a good thing to specify how

David Piepgrass (11/38) Sep 22 2012 I am puzzled about what you think my suggestion meant. I am

Nick Sabalausky (16/16) Sep 21 2012 Stepping back for a moment, I think we're facing two key issues here:

Joseph Rushton Wakeling (15/26) Sep 23 2012 My feeling is that even with a good explanation in the docs, you're _sti...

=?UTF-8?B?IsOYaXZpbmQi?= (8/8) Sep 19 2012 New question for you :)

Andrei Alexandrescu (3/11) Sep 21 2012 I think you have discovered a major issue. Ideas on how to attack this?

Jonathan M Davis (18/32) Sep 21 2012 Some of us have been asking for ages for the ability to mark a static

=?UTF-8?B?IsOYaXZpbmQi?= (7/58) Sep 22 2012 Is there a way to solve the dependency issue without forbidding

Andrei Alexandrescu (12/14) Sep 22 2012 I think an idea just occurred to me. The rules for static ctors and

=?UTF-8?B?IsOYaXZpbmQi?= (6/21) Sep 22 2012 It gets a bit ugly maybe, but we could do a mix of the proposals

=?UTF-8?B?IsOYaXZpbmQi?= (4/32) Sep 22 2012 We would probably not call it '@nycycliccheck' since you propose

Simen Kjaeraas (7/14) Sep 22 2012 =

Simen Kjaeraas (9/22) Sep 22 2012 ld =
Jonathan M Davis (20/37) Sep 22 2012 uld

Dmitry Olshansky (12/25) Sep 21 2012 Not ideal but...
Jacob Carlborg (8/9) Sep 22 2012 The standard way to solve this would be to move the initialization code

Jacob Carlborg (16/19) Sep 19 2012 * Why is "scheduleForBenchmarking" a string? Can't it be a template mixi...

Jonathan M Davis (11/14) Sep 19 2012 util is one of the worst package names ever, because it means basically

Jacob Carlborg (5/14) Sep 19 2012 I just think we have too many top level modules.

Andrei Alexandrescu (10/28) Sep 22 2012 A default value of n would depend on the speed of the function and the

Peter Alexander (14/14) Sep 19 2012 I don't see why `benchmark` takes (almost) all of its parameters

Andrei Alexandrescu (4/7) Sep 21 2012 That is intentional - indirect calls would add undue overhead to the

Peter Alexander (5/13) Sep 21 2012 I accept that it adds undue overhead. I just think that the

Dmitry Olshansky (21/21) Sep 21 2012 I'd throw in a request to address the following.

Jens Mueller <jens.k.mueller gmx.de> writes:

Hi,

it's my pleasure to announce the begin of the formal review of Andrei's
std.benchmark. The review will start today and end in two weeks, on 1st
of October. The review is followed by a week of voting which ends on 8th
of October.

Quoting Andrei from his request for formal review:
"I reworked the benchmarking framework for backward compatibility,
flexibility, and convenience.

There are a few enhancement possibilities (such as tracking
system/user time separately etc), but there is value in keeping things
simple and convenient. Right now it really takes only one line of code
and observing a simple naming convention to hook a module into the
benchmarking framework."

Code: https://github.com/D-Programming-Language/phobos/pull/794
Docs: http://dlang.org/phobos-prerelease/std_benchmark.html

If std.benchmark is accepted it will likely lead to a deprecation of
std.datetime's benchmark facilities.

The code is provided as a pull requested and being (as usual) integrated
by the auto tester for Mac OS X, FreeBSD, Linux and Windows (see
(http://d.puremagic.com/test-results/pull-history.ghtml?repoid=3&pullid=794).

In your comments you can/should address the
* design
* implementation
* documentation
* usefulness
of the library.

Provide information regarding the depth (ranging from very brief to
in-depth) of your review and conclude explicitly whether std.benchmark
should or shouldn't be included in Phobos.

Post all feedback to this thread. Constructive feedback is very much
appreciated.

To conclude in more Andrei like words: Happy destruction!

Jens

Sep 17 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/17/12 5:13 PM, Jens Mueller wrote:
 If std.benchmark is accepted it will likely lead to a deprecation of
 std.datetime's benchmark facilities.

One note - I moved the benchmark-related stuff from std.datetime 
unmodified into std.benchmark and left public aliases in place, so no 
code breakage is imminent. We may deprecate the aliases themselves later.

 To conclude in more Andrei like words: Happy destruction!

Sounds about right! :o)


Andrei

Sep 17 2012

=?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:

I think the std.benchmark is definitely a useful library 
addition, but in my mind it currently a bit too limited.

* All tests are run 1000 times. Depending on the length of the 
test to benchmark, this can be too much. In some cases it would 
be good to be able to trade the number of runs against accuracy.

* For all tests, the best run is selected, but would it not be 
reasonable in some cases to get the average value? Maybe 
excluding the runs that are more than a couple std. deviations 
away from the mean value..

* Is there a way of specifying a test name other than the 
function-name when using the 'mixin(scheduleForBenchmarking)' 
approach to register benchmarks?

* I would also like to be able (if possible) to register two 
mentioned things (number of runs and result strategy) with the 
mixin approach (or similar).

* It seems like the baseline for subtraction from subsequent test 
runs is taken from a call to the test function, passing 1 to it. 
Shouldn't 0 be passed for this value?

If these can be addressed, I would like it added to the library!

Sep 18 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/18/12 5:07 PM, "Øivind" wrote:
 I think the std.benchmark is definitely a useful library addition, but
 in my mind it currently a bit too limited.

 * All tests are run 1000 times. Depending on the length of the test to
 benchmark, this can be too much. In some cases it would be good to be
 able to trade the number of runs against accuracy.

It would be a good idea to make that a configurable parameter.

 * For all tests, the best run is selected, but would it not be
 reasonable in some cases to get the average value? Maybe excluding the
 runs that are more than a couple std. deviations away from the mean value..

After extensive tests with a variety of aggregate functions, I can say 
firmly that taking the minimum time is by far the best when it comes to 
assessing the speed of a function.

 * Is there a way of specifying a test name other than the function-name
 when using the 'mixin(scheduleForBenchmarking)' approach to register
 benchmarks?

Not currently. Probably a manual registration of an individual benchmark 
would make sense.

 * I would also like to be able (if possible) to register two mentioned
 things (number of runs and result strategy) with the mixin approach (or
 similar).

Makes sense.

 * It seems like the baseline for subtraction from subsequent test runs
 is taken from a call to the test function, passing 1 to it. Shouldn't 0
 be passed for this value?

I'll look into that.


Thanks,

Andrei

Sep 18 2012

"Thiez" <thiezz gmail.com> writes:

On Tuesday, 18 September 2012 at 22:01:30 UTC, Andrei 
Alexandrescu wrote:
 After extensive tests with a variety of aggregate functions, I 
 can say firmly that taking the minimum time is by far the best 
 when it comes to assessing the speed of a function.

What if one tries to benchmark a nondeterministic function? In 
such a case one might well be interested in the best run, worst 
run, and the average.

Sep 19 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/19/12 4:12 AM, Thiez wrote:
 On Tuesday, 18 September 2012 at 22:01:30 UTC, Andrei Alexandrescu wrote:
 After extensive tests with a variety of aggregate functions, I can say
 firmly that taking the minimum time is by far the best when it comes
 to assessing the speed of a function.

 What if one tries to benchmark a nondeterministic function? In such a
 case one might well be interested in the best run, worst run, and the
 average.

I agree. Currently std.benchmark is not geared for measuring 
non-deterministic functions.

Andrei

Sep 21 2012

Manu <turkeyman gmail.com> writes:

On 19 September 2012 01:02, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 9/18/12 5:07 PM, "=C3=98ivind" wrote:

 * For all tests, the best run is selected, but would it not be

 reasonable in some cases to get the average value? Maybe excluding the
 runs that are more than a couple std. deviations away from the mean
 value..

 After extensive tests with a variety of aggregate functions, I can say
 firmly that taking the minimum time is by far the best when it comes to
 assessing the speed of a function.


The fastest execution time is rarely useful to me, I'm almost always much
more interested in the slowest execution time.
In realtime software, the slowest time is often the only important factor,
everything must be designed to tolerate this possibility.
I can also imagine other situations where multiple workloads are competing
for time, the average time may be more useful in that case.

Side question:
Running a test over and over pre-populates the cache with all associated
data after the first cycle... The cache needs to be randomised between each
cycle to get realistic results.

Sep 19 2012

"Peter Alexander" <peter.alexander.au gmail.com> writes:

 The fastest execution time is rarely useful to me, I'm almost 
 always much
 more interested in the slowest execution time.
 In realtime software, the slowest time is often the only 
 important factor,
 everything must be designed to tolerate this possibility.
 I can also imagine other situations where multiple workloads 
 are competing
 for time, the average time may be more useful in that case.

The problem with slowest is that you end up with the occasional 
OS hiccup or GC collection which throws the entire benchmark off. 
I see your point, but unless you can prevent the OS from 
interrupting, the time would be meaningless.

Sep 19 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-19 11:38, Peter Alexander wrote:

 The problem with slowest is that you end up with the occasional OS
 hiccup or GC collection which throws the entire benchmark off. I see
 your point, but unless you can prevent the OS from interrupting, the
 time would be meaningless.

That's way the average is good to have as well.

-- 
/Jacob Carlborg

Sep 19 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/19/12 3:54 PM, Jacob Carlborg wrote:
 On 2012-09-19 11:38, Peter Alexander wrote:

 The problem with slowest is that you end up with the occasional OS
 hiccup or GC collection which throws the entire benchmark off. I see
 your point, but unless you can prevent the OS from interrupting, the
 time would be meaningless.

 That's way the average is good to have as well.

The occasional hiccup is often orders of magnitude slower than the rest, 
which means it will ruin the average. You may have meant "median", which 
has more merit, but then I'd say why bother - just use the minimum.

Andrei

Sep 21 2012

Manu <turkeyman gmail.com> writes:

On 19 September 2012 12:38, Peter Alexander <peter.alexander.au gmail.com>wrote:

 The fastest execution time is rarely useful to me, I'm almost always much
 more interested in the slowest execution time.
 In realtime software, the slowest time is often the only important factor,
 everything must be designed to tolerate this possibility.
 I can also imagine other situations where multiple workloads are competing
 for time, the average time may be more useful in that case.

 The problem with slowest is that you end up with the occasional OS hiccup
 or GC collection which throws the entire benchmark off. I see your point,
 but unless you can prevent the OS from interrupting, the time would be
 meaningless.

So then we need to start getting tricky, and choose the slowest one that is
not beyond an order of magnitude or so outside the average?

Sep 20 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/20/12 2:42 AM, Manu wrote:
 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com <mailto:peter.alexander.au gmail.com>> wrote:

         The fastest execution time is rarely useful to me, I'm almost
         always much
         more interested in the slowest execution time.
         In realtime software, the slowest time is often the only
         important factor,
         everything must be designed to tolerate this possibility.
         I can also imagine other situations where multiple workloads are
         competing
         for time, the average time may be more useful in that case.


     The problem with slowest is that you end up with the occasional OS
     hiccup or GC collection which throws the entire benchmark off. I see
     your point, but unless you can prevent the OS from interrupting, the
     time would be meaningless.


 So then we need to start getting tricky, and choose the slowest one that
 is not beyond an order of magnitude or so outside the average?

The "best way" according to some of the people who've advised my 
implementation of the framework at Facebook is to take the mode of the 
measurements distribution, i.e. the time at the maximum density.

I implemented that (and it's not easy). It yielded numbers close to the 
minimum, but less stable and needing more iterations to become stable 
(when they do get indeed close to the minimum).

Let's use the minimum. It is understood it's not what you'll see in 
production, but it is an excellent proxy for indicative and reproducible 
performance numbers.


Andrei

Sep 20 2012

Manu <turkeyman gmail.com> writes:

On 20 September 2012 15:36, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 9/20/12 2:42 AM, Manu wrote:

 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com
<mailto:peter.alexander.au **gmail.com<peter.alexander.au gmail.com>>>
 wrote:

         The fastest execution time is rarely useful to me, I'm almost
         always much
         more interested in the slowest execution time.
         In realtime software, the slowest time is often the only
         important factor,
         everything must be designed to tolerate this possibility.
         I can also imagine other situations where multiple workloads are
         competing
         for time, the average time may be more useful in that case.


     The problem with slowest is that you end up with the occasional OS
     hiccup or GC collection which throws the entire benchmark off. I see
     your point, but unless you can prevent the OS from interrupting, the
     time would be meaningless.


 So then we need to start getting tricky, and choose the slowest one that
 is not beyond an order of magnitude or so outside the average?

 The "best way" according to some of the people who've advised my
 implementation of the framework at Facebook is to take the mode of the
 measurements distribution, i.e. the time at the maximum density.

 I implemented that (and it's not easy). It yielded numbers close to the
 minimum, but less stable and needing more iterations to become stable (when
 they do get indeed close to the minimum).

 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and reproducible
 performance numbers.


If you do more than a single iteration, the minimum will virtually always
be influenced by ideal cache pre-population, which is unrealistic. Memory
locality is often the biggest contributing performance hazard in many
algorithms, and usually the most unpredictable. I want to know about that
in my measurements.
Reproducibility is not important to me as accuracy. And I'd rather be
conservative(/pessimistic) with the error.

What guideline would you apply to estimate 'real-world' time spent when
always working with hyper-optimistic measurements?

Sep 20 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/20/12 10:05 AM, Manu wrote:
 On 20 September 2012 15:36, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:
     Let's use the minimum. It is understood it's not what you'll see in
     production, but it is an excellent proxy for indicative and
     reproducible performance numbers.


 If you do more than a single iteration, the minimum will virtually
 always be influenced by ideal cache pre-population, which is
 unrealistic.

To measure performance against cold cache, you could always clear the 
cache using one of the available methods, see 
http://stackoverflow.com/questions/1756825/cpu-cache-flush. Probably 
std.benchmark could include a routine that does that. But performance on 
cold would actually be most unrealistic and uninformative, as loading 
the memory into cache will dominate the work that the algorithm is 
doing, so essentially the benchmark would evaluate the memory bandwidth 
against the working set of the algorithm. That may be occasionally 
useful, but I'd argue that most often the interest in benchmarking is to 
measure repeated application of a function, not occasional use of it.

 Memory locality is often the biggest contributing
 performance hazard in many algorithms, and usually the most
 unpredictable. I want to know about that in my measurements.
 Reproducibility is not important to me as accuracy. And I'd rather be
 conservative(/pessimistic) with the error.

 What guideline would you apply to estimate 'real-world' time spent when
 always working with hyper-optimistic measurements?

The purpose of std.benchmark is not to estimate real-world time. (That 
is the purpose of profiling.) Instead, benchmarking measures and 
provides a good proxy of that time for purposes of optimizing the 
algorithm. If work is done on improving the minimum time given by the 
benchmark framework, it is reasonable to expect that performance in-situ 
will also improve.


Andrei

Sep 20 2012

Manu <turkeyman gmail.com> writes:

On 21 September 2012 07:17, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 9/20/12 10:05 AM, Manu wrote:

 Memory locality is often the biggest contributing

 performance hazard in many algorithms, and usually the most
 unpredictable. I want to know about that in my measurements.
 Reproducibility is not important to me as accuracy. And I'd rather be
 conservative(/pessimistic) with the error.

 What guideline would you apply to estimate 'real-world' time spent when
 always working with hyper-optimistic measurements?

 The purpose of std.benchmark is not to estimate real-world time. (That is
 the purpose of profiling.) Instead, benchmarking measures and provides a
 good proxy of that time for purposes of optimizing the algorithm. If work
 is done on improving the minimum time given by the benchmark framework, it
 is reasonable to expect that performance in-situ will also improve.


Okay, I can buy this distinction in terminology.
What I'm typically more interested in is profiling. I do occasionally need
to do some benchmarking by your definition, so I'll find this useful, but
should there then be another module to provide a 'profiling' API? Also
worked into this API?

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/12 10:58 AM, Manu wrote:
 What I'm typically more interested in is profiling. I do occasionally
 need to do some benchmarking by your definition, so I'll find this
 useful, but should there then be another module to provide a 'profiling'
 API? Also worked into this API?

That's a good angle. Profiling is currently done by the -profile switch, 
and there are a couple of library functions associated with it. To my 
surprise, that documentation page has not been ported to the dlang.org 
style: http://digitalmars.com/ctg/trace.html

I haven't yet thought whether std.benchmark should add more 
profiling-related primitives. I'd opine for releasing it without such 
for the time being.


Thanks,

Andrei

Sep 21 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-21 18:21, Andrei Alexandrescu wrote:

 That's a good angle. Profiling is currently done by the -profile switch,
 and there are a couple of library functions associated with it. To my
 surprise, that documentation page has not been ported to the dlang.org
 style: http://digitalmars.com/ctg/trace.html

 I haven't yet thought whether std.benchmark should add more
 profiling-related primitives. I'd opine for releasing it without such
 for the time being.

If you have an API that is fairly open and provides more of the raw 
results then one can build a more profiling like solution on top of 
that. This can later be used to create a specific profiling module if we 
choose to do so.

-- 
/Jacob Carlborg

Sep 21 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, September 21, 2012 17:58:05 Manu wrote:
 Okay, I can buy this distinction in terminology.
 What I'm typically more interested in is profiling. I do occasionally need
 to do some benchmarking by your definition, so I'll find this useful, but
 should there then be another module to provide a 'profiling' API? Also
 worked into this API?

dmd has the -profile flag.

- Jonathan M Davis

Sep 21 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-20 14:36, Andrei Alexandrescu wrote:

 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and reproducible
 performance numbers.

Why not min, max and average?

-- 
/Jacob Carlborg

Sep 20 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/20/12 1:37 PM, Jacob Carlborg wrote:
 On 2012-09-20 14:36, Andrei Alexandrescu wrote:

 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and reproducible
 performance numbers.

 Why not min, max and average?

Because max and average are misleading and uninformative, as I explained.

Andrei

Sep 20 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/20/12 2:37 PM, Jacob Carlborg wrote:
 On 2012-09-20 14:36, Andrei Alexandrescu wrote:

 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and reproducible
 performance numbers.

 Why not min, max and average?

For a very simple reason: unless the algorithm under benchmark is very 
long-running, max is completely useless, and it ruins average as well.

For virtually all benchmarks I've run, the distribution of timings is a 
half-Gaussian very concentrated around the minimum. Say you have a 
minimum of e.g. 73 us. Then there would be a lot of results close to 
that; the mode of the distribution would be very close, e.g. 75 us, and 
the more measurements you take, the closer the mode is to the minimum. 
Then you have a few timings up to e.g. 90 us. And finally you will 
inevitably have a few outliers at some milliseconds. Those are orders of 
magnitude larger than anything of interest and are caused by system 
interrupts that happened to fall in the middle of the measurement.

Taking those into consideration and computing the average with those 
outliers simply brings useless noise into the measurement process.


Andrei

Sep 20 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-21 06:23, Andrei Alexandrescu wrote:

 For a very simple reason: unless the algorithm under benchmark is very
 long-running, max is completely useless, and it ruins average as well.

I may have completely misunderstood this but aren't we talking about 
what do include in the output of the benchmark? In that case, if you 
don't like max and average just don't look at it.

 For virtually all benchmarks I've run, the distribution of timings is a
 half-Gaussian very concentrated around the minimum. Say you have a
 minimum of e.g. 73 us. Then there would be a lot of results close to
 that; the mode of the distribution would be very close, e.g. 75 us, and
 the more measurements you take, the closer the mode is to the minimum.
 Then you have a few timings up to e.g. 90 us. And finally you will
 inevitably have a few outliers at some milliseconds. Those are orders of
 magnitude larger than anything of interest and are caused by system
 interrupts that happened to fall in the middle of the measurement.

 Taking those into consideration and computing the average with those
 outliers simply brings useless noise into the measurement process.

After your replay to one of Manu's post, I think I misunderstood the 
std.benchmark module. I was thinking more of profiling. But are these 
quite similar tasks, couldn't std.benchmark work for both?

-- 
/Jacob Carlborg

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/12 5:39 AM, Jacob Carlborg wrote:
 On 2012-09-21 06:23, Andrei Alexandrescu wrote:

 For a very simple reason: unless the algorithm under benchmark is very
 long-running, max is completely useless, and it ruins average as well.

 I may have completely misunderstood this but aren't we talking about
 what do include in the output of the benchmark? In that case, if you
 don't like max and average just don't look at it.

I disagree. I won't include something in my design just so people don't 
look at it most of the time. Min and average are most of the time an 
awful thing to include, and will throw off people with bizarre results.

If it's there, it's worth looking at. Note how all columns are directly 
comparable (I might add, unlike other approaches to benchmarking).

 For virtually all benchmarks I've run, the distribution of timings is a
 half-Gaussian very concentrated around the minimum. Say you have a
 minimum of e.g. 73 us. Then there would be a lot of results close to
 that; the mode of the distribution would be very close, e.g. 75 us, and
 the more measurements you take, the closer the mode is to the minimum.
 Then you have a few timings up to e.g. 90 us. And finally you will
 inevitably have a few outliers at some milliseconds. Those are orders of
 magnitude larger than anything of interest and are caused by system
 interrupts that happened to fall in the middle of the measurement.

 Taking those into consideration and computing the average with those
 outliers simply brings useless noise into the measurement process.

 After your replay to one of Manu's post, I think I misunderstood the
 std.benchmark module. I was thinking more of profiling. But are these
 quite similar tasks, couldn't std.benchmark work for both?

This is an interesting idea. It would delay release quite a bit because 
I'd need to design and implement things like performance counters and such.


Andrei

Sep 21 2012

Jens Mueller <jens.k.mueller gmx.de> writes:

Andrei Alexandrescu wrote:
 On 9/21/12 5:39 AM, Jacob Carlborg wrote:
On 2012-09-21 06:23, Andrei Alexandrescu wrote:

For a very simple reason: unless the algorithm under benchmark is very
long-running, max is completely useless, and it ruins average as well.

I may have completely misunderstood this but aren't we talking about
what do include in the output of the benchmark? In that case, if you
don't like max and average just don't look at it.

 
 I disagree. I won't include something in my design just so people
 don't look at it most of the time. Min and average are most of the
 time an awful thing to include, and will throw off people with
 bizarre results.
 
 If it's there, it's worth looking at. Note how all columns are
 directly comparable (I might add, unlike other approaches to
 benchmarking).
 
For virtually all benchmarks I've run, the distribution of timings is a
half-Gaussian very concentrated around the minimum. Say you have a
minimum of e.g. 73 us. Then there would be a lot of results close to
that; the mode of the distribution would be very close, e.g. 75 us, and
the more measurements you take, the closer the mode is to the minimum.
Then you have a few timings up to e.g. 90 us. And finally you will
inevitably have a few outliers at some milliseconds. Those are orders of
magnitude larger than anything of interest and are caused by system
interrupts that happened to fall in the middle of the measurement.

Taking those into consideration and computing the average with those
outliers simply brings useless noise into the measurement process.

After your replay to one of Manu's post, I think I misunderstood the
std.benchmark module. I was thinking more of profiling. But are these
quite similar tasks, couldn't std.benchmark work for both?

 
 This is an interesting idea. It would delay release quite a bit
 because I'd need to design and implement things like performance
 counters and such.

You mean like extending StopWatch and allowing the user to provide the
measuring code, i.e. counting the number of instructions. This would be
very useful. Is it possible to make sure that these changes can be
introduced later without breaking the API?

Jens

Sep 21 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-21-09 22:58:36, Andrei Alexandrescu wrote:

 On 9/21/12 5:39 AM, Jacob Carlborg wrote:

 After your replay to one of Manu's post, I think I misunderstood the
 std.benchmark module. I was thinking more of profiling. But are these
 quite similar tasks, couldn't std.benchmark work for both?

 This is an interesting idea. It would delay release quite a bit because  
 I'd need to design and implement things like performance counters and  
 such.

I certainly think the two use cases are similar enough to warrant their
inclusion in a common module. That does not preclude std.benchmark being
included as is now, and extended with profiling features at some later
point.

-- 
Simen

Sep 22 2012

Manu <turkeyman gmail.com> writes:

On 21 September 2012 07:23, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 For a very simple reason: unless the algorithm under benchmark is very
 long-running, max is completely useless, and it ruins average as well.

This is only true for systems with a comprehensive pre-emptive OS running
on the same core. Most embedded systems will only be affected by cache
misses and bus contention, in that situation, max is perfectly acceptable.

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/12 11:14 AM, Manu wrote:
 On 21 September 2012 07:23, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:

     For a very simple reason: unless the algorithm under benchmark is
     very long-running, max is completely useless, and it ruins average
     as well.


 This is only true for systems with a comprehensive pre-emptive OS
 running on the same core. Most embedded systems will only be affected by
 cache misses and bus contention, in that situation, max is perfectly
 acceptable.

I think embedded systems that run e.g. Linux will be affected by task 
switching.

Andrei

Sep 21 2012

"foobar" <foo bar.com> writes:

On Thursday, 20 September 2012 at 12:35:15 UTC, Andrei 
Alexandrescu wrote:
 On 9/20/12 2:42 AM, Manu wrote:
 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com 
 <mailto:peter.alexander.au gmail.com>> wrote:

        The fastest execution time is rarely useful to me, I'm 
 almost
        always much
        more interested in the slowest execution time.
        In realtime software, the slowest time is often the only
        important factor,
        everything must be designed to tolerate this 
 possibility.
        I can also imagine other situations where multiple 
 workloads are
        competing
        for time, the average time may be more useful in that 
 case.


    The problem with slowest is that you end up with the 
 occasional OS
    hiccup or GC collection which throws the entire benchmark 
 off. I see
    your point, but unless you can prevent the OS from 
 interrupting, the
    time would be meaningless.


 So then we need to start getting tricky, and choose the 
 slowest one that
 is not beyond an order of magnitude or so outside the average?

 The "best way" according to some of the people who've advised 
 my implementation of the framework at Facebook is to take the 
 mode of the measurements distribution, i.e. the time at the 
 maximum density.

 I implemented that (and it's not easy). It yielded numbers 
 close to the minimum, but less stable and needing more 
 iterations to become stable (when they do get indeed close to 
 the minimum).

 Let's use the minimum. It is understood it's not what you'll 
 see in production, but it is an excellent proxy for indicative 
 and reproducible performance numbers.


 Andrei

 From the responses on the thread clearly there isn't a "best way".
There are different use-cases with different tradeoffs so why not 
allow the user to choose the policy best suited for their 
use-case?
I'd suggest to provide a few reasonable common choices to choose 
from, as well as a way to provide a user defined calculation 
(function pointer/delegate?)

Sep 20 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/20/12 3:01 PM, foobar wrote:
 On Thursday, 20 September 2012 at 12:35:15 UTC, Andrei Alexandrescu wrote:
 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and
 reproducible performance numbers.


 Andrei

  From the responses on the thread clearly there isn't a "best way".

I don't quite agree. This is a domain in which intuition is having a 
hard time, and at least some of the responses come from an intuitive 
standpoint, as opposed from hard data.

For example, there's this opinion that taking the min, max, and average 
is the "fair" thing to do and the most informative. However, all noise 
in measuring timing is additive. Unless you talk about performance of 
entire large systems with networking, I/O, and the such, algorithms 
running in memory are inevitably spending time doing work, to which 
various sources of noise (system interrupts, clock quantization, 
benchmarking framework) just _add_ some time. Clearly these components 
do affect the visible duration of the algorithm, but if you want to 
improve it you need to remove the noise.

 There are different use-cases with different tradeoffs so why not allow
 the user to choose the policy best suited for their use-case?
 I'd suggest to provide a few reasonable common choices to choose from,
 as well as a way to provide a user defined calculation (function
 pointer/delegate?)

Reasonable choices are great, but in this case it's a bit difficult to 
figure what's reasonable.


Andrei

Sep 20 2012

Manu <turkeyman gmail.com> writes:

On 21 September 2012 07:30, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 I don't quite agree. This is a domain in which intuition is having a hard
 time, and at least some of the responses come from an intuitive standpoint,
 as opposed from hard data.

 For example, there's this opinion that taking the min, max, and average is
 the "fair" thing to do and the most informative.


I don't think this is a 'fair' claim, the situation is that different
people are looking for different statistical information, and you can
distinguish it with whatever terminology you prefer. You are only
addressing a single use case; 'benchmarking', by your definition. I'm more
frequently interested in profiling than 'benchmark'ing, and I think both
are useful to have.

The thing is, the distinction between 'benchmarking' and 'profiling' is
effectively implemented via nothing more than the sampling algorithm; min
vs avg, so is it sensible to expose the distinction in the API in this way?

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/20/12 3:42 AM, Manu wrote:
 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com <mailto:peter.alexander.au gmail.com>> wrote:

         The fastest execution time is rarely useful to me, I'm almost
         always much
         more interested in the slowest execution time.
         In realtime software, the slowest time is often the only
         important factor,
         everything must be designed to tolerate this possibility.
         I can also imagine other situations where multiple workloads are
         competing
         for time, the average time may be more useful in that case.


     The problem with slowest is that you end up with the occasional OS
     hiccup or GC collection which throws the entire benchmark off. I see
     your point, but unless you can prevent the OS from interrupting, the
     time would be meaningless.


 So then we need to start getting tricky, and choose the slowest one that
 is not beyond an order of magnitude or so outside the average?

That's exactly where it all starts getting unprincipled. Just use the 
minimum.

Just. Use. The. Minimum.

Andrei

Sep 21 2012

"Graham Fawcett" <fawcett uwindsor.ca> writes:

On Wednesday, 19 September 2012 at 08:28:36 UTC, Manu wrote:
 On 19 September 2012 01:02, Andrei Alexandrescu <
 SeeWebsiteForEmail erdani.org> wrote:

 On 9/18/12 5:07 PM, "Øivind" wrote:

 * For all tests, the best run is selected, but would it not be

 reasonable in some cases to get the average value? Maybe 
 excluding the
 runs that are more than a couple std. deviations away from 
 the mean
 value..

 After extensive tests with a variety of aggregate functions, I 
 can say
 firmly that taking the minimum time is by far the best when it 
 comes to
 assessing the speed of a function.


 The fastest execution time is rarely useful to me, I'm almost 
 always much
 more interested in the slowest execution time.
 In realtime software, the slowest time is often the only 
 important factor,
 everything must be designed to tolerate this possibility.
 I can also imagine other situations where multiple workloads 
 are competing
 for time, the average time may be more useful in that case.


For comparison's sake, the Criterion benchmarking package for 
Haskell is worth a look:

http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/

Criterion accounts for clock-call costs, displays various central 
tendencies, reports outliers (and their significance --- whether 
the variance is significantly affected by the outliers), etc., 
etc. It's a very well conceived benchmarking system, and might 
well be worth stealing from.

Best,
Graham

Sep 19 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/19/12 3:59 PM, Graham Fawcett wrote:
 For comparison's sake, the Criterion benchmarking package for Haskell is
 worth a look:

 http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/


 Criterion accounts for clock-call costs, displays various central
 tendencies, reports outliers (and their significance --- whether the
 variance is significantly affected by the outliers), etc., etc. It's a
 very well conceived benchmarking system, and might well be worth
 stealing from.

Will look into it, thanks.

Andrei

Sep 21 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Tue, 18 Sep 2012 18:02:10 -0400
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 On 9/18/12 5:07 PM, "=D8ivind" wrote:
 * For all tests, the best run is selected, but would it not be
 reasonable in some cases to get the average value? Maybe excluding
 the runs that are more than a couple std. deviations away from the
 mean value..

=20
 After extensive tests with a variety of aggregate functions, I can
 say firmly that taking the minimum time is by far the best when it
 comes to assessing the speed of a function.
=20

*Ahem*: http://zedshaw.com/essays/programmer_stats.html

Your claim that the minimum time is sufficient is...ummm...extremely
unorthodox, to say the least. As such, you're going to need a far more
convincing argument than "It worked well for me."

I assume I don't need to preach that "Extraordinary claims require
extraordinary evidence". But, condensing benchmarks and statistics down
to "take the minimum" and saying that's sufficient is one heck of an
extraordinary claim. If you feel that you can provide sufficiently
extraordinary justification, then please do.

Otherwise, I think we'll need richer results. At the very least there
should be an easy way to get at the raw results programmatically
so we can run whatever stats/plots/visualizations/output-formats we
want. I didn't see anything like that browsing through the docs, but
it's possible I may have missed it.

That brings up another question too: I like the idea of a
one-stop-benchmarking-shop, much like we have for unittests, but maybe
reporting shouldn't be so tightly integrated and left more open for
integration with a proper statistics lib and more generalized
output abilities? But of course, that doesn't preclude having a nice
built-in, but optional, default report. (Again though, maybe I'm
overlooking something already in the module?)

One other nitpick: My initial impression is that the
"benchmark_relative_file read" stuff seems a bit kludgey (and
confusing to visually parse). Is there maybe a better way to handle
that? For example, inspired by getopt:

printBenchmarks!(
    "file write", { std.file.write("/tmp/deleteme", "hello, world!"); },
    BenchmarkOption.relative,
    "file read", { std.file.read("/tmp/deleteme"); },
    "array creation", { new char[32]; })
    ();

Sep 20 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/20/12 11:03 PM, Nick Sabalausky wrote:
 On Tue, 18 Sep 2012 18:02:10 -0400
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:

 On 9/18/12 5:07 PM, "�ivind" wrote:
 * For all tests, the best run is selected, but would it not be
 reasonable in some cases to get the average value? Maybe excluding
 the runs that are more than a couple std. deviations away from the
 mean value..

 After extensive tests with a variety of aggregate functions, I can
 say firmly that taking the minimum time is by far the best when it
 comes to assessing the speed of a function.

 *Ahem*: http://zedshaw.com/essays/programmer_stats.html

I'm not sure I figure how this applies to the discussion at hand.

 Your claim that the minimum time is sufficient is...ummm...extremely
 unorthodox, to say the least.

What would be the orthodoxy? If orthodoxy is what google finds, it's 
good we're not orthodox.

 As such, you're going to need a far more
 convincing argument than "It worked well for me."

Sure. I have just detailed the choices made by std.benchmark in a couple 
of posts.

At Facebook we measure using the minimum, and it's working for us. We've 
tried other approaches (such as taking the mode of the distribution). 
Turns out the minimum is better every time. Take a look at the early 
return in estimateTime():

https://github.com/facebook/folly/blob/master/folly/Benchmark.cpp#L136

 I assume I don't need to preach that "Extraordinary claims require
 extraordinary evidence". But, condensing benchmarks and statistics down
 to "take the minimum" and saying that's sufficient is one heck of an
 extraordinary claim. If you feel that you can provide sufficiently
 extraordinary justification, then please do.

My claim is unremarkable. All I'm saying is the minimum running time of 
an algorithm on a given input is a stable and indicative proxy for the 
behavior of the algorithm in general. So it's a good target for 
optimization.

There might be some confusion that std.benchmark does profiling. That's 
not part of its charter.

 Otherwise, I think we'll need richer results. At the very least there
 should be an easy way to get at the raw results programmatically
 so we can run whatever stats/plots/visualizations/output-formats we
 want. I didn't see anything like that browsing through the docs, but
 it's possible I may have missed it.

Currently std.benchmark does not expose raw results for the sake of 
simplicity. It's easy to expose such, but I'd need a bit more convincing 
about their utility.

 That brings up another question too: I like the idea of a
 one-stop-benchmarking-shop, much like we have for unittests, but maybe
 reporting shouldn't be so tightly integrated and left more open for
 integration with a proper statistics lib and more generalized
 output abilities? But of course, that doesn't preclude having a nice
 built-in, but optional, default report. (Again though, maybe I'm
 overlooking something already in the module?)

That's pretty much what's happening. There's an API for collecting 
timings, and then there's an API for printing those with a default format.

 One other nitpick: My initial impression is that the
 "benchmark_relative_file read" stuff seems a bit kludgey (and
 confusing to visually parse). Is there maybe a better way to handle
 that? For example, inspired by getopt:

 printBenchmarks!(
      "file write", { std.file.write("/tmp/deleteme", "hello, world!"); },
      BenchmarkOption.relative,
      "file read", { std.file.read("/tmp/deleteme"); },
      "array creation", { new char[32]; })
      ();

The issue here is automating the benchmark of a module, which would 
require some naming convention anyway.


Andrei

Sep 20 2012

"Tove" <tove fransson.se> writes:

On Friday, 21 September 2012 at 04:44:58 UTC, Andrei Alexandrescu 
wrote:
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:

 My claim is unremarkable. All I'm saying is the minimum running 
 time of an algorithm on a given input is a stable and 
 indicative proxy for the behavior of the algorithm in general. 
 So it's a good target for optimization.

I reached the same conclusion and use the same method at work.

Considering min will converge towards a stable value quite 
quickly... would it not be a reasonable default to auto detect 
when the min is stable with some degree of statistical 
certainty...?

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/12 5:46 AM, Tove wrote:
 Considering min will converge towards a stable value quite quickly...
 would it not be a reasonable default to auto detect when the min is
 stable with some degree of statistical certainty...?

I think that's a great idea!

Andrei

Sep 21 2012

Johannes Pfau <nospam example.com> writes:

Am Fri, 21 Sep 2012 00:45:44 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:
 
 The issue here is automating the benchmark of a module, which would 
 require some naming convention anyway.

A perfect use case for user defined attributes ;-)

 benchmark void foo(){}
 benchmark("File read test") void foo(){}

Sep 21 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-21 19:45, Johannes Pfau wrote:

 A perfect use case for user defined attributes ;-)

  benchmark void foo(){}
  benchmark("File read test") void foo(){}

Yes, we need user defined attributes and AST macros ASAP :)

-- 
/Jacob Carlborg

Sep 21 2012

Manu <turkeyman gmail.com> writes:

On 21 September 2012 07:45, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 As such, you're going to need a far more
 convincing argument than "It worked well for me."

 Sure. I have just detailed the choices made by std.benchmark in a couple
 of posts.

 At Facebook we measure using the minimum, and it's working for us.


Facebook isn't exactly 'realtime' software. Obviously, faster is always
better, but it's not in a situation where if you slip a sync point by 1ms
in an off case, it's all over. You can lose 1ms here, and make it up at a
later time, and the result is the same. But again, this feeds back to your
distinction between benchmarking and profiling.

 Otherwise, I think we'll need richer results. At the very least there
 should be an easy way to get at the raw results programmatically
 so we can run whatever stats/plots/visualizations/**output-formats we
 want. I didn't see anything like that browsing through the docs, but
 it's possible I may have missed it.

 Currently std.benchmark does not expose raw results for the sake of
 simplicity. It's easy to expose such, but I'd need a bit more convincing
 about their utility.


Custom visualisation, realtime charting/plotting, user supplied reduce
function?

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/12 11:12 AM, Manu wrote:
 On 21 September 2012 07:45, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:

         As such, you're going to need a far more
         convincing argument than "It worked well for me."


     Sure. I have just detailed the choices made by std.benchmark in a
     couple of posts.

     At Facebook we measure using the minimum, and it's working for us.


 Facebook isn't exactly 'realtime' software. Obviously, faster is always
 better, but it's not in a situation where if you slip a sync point by
 1ms in an off case, it's all over. You can lose 1ms here, and make it up
 at a later time, and the result is the same. But again, this feeds back
 to your distinction between benchmarking and profiling.

You'd be surprised at how much we care about e.g. 90 percentile time to 
interaction.

         Otherwise, I think we'll need richer results. At the very least
         there
         should be an easy way to get at the raw results programmatically
         so we can run whatever
         stats/plots/visualizations/__output-formats we
         want. I didn't see anything like that browsing through the docs, but
         it's possible I may have missed it.


     Currently std.benchmark does not expose raw results for the sake of
     simplicity. It's easy to expose such, but I'd need a bit more
     convincing about their utility.


 Custom visualisation, realtime charting/plotting, user supplied reduce
 function?

Hrm, that sounds like an entire new project.


Andrei

Sep 21 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Fri, 21 Sep 2012 17:00:29 -0400
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 On 9/21/12 11:12 AM, Manu wrote:
 On 21 September 2012 07:45, Andrei Alexandrescu

     Currently std.benchmark does not expose raw results for the
 sake of simplicity. It's easy to expose such, but I'd need a bit
 ***more convincing about their utility***.


(Emphasis added for proper context.)

 Custom visualisation, realtime charting/plotting, user supplied
 reduce function?

 
 Hrm, that sounds like an entire new project.
 

That doesn't diminish their utility.

Keep in mind, nobody's suggesting putting all of that into
std.benchmark (certainly not initially anyway), but the idea is to at
least have the door open for them.

Sep 21 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 After extensive tests with a variety of aggregate functions, I 
 can say firmly that taking the minimum time is by far the best 
 when it comes to assessing the speed of a function.

Like others, I must also disagree in princple. The minimum sounds 
like a useful metric for functions that (1) do the same amount of 
work in every test and (2) are microbenchmarks, i.e. they measure 
a small and simple task. If the benchmark being measured either 
(1) varies the amount of work each time (e.g. according to some 
approximation of real-world input, which obviously may vary)* or 
(2) measures a large system, then the average and standard 
deviation and even a histogram may be useful (or perhaps some 
indicator whether the runtimes are consistent with a normal 
distribution or not). If the running-time is long then the max 
might be useful (because things like task-switching overhead 
probably do not contribute that much to the total).

* I anticipate that you might respond "so, only test a single 
input per benchmark", but if I've got 1000 inputs that I want to 
try, I really don't want to write 1000 functions nor do I want 
1000 lines of output from the benchmark. An average, standard 
deviation, min and max may be all I need, and if I need more 
detail, then I might break it up into 10 groups of 100 inputs. In 
any case, the minimum runtime is not the desired output when the 
input varies.

It's a little surprising to hear "The purpose of std.benchmark is 
not to estimate real-world time. (That is the purpose of 
profiling)"... Firstly, of COURSE I would want to estimate 
real-world time with some of my benchmarks. For some benchmarks I 
just want to know which of two or three approaches is faster, or 
to get a coarse ball-park sense of performance, but for others I 
really want to know the wall-clock time used for realistic inputs.

Secondly, what D profiler actually helps you answer the question 
"where does the time go in the real-world?"? The D -profile 
switch creates an instrumented executable, which in my experience 
(admittedly not experience with DMD) severely distorts running 
times. I usually prefer sampling-based profiling, where the 
executable is left unchanged and a sampling program interrupts 
the program at random and grabs the call stack, to avoid the 
distortion effect of instrumentation. Of course, instrumentation 
is useful to find out what functions are called the most and 
whether call frequencies are in line with expectations, but I 
wouldn't trust the time measurements that much.

As far as I know, D doesn't offer a sampling profiler, so one 
might indeed use a benchmarking library as a (poor) substitute. 
So I'd want to be able to set up some benchmarks that operate on 
realistic data, with perhaps different data in different runs in 
order to learn about how the speed varies with different inputs 
(if it varies a lot then I might create more benchmarks to 
investigate which inputs are processed quickly, and which slowly.)

Some random comments about std.benchmark based on its 
documentation:

- It is very strange that the documentation of printBenchmarks 
uses neither of the words "average" or "minimum", and doesn't say 
how many trials are done.... I suppose the obvious interpretation 
is that it only does one trial, but then we wouldn't be having 
this discussion about averages and minimums right? Øivind says 
tests are run 1000 times... but it needs to be configurable 
per-test (my idea: support a _x1000 suffix in function names, or 
_for1000ms to run the test for at least 1000 milliseconds; and 
allow a multiplier when when running a group of benchmarks, e.g. 
a multiplier argument of 0.5 means to only run half as many 
trials as usual.) Also, it is not clear from the documentation 
what the single parameter to each benchmark is (define 
"iterations count".)

- The "benchmark_relative_" feature looks quite useful. I'm also 
happy to see benchmarkSuspend() and benchmarkResume(), though 
benchmarkSuspend() seems redundant in most cases: I'd like to 
just call one function, say, benchmarkStart() to indicate "setup 
complete, please start measuring time now."

- I'm glad that StopWatch can auto-start; but the documentation 
should be clearer: does reset() stop the timer or just reset the 
time to zero? does stop() followed by start() start from zero or 
does it keep the time on the clock? I also think there should be 
a method that returns the value of peek() and restarts the timer 
at the same time (perhaps stop() and reset() should just return 
peek()?)

- After reading the documentation of comparingBenchmark and 
measureTime, I have almost no idea what they do.

Sep 21 2012

"jerro" <a a.com> writes:

 As far as I know, D doesn't offer a sampling profiler,

It is possible to use a sampling profiler on D executables 
though. I usually use perf on Linux and AMD CodeAnalyst on 
Windows.

Sep 21 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 21-Sep-12 22:49, David Piepgrass wrote:
 After extensive tests with a variety of aggregate functions, I can say
 firmly that taking the minimum time is by far the best when it comes
 to assessing the speed of a function.


 As far as I know, D doesn't offer a sampling profiler, so one might
 indeed use a benchmarking library as a (poor) substitute. So I'd want to
 be able to set up some benchmarks that operate on realistic data, with
 perhaps different data in different runs in order to learn about how the
 speed varies with different inputs (if it varies a lot then I might
 create more benchmarks to investigate which inputs are processed
 quickly, and which slowly.)

Real good profilers are the ones served by CPU vendor. See AMD's 
CodeAnalyst or Intel's VTune. They could even count number of branch 
predictions, cache misses etc.
It is certainly out of the charter of module or for that matter any 
standard library code.

 Some random comments about std.benchmark based on its documentation:

 - It is very strange that the documentation of printBenchmarks uses
 neither of the words "average" or "minimum", and doesn't say how many
 trials are done.... I suppose the obvious interpretation is that it only
 does one trial, but then we wouldn't be having this discussion about
 averages and minimums right?

See the algorithm in action here:
https://github.com/D-Programming-Language/phobos/pull/794/files#L2R381

In other word a function is run 10^n times with n is picked so that 
total time is big enough to be a trustworthy measurement. Then run-time 
is time/10^n.

Øivind says tests are run 1000 times...

The above 1000 times, picking the minimum as the best. Obviously it'd be 
good to be configurable.

  but
 it needs to be configurable per-test (my idea: support a _x1000 suffix
 in function names, or _for1000ms to run the test for at least 1000
 milliseconds; and allow a multiplier when when running a group of
 benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
 many trials as usual.) Also, it is not clear from the documentation what
 the single parameter to each benchmark is (define "iterations count".)

 - The "benchmark_relative_" feature looks quite useful. I'm also happy
 to see benchmarkSuspend() and benchmarkResume(), though
 benchmarkSuspend() seems redundant in most cases: I'd like to just call
 one function, say, benchmarkStart() to indicate "setup complete, please
 start measuring time now."

 - I'm glad that StopWatch can auto-start; but the documentation should
 be clearer: does reset() stop the timer or just reset the time to zero?
 does stop() followed by start() start from zero or does it keep the time
 on the clock? I also think there should be a method that returns the
 value of peek() and restarts the timer at the same time (perhaps stop()
 and reset() should just return peek()?)

It's the same as the usual stopwatch (as in the real hardware thingy). Thus:
- reset just resets numbers to zeros
- stop just stops counting
- start just starts counting
- peek imitates taking a look at numbers on a device ;)

 - After reading the documentation of comparingBenchmark and measureTime,
 I have almost no idea what they do.

I think that comparingBenchmark was present in std.datetime and is 
carried over as is.

-- 
Dmitry Olshansky

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/12 2:49 PM, David Piepgrass wrote:
 After extensive tests with a variety of aggregate functions, I can say
 firmly that taking the minimum time is by far the best when it comes
 to assessing the speed of a function.

 Like others, I must also disagree in princple. The minimum sounds like a
 useful metric for functions that (1) do the same amount of work in every
 test and (2) are microbenchmarks, i.e. they measure a small and simple
 task.

That is correct.

 If the benchmark being measured either (1) varies the amount of
 work each time (e.g. according to some approximation of real-world
 input, which obviously may vary)* or (2) measures a large system, then
 the average and standard deviation and even a histogram may be useful
 (or perhaps some indicator whether the runtimes are consistent with a
 normal distribution or not). If the running-time is long then the max
 might be useful (because things like task-switching overhead probably do
 not contribute that much to the total).

 * I anticipate that you might respond "so, only test a single input per
 benchmark", but if I've got 1000 inputs that I want to try, I really
 don't want to write 1000 functions nor do I want 1000 lines of output
 from the benchmark. An average, standard deviation, min and max may be
 all I need, and if I need more detail, then I might break it up into 10
 groups of 100 inputs. In any case, the minimum runtime is not the
 desired output when the input varies.

I understand. What we currently do at Facebook is support benchmark 
functions with two parameters (see 
https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md). 
One is the number of iterations, the second is "problem size", akin to 
what you're discussing.

I chose to not support that in this version of std.benchmark because it 
can be tackled later easily, but I probably need to add it now, sigh.

 It's a little surprising to hear "The purpose of std.benchmark is not to
 estimate real-world time. (That is the purpose of profiling)"...
 Firstly, of COURSE I would want to estimate real-world time with some of
 my benchmarks. For some benchmarks I just want to know which of two or
 three approaches is faster, or to get a coarse ball-park sense of
 performance, but for others I really want to know the wall-clock time
 used for realistic inputs.

I would contend that a benchmark without a baseline is very often 
misguided. I've seen tons and tons and TONS of nonsensical benchmarks 
lacking a baseline. "I created one million smart pointers, it took me 
only one millisecond!" Well how long did it take you to create one 
million dumb pointers?

Choosing good baselines and committing to good comparisons instead of 
un-based absolutes is what makes the difference between a professional 
and a well-intended dilettante.

 Secondly, what D profiler actually helps you answer the question "where
 does the time go in the real-world?"? The D -profile switch creates an
 instrumented executable, which in my experience (admittedly not
 experience with DMD) severely distorts running times. I usually prefer
 sampling-based profiling, where the executable is left unchanged and a
 sampling program interrupts the program at random and grabs the call
 stack, to avoid the distortion effect of instrumentation. Of course,
 instrumentation is useful to find out what functions are called the most
 and whether call frequencies are in line with expectations, but I
 wouldn't trust the time measurements that much.

 As far as I know, D doesn't offer a sampling profiler, so one might
 indeed use a benchmarking library as a (poor) substitute. So I'd want to
 be able to set up some benchmarks that operate on realistic data, with
 perhaps different data in different runs in order to learn about how the
 speed varies with different inputs (if it varies a lot then I might
 create more benchmarks to investigate which inputs are processed
 quickly, and which slowly.)

I understand there's a good case to be made for profiling. If this turns 
out to be an acceptance condition for std.benchmark (which I think it 
shouldn't), I'll define one.

 Some random comments about std.benchmark based on its documentation:

 - It is very strange that the documentation of printBenchmarks uses
 neither of the words "average" or "minimum", and doesn't say how many
 trials are done....

Because all of those are irrelevant and confusing. We had an older 
framework at Facebook that reported those numbers, and they were utterly 
and completely meaningless. Besides the trials column contained numbers 
that were not even comparable. Everybody was happy when I removed them 
with today's simple and elegant numbers.

 I suppose the obvious interpretation is that it only
 does one trial, but then we wouldn't be having this discussion about
 averages and minimums right? Øivind says tests are run 1000 times... but
 it needs to be configurable per-test (my idea: support a _x1000 suffix
 in function names, or _for1000ms to run the test for at least 1000
 milliseconds; and allow a multiplier when when running a group of
 benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
 many trials as usual.)

I don't think that's a good idea.

 Also, it is not clear from the documentation what
 the single parameter to each benchmark is (define "iterations count".)

The documentation could include that, but I don't want to overspecify.

 - The "benchmark_relative_" feature looks quite useful. I'm also happy
 to see benchmarkSuspend() and benchmarkResume(), though
 benchmarkSuspend() seems redundant in most cases: I'd like to just call
 one function, say, benchmarkStart() to indicate "setup complete, please
 start measuring time now."

Good point. I think this is a minor encumbrance, so it's good to keep 
generality.

 - I'm glad that StopWatch can auto-start; but the documentation should
 be clearer: does reset() stop the timer or just reset the time to zero?
 does stop() followed by start() start from zero or does it keep the time
 on the clock? I also think there should be a method that returns the
 value of peek() and restarts the timer at the same time (perhaps stop()
 and reset() should just return peek()?)

 - After reading the documentation of comparingBenchmark and measureTime,
 I have almost no idea what they do.

Yah, these are moved over from std.datetime. I'll need to make a couple 
more passes through the dox.


Andrei

Sep 21 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 Some random comments about std.benchmark based on its 
 documentation:

 - It is very strange that the documentation of printBenchmarks 
 uses neither of the words "average" or "minimum", and doesn't 
 say how many trials are done....

 Because all of those are irrelevant and confusing.

Huh? It's not nearly as confusing as reading the documentation 
and not having the faintest idea what it will do. The way the 
benchmarker works is somehow 'irrelevant'? The documentation 
doesn't even indicate that the functions are to be run more than 
once!!

 I don't think that's a good idea.

I have never seen you make such vague arguments, Andrei.

Sep 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/12 5:36 PM, David Piepgrass wrote:
 Some random comments about std.benchmark based on its
 documentation:

 - It is very strange that the documentation of printBenchmarks
 uses neither of the words "average" or "minimum", and doesn't say
 how many trials are done....

 Because all of those are irrelevant and confusing.

 Huh? It's not nearly as confusing as reading the documentation and
 not having the faintest idea what it will do. The way the benchmarker
 works is somehow 'irrelevant'? The documentation doesn't even
 indicate that the functions are to be run more than once!!

I misunderstood. I agree that it's a good thing to specify how
benchmarking proceeds.

 I don't think that's a good idea.

 I have never seen you make such vague arguments, Andrei.

I had expanded my point elsewhere. Your suggestion was:

 - It is very strange that the documentation of printBenchmarks uses
 neither of the words "average" or "minimum", and doesn't say how many
 trials are done.... I suppose the obvious interpretation is that it
 only does one trial, but then we wouldn't be having this discussion
 about averages and minimums right? �ivind says tests are run 1000
 times... but it needs to be configurable per-test (my idea: support a
 _x1000 suffix in function names, or _for1000ms to run the test for at
 least 1000 milliseconds; and allow a multiplier when when running a
 group of benchmarks, e.g. a multiplier argument of 0.5 means to only
 run half as many trials as usual.) Also, it is not clear from the
 documentation what the single parameter to each benchmark is (define
 "iterations count".)

I don't think it's a good idea because the "for 1000 ms" doesn't say 
anything except how good the clock resolution was on the system. I'm as 
strongly convinced we shouldn't print useless information as I am we 
should print useful information.


Andrei

Sep 21 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 - It is very strange that the documentation of printBenchmarks 
 uses
 neither of the words "average" or "minimum", and doesn't say 
 how many
 trials are done.... I suppose the obvious interpretation is 
 that it
 only does one trial, but then we wouldn't be having this 
 discussion
 about averages and minimums right? Øivind says tests are run 
 1000
 times... but it needs to be configurable per-test (my idea: 
 support a
 _x1000 suffix in function names, or _for1000ms to run the test 
 for at
 least 1000 milliseconds; and allow a multiplier when when 
 running a
 group of benchmarks, e.g. a multiplier argument of 0.5 means 
 to only
 run half as many trials as usual.) Also, it is not clear from 
 the
 documentation what the single parameter to each benchmark is 
 (define
 "iterations count".)

 I don't think it's a good idea because the "for 1000 ms" 
 doesn't say anything except how good the clock resolution was 
 on the system. I'm as strongly convinced we shouldn't print 
 useless information as I am we should print useful information.

I am puzzled about what you think my suggestion meant. I am 
suggesting allowing the user to configure how long benchmarking 
takes. Some users might want to run their benchmark for an hour 
to get stable and reliable numbers; others don't want to wait and 
want to see results ASAP. Perhaps the *same* user will want to 
run benchmarks quickly while developing them and then do a "final 
run" with more trials once their benchmark suite is complete. 
Also, some individual benchmark functions will take microseconds 
to complete; others may take seconds to complete. All I'm 
suggesting are simple ways to avoid wasting users' time, without 
making std.benchmark overly complicated.

Sep 22 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

Stepping back for a moment, I think we're facing two key issues here: 

The first key issue is that the docs for std.benchmark don't adequately
explain Andre's intended charter/scope for it, it's methodology or the
rationale for its methodology. So people see "benchmark" and they think
"oh, ok, for timing stuff", but it appears to be intended as being
for very specific use-cases. I think this entire discussion serves as
evidence that, at the very least, it needs to communicate that
scope/methodology/rationale better that it currently does. If all of
us are having trouble "getting it", then others certainly will too.

Aside from that, there's the second key issue: whether the
current intended scope is sufficient. Should it be more general in
scope and not so specialized? Personally, I would tend to think do, and
I think that seems to the the popular notion. But I don't know for sure.
If it should be more generalized, than does it need to be so for the
first iteration, or can it be done later after being added to phobos?
That, I have no idea.

Sep 21 2012

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 22/09/12 07:10, Nick Sabalausky wrote:
 I think this entire discussion serves as evidence that, at the very
 least, it needs to communicate that scope/methodology/rationale better
 that it currently does. If all of us are having trouble "getting it",
 then others certainly will too.

My feeling is that even with a good explanation in the docs, you're _still_ 
going to have a regular stream of people showing up on the mailing lists going, 
"Hey, why can't I get my preferred metric with std.benchmark??!!"

So, even if there's good reason to think their preferences are daft, it might
be 
worth supporting what they want to do, just to avoid that continuous stream of 
requests.

 Aside from that, there's the second key issue: whether the
 current intended scope is sufficient. Should it be more general in
 scope and not so specialized? Personally, I would tend to think do, and
 I think that seems to the the popular notion. But I don't know for sure.
 If it should be more generalized, than does it need to be so for the
 first iteration, or can it be done later after being added to phobos?
 That, I have no idea.

This is what I was wondering, whether it's possible to take the current 
functionality but leave the door open to extending it with a different choice
of 
metrics.

Under the extended version, the default metric would be as it is currently, the 
docs would explain why this default makes sense and the caveats related to
other 
metrics, but ultimately if the user wanted to use them, they'd be available.

But I'd only like to see that happening if there is a "natural" path to the 
extended version rather than breaking changes or significant rewrites.

Sep 23 2012

=?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:

New question for you :)

To register benchmarks, the 'scheduleForBenchmarking' mixin 
inserts a shared static initializer into the module. If I have a 
module A and a module B, that both depend on eachother, than this 
will probably not work..? The runtime will detect the init cycle 
and fail with the following error:

"Cycle detected between modules with ctors/dtors"

Or am I wrong now?

Sep 19 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)

 To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
 shared static initializer into the module. If I have a module A and a
 module B, that both depend on eachother, than this will probably not
 work..? The runtime will detect the init cycle and fail with the
 following error:

 "Cycle detected between modules with ctors/dtors"

 Or am I wrong now?

I think you have discovered a major issue. Ideas on how to attack this?

Andrei

Sep 21 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, September 21, 2012 15:59:31 Andrei Alexandrescu wrote:
 On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)
 
 To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
 shared static initializer into the module. If I have a module A and a
 module B, that both depend on eachother, than this will probably not
 work..? The runtime will detect the init cycle and fail with the
 following error:
 
 "Cycle detected between modules with ctors/dtors"
 
 Or am I wrong now?

 
 I think you have discovered a major issue. Ideas on how to attack this?

Some of us have been asking for ages for the ability to mark a static 
constructor as not depending on anything so that the runtime _doesn't_ think 
that there's a circular dependency, but Walter has been against the idea when 
it's been brought up. That would _really_ help here.

Without redesigning std.benchmark so that it doesn't use static constructors, 
I don't know how you can fix that. Normally, if you really need a static 
constructor, you go through the pain of creating a separate module which does 
the initialization for you (like std.stdio does). But that won't work in this 
case, because you're mixing it in. So, unless you can redesign it so that 
std.benchmark doesn't require static constructors, it may have to be a 
limitation of std.benchmark that it can't be used where it would create a 
circular dependency.

Unfortunately, the circular dependency issue makes static constructors almost 
useless outside of isolated cases, even though they rarely actually have 
circular dependencies. It's one of the few places in D that I'd say that 
there's a major design flaw.

- Jonathan M Davis

Sep 21 2012

=?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:

On Friday, 21 September 2012 at 20:43:13 UTC, Jonathan M Davis 
wrote:
 On Friday, September 21, 2012 15:59:31 Andrei Alexandrescu 
 wrote:
 On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)
 
 To register benchmarks, the 'scheduleForBenchmarking' mixin 
 inserts a
 shared static initializer into the module. If I have a 
 module A and a
 module B, that both depend on eachother, than this will 
 probably not
 work..? The runtime will detect the init cycle and fail with 
 the
 following error:
 
 "Cycle detected between modules with ctors/dtors"
 
 Or am I wrong now?

 
 I think you have discovered a major issue. Ideas on how to 
 attack this?

 Some of us have been asking for ages for the ability to mark a 
 static
 constructor as not depending on anything so that the runtime 
 _doesn't_ think
 that there's a circular dependency, but Walter has been against 
 the idea when
 it's been brought up. That would _really_ help here.

 Without redesigning std.benchmark so that it doesn't use static 
 constructors,
 I don't know how you can fix that. Normally, if you really need 
 a static
 constructor, you go through the pain of creating a separate 
 module which does
 the initialization for you (like std.stdio does). But that 
 won't work in this
 case, because you're mixing it in. So, unless you can redesign 
 it so that
 std.benchmark doesn't require static constructors, it may have 
 to be a
 limitation of std.benchmark that it can't be used where it 
 would create a
 circular dependency.

 Unfortunately, the circular dependency issue makes static 
 constructors almost
 useless outside of isolated cases, even though they rarely 
 actually have
 circular dependencies. It's one of the few places in D that I'd 
 say that
 there's a major design flaw.

 - Jonathan M Davis


Is there a way to solve the dependency issue without forbidding 
static constructors in modules with cyclic dependencies?

I.e. for two modules with static init referencing eachother, an 
analysis of the static init code could be used to detect circular 
dependencies instead of the import statements?

Sep 22 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/22/12 8:28 AM, "Øivind" wrote:
 Is there a way to solve the dependency issue without forbidding static
 constructors in modules with cyclic dependencies?

I think an idea just occurred to me. The rules for static ctors and 
dtors were invented before "import" was allowed inside a scope. We could 
have taken advantage of that.

Say we restrict symbol visibility inside static cdtors to ONLY symbols 
within the current module. If some static cdtor needs a symbol from a 
different module, it must import it explicitly (even if the current 
module already imports it).

In this setup it should be possible to compute, in a fine-grained 
manner, the dependencies of static cdtors.

Unfortunately that would be a breaking change.


Andrei

Sep 22 2012

=?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:

On Saturday, 22 September 2012 at 13:03:06 UTC, Andrei 
Alexandrescu wrote:
 On 9/22/12 8:28 AM, "Øivind" wrote:
 Is there a way to solve the dependency issue without 
 forbidding static
 constructors in modules with cyclic dependencies?

 I think an idea just occurred to me. The rules for static ctors 
 and dtors were invented before "import" was allowed inside a 
 scope. We could have taken advantage of that.

 Say we restrict symbol visibility inside static cdtors to ONLY 
 symbols within the current module. If some static cdtor needs a 
 symbol from a different module, it must import it explicitly 
 (even if the current module already imports it).

 In this setup it should be possible to compute, in a 
 fine-grained manner, the dependencies of static cdtors.

 Unfortunately that would be a breaking change.


 Andrei

It gets a bit ugly maybe, but we could do a mix of the proposals 
that have come before and this one, e.g. add a  nocycliccheck (or 
similar) to the static constructor,  and in that case only allow 
access to current module and those imorted inside the ctor scope..

Sep 22 2012

=?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:

On Saturday, 22 September 2012 at 13:25:47 UTC, Øivind wrote:
 On Saturday, 22 September 2012 at 13:03:06 UTC, Andrei 
 Alexandrescu wrote:
 On 9/22/12 8:28 AM, "Øivind" wrote:
 Is there a way to solve the dependency issue without 
 forbidding static
 constructors in modules with cyclic dependencies?

 I think an idea just occurred to me. The rules for static 
 ctors and dtors were invented before "import" was allowed 
 inside a scope. We could have taken advantage of that.

 Say we restrict symbol visibility inside static cdtors to ONLY 
 symbols within the current module. If some static cdtor needs 
 a symbol from a different module, it must import it explicitly 
 (even if the current module already imports it).

 In this setup it should be possible to compute, in a 
 fine-grained manner, the dependencies of static cdtors.

 Unfortunately that would be a breaking change.


 Andrei

 It gets a bit ugly maybe, but we could do a mix of the 
 proposals that have come before and this one, e.g. add a 
  nocycliccheck (or similar) to the static constructor,  and in 
 that case only allow access to current module and those imorted 
 inside the ctor scope..

We would probably not call it ' nycycliccheck' since you propose 
to still do these checks, but only local imports :) Would need 
another name for it.

Sep 22 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-09-22, 15:28, =C3=98ivind wrote:

 It gets a bit ugly maybe, but we could do a mix of the proposals that=


  =

 have come before and this one, e.g. add a  nocycliccheck (or similar)=


  =

 to the static constructor,  and in that case only allow access to  =


 current module and those imorted inside the ctor scope..

 We would probably not call it ' nycycliccheck' since you propose to  =

 still do these checks, but only local imports :) Would need another na=

me  =

 for it.

 noglobalimports. :p

-- =

Simen

Sep 22 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-09-22, 15:04, Andrei Alexandrescu wrote:

 On 9/22/12 8:28 AM, "=C3=98ivind" wrote:
 Is there a way to solve the dependency issue without forbidding stati=


c
 constructors in modules with cyclic dependencies?

 I think an idea just occurred to me. The rules for static ctors and  =

 dtors were invented before "import" was allowed inside a scope. We cou=

ld  =

 have taken advantage of that.

 Say we restrict symbol visibility inside static cdtors to ONLY symbols=

  =

 within the current module. If some static cdtor needs a symbol from a =

 =

 different module, it must import it explicitly (even if the current  =

 module already imports it).

 In this setup it should be possible to compute, in a fine-grained  =

 manner, the dependencies of static cdtors.

 Unfortunately that would be a breaking change.

That *is* neat. I guess putting it on the deprecation path could work.
This is a change we'd really, *really* like to see.

-- =

Simen

Sep 22 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, September 22, 2012 09:04:09 Andrei Alexandrescu wrote:
 On 9/22/12 8:28 AM, "=C3=98ivind" wrote:
 Is there a way to solve the dependency issue without forbidding sta=


tic
 constructors in modules with cyclic dependencies?

=20
 I think an idea just occurred to me. The rules for static ctors and
 dtors were invented before "import" was allowed inside a scope. We co=

uld
 have taken advantage of that.
=20
 Say we restrict symbol visibility inside static cdtors to ONLY symbol=

s
 within the current module. If some static cdtor needs a symbol from a=

 different module, it must import it explicitly (even if the current
 module already imports it).
=20
 In this setup it should be possible to compute, in a fine-grained
 manner, the dependencies of static cdtors.
=20
 Unfortunately that would be a breaking change.

It's a nice thought, but it wouldn't work. If nothing else .di files co=
mpletely=20
ruin it.

1. I don't think that it's actually required that static constructors b=
e in a=20
.di file. So, the compiler couldn't know for sure whether the modules b=
eing=20
imported had static constructors.

2. Even if all static constructors had to be in a .di file, local impor=
ts ruin=20
it, because the function bodies (which can definitely be elided from .d=
i files)=20
could contain local imports to modules which have static constructors a=
nd=20
cause circular dependencies.

- Jonathan M Davis

Sep 22 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 21-Sep-12 23:59, Andrei Alexandrescu wrote:
 On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)

 To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
 shared static initializer into the module. If I have a module A and a
 module B, that both depend on eachother, than this will probably not
 work..? The runtime will detect the init cycle and fail with the
 following error:

 "Cycle detected between modules with ctors/dtors"

 Or am I wrong now?

 I think you have discovered a major issue. Ideas on how to attack this?

Not ideal but...

Make scheduleForBenchmarking to mixin in something else but not code - 
say global templated struct with certain name.

Then it should be possible to do:

benchmarkModules!(module1, module2, ...);

That would search for this specific anchor at the top scope of modules 
and collect all info. I'm not sure we can pass module names as alias 
parameters but I think our meta-programming tricksters certainly did 
something along the these lines.

-- 
Dmitry Olshansky

Sep 21 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-21 21:59, Andrei Alexandrescu wrote:

 I think you have discovered a major issue. Ideas on how to attack this?

The standard way to solve this would be to move the initialization code 
from a static constructor to a function what will be called and do the 
initialization lazy. But since this several layers of string mixins what 
will make it more complicated.

Can you give a short example of how the mixed in code will look like?

-- 
/Jacob Carlborg

Sep 22 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-17 23:13, Jens Mueller wrote:

 Post all feedback to this thread. Constructive feedback is very much
 appreciated.

 To conclude in more Andrei like words: Happy destruction!

* Why is "scheduleForBenchmarking" a string? Can't it be a template mixin?

* What's the most appropriate way of just timing a block of code? 
Something like this:

auto time = benchmark!({ /* some code */ })(1);

If that's the case then I suggest setting a default value of "1" for the 
"n" parameter.

* If I want to format the printed result differently, say in HTML, how 
would I do that? Should I use the "benchmark" function and iterate the 
BenchmarkResult array?

* BTW why doesn't benchmark return the BenchmarkResult array?

* Is this module so important to keep it as a top level module? I'm 
thinking something like a utility package or a time/date package. How 
about std.util.benchmark?

-- 
/Jacob Carlborg

Sep 19 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, September 19, 2012 09:13:40 Jacob Carlborg wrote:
 * Is this module so important to keep it as a top level module? I'm
 thinking something like a utility package or a time/date package. How
 about std.util.benchmark?

util is one of the worst package names ever, because it means basically 
nothing. Any function could go in there.

As for a time/date package, we already have std.datetime (which will hopefully 
be split into the package std.datetime at some point, but we need something 
like DIP 15 or 16 before we can do that), and we're moving the benchmarking 
_out_ of there. If std.datetime were already a package, then maybe putting it 
in there would make some sense, but benchmarking is arguably fundamentally 
different from what the rest of std.datetime does. I really so no problem with 
benchmarking being its own thing, and std.benchmark works just fine for that.

- Jonathan M Davis

Sep 19 2012

Jacob Carlborg <doob me.com> writes:

On 2012-09-19 09:58, Jonathan M Davis wrote:

 util is one of the worst package names ever, because it means basically
 nothing. Any function could go in there.

Well, the "util" package in Phobos is called "std".

 As for a time/date package, we already have std.datetime (which will hopefully
 be split into the package std.datetime at some point, but we need something
 like DIP 15 or 16 before we can do that), and we're moving the benchmarking
 _out_ of there. If std.datetime were already a package, then maybe putting it
 in there would make some sense, but benchmarking is arguably fundamentally
 different from what the rest of std.datetime does. I really so no problem with
 benchmarking being its own thing, and std.benchmark works just fine for that.

I just think we have too many top level modules.


-- 
/Jacob Carlborg

Sep 19 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/19/12 3:13 AM, Jacob Carlborg wrote:
 On 2012-09-17 23:13, Jens Mueller wrote:

 Post all feedback to this thread. Constructive feedback is very much
 appreciated.

 To conclude in more Andrei like words: Happy destruction!

 * Why is "scheduleForBenchmarking" a string? Can't it be a template mixin?

Good point, I'll look into it.

 * What's the most appropriate way of just timing a block of code?
 Something like this:

 auto time = benchmark!({ /* some code */ })(1);

 If that's the case then I suggest setting a default value of "1" for the
 "n" parameter.

A default value of n would depend on the speed of the function and the 
granularity of the system's timer. That overload of benchmark is 
imperfect (rather rigid) but we must keep it for backwards compatibility.

 * If I want to format the printed result differently, say in HTML, how
 would I do that? Should I use the "benchmark" function and iterate the
 BenchmarkResult array?

That is correct. printBenchmarks() does not use any magic - it just 
looks at that same array.

 * BTW why doesn't benchmark return the BenchmarkResult array?

Will look into it.

 * Is this module so important to keep it as a top level module? I'm
 thinking something like a utility package or a time/date package. How
 about std.util.benchmark?

Not sure.



Andrei

Sep 22 2012

"Peter Alexander" <peter.alexander.au gmail.com> writes:

I don't see why `benchmark` takes (almost) all of its parameters 
as template parameters. It looks quite odd, seems unnecessary, 
and (if I'm not mistaken) makes certain use cases quite difficult.

For example, suppose I want to benchmark a function several times 
with different parameters and names, how would I do that?

foreach (i; 0..10)
{
     printBenchmark!( format("Test %d", i), { someFunc(i); } )();
}

This won't work because i isn't known at compile time, and for 
some use cases it can't be known at compile time.

I wouldn't mind if there was some real benefit to taking these as 
template arguments, but there doesn't seem to be any value at all 
-- it just limits usage.

Sep 19 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/19/12 4:06 AM, Peter Alexander wrote:
 I don't see why `benchmark` takes (almost) all of its parameters as
 template parameters. It looks quite odd, seems unnecessary, and (if I'm
 not mistaken) makes certain use cases quite difficult.

That is intentional - indirect calls would add undue overhead to the 
measurements.

Andrei

Sep 21 2012

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Friday, 21 September 2012 at 19:54:12 UTC, Andrei Alexandrescu 
wrote:
 On 9/19/12 4:06 AM, Peter Alexander wrote:
 I don't see why `benchmark` takes (almost) all of its 
 parameters as
 template parameters. It looks quite odd, seems unnecessary, 
 and (if I'm
 not mistaken) makes certain use cases quite difficult.

 That is intentional - indirect calls would add undue overhead 
 to the measurements.

I accept that it adds undue overhead. I just think that the 
function would be more usable with non-template parameters (as 
per my example). I also think the overhead would be negligible.

Sep 21 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

I'd throw in a request to address the following.

Suppose we have a function F and a set of inputs S that are supposedly 
different scenarios we optimize for.
What is interesting is to benchmark all of F(S[i]) as |S| separate 
functions greatly saving on boilerplate (and helping readability).

One way would to allow passing in an input range of ArgumentTuples to F.

Say as prefix:

void benchmark_f(int a, double b, string s){ ... }

enum benchmark_data_f = [ tuple(1, 2.0, "hi"), tuple(2, 3.0, "bye") ];

Then in the results it'd look as:
f(1, 2.0, "hi")                <ns/iter>   <iter/s>
f(2, 3.0, "bye")               <ns/iter>   <iter/s>

Using any input range is interestingly flexible e.g. :

enum benchmark_data_x = cortesianProduct(iota(1, 3), iota(1, 3));
//we should probably have it in std.range somewhere

void benchmark_x(int a, int b){ ... }

That being said I don't really get the benefit of passing iteration 
count to the function being benched. To allow it to do initialization 
step once then do resumeBenchmark() and run some inner loop n times?

-- 
Dmitry Olshansky

Sep 21 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Review of Andrei's std.benchmark