digitalmars.D - Documented unittests & code coverage

Johannes Pfau (27/27) Jul 28 2016 Some time ago we moved some example code from documentation comments

Jonathan M Davis via Digitalmars-d (3/10) Jul 28 2016 https://issues.dlang.org/show_bug.cgi?id=14856
Walter Bright (4/6) Jul 28 2016 Yes. I've read all the arguments against code coverage testing. But in m...

Seb (10/18) Jul 28 2016 @Walter: the discussion is not about code coverage in general,
Jonathan M Davis via Digitalmars-d (23/29) Jul 28 2016 The issue isn't whether we should have code coverage testing. We agree t...

Walter Bright (6/21) Jul 28 2016 As soon as we start taking the % coverage too seriously, we are in troub...

Jonathan M Davis via Digitalmars-d (10/16) Jul 28 2016 True, but particularly when you start doing stuff like trying to require

Walter Bright (5/13) Jul 28 2016 Worrying about this just serves no purpose. Code coverage percentages ar...
Seb (7/29) Jul 30 2016 Yep especially because I think we agree that "coverage [should]

Jack Stouffer (9/16) Jul 28 2016 In the context of the bug, we are not the ones interpreting the

Walter Bright (6/8) Jul 29 2016 Think of it like the airspeed indicator on an airplane. There is no righ...

Chris (6/15) Jul 29 2016 Maybe it would help to give more than one value, e.g. the actual
Steven Schveighoffer (4/12) Jul 29 2016 What if the guage was air-speed added to fuel level, and you didn't get
Jack Stouffer (11/15) Jul 29 2016 Continuing with this metaphor, in this situation you're not the

Atila Neves (4/12) Aug 04 2016 Have you read this?

Walter Bright (13/23) Aug 04 2016 I've seen the reddit discussion of it. I don't really understand from re...

Atila Neves (32/55) Aug 04 2016 I think I read the paper around a year ago, my memory is fuzzy.

Walter Bright (20/31) Aug 04 2016 Any metric that is blindly followed results in counterproductive edge ca...
Walter Bright (5/5) Aug 04 2016 In adding some overflow detection to Phobos, I discovered that some allo...

Atila Neves (11/17) Aug 05 2016 :)
Jonathan M Davis via Digitalmars-d (20/25) Aug 05 2016 Well, like you said in the previous post, code coverage is important but

Walter Bright (2/3) Aug 05 2016 Yes, I pretty much agree with your entire post.

Jonathan M Davis via Digitalmars-d (9/12) Aug 05 2016 I just wish that you'd agree that issues like these

Johannes Pfau <nospam example.com> writes:

Some time ago we moved some example code from documentation comments
into documented unittests. Some of these more complicated examples are
incomplete and therefore are not expected to actually run. These are
also not very useful as real unittests as they do not contain any
asserts or verification of results. Making them documented
unittests was mainly done to make sure these examples keep compiling
when there are library changes.

I just had a quick look at https://github.com/dlang/phobos/pull/4587
and some example output:
https://codecov.io/gh/wilzbach/phobos/src/5fc9eb90076101c0266fb3491ac68527d3520fba/std/digest/digest.d#L106

And it seems that this 'idiom' messes up code coverage results: The
code in the unittest is never executed (we just want syntactically
valid code) and therefore shows up as untested code. The code coverage
shows 46 missed lines for std.digest.digest, but only 8 of these
actually need testing.

So how do we solve this?
* Ignore lines in documented unittests for code coverage?
* Make these examples completely executable, at the expense of
documentation which will then contain useless boilerplate code
* Move these examples back to the documentation block?

And as a philosophical question: Is code coverage in unittests even a
meaningful measurement? We write unittests to test the library code. But
if there's a line in a unittests which is never executed, this does not
directly mean there's a problem in library code, as long as all library
code is still tested. It may be an oversight in the unittest in the
worst case, but how important are ovesights / bugs in the unittests if
they don't affect library code in any way?

Jul 28 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, July 28, 2016 12:15:27 Johannes Pfau via Digitalmars-d wrote:
 And as a philosophical question: Is code coverage in unittests even a
 meaningful measurement? We write unittests to test the library code. But
 if there's a line in a unittests which is never executed, this does not
 directly mean there's a problem in library code, as long as all library
 code is still tested. It may be an oversight in the unittest in the
 worst case, but how important are ovesights / bugs in the unittests if
 they don't affect library code in any way?

https://issues.dlang.org/show_bug.cgi?id=14856

- Jonathan M Davis

Jul 28 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 7/28/2016 3:15 AM, Johannes Pfau wrote:
 And as a philosophical question: Is code coverage in unittests even a
 meaningful measurement?

Yes. I've read all the arguments against code coverage testing. But in my usage 
of it for 30 years, it has been a dramatic and unqualified success in improving 
the reliability of shipping code.

Jul 28 2016

Seb <seb wilzba.ch> writes:

On Thursday, 28 July 2016 at 23:14:42 UTC, Walter Bright wrote:
 On 7/28/2016 3:15 AM, Johannes Pfau wrote:
 And as a philosophical question: Is code coverage in unittests 
 even a
 meaningful measurement?

 Yes. I've read all the arguments against code coverage testing. 
 But in my usage of it for 30 years, it has been a dramatic and 
 unqualified success in improving the reliability of shipping 
 code.

 Walter: the discussion is not about code coverage in general, 
but whether code coverage within unittests should be reported, 
because we are only interested in the coverage of the library 
itself and as Johannes and Jonathan pointed out there are some 
valid patterns (e.g. scope(failure)) that are used within 
unittests and never called.

However as Jonathan mentioned in the Bugzilla issue, the downside 
of not counting within unittest blocks is that potential bugs in 
the unittests can't be catched that easy anymore.

Jul 28 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, July 28, 2016 16:14:42 Walter Bright via Digitalmars-d wrote:
 On 7/28/2016 3:15 AM, Johannes Pfau wrote:
 And as a philosophical question: Is code coverage in unittests even a
 meaningful measurement?

 Yes. I've read all the arguments against code coverage testing. But in my
 usage of it for 30 years, it has been a dramatic and unqualified success in
 improving the reliability of shipping code.

The issue isn't whether we should have code coverage testing. We agree that
that's a great thing. The issue is whether the lines in the unit tests
themselves should count towards the coverage results.

https://issues.dlang.org/show_bug.cgi?id=14856

gives some good examples of why having the unittest blocks themselves
counted in the total percentage is problematic and can lead to dmd's code
coverage tool listing than 100% coverage in a module that is fully tested.
What's critical is that the code itself has the coverage testing not that
the lines in the tests which are doing that testing be counted as part of
the code that is or isn't covered.

I know that it will frequently be the case that I will not get 100% code
coverage per -cov for the code that I write simply because I frequently do
stuff like use scope(failure) writefln(...) to print useful information on
failure in unittest blocks so that I can debug what happened when things go
wrong (including when someone reports failures on their machine that don't
happen on mine).

D's code coverage tools are fantastic to have, but they do need a few tweaks
if we want to actually be reporting 100% code coverage for fully tested
modules. A couple of other reports that I opened a while back are

https://issues.dlang.org/show_bug.cgi?id=14855
https://issues.dlang.org/show_bug.cgi?id=14857

- Jonathan M Davis

Jul 28 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 7/28/2016 9:48 PM, Jonathan M Davis via Digitalmars-d wrote:
 gives some good examples of why having the unittest blocks themselves
 counted in the total percentage is problematic and can lead to dmd's code
 coverage tool listing than 100% coverage in a module that is fully tested.
 What's critical is that the code itself has the coverage testing not that
 the lines in the tests which are doing that testing be counted as part of
 the code that is or isn't covered.

 I know that it will frequently be the case that I will not get 100% code
 coverage per -cov for the code that I write simply because I frequently do
 stuff like use scope(failure) writefln(...) to print useful information on
 failure in unittest blocks so that I can debug what happened when things go
 wrong (including when someone reports failures on their machine that don't
 happen on mine).

 D's code coverage tools are fantastic to have, but they do need a few tweaks
 if we want to actually be reporting 100% code coverage for fully tested
 modules. A couple of other reports that I opened a while back are

As soon as we start taking the % coverage too seriously, we are in trouble.
It's 
never going to be cut and dried what should be tested and what is unreasonable 
to test, and I see no point in arguing about it.

The % is a useful indicator, that is all. It is not a substitute for thought.

As always, use good judgement.

Jul 28 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, July 28, 2016 22:12:58 Walter Bright via Digitalmars-d wrote:
 As soon as we start taking the % coverage too seriously, we are in trouble.
 It's never going to be cut and dried what should be tested and what is
 unreasonable to test, and I see no point in arguing about it.

 The % is a useful indicator, that is all. It is not a substitute for
 thought.

 As always, use good judgement.

True, but particularly when you start doing stuff like trying to require
that modules have 100% coverage - or that the coverage not be reduced by a
change - it starts mattering - especially if it's done with build tools. The
current situation is far from the end of the world, but I definitely think
that we'd be better off if we fixed some of these issues so that the
percentage reflected the amount of the actual code that's covered rather
than having unit tests, assert(0) statements, invariants, etc. start
affecting code coverage when they aren't what you're trying to cover at all.

- Jonathan M Davis

Jul 28 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 7/28/2016 10:49 PM, Jonathan M Davis via Digitalmars-d wrote:
 True, but particularly when you start doing stuff like trying to require
 that modules have 100% coverage - or that the coverage not be reduced by a
 change - it starts mattering - especially if it's done with build tools. The
 current situation is far from the end of the world, but I definitely think
 that we'd be better off if we fixed some of these issues so that the
 percentage reflected the amount of the actual code that's covered rather
 than having unit tests, assert(0) statements, invariants, etc. start
 affecting code coverage when they aren't what you're trying to cover at all.

Worrying about this just serves no purpose. Code coverage percentages are a 
guide, an indicator, not a requirement in and of itself.

Changing the code in order to manipulate the number to meet some metric means 
the reviewer or the programmer or both have failed.

Jul 28 2016

Seb <seb wilzba.ch> writes:

On Friday, 29 July 2016 at 05:49:01 UTC, Jonathan M Davis wrote:
 On Thursday, July 28, 2016 22:12:58 Walter Bright via 
 Digitalmars-d wrote:
 As soon as we start taking the % coverage too seriously, we 
 are in trouble. It's never going to be cut and dried what 
 should be tested and what is unreasonable to test, and I see 
 no point in arguing about it.

 The % is a useful indicator, that is all. It is not a 
 substitute for thought.

 As always, use good judgement.

 True, but particularly when you start doing stuff like trying 
 to require that modules have 100% coverage - or that the 
 coverage not be reduced by a change - it starts mattering - 
 especially if it's done with build tools. The current situation 
 is far from the end of the world, but I definitely think that 
 we'd be better off if we fixed some of these issues so that the 
 percentage reflected the amount of the actual code that's 
 covered rather than having unit tests, assert(0) statements, 
 invariants, etc. start affecting code coverage when they aren't 
 what you're trying to cover at all.

 - Jonathan M Davis

Yep especially because I think we agree that "coverage [should] 
not be reduced by a change", except there is a pretty good reason 
to do so?

It could have the negative effect that people won't use such 
techniques anymore (e.g. debugging in unittests, invariants, ...) 
as they might develop an evil smell.

Jul 30 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Friday, 29 July 2016 at 05:12:58 UTC, Walter Bright wrote:
 As soon as we start taking the % coverage too seriously, we are 
 in trouble. It's never going to be cut and dried what should be 
 tested and what is unreasonable to test, and I see no point in 
 arguing about it.

 The % is a useful indicator, that is all. It is not a 
 substitute for thought.

 As always, use good judgement.

In the context of the bug, we are not the ones interpreting the 
statistic, we're the ones measuring and reporting it to users, 
and it's being measured incorrectly. By deciding not to fix a bug 
that causes an inaccurate statistic to be reported, you're making 
a decision on the user's behalf that coverage % is unimportant 
without knowing their circumstances.

If you're going to include coverage % in the report, then a job 
worth doing is worth doing well.

Jul 28 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 7/28/2016 11:07 PM, Jack Stouffer wrote:
 you're making a decision on the user's behalf that coverage % is
 unimportant without knowing their circumstances.

Think of it like the airspeed indicator on an airplane. There is no right or 
wrong airspeed. The pilot reads the indicated value, interprets it in the 
context of what the other instruments say, APPLIES GOOD JUDGMENT, and flies the 
airplane.

You won't find many pilots willing to fly without one.

Jul 29 2016

Chris <wendlec tcd.ie> writes:

On Friday, 29 July 2016 at 07:01:35 UTC, Walter Bright wrote:
 On 7/28/2016 11:07 PM, Jack Stouffer wrote:
 you're making a decision on the user's behalf that coverage % 
 is
 unimportant without knowing their circumstances.

 Think of it like the airspeed indicator on an airplane. There 
 is no right or wrong airspeed. The pilot reads the indicated 
 value, interprets it in the context of what the other 
 instruments say, APPLIES GOOD JUDGMENT, and flies the airplane.

 You won't find many pilots willing to fly without one.

Maybe it would help to give more than one value, e.g. the actual 
code coverage, i.e. functions and branches executed in the actual 
program, and commands executed in the unit test. So you would have

100% code coverage
95% total commands executed (but don't worry!)

Jul 29 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/29/16 3:01 AM, Walter Bright wrote:
 On 7/28/2016 11:07 PM, Jack Stouffer wrote:
 you're making a decision on the user's behalf that coverage % is
 unimportant without knowing their circumstances.

 Think of it like the airspeed indicator on an airplane. There is no
 right or wrong airspeed. The pilot reads the indicated value, interprets
 it in the context of what the other instruments say, APPLIES GOOD
 JUDGMENT, and flies the airplane.

 You won't find many pilots willing to fly without one.

What if the guage was air-speed added to fuel level, and you didn't get 
a guage for each individually?

-Steve

Jul 29 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Friday, 29 July 2016 at 07:01:35 UTC, Walter Bright wrote:
 The pilot reads the indicated value, interprets it in the 
 context of what the other instruments say, APPLIES GOOD 
 JUDGMENT, and flies the airplane.

Continuing with this metaphor, in this situation you're not the 
pilot making the judgement, you're the aerospace engineer 
deciding that the speedometer in the plane can be off by several 
hundred m/s and it's no big deal.

Yes, every measurement in the real world has a margin of error. 
But, since we're dealing with computers this is one of the rare 
situations where a perfect number can actually be obtained and 
presented to the user.

 There is no right or wrong airspeed.

The right one is the actual speed of the plane and the wrong one 
is every other number.

Jul 29 2016

Atila Neves <atila.neves gmail.com> writes:

On Thursday, 28 July 2016 at 23:14:42 UTC, Walter Bright wrote:
 On 7/28/2016 3:15 AM, Johannes Pfau wrote:
 And as a philosophical question: Is code coverage in unittests 
 even a
 meaningful measurement?

 Yes. I've read all the arguments against code coverage testing. 
 But in my usage of it for 30 years, it has been a dramatic and 
 unqualified success in improving the reliability of shipping 
 code.

Have you read this?

http://www.linozemtseva.com/research/2014/icse/coverage/

Atila

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 1:13 AM, Atila Neves wrote:
 On Thursday, 28 July 2016 at 23:14:42 UTC, Walter Bright wrote:
 On 7/28/2016 3:15 AM, Johannes Pfau wrote:
 And as a philosophical question: Is code coverage in unittests even a
 meaningful measurement?

 Yes. I've read all the arguments against code coverage testing. But in my
 usage of it for 30 years, it has been a dramatic and unqualified success in
 improving the reliability of shipping code.

 Have you read this?

 http://www.linozemtseva.com/research/2014/icse/coverage/

I've seen the reddit discussion of it. I don't really understand from reading 
the paper how they arrived at their test suites, but I suspect that may have a 
lot to do with the poor correlations they produced.

Unittests have uncovered lots of bugs for me, and code that was unittested had 
far, far fewer bugs showing up after release. The bugs that did turn up tended 
to be based on misunderstandings of the requirements.

For example, the Warp project was fully unittested from the ground up. I 
attribute that as the reason for the remarkably short development time for it 
and the near complete absence of bugs in the shipped product.

Unittests also enabled fearless rejiggering of the data structures trying to 
make Warp run faster. Not-unittested code tends to stick with the first design 
out of fear.

Aug 04 2016

Atila Neves <atila.neves gmail.com> writes:

On Thursday, 4 August 2016 at 10:24:39 UTC, Walter Bright wrote:
 On 8/4/2016 1:13 AM, Atila Neves wrote:
 On Thursday, 28 July 2016 at 23:14:42 UTC, Walter Bright wrote:
 On 7/28/2016 3:15 AM, Johannes Pfau wrote:
 And as a philosophical question: Is code coverage in 
 unittests even a
 meaningful measurement?

 Yes. I've read all the arguments against code coverage 
 testing. But in my
 usage of it for 30 years, it has been a dramatic and 
 unqualified success in
 improving the reliability of shipping code.

 Have you read this?

 http://www.linozemtseva.com/research/2014/icse/coverage/

 I've seen the reddit discussion of it. I don't really 
 understand from reading the paper how they arrived at their 
 test suites, but I suspect that may have a lot to do with the 
 poor correlations they produced.

I think I read the paper around a year ago, my memory is fuzzy. 
 From what I remember they analysed existing test suites. What I 
do remember is having the impression that it was done well.

 Unittests have uncovered lots of bugs for me, and code that was 
 unittested had far, far fewer bugs showing up after release. 
 <snip>

No argument there, as far as I'm concerned, unit tests = good 
thing (TM).

It think measuring unit test code coverage is a good idea, but 
only so it can be looked at to find lines that really should have 
been covered but weren't. What I take issue with is two things:

1. Code coverage metric targets (especially if the target is 
100%).  This leads to inane behaviours such as "testing" a print 
function (which itself was only used in testing) to meet the 
target. It's busywork that accomplishes nothing.

2. Using the code coverage numbers as a measure of unit test 
quality. This was always obviously wrong to me, I was glad that 
the research I linked to confirmed my opinion, and as far as I 
know (I'd be glad to be proven wrong), nobody else has published 
anything to convince me otherwise.

Code coverage, as a measure of test quality, is fundamentally 
broken. It measures coupling between the production code and the 
tests, which is never a good idea. Consider:

int div(int i, int j) { return i + j; }
unittest { div(3, 2); }

100% coverage, utterly wrong. Fine, no asserts is "cheating":

int div(int i, int j) { return i / j; }
unittest { assert(div(4, 2) == 2); }

100% coverage. No check for division by 0. Oops.

This is obviously a silly example, but the main idea is: coverage 
doesn't measure the quality of the sentinel values. Bad tests 
serve only as sanity tests, and the only way I've seen so far to 
make sure the tests themselves are good is mutant testing.



Atila

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 12:04 PM, Atila Neves wrote:
 What I take issue with is two things:

 1. Code coverage metric targets (especially if the target is 100%).  This leads
 to inane behaviours such as "testing" a print function (which itself was only
 used in testing) to meet the target. It's busywork that accomplishes nothing.

Any metric that is blindly followed results in counterproductive edge cases. It 
doesn't mean the metric is pointless, however, it just means that "good 
judgment" is necessary.

I don't think anyone can quote me on a claim that 100% coverage is required. I 
have said things like uncovered code requires some sort of credible 
justification. Being part of the test harness is a credible justification, as 
are assert(0)'s not being executed.

Leaving the multi-codepoint Unicode pathway untested probably has no credible 
justification.


 2. Using the code coverage numbers as a measure of unit test quality. This was
 always obviously wrong to me, I was glad that the research I linked to
confirmed
 my opinion, and as far as I know (I'd be glad to be proven wrong), nobody else
 has published anything to convince me otherwise.

 Code coverage, as a measure of test quality, is fundamentally broken. It
 measures coupling between the production code and the tests, which is never a
 good idea. Consider:

All that means is code coverage is necessary, but not sufficient. Even just 
executing code and not testing the results has *some* value, in that it
verifies 
that the code doesn't crash, and that it is not dead code.

----

One of the interesting differences between D and C++ is that D requires
template 
bodies to have valid syntax, while C++ requires template bodies to be both 
syntactically correct and partially semantically correct. The justification for 
the latter is so that the user won't see semantic errors when instantiating 
templates, but I interpret that as "so I can ship templates that were never 
instantiated", a justification that is unsupportable in my view :-)

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

In adding some overflow detection to Phobos, I discovered that some allocations 
were never called by the unittests. Adding a unittest for those paths, I 
discovered those paths didn't work at all for any cases.

I'm not giving up coverage testing anytime soon, regardless of what some study 
claims :-)

Aug 04 2016

Atila Neves <atila.neves gmail.com> writes:

On Friday, 5 August 2016 at 02:37:35 UTC, Walter Bright wrote:
 In adding some overflow detection to Phobos, I discovered that 
 some allocations were never called by the unittests. Adding a 
 unittest for those paths, I discovered those paths didn't work 
 at all for any cases.

 I'm not giving up coverage testing anytime soon, regardless of 
 what some study claims :-)

:)

Like I said, measuring coverage is important, what isn't is using 
it as a measure of the quality of the tests themselves. The other 
important thing is to decide whether or not certain lines are 
worth covering, which of course you can only do if you have the 
coverage data!

Mutant testing could have found those code paths you just 
mentioned, BTW: you'd always get surviving mutants for those 
paths.

Atila

Aug 05 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, August 04, 2016 19:37:35 Walter Bright via Digitalmars-d wrote:
 In adding some overflow detection to Phobos, I discovered that some
 allocations were never called by the unittests. Adding a unittest for those
 paths, I discovered those paths didn't work at all for any cases.

 I'm not giving up coverage testing anytime soon, regardless of what some
 study claims :-)

Well, like you said in the previous post, code coverage is important but
it's not sufficient.

It's always a bad sign when the code coverage isn't 100% (which is part of
why we'd like the metrics to actually be accurate), but while testing
everything is a huge step in the right direction, you have to have thorough
tests to actually verify that the code behaves correctly. And since you can
never test everything, it does become a bit of an art to figure out how to
sufficiently test a function without going overboard. Too many folks don't
test sufficiently IMHO, but then again, I probably tend to go overboard.
Still, I've found that being very thorough with unit tests seriously reduces
the number of bugs (e.g. very few bugs have ever been reported for
std.datetime).

Regardless, the fact that D has unit testing built in is a huge win, and
even if too many folks don't test their code thoroughly, the fact that it's
so easy to add tests to your program makes it embarassing when they don't do
at least _some_ testing, and that improves the quality of D code overall
even if there's still plenty of code that isn't necessarily at the level of
quality that we'd like.

- Jonathan M Davis

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/5/2016 12:10 PM, Jonathan M Davis via Digitalmars-d wrote:
 [...]

Yes, I pretty much agree with your entire post.

Aug 05 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Friday, August 05, 2016 14:10:02 Walter Bright via Digitalmars-d wrote:
 On 8/5/2016 12:10 PM, Jonathan M Davis via Digitalmars-d wrote:
 [...]

 Yes, I pretty much agree with your entire post.

I just wish that you'd agree that issues like these

https://issues.dlang.org/show_bug.cgi?id=14855
https://issues.dlang.org/show_bug.cgi?id=14856
https://issues.dlang.org/show_bug.cgi?id=14857

should be fixed so that the code coverage metrics would be more accurate. :)
Without those fixes, most of the code that I write will never have 100% code
coverage even if the tests do test everything and test it very throughly.

- Jonathan M Davis

Aug 05 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Documented unittests & code coverage