"[T]he concept of a coverage report is itself fatally flawed" wrote John Casey in his blog post Testing: Coverage Reports Considered Dangerous. Deborah Hartmann was a bit less over the top in her follow up post Code Coverage Stats Misleading. These blog posts decrying code coverage have it wrong, but they do contain a grain of truth. Let me touch on some of the points I think deserve another look.
The Problems
John was working on a large refactoring of Maven, and was relying on code coverage reports to let him know how complete his tests were. He wrote:
I had nearly perfect coverage numbers on the parts I felt were critical - not on constructing exceptions where the constructor consists of super( message, error ), mind you - and felt fairly confident that this plugin would almost drop in as a replacement of the old incarnation. As a first test, I attempted to create the assembly for Maven itself, at which point the whole plugin fell apart. After battling several NullPointerExceptions, I finally managed to build this fairly straightforward assembly, but by this time I was badly shaken. I had the coverage numbers, but I was starting to see that they weren't telling me the whole story.
John's real problem wasn't that the code coverage reports were wrong, just that he was looking at them for the wrong information. Like checking your tachometer to see how fast you're going. Coverage reports can't tell you what is tested, just what's not tested.
Having hit a significant failure with code coverage, John wen on to say:
"Bumping the progress bar of your coverage report up to account for a line of code which has been touched by one test is a fatally flawed concept, and it can lead to misperceptions on the parts of managers and developers alike"
"the worst thing you can do is to look at a test coverage report"
One of Deborah's commenters wrote: "We consider those test results internal to our development team and actually never show them to management or clients.".
Really though, the problem here isn't that the coverage results are wrong or misleading. Just that they're misleading, especially in isolation. (Deborah herself pointed this out. Coverage results are important, but only for telling your part of what you (and your management and clients) need to hear.
"The sky is falling!" Well, okay, no one actually said it. Just for the record, it's not true either.
A proposed solution
There is some hope though. If we think about our tests in terms of expected behavior, as per BDD, we'll be a lot better off. Using this approach, we can write our tests around the specification (so that we know we're covering our desired functionality and the (documented) edge cases. With good reports, used in conjunction with code coverage reports, we can show ourselves (and others) that we're testing the code's expected functionality and ensuring that we're not leaving code paths uncovered. (By the way, code coverage reports can also show us dead code hiding in our projects, see my article at Linux Journal for an example of this).
speaking of nice reports, I showed off the RSpec html test report at RubyConf*MI. It got great reviews. If you're interested in trying out BDD, RSpec has a lot of upside. I'd love to see this kind of report built from test/unit output as well.
On a related note, I don't think any set of unit tests (TDD or BDD) is going to catch all the bugs that lurk in our code. Gregory Brown had some good advice about this that didn't make it into our recent interview: "[W]rite tests to reproduce any bug reports [you] have. This seems to be a solid way to improve test coverage, and avoid the problem of recurrent bugs.".
I agree that code coverage stats should be used to determine where tests are missing, not to tell when testing is complete. And absolutely, every bug should be documented with a test. Great advice.
ReplyDeleteFrom what I can tell, what all the tools mentioned in any of these articles measure is statement coverage. That is, they just record if a particular statement has been executed or not.
ReplyDeleteStatement coverage is about the weakest coverage measure there is. There are numerous others that subsume it. Notably path coverage, which, alas, suffers from a combinatorial explosion.
But no matter how strong your coverage measure, it does not prove the absence of faults in the tested code.
Sean:
ReplyDeleteThanks for the comments (and the support).
Michael:
I'm not sure what tools the original articles were using, but I only know of rcov for Ruby. rcov only does statement coverage (sometimes called C0 coverage).
I'm not aware of any coverage analysis tools (for Ruby) that do anything else.
For a great paper on code coverage, take a look at: www.bullseye.com/coverage.html.
You're absolutely right about writing tests - they should be based on expected behaviour, not lines of code.
ReplyDelete