Why good metrics values do not equal good quality - codecentric AG Blog


Quite regularly, codecentric’s experts perform reviews and quality evaluations of software products. For example, clients may want to get an independent assessment of a program they had a contractor develop. In other cases, they request an assessment of software developed in-house to get an understanding of their its current level of quality.

There often is an implicit assumption that by just using automatic analysis tools you can get a reliable impression of the quality and maintainability, saving the cost and effort for a manual review. Using a simplified example we are going to explain, why this is a fallacy and why an automatically derived set of metrics cannot be a viable replacement for the manual process.

Metrics and Tools

In fact, at the beginning of most analyses there is a step of collecting some base metrics automatically, to get a first superficial impression of the software under inspection. Usually at this early stage one uses simple counts – e. g. to get an idea of the product’s size (number of packages, classes, methods, lines of code) – as well as common quality metrics, for example the cyclomatic complexity.

These values can be quickly calculated using several free or commercial tools and are based on the source code and compiled Java classes.

Once these metrics have been measured, they can be compared to well-known references, e. g. those of Carnegie Mellon University for cyclomatic complexity.

Cyclomatic Complexity

The purpose of this metric s to get an assessment of the complexity – and therefore indirectly the maintainability – of a piece of software.

The aforementioned reference values from Carnegie Mellon define four rough ranges for cyclomatic complexity values:

  • methods between 1 and 10 are considered simple and easy to understand and test
  • values between 10 and 20  indicate more complex code, which may still be comprehensible; however testing becomes more difficult due to the greater number of possible branches the code can take
  • values of 20 and above are typical of code with a very large number of potential execution paths and can only be fully grasped and tested with great difficulty and effort
  • methods going even higher, e. g. >50, are certainly unmaintainable

Often complexity increases gradually with the life-time of a code base as new features are added and existing code is modified. Over time new code is introduced into the system, but the individual “small” changes regularly do not convey the impression of being complex enough to warrant refactoring the affected sections of the code.

In effect the risk of introducing new bugs increases proportionally with the code’s complexity as undesirable side-effects cannot be foreseen. Theoretically this could be alleviated with a sufficient level of test coverage, but unfortunately coming up with useful test code also becomes more difficult and time-consuming for complex code. This regularly leads to test coverage becoming worse, making future changes even more error prone. This is a vicious circle that is hard to break out from.

All this leads to a simple and unsurprising conclusion: Lower complexity eases maintenance, writing meaningful tests and consequently reduces the chances of introducing new bugs. It can therefore be used as an indicator for good quality.

Let’s assume the following result of a complexity analysis of a code base with 10.000 methods:

  • 96% – 9600 methods: CC < 17 : acceptable
  • 3% – 300 methods: 17 < CC < 20 : borderline
  • 1% – 100 methods:  20 <= CC : too high

Does this mean that complexity is not a critical issue in this code base?

The answer has to be: No.

The statement of “only” 1% of all methods being reported as too complex does not carry much meaning in and of itself. There is no way to tell if those 100 methods contain central and mission critical business logic and are disproportionately important for the overall application’s quality.

However, the complexity metric alone does not say anything about the possibly great test coverage of this critical portion of code. Thorough testing could have been deliberately introduced to verify the correctness and guard and against regressions despite high complexity values. But we can get more information on that topic with more tools…

Test Coverage

Several tools are available to determine test coverage, a few popular ones being Clover, Cobertura or Emma. They monitor the execution of unit tests and report on which parts of the code under test are exercised. This allows a reasonable evaluation of which percentage of a software product is covered by automated tests.

While it is difficult to proclaim a generally valid minimum degree of test coverage, because it partly depends on the application at hand – e. g. completely covering trivial bean setters and getters is not usually very useful – values of 80% or above are advised to be sufficiently confident that refactorings and modifications will not break existing functionality.

Assuming an average test coverage of 85% – esp. including the 100 complex (and allegedly important) methods mentioned above – would that not imply a reasonably good code quality, because the source code is covered by tests for the most part?

Again, the answer must be: No.

Even high levels of test coverage only prove that the execution paths that are exercised by the tests are run at least once and with a particular set of test data. Even though the coverage tools do record the number of times each branch gets executed, for it to be “covered” just requires a single execution.

Moreover, 85% of coverage leave 15% uncovered – there is no immediate indication of which parts comprise that 15%. Not seldom this is code for error conditions or exception handling, which can have especially nasty consequences when there are bugs lurking around here.

and so on…

Everything that has been said so far can be applied to virtually all calculated metrics: Every automated analysis process can at most produce hints as to which parts of the code should be targeted for a manual review. They provide starting points and allow a directed approach of large projects, but just looked at in isolation is never sufficient and can even be misleading.

In a recent case, good or sometimes even very good results of the initial automated metrics analysis runs, including – among others –  cyclomatic complexity and Robert C. Martin’s metrics about levels of coupling and abstraction, conveyed a rather positive first impression of the subject project.

Even further diagnostics using static analysis tools like CheckstyleFindBugs or Sonar did not report unusually high numbers of problems, relative to the overall size of the software product, and those issues that were reported would mostly have been rather easy to fix.

But despite the seemingly uncritical results of all tool runs, at the end of the review process we had found a number of severe problems in the code base that clearly prohibited the customer from going live with the new product. Some of  – but not limited to – these problems were fundamental issues with concurrency, useless caches, severe flaws in error- and exception handling and obvious performance problems (unnecessary, but frequent calls to remote services in tight loops) etc.

Bottom Line

Judging the quality of a software product – and consequently the risk when using it in production – by tool-based measurements and metrics alone can easily lead to false conclusions.

Too many factors that influence the actual quality of a solution cannot reliably, if at all, be evaluated automatically. Despite lots of great and proven tools being readily available and even free to use, their results still require careful evaluation – they must be seen as the indicators that they are, not comprehensive and final statements about quality. They can only lead to the way and hint at where it might be sensible to focus a manual review.

In the case mentioned above, using the software in production would have had far-reaching and potentially critical consequences, because data could have been corrupted silently or the system might have crashed completely.

Though manual reviews and checks cannot guarantee error-free software, even in the IT business experience and intuition – luckily – still cannot be replaced with tools.