Quality assurance is structured attempts at falsification

I first had the idea for this article in January 2021. Around that time or perhaps some years prior, I had read Karl Popper, a science philosopher whose work strongly resonates with me. A core element of his ideas is falsification, and he builds a framework for how to conduct scientific work with it.

Applying these thoughts to my own work as a software engineering professional, I had the revelation that they fit well to describe how we can think of quality assurance tooling. In terms of what quality assurance tooling and practices can provide, and in terms of what the limits are.

The black swan problem

A helpful analogy to understand the problem that Popper's ideas address is the black swan problem. The term originates from a 2nd century Roman satiric poem, containing the phrase "a bird as rare upon the earth as a black swan", and assuming that all swans are white. Extending this to the practice of science, and empirical studies specifically, we can ask: how many swans do we need to observe before concluding that they are all white?

It took Europeans until the year 1697 to make the first observations of black swans in Australia, and hence disproving the long-lived belief that they do not exist.

The answer that Popper's work provides to the question is that we cannot regard empirical studies as providing proof at all in this way. The only thing that can be concluded from observing 10'000 white swans, or 100'000 white swans, is that exactly the swans that were observed were white. It does not give any indication that all swans are white, and it does not matter how large the number of observations is.

The statement that "all swans are white" is said to be falsifiable, because it can be disproved by empirical study. While it is impossible to prove that all swans are white by making empirical observations, it only takes a single observation of a black swan to prove that the statement is false.

An example of a statement that is not falsifiable is the claim that "all human beings are mortal". While empirically studying this claim, the overwhelming amount of observations seems to clearly show that it is true, concluding that it is the case requires an inductive step. Inducing that from all observations we have made so far, we can conclude something about all future observations. This is outside the limit of what science can provide in terms of knowledge creation, and in other words, it is not sound to conclude that this statement is true.

If this seems counterintuitive, let's consider the statement "the sun rises every morning". It is a very similar statement to "all human beings are mortal", and similarly, humans have observed that the sun rises every day of their life for a very long time. Yet, if it was possible to prove that this statement is true by scientific method, then we would clearly have system of science that is broken, as we today know that this statement is not true and that the sun will rise for the last time in about 7 or 8 billion years from today.

It cannot be proven that the program is correct

In software engineering, it is a widely accepted truth that there is no such thing as a program free from deficiencies and bugs. Yet, it is also widely accepted that good quality programs are well tested at multiple levels, with both unit tests of small internal parts, and integration tests that exercises how the program behaves in closer to a real-world environment. We also widely use advanced automated tooling to find problematic constructs and bugs. It is also a widely accepted truth that to produce quality software, peer review is a key component in identifying deficiencies.

The key understanding to gain from applying Popper's falsification paradigm to these practices are that we are not proving programs to be correct. That, in fact, is in the general case impossible¹.

However, a better way to think about quality assurance practices, both automated and manual, is as attempts to falsify a specific version of a piece of software. When a software engineer is reviewing a colleagues work, they are not strictly speaking verifying the correctness of the program. Rather, they are trying to falsify the claim that the software is correct. The difference, as discussed in the previous section, is that the latter can proved.

In lieu of being able to prove that our work is correct, by applying a standardized and continuously expanded set of falsification attempts, we deem rather that we have done what we feasibly can do to reduce deficiencies, rather than completely eliminating them.

This is a more honest way to think about quality assurance practices, as it recognizes also that the level of assured quality is a factor of the imagination of the people involved in creating automated testing and reviewing software, and of the culture that they are surrounded by.

Large state spaces

The software that most software engineers are working on will be growing in complexity over time, and the state space that it is operating on is typically also large and growing. Very few programs can be exhaustively tested with all possible inputs in a reasonable amount of time. Combinatorics of feature flagging does not help in this regard, as the more flags that are added to a program increases the combinatorial product, leading to an ever-growing state space. In this sense, software engineers are fighting a losing battle against growing complexity.

An interesting testing approach in this context and within the topics discussed, is property testing. It is an approach where contrary to traditional unit testing practices, rather than providing a fixed set of program input, the input is generated from random data, and the program can be tested to always uphold a certain property given any data generated.

As a very basic example, let's consider a function add that takes numbers x and y as parameters and returns their sum. A property we can test of this function is that subtracting y from the returned value always equals x.

This approach to testing is rarely sufficient to replace traditional unit tests, but rather works in a complimentary fashion, and allows automated testing to cover a much greater area of large input spaces.

I have had use of this recently in my day-to-day work, in cases where it is infeasible to test for all possible input, but by adding property tests we were able to identify existing deficiencies that would be occurring in our production environments.

Circling back to the topic at hand, how can we reason about the role and justification of property testing practices? After all, it is clear that also a property based test does not prove the absence of bugs, not even for the property is testing.

The way I reason about this is twofold, first: property based tests extends on top of my human ability to come up with and imagine useful test cases for a piece of software. More than a few times when I have turned to property testing, the test has found true positives that were not what I expected to find.

The other way is maybe the same as the first, but formulated from a different perspective. We can think of all the possible inputs to a program or function as points in a 2-dimensional space. For a trivial function, it can be possible to exercise every point in such space, but for any non-trivial function we can think of this space as containing an infinite number of points. Using traditional unit testing practices, we can cover some arbitrary chosen points in such a space that we by heuristics think makes sense. We might even be able to cover all points in certain areas of such a space. We can visualize similarly about bugs and deficiencies in this 2D space. Some bugs will only surface in exactly one point, while others will surface in a certain area.

Using property tests, we exercise a uniformly distributed set of points across this 2D space. We should have very little hope of finding the bugs that only surface in a single point. However, the effect is that for any bug that surface in an area of the space, the possible size of that area for a not yet discovered deficiency shrinks as we increase the number of samples.

Hence, we can think of property tests as eliminating (or at least, making less likely) the existence of bugs exercisable in an area of the state space of a certain size. While not always the case, for many aspects of software programs running in production environments, this can equate to meaning only bugs of a certain level of rarity can exist, limited to exactly the property that is being tested.

Eliminating classes of bugs

It is also worth mentioning that, while it is not possible to prove the correctness of an arbitrary program, it is possible to eliminate certain classes of bugs. And it is possible to turn to formal verification methods to prove the correctness of the design of a program. Static type checking, and type driven development also often have the ability to eliminate classes of bugs. In the 2D space analogy, this is also equivalent to covering the full set of points in the state space, which is often useful.

Human practices are important

If you read all this way, I hope instead of losing trust in science and what it can provide, is this key take-away: human processes and imagination is the key driver. And that is true as much in the scientific processes as it is in conducting quality assurance within software engineering.

There are interesting tools like the ones mentioned, that can help us extend our human ability of imagining falsification attempts, but ultimately, the choice of turning to those tools is taken by a human being.

To build high quality software, the imagination and discipline of the humans involved, and the encouragement of the culture they are in, are the two key factors. You cannot prove that the programs you author are correct, but you can build robust processes to exercise as many falsification attempts as possible for any given version that you release.

See Rice's theorem. ↩