Lumps 'n' Bumps

Why you can ignore reviews of scientific code by commercial software developers


tl;dr: Many scientists write code that is crappy stylistically, but which is nevertheless scientifically correct (following rigorous checking/validation of outputs etc). Professional commercial software developers are well-qualified to review code style, but most don’t have a clue about checking scientific validity or what counts as good scientific practice. Criticisms of the Imperial Covid-Sim model from some of the latter are overstated at best.

Update (2020-06-02): The CODECHECK project has independently reproduced the results of one of the key reports (“Report 9”) that was based on the Imperial code, addressing some of the objections raised in the spurious “reviews” that are the subject of this article.

I’ve been watching with increasing horror as the most credible providers of scientific evidence and advice surrounding the Coronavirus outbreak have come under attack by various politically-motivated parties — ranging from the UK’s famously partisan newspapers and their allied politicians, to curious “grassroots” organisations that have sprung up overnight, to armies of numerically-handled Twitter accounts. While there will surely be cause for a sturdy review of the UK’s SAGE (Scientific Advice Group for Emergencies) system at some point very soon, it seems clear that a number of misinformation campaigns are in full flow that are trying to undermine and discredit this important source of independent, public-interest scientific advice in order to advance particular causes. Needless to say, this could be very dangerous — discrediting the very people we most need to listen to could produce very dire results.

The strategies being used to undermine SAGE advisers will be familiar to anyone who has worked in fields related to climate change or vaccination in recent decades. I will focus on one in particular here — the use of “experts” in other fields to cast doubt on the soundness of the actual experts in the field itself. In particular, this is an attempt to explain what’s so problematic about articles like this one, which are being used as ammunition for disingenuous political pieces like this one [paywall]. Both articles are clearly written with a particular political viewpoint in mind, but they have a germ of credibility in that the critique is supposedly coming from an expert in something that seems relevant. In this case, the (anonymous) expert claims to be a professional software developer with 30 years’ experience, including working at a well-regarded software company. They are credentialled, so they must be credible, right? Who better to review the Imperial group’s epidemiology code than a software developer?

Most of what I’m going to say is a less succinct restatement of an old article by John D. Cook, a maths/statistics/computing consultant. Cook explains how scientists use their code as an “exoskeleton” — a constantly-evolving tool to help themselves (and perhaps a small group around them) answer particular questions — rather than as an engineered “product” intended to solve a pre-specified problem for a separate set of users who will likely never see or modify the code themselves. While both scientists and software developers may write code for a living — perhaps even using the same programming language and similar development tools — that doesn’t mean they are trying to achieve similar aims with their code. A software developer will care more about maintainability and end-user experience than a scientific coder, who will likely prize flexibility and control instead. Importantly, this means that programming patterns and norms that work for one may not work for the other — exhortations to keep code simple, to remove cruft, to limit the number of parameters and settings, might actually interfere with the intended applications of a scientific code for example.

Software development heuristics that don’t apply to scientific code

The key flaw in the “Lockdown Sceptics” article is that they apply software engineering heuristics to assess the quality of the code when they simply don’t apply here. As someone who has spent a good fraction of my scientific career writing and working with modelling codes of different types, let me first try to set out the desirable properties of a high-quality scientific code. For most modelling applications, this will be:

Note that I have in mind the kinds of codes used by specialist groups that are usually seeking to model certain classes of phenomena. There are other types of scientific code intended for more general use with goals that hew closer to what software engineers are generally trying to achieve. The Imperial code does not fall into this second category however.

What things are missing from this list that would be high priority for a professional software developer? Here are a few:

These last four points will horrify most software developers (and I know quite a few — I was active in the FOSS movement for a solid decade; buy my book etc etc). Skipping these things is terrible practise if you’re developing software for end-users. But for scientific software, it’s not so important. If you have users other than yourself, they will figure things out after a while (a favourite project for starting grad students!) or email you to ask. If you put in invalid inputs, your testing and other types of scientific examination of the results will generally uncover the error. And, really, who cares if your code is ugly and messy? As long as you are doing the right things to properly check the scientific results before publishing them, it doesn’t matter if you wrote it in bloody Perl with Russian comments — the quality of the scientific results is what matters, not the quality of the code itself. This is well understood throughout the scientific community.

In summary, most scientific modelling codes are expected to be used by user-developers with extensive internal knowledge of the code, the model, and the assumptions behind it, and who are routinely performing a wide variety of checks for correctness before doing anything with the results. In the right hands, you can have a lot of confidence that sensible, rigorous results are being obtained; however they are not for non-expert users.

Specific misunderstandings in the “Lockdown Sceptics” article

I will caveat this section with the fact that I am an astrophysicist and not an epidemiologist, so can’t critique the model assumptions or even really the extent to which it has been implemented well in the Imperial code. I can explain where I think the Lockdown Sceptics article has missed the point of this kind of code though.

Non-deterministic outputs: This is the most important one, as it could, in particular circumstances, be a valid criticism. The model implemented by this code is a stochastic model, and so is expected to produce outputs with some level of randomness (it is exploring a particular realisation of some probability distribution; running it many times will allow us to reconstruct that distribution, a method called Monte Carlo). Computers deal in “pseudo-randomness” though; given the same starting “seed” they will produce the same random-looking sequence. A review by a competing group in Edinburgh found a bug that resulted in different results for the same seed, which is generally not what you’d want to happen. As you can see at that link, a developer of the Imperial code acknowledged the bug and gave some explanation of its impact.

The key question here is whether the bug could have caused materially incorrect results in published papers or advice. Based on the response of the developer, I would expect not. They are clearly aware of similar types of behaviour happening before, which implies that they have run the code in ways that could pick up this kind of behaviour (i.e. they are running some reproducibility tests — standard scientific practise). The bug is not unknown. A particular workaround here appears to be re-running the model many times with different seeds, which is what you’d do with this code anyway; or using different settings that don’t seem to suffer from this bug. My guess is that the “false stochasticity” caused by this bug is simply inconsequential, or that it doesn’t occur with the way they normally run the code. They aren’t worried about it — not because this is a disaster they are trying to cover up, but because this is a routine bug that doesn’t really affect anything important.

Again, this is bread and butter for scientific programming. They have seen the issue before, and so are aware of this limitation of the code. Ideally they would have fixed the bug, yes, but with this sort of code we’re not normally trying to reach a state of near-perfection ready for a point release or some such, as with commercial software. Instead, the code is being used in a constantly evolving state. So perhaps, being aware of it, it’s just not a very high priority to fix given how they are using the code. Indeed, why would they run the code in such a way that the bug arises and knowingly invalidates their results? It’s pretty clear this is not a major result-invalidating bug from their behaviour (and the behaviour of the reporter from Edinburgh) alone.

Undocumented equations: See above regarding the approach to documentation. It would definitely be much more user-friendly to document the equations, but does it mean that the code is bad? No. For all we know, there is a scruffy old LaTeX note explaining the equations, or they are in one of the early papers (either are common). This is totally normal — ugly, and not helpful for the non-expert trying to make sense of the code, but not an indicator of poor code quality.

Continuing development: As per the above, scientific codes of this kind generally evolve as they need to, rather than aiming for a particular release date or set of features. Continuing development is the usual, and things like bugfixes are applied as and when they crop up. Serious issues that affect previously published results would normally prompt an erratum (e.g. see this one of mine); some scientists are less good about issuing errata (or corrective follow-up papers) than others, especially for more minor issues, although covering up a really serious issue would be a career-ending ethical violation for most. As I hope I’m making clear from the above, the article’s charge of serious “quality problems” isn’t actually borne out though; they are just (harmlessly!) violating the norms they are used to from a completely different field.

Some other misunderstandings from that article:

I could go on, but hopefully this is enough to establish my case — the author of that article is out of their depth, and clearly unaware of many of the basics of numerical modelling or the way this kind of science is done (with great historical success across many fields, might I add). In fact, they are so far out that they don’t even realise how silly this all sounds to someone with even a cursory knowledge of this kind of thing — it is an almost perfect study in the Dunning-Kruger effect. How they reached the conclusion that scientists must be so incompetent that “all academic epidemiology [should] be defunded” and that “This sort of work is best done by the insurance sector” is truly remarkable — there is a remarkable arrogance in overlooking the possibility that, just maybe, the failure is in their own understanding.

The peculiarly implausible nature of the accusations

What I have discussed above is a (by no means complete) explanation of how many, if not most, scientific modellers approach their job. I have no special insight into the workings of the Imperial group; this is more an attempt to explain some of the sociology, attitude, and norms of quantitative scientific modelling. Professional software developers will hate some of these norms because they are bad end-user software engineering — but this doesn’t actually matter, since scientific correctness of a code typically owes little to the state of its engineering, and we have a very different notion of who the end user is compared to a company like Google. Instead, we have our own well-worn methods of checking that our codes — our exoskeletons — are scientifically correct in the most pertinent ways, backed up by decades of experience and rabid, competitive cross-checking of results. This, really, is all that matters in terms of scientific code quality, whether you’re publishing prospective theories of particle physics or informing the public health response to an unprecedented pandemic.

Let’s not lose sight of the bigger picture here though. The point of the Lockdown Sceptics article is to challenge the validity of the Imperial code by painting it as shoddy, and presumably, therefore, to undermine the basis of particular actions that may have taken note of SAGE advice. For this to actually be the case, though, the Imperial group must have (a) evaded detection for over 10 years from a global community of competing experts; (b) be almost criminally negligent as scientists, by having ignored easily-discovered but consequential bugs; (c) be almost criminally arrogant to suppose that their unchecked/flawed model should be used to inform such big decisions; and (d) for the entire scientific advisory establishment to have been taken for a ride without any thought to question what they were being told. Boiling this all down, the author is calling several hundred eminent scientists — in the UK and elsewhere — complete idiots, while they, through maybe half an hour of cursory inspection, have found the “true, flawed” nature of the code through a series of noddy issues.

This is clearly very silly, as even an ounce of introspection would have revealed to the author of that article had they started out with innocent motives. Their “code review” is no such thing however — instead, it is a blatant hatchet job, of the kind we have come to expect from climate change deniers and other anti-science types. The article author’s expertise in software development (which I will recognise, despite their anonymity) is of entirely the wrong type to actually, meaningfully, review this codebase, and is clearly misapplied. You may as well ask a Java UI programmer to review security bugs in the Linux kernel. To then rejoice in the fact that this has been picked up by an obviously lop-sided wing of the press and used to push a potentially harmful agenda, against the best scientific evidence we have, is chilling.