Why you can ignore reviews of scientific code by commercial software developers

tl;dr: Many scientists write code that is crappy stylistically, but which is nevertheless scientifically correct (following rigorous checking/validation of outputs etc). Professional commercial software developers are well-qualified to review code style, but most don’t have a clue about checking scientific validity or what counts as good scientific practice. Criticisms of the Imperial Covid-Sim model from some of the latter are overstated at best.

Update (2020-06-02): The CODECHECK project has independently reproduced the results of one of the key reports (“Report 9”) that was based on the Imperial code, addressing some of the objections raised in the spurious “reviews” that are the subject of this article.

I’ve been watching with increasing horror as the most credible providers of scientific evidence and advice surrounding the Coronavirus outbreak have come under attack by various politically-motivated parties — ranging from the UK’s famously partisan newspapers and their allied politicians, to curious “grassroots” organisations that have sprung up overnight, to armies of numerically-handled Twitter accounts. While there will surely be cause for a sturdy review of the UK’s SAGE (Scientific Advice Group for Emergencies) system at some point very soon, it seems clear that a number of misinformation campaigns are in full flow that are trying to undermine and discredit this important source of independent, public-interest scientific advice in order to advance particular causes. Needless to say, this could be very dangerous — discrediting the very people we most need to listen to could produce very dire results.

The strategies being used to undermine SAGE advisers will be familiar to anyone who has worked in fields related to climate change or vaccination in recent decades. I will focus on one in particular here — the use of “experts” in other fields to cast doubt on the soundness of the actual experts in the field itself. In particular, this is an attempt to explain what’s so problematic about articles like this one, which are being used as ammunition for disingenuous political pieces like this one [paywall]. Both articles are clearly written with a particular political viewpoint in mind, but they have a germ of credibility in that the critique is supposedly coming from an expert in something that seems relevant. In this case, the (anonymous) expert claims to be a professional software developer with 30 years’ experience, including working at a well-regarded software company. They are credentialled, so they must be credible, right? Who better to review the Imperial group’s epidemiology code than a software developer?

Most of what I’m going to say is a less succinct restatement of an old article by John D. Cook, a maths/statistics/computing consultant. Cook explains how scientists use their code as an “exoskeleton” — a constantly-evolving tool to help themselves (and perhaps a small group around them) answer particular questions — rather than as an engineered “product” intended to solve a pre-specified problem for a separate set of users who will likely never see or modify the code themselves. While both scientists and software developers may write code for a living — perhaps even using the same programming language and similar development tools — that doesn’t mean they are trying to achieve similar aims with their code. A software developer will care more about maintainability and end-user experience than a scientific coder, who will likely prize flexibility and control instead. Importantly, this means that programming patterns and norms that work for one may not work for the other — exhortations to keep code simple, to remove cruft, to limit the number of parameters and settings, might actually interfere with the intended applications of a scientific code for example.

Software development heuristics that don’t apply to scientific code

The key flaw in the “Lockdown Sceptics” article is that they apply software engineering heuristics to assess the quality of the code when they simply don’t apply here. As someone who has spent a good fraction of my scientific career writing and working with modelling codes of different types, let me first try to set out the desirable properties of a high-quality scientific code. For most modelling applications, this will be:

  • Scientific correctness: A mathematically and logically correct representation of the model(s) that are being studied, as well as a correct handling and interpretation of any input data. This is distinct from “correctly fitting observations” — while finding a best-fit model to some data might be one aim of a code, the ability to explore counterfactuals is also important. A code may implement a model that is a terrible fit to the data, but can still be “high quality” in the sense that it correctly implements a counterfactual model.
  • Flexibility: The ability to add, adjust, turn on/off different effects, try different assumptions etc. These codes are normally exploratory, and will be used for studying a number of different questions over time, including many that will only arise long after the initial development. Large numbers of parameters and copious if statements are the norm.
  • Performance: Sufficient speed and/or precision to allow the scientist to answer questions satisfactorily. Repeatability and numerical precision fall under this category, as well as raw computational performance. (It is also common for scientific codes to have settings that allow the user to trade-off speed vs accuracy, depending on the application.)

Note that I have in mind the kinds of codes used by specialist groups that are usually seeking to model certain classes of phenomena. There are other types of scientific code intended for more general use with goals that hew closer to what software engineers are generally trying to achieve. The Imperial code does not fall into this second category however.

What things are missing from this list that would be high priority for a professional software developer? Here are a few:

  • Maintainability: Most scientific codes aren’t developed with future maintainers in mind. As per John Cook, they are more likely to be developed as “exoskeletons” by and for a particular scientist, growing organically over time as new questions come up. Maintainability is a nice-to-have, especially if others will use the code, but it has little bearing on the code’s scientific quality. Some scientifically valuable codes are really annoying to modify!
  • Documentation: Providing code and end-user documentation is a good practice, but it’s not essential for scientific codes. Different fields have different norms surrounding whether code is open sourced on publication or simply available on request for example, and plenty of scientists do a bad job of including comments or writing nice-to-read code. This is because the code is rarely the end product in itself — it is normally just a means to run some particular mathematical model that is then presented (and defended and dissected) in a journal article. The methods, assumptions, and consistency and accuracy of the results is what matters — the code itself can be an ugly mess as long as it’s scientifically correct in this sense.
  • User-proofing/error checking: To a software developer, a well-engineered code shouldn’t require any knowledge of internal implementation details on the part of the end user. The code should check user inputs for validity, and, to the greatest extent possible, prevent them from doing things that are wrong, or invalid, or that will produce nonsense results, for the widest possible range of possible inputs. Some level of error-checking is nice to have in scientific codes too, but in many cases the code is presented “as-is” — the user is expected to determine what correct and valid inputs are themselves, through understanding of the internals and the scientific principles behind them. In fact, a code may even, intentionally, produce an output that is known to be “wrong” in some ways and “right” in others — e.g. the amplitude of a curve is wrong, but its shape is right. In essence, the user is assumed to understand all of the (known/intended) limitations of the code and its outputs. This will generally be the case if you run the code yourself and are an expert in your particular field.
  • Formal testing: Software developers know the value of a test suite: Write unit tests for everything; throw lots of invalid inputs at the code to check it doesn’t fall over; use continuous integration or similar to routinely test for regressions. This is good practise that can often catch bugs. Setting up such infrastructure is still not the norm in scientific code development however. So how do scientists deal with regressions and so on? The answer is that most use ad hoc methods to check for issues. When a new result first comes out of the code, we tend to study it to death. Does it make sense? If I change this setting, does it respond as expected? Can it reproduce idealised/previous results? Does it agree with this alternative but equivalent approach? Again, this is a key part of the scientific process. We also output meaningful intermediate results of the code as a matter of course. Remember that we are generally dealing with quantities that correspond to something in the real world — you can make sure you aren’t propagating a negative number of deaths in the middle of your calculation for example. While these checks could also be handled by unit tests, most scientists generally just end up with their own weird set of ad hoc test outputs and print statements. It’s ugly, and not infallible, but it tends to work well given the intensive nature of our result-testing behaviour and community cross-checking.

These last four points will horrify most software developers (and I know quite a few — I was active in the FOSS movement for a solid decade; buy my book etc etc). Skipping these things is terrible practise if you’re developing software for end-users. But for scientific software, it’s not so important. If you have users other than yourself, they will figure things out after a while (a favourite project for starting grad students!) or email you to ask. If you put in invalid inputs, your testing and other types of scientific examination of the results will generally uncover the error. And, really, who cares if your code is ugly and messy? As long as you are doing the right things to properly check the scientific results before publishing them, it doesn’t matter if you wrote it in bloody Perl with Russian comments — the quality of the scientific results is what matters, not the quality of the code itself. This is well understood throughout the scientific community.

In summary, most scientific modelling codes are expected to be used by user-developers with extensive internal knowledge of the code, the model, and the assumptions behind it, and who are routinely performing a wide variety of checks for correctness before doing anything with the results. In the right hands, you can have a lot of confidence that sensible, rigorous results are being obtained; however they are not for non-expert users.

Specific misunderstandings in the “Lockdown Sceptics” article

I will caveat this section with the fact that I am an astrophysicist and not an epidemiologist, so can’t critique the model assumptions or even really the extent to which it has been implemented well in the Imperial code. I can explain where I think the Lockdown Sceptics article has missed the point of this kind of code though.

Non-deterministic outputs: This is the most important one, as it could, in particular circumstances, be a valid criticism. The model implemented by this code is a stochastic model, and so is expected to produce outputs with some level of randomness (it is exploring a particular realisation of some probability distribution; running it many times will allow us to reconstruct that distribution, a method called Monte Carlo). Computers deal in “pseudo-randomness” though; given the same starting “seed” they will produce the same random-looking sequence. A review by a competing group in Edinburgh found a bug that resulted in different results for the same seed, which is generally not what you’d want to happen. As you can see at that link, a developer of the Imperial code acknowledged the bug and gave some explanation of its impact.

The key question here is whether the bug could have caused materially incorrect results in published papers or advice. Based on the response of the developer, I would expect not. They are clearly aware of similar types of behaviour happening before, which implies that they have run the code in ways that could pick up this kind of behaviour (i.e. they are running some reproducibility tests — standard scientific practise). The bug is not unknown. A particular workaround here appears to be re-running the model many times with different seeds, which is what you’d do with this code anyway; or using different settings that don’t seem to suffer from this bug. My guess is that the “false stochasticity” caused by this bug is simply inconsequential, or that it doesn’t occur with the way they normally run the code. They aren’t worried about it — not because this is a disaster they are trying to cover up, but because this is a routine bug that doesn’t really affect anything important.

Again, this is bread and butter for scientific programming. They have seen the issue before, and so are aware of this limitation of the code. Ideally they would have fixed the bug, yes, but with this sort of code we’re not normally trying to reach a state of near-perfection ready for a point release or some such, as with commercial software. Instead, the code is being used in a constantly evolving state. So perhaps, being aware of it, it’s just not a very high priority to fix given how they are using the code. Indeed, why would they run the code in such a way that the bug arises and knowingly invalidates their results? It’s pretty clear this is not a major result-invalidating bug from their behaviour (and the behaviour of the reporter from Edinburgh) alone.

Undocumented equations: See above regarding the approach to documentation. It would definitely be much more user-friendly to document the equations, but does it mean that the code is bad? No. For all we know, there is a scruffy old LaTeX note explaining the equations, or they are in one of the early papers (either are common). This is totally normal — ugly, and not helpful for the non-expert trying to make sense of the code, but not an indicator of poor code quality.

Continuing development: As per the above, scientific codes of this kind generally evolve as they need to, rather than aiming for a particular release date or set of features. Continuing development is the usual, and things like bugfixes are applied as and when they crop up. Serious issues that affect previously published results would normally prompt an erratum (e.g. see this one of mine); some scientists are less good about issuing errata (or corrective follow-up papers) than others, especially for more minor issues, although covering up a really serious issue would be a career-ending ethical violation for most. As I hope I’m making clear from the above, the article’s charge of serious “quality problems” isn’t actually borne out though; they are just (harmlessly!) violating the norms they are used to from a completely different field.

Some other misunderstandings from that article:

  • “the original program was ‘a single 15,000 line file that had been worked on for a decade’ (this is considered extremely poor practice)” — Not to a scientist it’s not! If anything, the fact that the group have been plugging away at this code for a decade, with an increasing number of collaborators, confronting it with more and more peer reviewers, and withstanding more and more comparisons from other groups, gives me more confidence in it. It certainly improves the chances that substantial bugs would have been found and resolved over time, or structural flaws noticed. Young, unproven codes are the dangerous ones! And while the large mono-file structure will surely be annoying to work with (and so is poor in that sense), it has no bearing on the actual scientific correctness of the code.
  • “A request for the original code was made 8 days ago but ignored, and it will probably take some kind of legal compulsion to make them release it. Clearly, Imperial are too embarrassed by the state of it ever to release it of their own free will, which is unacceptable given that it was paid for by the taxpayer and belongs to them.” — This is a tell-tale sign that this person isn’t a scientist. First, the motto of most academics is “Apologies for the late reply”! Waiting 8 days for a reply to a potentially complicated and labour-intensive request is nothing, especially as the group is obviously busy with more urgent matters. Second, there’s no saying that the taxpayer paid for most of the code (it could be funded by a charitable foundation like the Wellcome Trust for example), and the code will likely remain the IP of the author, but with Imperial retaining a perpetual license to it. Instead, the obligation for openness here comes from the publications that use the code. Most journals require that authors make code and data used to produce results in particular journal articles available on request. Note that some scientists are cagey about releasing their code fully publicly because they worry about competitors co-opting it (and not without good reason). I personally have made all of my scientific code available by default however, and it’s good that this group are making theirs fully public too. It’s the right thing to do in this scenario (and we should also recognise previously opened codes, such as the one from the LSHTM group also used by SAGE).
  • “What it’s doing is best described as ‘SimCity without the graphics'” — Fantastic! The original SimCity used a tremendously sophisticated model for what it did, and has even been used in teaching town planners (I remember reading the manual in the 90s).
  • “The people in the Imperial team would quickly do a lot better if placed in the context of a well run software company….the difference between ICL and the software industry is the latter has processes to detect and prevent mistakes” — Now really, this is teaching grandma to suck eggs. Remember that much of modern programming emerged from academic science. This is not to say that your average scientific programmer couldn’t stand to learn some cleaner coding practises, but to accuse scientists of not having processes to detect and prevent mistakes — ludicrous! The bedrock of the scientific process is in validation and self-correction of results, and we have plenty of highly effective tools in our arsenal to handle that thank you very much. Now I have found unit testing, continuous integration etc. to be useful in some of my more infrastructural projects, but they are conveniences rather than necessities. Practically every scientist I know spends most of their time checking their results rather than coding, and I sincerely doubt that the Imperial group is any different. If anything, in most fields there is a culture of “conspicuous correctness” — finding mistakes in the work of others is a highly prized activity (especially if they are direct competitors).
  • “Models that consume their own outputs as inputs is problem well known to the private sector – it can lead to rapid divergence and incorrect predictions” — This is a highly simplistic way of looking at things, and I suspect the author doesn’t know much about this kind of method. I don’t know specifically what the Imperial folks are doing here, but there are important classes of methods that use feedback loops of this kind called iterative methods that are common in solving complicated systems of coupled equations and so on, which are mathematically highly rigorous. I have used them in a statistical modelling context on occasion. Ferguson and co are from highly numerate backgrounds, and so I think it’s safe to assume they’re not missing obvious problems of this kind.

I could go on, but hopefully this is enough to establish my case — the author of that article is out of their depth, and clearly unaware of many of the basics of numerical modelling or the way this kind of science is done (with great historical success across many fields, might I add). In fact, they are so far out that they don’t even realise how silly this all sounds to someone with even a cursory knowledge of this kind of thing — it is an almost perfect study in the Dunning-Kruger effect. How they reached the conclusion that scientists must be so incompetent that “all academic epidemiology [should] be defunded” and that “This sort of work is best done by the insurance sector” is truly remarkable — there is a remarkable arrogance in overlooking the possibility that, just maybe, the failure is in their own understanding.

The peculiarly implausible nature of the accusations

What I have discussed above is a (by no means complete) explanation of how many, if not most, scientific modellers approach their job. I have no special insight into the workings of the Imperial group; this is more an attempt to explain some of the sociology, attitude, and norms of quantitative scientific modelling. Professional software developers will hate some of these norms because they are bad end-user software engineering — but this doesn’t actually matter, since scientific correctness of a code typically owes little to the state of its engineering, and we have a very different notion of who the end user is compared to a company like Google. Instead, we have our own well-worn methods of checking that our codes — our exoskeletons — are scientifically correct in the most pertinent ways, backed up by decades of experience and rabid, competitive cross-checking of results. This, really, is all that matters in terms of scientific code quality, whether you’re publishing prospective theories of particle physics or informing the public health response to an unprecedented pandemic.

Let’s not lose sight of the bigger picture here though. The point of the Lockdown Sceptics article is to challenge the validity of the Imperial code by painting it as shoddy, and presumably, therefore, to undermine the basis of particular actions that may have taken note of SAGE advice. For this to actually be the case, though, the Imperial group must have (a) evaded detection for over 10 years from a global community of competing experts; (b) be almost criminally negligent as scientists, by having ignored easily-discovered but consequential bugs; (c) be almost criminally arrogant to suppose that their unchecked/flawed model should be used to inform such big decisions; and (d) for the entire scientific advisory establishment to have been taken for a ride without any thought to question what they were being told. Boiling this all down, the author is calling several hundred eminent scientists — in the UK and elsewhere — complete idiots, while they, through maybe half an hour of cursory inspection, have found the “true, flawed” nature of the code through a series of noddy issues.

This is clearly very silly, as even an ounce of introspection would have revealed to the author of that article had they started out with innocent motives. Their “code review” is no such thing however — instead, it is a blatant hatchet job, of the kind we have come to expect from climate change deniers and other anti-science types. The article author’s expertise in software development (which I will recognise, despite their anonymity) is of entirely the wrong type to actually, meaningfully, review this codebase, and is clearly misapplied. You may as well ask a Java UI programmer to review security bugs in the Linux kernel. To then rejoice in the fact that this has been picked up by an obviously lop-sided wing of the press and used to push a potentially harmful agenda, against the best scientific evidence we have, is chilling.

About Phil Bull

I'm a Lecturer in Cosmology at Queen Mary University of London. My research focuses on the effects of inhomogeneities on the evolution of the Universe and how we measure it. I'm also keen on stochastic processes, scientific computing, the philosophy of science, and open source stuff. View all posts by Phil Bull

36 responses to “Why you can ignore reviews of scientific code by commercial software developers

  • Daniel

    It’s also worth mentioning the different levels of programming resources available to academics. Scientific labs do not have large teams of professional developers to tidy up source code, maintain documentation, etc. Scientific robustness takes precedent over product development.

    But as you say, the main issue is that the original author seemingly has no awareness or comprehension of stochastic modelling. Presumably it conflicts with an aphorism he read in a programming text book in the 80s about randomness being bad, and he isn’t aware of any forms of statistics developed after the 1950s.

  • Latest News – Lockdown Sceptics

    […] Like this one by Phil Bull, a Lecturer in Cosmology at Queen Mary University headlined ‘Why you can ignore reviews of scientific code by commercial software developers‘. Includes a caveat towards the end that tells you everything you need to know: “I will […]

    • OldCoder

      Thanks for writing this, and puwhtting your head above the parapet on the lockdown “sceptics” site. There were others who did the same too.

      Some software engineers in the twitter mob raised by the site started opening outrageous issues on the ICL github page. However, there are a couple of good non-ICL contributors, notably Feynstein who are offering constructive comments on the code, and trying to explain like this blog post.

      Toby Young took the comment that you’re not an epidemiologist to “discredit” your whole blog entry, hence the trackback.

      I’ve personally worked in commercial, non-scientific software, for 30 years, and there are pressures that result in some commercial software, and associated processes, being of extremely poor quality in software engineering terms (bloated, uncommented, poorly laid out, lack of test coverage, lack of knowledge on algorithms, programming language features).

  • Jean

    Interesting. I thought scientific coders actually had much harder standards than “commercial” developers as you call us. I thought you would try to prove your algorithms, for instance. Clearly, formally proving any of the algorithms used in Neil Ferguson’s code is impossible.

    You’re missing an important point here : maintainability and flexibility go back to the same thing, which is code sanity. When your functions are several thousand lines long with meaningless variables reused for different purposes, modifying anything in this mess without breaking anything is a near impossible task, even if you work on this code on a daily basis.

    And this has nothing to do with “commercial developers” vs scientists. I have seen such code produced in the private sector (unsurprisingly, even its own author was unable to maintain it, and it was filled with exotic bugs). And scientists also have written brilliant pieces of excellent code which are now in the libc or in the Linux kernel. This has to do with people lacking of rigour. This kind of code is not acceptable in the private sector for software being used in production. Nor is it acceptable for scientists determining public policy.

    • Phil Bull

      Thanks for the comment Jean. I’m not sure why you find formal proof of algorithms synonymous with “high standards”, as mathematical well-posedness is only rarely an issue for most scientific modelling problems. This is typically more of an issue for computer scientists and applied mathematicians who are interested in the structure and limits of methods themselves rather than their applications. I suppose it sometimes might be necessary for certain kinds of problems — proving detailed balance for a novel sampling method, or proving the existence of an attractor for example — but for the most part we are using well-understood classes of methods (e.g. of the types found in http://numerical.recipes/) to solve novel systems of equations. Re-proving the validity of the methods would generally be superfluous.

      I really must disagree that I am missing the point, as your note on code sanity is something I addressed. Scientific programmers have their own ways of dealing with checking and testing the correctness of their code, and while I readily agree that these methods go against what most software engineers would consider best-practise, that doesn’t mean that they’re ineffective. The reason for the difference is that scientific codes are structured differently and intended to be used in different ways from many other kinds of code, and so ways of working that would be risky for a database code (no input sanitisation, aaagh!?) are fine for many scientific codes. I say this as someone who has worked with both kinds of code, and who is well aware of many good software engineering practises, even if it isn’t my day job.

      It’s also the case that a code that seems a mess to the untrained eye might actually be quite logical once you understand the problem it is solving. I’ve worked with plenty of such codes, and while the learning curve is steep (something that would be frowned upon with a commercial code!), that doesn’t mean it’s badly written. The processes of abstraction and modularisation that are typically encouraged in other types of code may well obscure or complicate the workings of some scientific codes. You can get a feeling for this by fiddling around with Fortran 90 for a while; many modern software engineers hate it, as it makes certain sensible programming patterns quite hard to implement, but to some scientists it can actually feel like a very intuitive way of coding up our equations.

      I don’t dispute that there is messy, unmaintainable code in the private sector, and there is certainly some of that in science too. But there are also many complex, messy-looking codes in science that are actually very well maintained and “battle-tested”. What I’m trying to say is that you, as a non-scientific developer, can’t rely on your usual heuristics to tell the difference, simply because you probably don’t have enough experience with that kind of code. By way of analogy, it’s almost like saying that “cryptic crossword clues don’t make sense” when actually they make plenty of sense if you understand the rules. I am in the mildly uncommon position of understanding the rules of both kinds of programming, and so believe me when I say there is no obvious issue with rigour or malpractice here.

  • Jean

    By the way, some scientists also believe code sanity is important: https://chrisvoncsefalvay.com/2020/05/09/imperial-covid-model/

  • Joe Smith

    ‘Their “code review” is no such thing however — instead, it is a blatant hatchet job, of the kind we have come to expect from climate change deniers and other anti-science types.’

    Or maybe it’s a review by someone who thinks a program whose output is used by government to make massive decisions should be developed and tested to high standards. This isn’t the first time that poor standards of software engineering have been uncovered in academia. Read up on the Climate Research Unit’s even worse software – the Harry Read Me file.

    If we wouldn’t accept this level of quality control on software to manage our bank accounts then we sure as heck shouldn’t accept it in software for modelling pandemics or climate change, unless such software is used for purely theoretical purposes only.

    • Phil Bull

      I don’t think so — the tone of the piece (and its location) are hardly indicative of an honest inquiry into the code, and (as I’ve discussed) the criticisms it presents are largely not valid, at least when the code is considered in its proper context. Their article is wrong, but in an intentional and overblown way — it is a hatchet job. And regarding the CRU, you are not exactly helping your case by bringing up a climate change denial talking point. See https://www.factcheck.org/2009/12/climategate/ and other sources for debunking.

      I wouldn’t want scientists writing banking software either; but nor would I want financial software developers coding up epidemiological models. The sensible thing to do is acknowledge the domain-specific expertise that’s at play here. (And if you don’t like the Imperial code, take a look at the LSHTM code and others that also fed into the SAGE advice and let me know what you think of that.)

      • Joe Smith

        I don’t think domain specific knowledge is the key point. What’s important is that the software has validated through rigorous testing. Reviewers have found bugs in Ferguson’s code and AFAIK no evidence of rigorous testing has been presented.

        Regarding the CRU, the leaked emails aren’t the issue, it’s the leaked Harry Read Me file which shows a staggering amount of problems in their code. Clearly the programmer involved had trouble partly because of lack of maintinability.

      • Phil Bull

        This is a point I make at length in my article. It’s inconceivable that there hasn’t been a ton of testing — that’s what scientists spend most of their time doing, and their having spent >10 years on the code, with multiple collaborators, many peer-reviewed articles etc. inspires confidence that any real howlers would have been found by now.

        It’s still relatively uncommon for those tests to all be encoded in a formal test suite of the kind that a software engineer would be more used to seeing however. Scientific software tends to be validated in a more ad hoc way. This is a domain-specific norm, not a matter of competence. Seeing both sides of software development, there are some situations where scientific codes I know would benefit from a more or less extensive test suite (we have a substantial one for a library I work on for example: https://github.com/LSSTDESC/CCL), and others where it would be pure overkill, a waste of precious time and (typically) public funds. I have both kinds of code, and would be happy to go through my ad hoc testing stretegies in excruciating detail if you really want to understand the kinds of things it entails…

        I decided to take a look at that readme file, and it actually seems to be a dump of an IDL session, where the author has been debugging some input handling bits of the code. It all looks pretty standard. Whether or not that bug is consequential will depend on how the code is used, and how inputs are normally passed to it. It certainly doesn’t suggest a “staggering number of problems in their code” though — it looks more like a standard file format handling gripe, where different conventions may or may not be used in datafiles from different sources. Commercial data scientists would also recognise this as a standard, common issue, part of the “data wrangling” aspect of working with data.

  • Txcon

    Your comments are not unfair for the vast majority of scientific coding efforts. But if you are going to use the results of your code to persuade the government to lock down 60 million people (or 330 million) it had better be bulletproof.

  • Tony Proctor

    Seems to be a disparity between the clarity, transparency, and documentation of the code, as compared with academic articles from the same people.

  • Joe Smith

    “This is a point I make at length in my article. It’s inconceivable that there hasn’t been a ton of testing — that’s what scientists spend most of their time doing, and their having spent >10 years on the code, with multiple collaborators, many peer-reviewed articles etc. inspires confidence that any real howlers would have been found by now.”

    I hope that’s the case, but we don’t have the evidence that a “ton” of testing has been done and that it was of sufficient rigour, we just have a form of “take my word for it” from another academic. Also, can peer reviewed articles detect bugs in unreleased code? Even if they agree those papers agree that the mathematics and algorithm are good, the question is whether the code implements that and the input data correctly.

    “It’s still relatively uncommon for those tests to all be encoded in a formal test suite of the kind that a software engineer would be more used to seeing however. Scientific software tends to be validated in a more ad hoc way. This is a domain-specific norm, not a matter of competence.”

    This may well be the case, but if the results of academic software is being consulted for public policy then I believe it should be held to much higher standards than it is currently. That’s a matter for academia and funding bodies to deal with, but judging by what I’ve read here and elsewhere there’s little appetite for change.

    “I decided to take a look at that readme file, and it actually seems to be a dump of an IDL session, where the author has been debugging some input handling bits of the code. It all looks pretty standard. Whether or not that bug is consequential will depend on how the code is used, and how inputs are normally passed to it. It certainly doesn’t suggest a “staggering number of problems in their code” though — it looks more like a standard file format handling gripe, where different conventions may or may not be used in datafiles from different sources. Commercial data scientists would also recognise this as a standard, common issue, part of the “data wrangling” aspect of working with data.”

    From what I’ve read of the readme file (hundreds of pages long) there were serious and unacceptable problems with the software and previous results couldn’t be replicated. The developer suggests the software should be junked and re-written because it’s such a mess. Even if the readme file only relates to input data handling, it suggests the rest of the code is probably shoddy too.

    I once worked for a large non-profit organisation and on one project found that their systems and data meant they couldn’t accurately report data they wanted. I designed a solution to make the best of it and early on flagged the issue to the internal customer so it was up to them to decide how much of a problem it was and whether they could live with it. That’s a different kettle of fish to creating predictions on pandemics or climate change upon which governments make life changing decisions.

    IMO nobody should be defending anything less than high and transparent standards in the development and testing of software which may be used to make decisions having a massive impact on the country’s economy and people’s lives.

    Btw, it may be that not all of Sue Denim’s criticisms are correct, but I’ve seen other critiques of Ferguson’s code with some areas of agreement. I don’t think they are necessarily hatchet jobs by climate change denier or anti-science types. I accept the scientific consensus on climate change but that doesn’t mean I believe there are no serious issues with some particular scientific software.

    • Phil Bull

      I think you get into the salient point here: “we don’t have the evidence that a “ton” of testing has been done and that it was of sufficient rigour”. The kind of evidence of rigour/correctness that you are looking for is likely not the same as what a scientist working in this field would be looking for. While a suite of unit tests might make you happy, this wouldn’t necessarily be seen as the hallmark of a good, trustworthy code to a scientist. As per my article, there are lots of other ways scientists assess the validity of a code or any other kind of calculation, and they tend to work well for us. Without a background in a particular field, however, those methods may not be very useful or convincing to you personally. Importantly, though, that doesn’t mean that they are ineffective! Conversely, a code could be written to impeccable software engineering standards, but in the absence of the “usual” scientific testing could be found completely untrustworthy by a scientist.

      I would argue that the important thing here is that other expert epidemiologists have critiqued the code and the model it represents. They are the group with the best shot at identifying serious issues, and the least chance of misclassifying unimportant issues. As happens when practically anyone starts looking into a new field, you don’t yet know what you don’t know, and so tend to make mistakes that experienced practitioners will find naive. I’m simply pointing out that this is very likely to be the case with these code reviews.

      Now you may personally find this unsatisfying, because what I’m implying is that you can’t really get to grips with the scientific quality of the Imperial code without doing a lot of work to build an understanding of the field! Perhaps my own expertise is closer to theirs, but I still recognise that I’m not a domain expert in epidemiology, hence the caveats in the article.

      There are a few examples where people *have* dug into another field because they believe there are serious issues. A high-profile one is the Berkeley Earth project, founded on scepticism of climate change models and the way they were using temperature data. Lo and behold, it turned out that the models and data were reliable and reproducible — but it took a lot of work (and money, frankly) for the external experts to figure that out: https://www.nytimes.com/2012/07/30/opinion/the-conversion-of-a-climate-change-skeptic.html

      That isn’t to say that external experts can’t have a sharp eye for serious problems. The replication crisis in the social sciences is an example of this: https://statmodeling.stat.columbia.edu/2016/09/22/why-is-the-scientific-replication-crisis-centered-on-psychology/

      Back to the bigger picture: the “Ferguson’s horrible code” argument is a red herring anyway, since plenty of other groups have shown alternative calculations and modelling that back-up the Imperial group’s general conclusions using quite different approaches. Several of those groups were represented on the SAGE panels at the time, and from the minutes I’ve seen had substantial input. We never were using this code as a sole, flawed oracle. Splashing his private life on front pages of particular newspapers, and mysteriously-timed anonymous articles that “support” strident opinion pieces from politicians… It’s clear that there is a political dimension here. With hindsight, we will see that Toby Young and co are the people whose activities are most deserving of public scrutiny here.

  • Why we can ignore reviews of scientific code by commercial software developers – Taller the Kenyan Blogger

    […] Why we can ignore reviews of scientific code by commercial software developers […]

  • Why we can ignore reviews of scientific code by commercial software developers | Au moelleux

    […] Outreach. You may maybe presumably presumably be aware any responses to this entry via the RSS 2.0 […]

  • Nathan

    I noticed that the Lockdown Sceptics piece was first published on 6 May, one day different from the Telegraph release of its information about Prof Ferguson’s breach of the lockdown guidance (which I believe to have been available to the Telegraph for 4 weeks). Coincidence is a peculiar thing.

  • Ray Myers

    You make a good case that some things are being blown out of proportion and politically weaponized here, but I have to take issue when you generalize to saying commercial developers should simply never be listened to because good engineering isn’t related to correctness.

    I think if you’ve got one group that knows about code quality (which does indeed relate to correctness and ability to peer review) but doesn’t know about the science and another group that’s vice versa, what’s needed is more understanding between them not less.

    So I would urge commercial developers to be more patient and constructive than some of the feedback I see. Someone coming in with “this is bad code – I have no specific suggestions, haven’t found any specific bugs, haven’t read any of the papers, and don’t know how it’s being used, but it’s bad” isn’t going to make any headway. That’s not helping, it’s poking holes.

    I would also urge anyone writing scientific code to learn a bit about refactoring, unit testing, descriptive names, modular design. I realize there are only so many hours in the day but these skills will make you more productive and help you collaborate.

    • Phil Bull

      Yes, I do think adopting several professional software engineering practices would be useful for most scientific codes. As you say, it can certainly reduce the amount of time people spend hunting for bugs etc, and makes multi-developer projects much more pleasant to work on. We use unit testing, modular design, descriptive names etc. for several of the bigger codes I work on, and there are efforts like Software Carpentry (https://software-carpentry.org/scf/history/) that have been running for some time to help scientists improve their software engineering skills. In fact, a lot of scientists *are* adopting these ways of working, thanks in part to better awareness and programming tools that make it easier to do the right thing. The point is just to split “scientific correctness” from “code quality”, as in my experience they’re not tremendously correlated.

      Unfortunately, the code reviews I’ve seen so far have mostly been either politically-motivated hatchet jobs, or start from a position of not understanding the goals of scientific code, as discussed in the article.

      • dguest

        This seems to set up unnecessary polarization between “academics who know science” and “politically-motivated code reviews”.

        John Carmack (famous for games like Doom and Quake) and the developers at github are professional developers, not academics. They offered a fair assessment of the original code (Carmack’s general review was quite positive), helped to clean it up, fixed a lot of bugs, and provided all the source material for the hatchet job “reviews” that this blog post is reacting to.

        I’m fine with calling out propaganda on some rightwing blog, but don’t imply that the code reviews from professional developers are at the root of the the problem. Useful as it might be as a dramatic device, this “science vs programmer” framing is a false dichotomy and is ultimately harmful to everyone involved.

        On the commercial software side, this ignores the developers who took their time to refactor and publish the academic code. This was purely altruistic on their part and they deserve to be commended, not lumped together with a some fringe blogger with a clear political agenda.

        On the science side, we really do have problems with lack of reproducibility and our stubborn refusal to accept ideas from professionals in outside fields. It’s important to note that a good fraction (arguably most) academic science is produced by graduate students with less than 5 years experience. In general they don’t have the luxury of having worked with any code for 20 years, much less the particular 15k lines of C that were used in this model. At best sloppy code is a waste of time, and at worst it’s going to lead to subtile bugs that any amount of scientific knowledge can miss.

        To be clear: you’re completely correct in defending the imperial model. The scientific community has made up for our lack of coding expertise in many other ways. And yes, the blog posts you’re liking to are utter garbage. But please don’t kill the messenger and blame “commercial software developers” for a partisan hackjobs of a blog post.

      • Ray Myers

        Phil and dguest: I’m preparing a video on this topic so came back to check on this post and just wanted to say I couldn’t be happier with how thoughtful and constructive your replies were. I’d like to quote them, feel free to let me know any social media / links you’d want for attribution.

      • Phil Bull

        Sure, please go ahead (my Twitter handle is @philipbull)

  • Andy

    The odd thing about these unit tests complains is that … commercial software is typically not well tested. It should be, but a lot of it is done under time pressure and tests are the thing that often gets lost. There should be tests and tests are good. Not doing them is bad.

    But acting like it is shocking to encounter codebase that not is well tested is disingenuous. And there is absolutely no reason to buy the assumption that commercial developers well test their software. It tends to be somewhat better then scientific code, but that is all you can say about it.

  • Code: science and production – Adam Shostack & friends

    […] an interesting article by Phil Bull, Why you can ignore reviews of scientific code by commercial software developers. It’s an interesting, generally convincing argument, with a couple of exceptions. (Also worth […]

  • It’s all very well “following the science”, but is the science any good? – Warta Saya

    […] We are not ourselves coders, but it seems like there is a consensus around the fact that the Imperial code does indeed contain a number of bugs. That might sound worrying. But somewhat counterintuitively, there also seems to be a consensus in the academic community that you cannot apply commercial software development principles to scientific modelling. (You can read a good explanation of this argument here.) […]

  • Fitz Fulke

    Science proceeds as Pierce says by generating and testing hypotheses, narrowing but never removing uncertainty of method or result. Good commercial software seeks predictable determinacy. NHS experience of software provided by commercial developers shows how difficult that goal can be, and why commercial developers should be circumspect in setting up their work as a standard of judgement.

  • Matt

    Thanks for writing this wonderful article Phil. As a software developer who’s “productionised” lots of code written by scientists I completely agree with you. Often the code smell would be awful, but it was fantastically valuable to the companies that owned it. And while there were very few automated tests, the scientists that wrote it would test it thoroughly in a variety of inventive (and sometimes convoluted and downright mad) ways.

  • ThinkingScientist

    Coming to the party very late here. I am a geophysicist and I have spent the last 20 years in R&D and latterly commercialisation of stochastic seismic inversion technology. I have a lot of experience in stochastic methods, applied to very large scale problems (spatially dependent in 3D and also co-dependent between three output variables). I am also a lecturer and teacher of geostatistics and stochastic methods. I have a very clear understanding of what is involved.

    I am not a programmer, but I employ them and work with them on development. So I understand the distinction between R&D code and commercial code, and also the difference between what you can use for internal runs of software (as domain experts) as opposed to selling software to third parties.

    The author of this article does not have a proper understanding of the practicalities of checking software designed for stochastic modelling.

    “Non-deterministic outputs:

    Computers deal in “pseudo-randomness” though; given the same starting “seed” they will produce the same random-looking sequence. A review by a competing group in Edinburgh found a bug that resulted in different results for the same seed, which is generally not what you’d want to happen.”

    It is correct to state that the same start seed results in the same sequence. However, code which has an undetected bug giving different results from the same seed can never have been tested to check the implementation of, say, a given PDF simulation is correct. Big red flag for me.

    “A particular workaround here appears to be re-running the model many times with different seeds, which is what you’d do with this code anyway; or using different settings that don’t seem to suffer from this bug.”

    There are two problems with this explanation and it also shows a naivety about how random number generators work. A programme which has core functionality based on a random generator should really only have 1 seed set for the running of the program and the calls to the random number generator should be global. Explaining away the problem by saying that is doesn’t matter because you use different seeds in practice is hiding from the point I made above – the results cannot be tested to check the program works correctly.

    Secondly, it is not a “workaround” to re-run the model with different seeds.
    This explanation also hides another common misunderstanding that I have previously encountered. The use of multiple instances of seeds in programs is often thought to make the program “more random”. In fact the reverse is true. Using multiple seeds creates a possibility for re-entering the random sequence through a previously simulated segment in the same run, potentially creating dependencies in the results which are unintentional and uncontrollable. Now, it is fair to say that with a very long period sequence generator this risk is small, but it is nonetheless real.

    Finally, the idea that its ok if, given the same seed, the program generates different results each time because the runs are simply going to be averaged anyway is an attempt to explain away a fundamental flaw – it ignores the lack of reproducibility required for at least basis testing of any stochastic code.

    • Phil Bull

      Thanks for your comment.

      “However, code which has an undetected bug giving different results from the same seed can never have been tested to check the implementation of, say, a given PDF simulation is correct. Big red flag for me.”
      Remember that this appears to be in one relatively unused branch of the code according to the devs. A bug is a bug, sure, but this one doesn’t seem to be particularly material.

      “There are two problems with this explanation and it also shows a naivety about how random number generators work…”
      I understand perfectly well how pseudo-RNG works I’m afraid! Setting one seed per realisation of a simulation is a common and reasonable usage pattern, especially for codes intended to generate many realisations per set of input parameters. They appear to be using a similar seed realisation correspondence here. While there is a bug here, I’m finding it hard to see how it might cause any significant problem with the end results. Having reproducible runs would surely be better, but if all they’re doing is analysing an ensemble of realisations at the end, I’m failing to see how this could matter (unless their results don’t converge to the target distribution, in which case they have bigger problems).

      “Secondly, it is not a “workaround” to re-run the model with different seeds.”
      In this case, it seems that it is! While I take your point about not getting more randomness by using multiple seeds, the chance of a collision with a modern RNG is indeed very, very low, and in the way the code is being used here would not matter anyway (any unlikely spurious correlation that cropped up between two realisations would do little to skew the results).

      So, while I agree with some of what you say in principle, considering the way that they say they are using the code, the risk of this bug causing erroneous conclusions seems vanishingly small. I really can’t imagine a practical example where this would be an issue unless they are making some much more fundamental errors (e.g. leading to poor convergence) in addition.

  • ReformedAcademic

    Not commenting here on the covid related aspects of this post, only the discussion of scientific software development methodology.

    The 4 points listed as being import to commercial software developers, but not really necessary for scientific code (Maintainability, Documentation, User Proofing/Error Checking, Formal Testing) are a huge problem whenever code is to be used by more than one person. Anyone who expects or offers their code up for use by another person in a setting where others are required to use that code should themselves be required to address all of these elements before another human has to lay eyes on that code.

    I have worked in the R&D arms of too many commercial organizations where former graduate students bring this exact attitude to software projects that start as research projects but are relied upon for actual product development. Too many former academics (I myself was one in the past) think it is perfectly acceptable to say “the code is the documentation”, “it works for me”, or “they will figure things out after a while”. None of that is acceptable and academia needs to stop perpetuating this attitude, and the organizations that employ them need to enforce this behavior. I’m still searching for such an organization. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: