Excel As A Cancer

17 Oct 2023

      This only slightly tongue-in-cheek opinion piece
<https://www.theregister.com/2023/10/16/excel_hell_comment/> on the
seeming inevitability of data-processing errors due to
overuse/misuse/abuse of Microsoft Excel suggests the creation of a whole
new industry to mitigate those errors, rather than try to avoid them by
switching to another tool.

One of the user comments linked to this article
<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984>,
which is an analysis of what has happened in genetics research in the
years since measures were taken to rename genes with names that were
prone to being misinterpreted by Excel, and otherwise raise awareness of
the issue of Excel import/processing errors. The conclusion? Things
haven’t improved at all.

I absolutely love the recommendations that they make. The first one is
a biggie:

    Scripted analyses are preferred over spreadsheets. Gene name to
    date conversion is a bug specific to spreadsheets and doesn’t occur
    in scripted computer languages like Python or R. In addition,
    analyses conducted with Python and R notebooks (eg: Jupyter or
    Rmarkdown) capture computational methods and results in a stepwise
    fashion meaning these workflows can be more readily audited. These
    notebooks can therefore achieve a higher level of computational
    reproducibility than spreadsheets. Although this requires a big
    investment in learning a computer language, this investment pays
    off in the longer term.

Note that bit: “capture computational methods and results in a stepwise
fashion meaning these workflows can be more readily audited”. Here I
thought reproducibility was an absolutely non-negotiable foundation
stone of scientific research, yet it seems people have been publishing
results with nothing to back up their analyses other than an Excel
spreadsheet.

Also:

    If a spreadsheet must be used, then LibreOffice is recommended
    because it will avoid such errors from occurring. This will not
    remedy other error types.

Better than sticking with Excel! But still not as good as proper
analysis tools.

If you must use Excel, then

    ... then take great care importing the data. If opening a TSV or
    CSV file, use the data import wizard to ensure that each column of
    data is formatted appropriately.

Though I suspect most of the users are clueless about this, else they
would be doing it already.

A good recommendation on data formats in general:

    Instead of spreadsheets, share genomic data as “flat text” files.
    These typically have the suffixes “csv”, “tsv” or “txt”. These are
    native formats for computer languages and suitable for long term
    data archiving. Excel formats such as “xls” or “xlsx” are
    proprietary and future development is decided by Microsoft.

The problems are not just in genetics, of course:

    Although changes to gene names and software will help, they won’t
    solve the overarching problem with spreadsheets; that (i) errors
    occur silently, (ii) errors can be hidden amongst thousands of rows
    of data, and (iii) they are difficult to audit. Research shows that
    errors are surprisingly common in the business setting, which
    raises the question as to how common such errors are in science.
    The difficulty in auditing spreadsheets makes them generally
    incompatible with the principles of computational reproducibility.

Lawrence D'Oliveiro

Glenn Ramsey

Peter Reutemann

Lawrence D'Oliveiro

tags

participants (3)