[wlug] Re: Excel As A Cancer

20 Oct 2023

      Fortunately for data science the go-to tool seems to be Jupyter Notebook. If 
only the others (engineering, finance, etc) would follow suit.

Back in the day I used to use MathCad for engineering calculations. This was 
around 25 years ago and Jupyter has now mostly caught up with its capabilities 
with the likes of Markdown comments and libraries like SymPy. The last piece of 
the required functionality for this is provided by module called handcalcs[1] 
that has only been around for 3 years.

[1] https://pypi.org/project/handcalcs/

On 17/10/23 12:04, Lawrence D'Oliveiro wrote:
...
This only slightly tongue-in-cheek opinion piece
<https://www.theregister.com/2023/10/16/excel_hell_comment/> on the
seeming inevitability of data-processing errors due to
overuse/misuse/abuse of Microsoft Excel suggests the creation of a whole
new industry to mitigate those errors, rather than try to avoid them by
switching to another tool.
One of the user comments linked to this article
<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984>,
which is an analysis of what has happened in genetics research in the
years since measures were taken to rename genes with names that were
prone to being misinterpreted by Excel, and otherwise raise awareness of
the issue of Excel import/processing errors. The conclusion? Things
haven’t improved at all.
I absolutely love the recommendations that they make. The first one is
a biggie:
Scripted analyses are preferred over spreadsheets. Gene name to
     date conversion is a bug specific to spreadsheets and doesn’t occur
     in scripted computer languages like Python or R. In addition,
     analyses conducted with Python and R notebooks (eg: Jupyter or
     Rmarkdown) capture computational methods and results in a stepwise
     fashion meaning these workflows can be more readily audited. These
     notebooks can therefore achieve a higher level of computational
     reproducibility than spreadsheets. Although this requires a big
     investment in learning a computer language, this investment pays
     off in the longer term.
Note that bit: “capture computational methods and results in a stepwise
fashion meaning these workflows can be more readily audited”. Here I
thought reproducibility was an absolutely non-negotiable foundation
stone of scientific research, yet it seems people have been publishing
results with nothing to back up their analyses other than an Excel
spreadsheet.
Also:
If a spreadsheet must be used, then LibreOffice is recommended
     because it will avoid such errors from occurring. This will not
     remedy other error types.
Better than sticking with Excel! But still not as good as proper
analysis tools.
If you must use Excel, then
... then take great care importing the data. If opening a TSV or
     CSV file, use the data import wizard to ensure that each column of
     data is formatted appropriately.
Though I suspect most of the users are clueless about this, else they
would be doing it already.
A good recommendation on data formats in general:
Instead of spreadsheets, share genomic data as “flat text” files.
     These typically have the suffixes “csv”, “tsv” or “txt”. These are
     native formats for computer languages and suitable for long term
     data archiving. Excel formats such as “xls” or “xlsx” are
     proprietary and future development is decided by Microsoft.
The problems are not just in genetics, of course:
Although changes to gene names and software will help, they won’t
     solve the overarching problem with spreadsheets; that (i) errors
     occur silently, (ii) errors can be hidden amongst thousands of rows
     of data, and (iii) they are difficult to audit. Research shows that
     errors are surprisingly common in the business setting, which
     raises the question as to how common such errors are in science.
     The difficulty in auditing spreadsheets makes them generally
     incompatible with the principles of computational reproducibility.

[wlug] Re: Excel As A Cancer

Glenn Ramsey