Excel As A Cancer

This only slightly tongue-in-cheek opinion piece <https://www.theregister.com/2023/10/16/excel_hell_comment/> on the seeming inevitability of data-processing errors due to overuse/misuse/abuse of Microsoft Excel suggests the creation of a whole new industry to mitigate those errors, rather than try to avoid them by switching to another tool. One of the user comments linked to this article <https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984>, which is an analysis of what has happened in genetics research in the years since measures were taken to rename genes with names that were prone to being misinterpreted by Excel, and otherwise raise awareness of the issue of Excel import/processing errors. The conclusion? Things haven’t improved at all. I absolutely love the recommendations that they make. The first one is a biggie: Scripted analyses are preferred over spreadsheets. Gene name to date conversion is a bug specific to spreadsheets and doesn’t occur in scripted computer languages like Python or R. In addition, analyses conducted with Python and R notebooks (eg: Jupyter or Rmarkdown) capture computational methods and results in a stepwise fashion meaning these workflows can be more readily audited. These notebooks can therefore achieve a higher level of computational reproducibility than spreadsheets. Although this requires a big investment in learning a computer language, this investment pays off in the longer term. Note that bit: “capture computational methods and results in a stepwise fashion meaning these workflows can be more readily audited”. Here I thought reproducibility was an absolutely non-negotiable foundation stone of scientific research, yet it seems people have been publishing results with nothing to back up their analyses other than an Excel spreadsheet. Also: If a spreadsheet must be used, then LibreOffice is recommended because it will avoid such errors from occurring. This will not remedy other error types. Better than sticking with Excel! But still not as good as proper analysis tools. If you must use Excel, then ... then take great care importing the data. If opening a TSV or CSV file, use the data import wizard to ensure that each column of data is formatted appropriately. Though I suspect most of the users are clueless about this, else they would be doing it already. A good recommendation on data formats in general: Instead of spreadsheets, share genomic data as “flat text” files. These typically have the suffixes “csv”, “tsv” or “txt”. These are native formats for computer languages and suitable for long term data archiving. Excel formats such as “xls” or “xlsx” are proprietary and future development is decided by Microsoft. The problems are not just in genetics, of course: Although changes to gene names and software will help, they won’t solve the overarching problem with spreadsheets; that (i) errors occur silently, (ii) errors can be hidden amongst thousands of rows of data, and (iii) they are difficult to audit. Research shows that errors are surprisingly common in the business setting, which raises the question as to how common such errors are in science. The difficulty in auditing spreadsheets makes them generally incompatible with the principles of computational reproducibility.

Fortunately for data science the go-to tool seems to be Jupyter Notebook. If only the others (engineering, finance, etc) would follow suit. Back in the day I used to use MathCad for engineering calculations. This was around 25 years ago and Jupyter has now mostly caught up with its capabilities with the likes of Markdown comments and libraries like SymPy. The last piece of the required functionality for this is provided by module called handcalcs[1] that has only been around for 3 years. [1] https://pypi.org/project/handcalcs/ On 17/10/23 12:04, Lawrence D'Oliveiro wrote:
This only slightly tongue-in-cheek opinion piece <https://www.theregister.com/2023/10/16/excel_hell_comment/> on the seeming inevitability of data-processing errors due to overuse/misuse/abuse of Microsoft Excel suggests the creation of a whole new industry to mitigate those errors, rather than try to avoid them by switching to another tool.
One of the user comments linked to this article <https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984>, which is an analysis of what has happened in genetics research in the years since measures were taken to rename genes with names that were prone to being misinterpreted by Excel, and otherwise raise awareness of the issue of Excel import/processing errors. The conclusion? Things haven’t improved at all.
I absolutely love the recommendations that they make. The first one is a biggie:
Scripted analyses are preferred over spreadsheets. Gene name to date conversion is a bug specific to spreadsheets and doesn’t occur in scripted computer languages like Python or R. In addition, analyses conducted with Python and R notebooks (eg: Jupyter or Rmarkdown) capture computational methods and results in a stepwise fashion meaning these workflows can be more readily audited. These notebooks can therefore achieve a higher level of computational reproducibility than spreadsheets. Although this requires a big investment in learning a computer language, this investment pays off in the longer term.
Note that bit: “capture computational methods and results in a stepwise fashion meaning these workflows can be more readily audited”. Here I thought reproducibility was an absolutely non-negotiable foundation stone of scientific research, yet it seems people have been publishing results with nothing to back up their analyses other than an Excel spreadsheet.
Also:
If a spreadsheet must be used, then LibreOffice is recommended because it will avoid such errors from occurring. This will not remedy other error types.
Better than sticking with Excel! But still not as good as proper analysis tools.
If you must use Excel, then
... then take great care importing the data. If opening a TSV or CSV file, use the data import wizard to ensure that each column of data is formatted appropriately.
Though I suspect most of the users are clueless about this, else they would be doing it already.
A good recommendation on data formats in general:
Instead of spreadsheets, share genomic data as “flat text” files. These typically have the suffixes “csv”, “tsv” or “txt”. These are native formats for computer languages and suitable for long term data archiving. Excel formats such as “xls” or “xlsx” are proprietary and future development is decided by Microsoft.
The problems are not just in genetics, of course:
Although changes to gene names and software will help, they won’t solve the overarching problem with spreadsheets; that (i) errors occur silently, (ii) errors can be hidden amongst thousands of rows of data, and (iii) they are difficult to audit. Research shows that errors are surprisingly common in the business setting, which raises the question as to how common such errors are in science. The difficulty in auditing spreadsheets makes them generally incompatible with the principles of computational reproducibility.

This only slightly tongue-in-cheek opinion piece <https://www.theregister.com/2023/10/16/excel_hell_comment/> on the seeming inevitability of data-processing errors due to overuse/misuse/abuse of Microsoft Excel suggests the creation of a whole new industry to mitigate those errors, rather than try to avoid them by switching to another tool.
[...] Looks like Microsoft got afraid that people move away from their tools: 'In 2020, scientists decided just to rework the alphanumeric symbols they used to represent genes rather than try to deal with an Excel feature that was interpreting their names as dates and (un)helpfully reformatting them automatically. Last week, a member of the Excel team posted that the company is rolling out an update on Windows and macOS to fix that. Excel's automatic conversions are intended to make it easier and faster to input certain types of commonly entered data -- numbers and dates, for instance. But for scientists using quick shorthand to make things legible, it could ruin published, peer-reviewed data, as a 2016 study found. Microsoft detailed the update in a blog post last week, adding a checkbox labeled "Convert continuous letters and numbers to a date." You can probably guess what that toggles. The update builds on the Automatic Data Conversions settings the company added last year, which included the option for Excel to warn you when it's about to get extra helpful and let you load your file without automatic conversion so you can ensure nothing will be screwed up by it.' -- source: https://it.slashdot.org/story/23/10/23/1217252/microsoft-fixes-the-excel-fea... But, open-source tools aren't without blame for "automagic" either. E.g. pandas: https://stackoverflow.com/questions/41417214/prevent-pandas-from-reading-na-... At least pandas had a way of turning it off for a bit longer than Excel. :-) I always liked that about OpenOffice/LibreOffice that they had an actual import dialog (eg for CSV files) with a preview. Excel always imported your CSV files as it thought best - with the usual disastrous outcome (oh, look, now I have to go through the extract step of applying the convert text to columns function)... Cheers, Peter -- Peter Reutemann Dept. of Computer Science University of Waikato, Hamilton, NZ Mobile +64 22 190 2375 https://www.cs.waikato.ac.nz/~fracpete/ http://www.data-mining.co.nz/

On Tue, 24 Oct 2023 08:49:37 +1300, Peter Reutemann wrote:
But, open-source tools aren't without blame for "automagic" either. E.g. pandas: https://stackoverflow.com/questions/41417214/prevent-pandas-from-reading-na-...
The issue in that case seems to be trying to interpret a non-numeric string as a number. That would mean converting _any_ such string to “NaN”, not just “NA”. That might be some issue with automatically inferring column types (I’m no Pandas expert), but apparently you can explicitly specify types anyway. And why not use NaN to express “not applicable”? It’s the numerical equivalent of a “null” value.
I always liked that about OpenOffice/LibreOffice that they had an actual import dialog (eg for CSV files) with a preview. Excel always imported your CSV files as it thought best - with the usual disastrous outcome (oh, look, now I have to go through the extract step of applying the convert text to columns function)...
There is apparently some kind of import wizard you can engage in Excel (according to that original paper), but it seems most users are unaware it exists. Also note that incorrect data conversions are just one aspect of the problems with Excel. The whole idea of trying to express mathematical models of any complexity in a spreadsheet is inherently going to be trouble.
participants (3)
-
Glenn Ramsey
-
Lawrence D'Oliveiro
-
Peter Reutemann