
Fortunately for data science the go-to tool seems to be Jupyter Notebook. If only the others (engineering, finance, etc) would follow suit. Back in the day I used to use MathCad for engineering calculations. This was around 25 years ago and Jupyter has now mostly caught up with its capabilities with the likes of Markdown comments and libraries like SymPy. The last piece of the required functionality for this is provided by module called handcalcs[1] that has only been around for 3 years. [1] https://pypi.org/project/handcalcs/ On 17/10/23 12:04, Lawrence D'Oliveiro wrote:
This only slightly tongue-in-cheek opinion piece <https://www.theregister.com/2023/10/16/excel_hell_comment/> on the seeming inevitability of data-processing errors due to overuse/misuse/abuse of Microsoft Excel suggests the creation of a whole new industry to mitigate those errors, rather than try to avoid them by switching to another tool.
One of the user comments linked to this article <https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984>, which is an analysis of what has happened in genetics research in the years since measures were taken to rename genes with names that were prone to being misinterpreted by Excel, and otherwise raise awareness of the issue of Excel import/processing errors. The conclusion? Things haven’t improved at all.
I absolutely love the recommendations that they make. The first one is a biggie:
Scripted analyses are preferred over spreadsheets. Gene name to date conversion is a bug specific to spreadsheets and doesn’t occur in scripted computer languages like Python or R. In addition, analyses conducted with Python and R notebooks (eg: Jupyter or Rmarkdown) capture computational methods and results in a stepwise fashion meaning these workflows can be more readily audited. These notebooks can therefore achieve a higher level of computational reproducibility than spreadsheets. Although this requires a big investment in learning a computer language, this investment pays off in the longer term.
Note that bit: “capture computational methods and results in a stepwise fashion meaning these workflows can be more readily audited”. Here I thought reproducibility was an absolutely non-negotiable foundation stone of scientific research, yet it seems people have been publishing results with nothing to back up their analyses other than an Excel spreadsheet.
Also:
If a spreadsheet must be used, then LibreOffice is recommended because it will avoid such errors from occurring. This will not remedy other error types.
Better than sticking with Excel! But still not as good as proper analysis tools.
If you must use Excel, then
... then take great care importing the data. If opening a TSV or CSV file, use the data import wizard to ensure that each column of data is formatted appropriately.
Though I suspect most of the users are clueless about this, else they would be doing it already.
A good recommendation on data formats in general:
Instead of spreadsheets, share genomic data as “flat text” files. These typically have the suffixes “csv”, “tsv” or “txt”. These are native formats for computer languages and suitable for long term data archiving. Excel formats such as “xls” or “xlsx” are proprietary and future development is decided by Microsoft.
The problems are not just in genetics, of course:
Although changes to gene names and software will help, they won’t solve the overarching problem with spreadsheets; that (i) errors occur silently, (ii) errors can be hidden amongst thousands of rows of data, and (iii) they are difficult to audit. Research shows that errors are surprisingly common in the business setting, which raises the question as to how common such errors are in science. The difficulty in auditing spreadsheets makes them generally incompatible with the principles of computational reproducibility.