Computation-Intensive Research And The Quest For Scientific Reproducibility

The fruits of scientific research are supposed to be open to all. A key part of this is the need for reproducibility -- the idea that somebody else can repeat the same experiments and analysis and (hopefully) come to the same conclusions. It has long been a common expectation that researchers will make their raw data available to others for this purpose, but nowadays even that is likely no longer enough. The analysis of the data usually requires some particular piece of computer software, even if this was just some in-house scripting done on top of a commonly-available toolkit or package. Two different reports on this subject have come out recently, this one <https://www.theregister.com/2021/11/25/research_software_inquiry/> from the UK and this <https://arstechnica.com/science/2021/11/keeping-science-reproducible-in-a-world-of-custom-code-and-data/> with examples from the US and elsewhere. The latter goes into a lot more detail, including good news (the rise of publicly-available data sets which get heavily used for many different analyses), and bad: From 2017 through 2019, Tsuyoshi Miyakawa, the editor-in-chief of the journal Molecular Brain, replied to 41 article submissions by requesting that the authors provide their complete source data for review, as per the stated policy of the journal. Only one author did so. ... Based on his efforts to replicate papers from other statisticians, Thomas Lumley, a professor of biostatistics at the University of Auckland in New Zealand, says of the phrase data available upon request: "When people put it in their papers, what they typically mean is 'data not available.'" As for making code available, that has its own challenges: often the scripts/programs are hastily thrown together, and the creators may be embarrassed to have others see it in this state. Or it’s not likely to work properly anyway, outside of the original systems where it was developed. The good news is, bodies that fund the research and the journals that publish the results are becoming more aware of such issues, and increasingly trying to ensure that procedures for dealing with them are built into the projects from the beginning.

On Fri, Nov 26, 2021 at 11:11:09AM +1300, Lawrence D'Oliveiro wrote:
The fruits of scientific research are supposed to be open to all. A key part of this is the need for reproducibility -- the idea that somebody else can repeat the same experiments and analysis and (hopefully) come to the same conclusions. It has long been a common expectation that researchers will make their raw data available to others for this purpose,
Well, not necessarily. It is expected that the data acquisition and analysis has been described to the detail that someone else could repeat the entire experiment, including the data acquisition. But if the data acquisition is difficult or impossible to repeat or replicate (for example, it is a one-off event or it involves incredible expense so is not practical for anyone else) then there is an expectation that raw data should be provided.
but nowadays even that is likely no longer enough. The analysis of the data usually requires some particular piece of computer software, even if this was just some in-house scripting done on top of a commonly-available toolkit or package.
In other words, they have failed to describe the data analysis in sufficient detail that it is repeatable, which, indeed, is a major problem. Cheers, Michael.

IMO in my area at least, if code for a paper isn't available, *or *someone hasn't replicated the paper with the code available, it's hard to get invested ('i.e. this paper is interesting....oh, code isn't available.'). Speaking of which, I was impressed by this repository: https://github.com/MaximeVandegar/Papers-in-100-Lines-of-Code (yeah ok, it's code golf, but nice to see a minimal representation of different papers). There's also a point to note that open data and data sovereignty seem (?) to have a lot of friction. Cheers, Matthew On Fri, Nov 26, 2021 at 11:11 AM Lawrence D'Oliveiro < ldo(a)geek-central.gen.nz> wrote:
The fruits of scientific research are supposed to be open to all. A key part of this is the need for reproducibility -- the idea that somebody else can repeat the same experiments and analysis and (hopefully) come to the same conclusions. It has long been a common expectation that researchers will make their raw data available to others for this purpose, but nowadays even that is likely no longer enough. The analysis of the data usually requires some particular piece of computer software, even if this was just some in-house scripting done on top of a commonly-available toolkit or package.
Two different reports on this subject have come out recently, this one <https://www.theregister.com/2021/11/25/research_software_inquiry/> from the UK and this < https://arstechnica.com/science/2021/11/keeping-science-reproducible-in-a-wo...
with examples from the US and elsewhere. The latter goes into a lot more detail, including good news (the rise of publicly-available data sets which get heavily used for many different analyses), and bad:
From 2017 through 2019, Tsuyoshi Miyakawa, the editor-in-chief of the journal Molecular Brain, replied to 41 article submissions by requesting that the authors provide their complete source data for review, as per the stated policy of the journal. Only one author did so.
...
Based on his efforts to replicate papers from other statisticians, Thomas Lumley, a professor of biostatistics at the University of Auckland in New Zealand, says of the phrase data available upon request: "When people put it in their papers, what they typically mean is 'data not available.'"
As for making code available, that has its own challenges: often the scripts/programs are hastily thrown together, and the creators may be embarrassed to have others see it in this state. Or it’s not likely to work properly anyway, outside of the original systems where it was developed.
The good news is, bodies that fund the research and the journals that publish the results are becoming more aware of such issues, and increasingly trying to ensure that procedures for dealing with them are built into the projects from the beginning. _______________________________________________ wlug mailing list -- wlug(a)list.waikato.ac.nz | To unsubscribe send an email to wlug-leave(a)list.waikato.ac.nz Unsubscribe: https://list.waikato.ac.nz/postorius/lists/wlug.list.waikato.ac.nz
participants (3)
-
Lawrence D'Oliveiro
-
Matthew Skiffington
-
Michael Cree