How machine learning is affecting science reproducibility and how to solve it
Reproducibility is fundamental for scientific progress, but the increasing use of machine learning is affecting it. Why reproducibility is important? Why machine learning usage has a problematic side effect? How we can solve it?
Not everything shining is a diamond
In 2016 the scientific journal Nature published the results of a survey. They asked 1,576 researchers to reply to a brief questionnaire about reproducibility in research. The results showed that more than 70% of the scientists failed to reproduce another fellow researcher’s experiment. More than 50% of the researcher in the survey declared that there is a reproducibility crisis.
The problem is concerning all the scientific disciplines from medicine to biology, from economy to physics, from psychology to chemistry. The scientist replied that the main causes behind these are two factors: pressure to publish (“publish or perish”)and selective reporting. Others pointed out that also low statistical power and technical difficulties can be a cause. In fact, p-value and other statistical methods are under scrutiny to find a better way to analyze the data.
Researchers declared that when trying to replicate academic findings less than 40% of the attempts are successful. Moreover, many undergraduate students in the laboratory are frustrated by the failure to replicate (leading them to burnout). In addition, often when a scientist was able to replicate the findings, the results were much less enthusiastic than the original paper (effect size much smaller than declared). In fact, often what they look for break-through findings are actually much less impacting results.
“The definition of insanity is doing the same thing over and over again and expecting different results.” attributed to Einstein
However, the effect of the reproducibility crisis is going beyond academia. The industry is often trying to reproduce the findings of researchers. For example, pharmaceutical companies are looking for promising research in order to develop new potential therapy. However, they are also encountering difficulties in replicating the results. This is a contributing cause to the low success rate of Phase II clinical trials (especially in oncology).
Can you trust the machine?
In the last decade, machine learning has been influential in many different fields (from political science to psychology, from physics to biomedicine). For example, CERN experiments or the new Webb Telescope are producing a large quantity of data. In medicine, there are thousands of electronic patient records, huge datasets of patient data images, and so on. It is also the same for biology where we are accumulating data thanks to the OMICS revolution.
Thus, data science and machine learning have found space for many research applications. However, a recently published article is projecting shade on the use of machine learning in science.
“Machine learning is being sold as a tool that researchers can learn in a few hours and use by themselves. But you wouldn’t expect a chemist to be able to learn how to run a lab using an online course” — Sayash Kapoor says at Nature
The authors found different errors in the usage of machine learning methods in different scientific fields. In fact, the authors conducted a meta-analysis of 20 scientific articles ( 20 reviews in 17 research fields) finding 329 research papers whose results could not be replicated and they identified the cause in the erroneous application of machine learning. The figure shows that the prominent error in most of the papers was some form of data leakage.
Tape well your pipe(line)
Data leakage, in machine learning, is when the model has access during the training to information it should not be allowed to see. For instance, it should be information leakage between the test set and training set, because the model will be evaluated on it. In fact, since the model has already seen the answer it will perform better than in reality.
While it is easy to prevent some forms of data leakage (like lack of test set, duplicated entries) others are more subtle. A common error during preprocessing of the dataset is the lack of clean separation of training and test. In fact, step as normalization, missing value imputation, and over/under sampling should be performed separately on the training and test set. Moreover, even if less obvious, feature selection should be performed separately for training and test set (otherwise the model will be aware of which features perform better on the set).
The reality is that as a data scientist, you’re at risk of producing a data leakage situation any time you prepare, clean your data, impute missing values, remove outliers, etc. — Doing Data Science: Straight Talk from the Frontline
A common error but more difficult to notice is temporal leakage where data points from the later times are included in the test data. For example, if it a model used to predict stock value is trained with data points from the future. An example, in 2011 researchers claimed that their model could predict the stock market using Twitter users’ moods. However, the incredible accuracy (87%) was due to temporal leakage error. Other complex errors to handle are non-independence between train and sample (for example, the training and test set contains images from the same patients) and sampling bias in test distribution.
Interestingly, the authors showed that correctly the data leakage errors the performance of many models were deeply affected.
Moreover, often researchers are claiming that their model is performing better than other models. The authors showed that by correcting data leakage error, complex models (such as Random Forest, AdaBoost, Gradient Boost Trees, etc…) are actually performing less than simpler models like logistic regression. In the table below, there are a few examples:
Reproducibility is fundamental in science. With the increasing use of machine learning and artificial intelligence in the scientific fields, reproducibility is facing additional changes. In fact, this lack of reproducibility is affecting the possibility of deriving applications from scientific papers.
Traditionally, before being published scientific articles have to pass a process called peer-review (where an article is revised by experts in the field). However, the reviewers are today more difficult to find (since they are not paid), and often they have little knowledge of machine learning. Thus, it is easy that methodological errors in the article could pass unnoticed during the review.
“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” — Eliezer Yudkowsky
The authors proposed in the article a model info sheet to assure that a scientific paper is detecting/preventing data leakage. However, many scientific articles are presenting incomplete method sections and often do not release the code. Moreover, often the code is not well written or not appropriately documented and thus making it hard to reuse. Thus, scientific journals’ editors should be more careful in selecting the reviewer and demand that the code is released and documented according to standard guidelines.
For example, psychology has had a beneficial effect from the use of statistics, but its careless use has created reproducibility problems. In the same way, machine learning (and artificial intelligence) has transformative power in scientific research but should be handled by experts (or at least in collaboration with those who know how to use it).
- About reproducibility crisis (here, here)
- About data leakage (here, here)
- Reproducibility in machine learning (here, here)
Here is the link to my Github repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.
Or feel free to check out some of my other articles on Medium: