Science faces a replicability crisis. This is well-known among scientists and even the general public. Various explanations have been proposed as to the cause of this crisis – some of them on this blog – but these proposals have usually been informal in nature. Recently, two scientists (Smaldino & McElreath, 2016) built a computational model to explain the replicability crisis.
Failing to replicate means that a previously significant finding is no longer significant when the original study or analysis is done again. In other words, the conclusion of the original study is likely false.
This is often due to what’s called a false positive, or Type I Error. In science, a false positive means that a finding appears significant, but that the underlying relationship between your variables is actually insignificant.
For example, a study might find that MiracleDrugTM significantly reduces the debilitating pain of a hangover compared to a control group (p < .05), so the authors conclude that MiracleDrugTM is a hangover cure. But it turns out that the authors didn’t test all that many subjects, and they didn’t do a great job of controlling for a whole slew of factors, such as: age, quantity (and quality) of alcohol consumed, and previous experience with alcohol. A few months later, MiracleDrugTM is on the fast-track to FDA approval and a shiny new patent for the authors, when suddenly another group finds that MiracleDrugTM actually has no effect on a hangover; the original study’s significant finding was a false positive.
There are all sorts of factors that contribute to false positives, many of them well-documented. These include: poor control across conditions, problems with the experimental design, hasty statistical analysis, and more. But scientists are, for the most part, pretty aware of these problems; most Introductory Statistics classes cover them in detail. This begs the question: why do scientists continue to make the same errors, despite being told over and over again not to?
The tendency to see patterns that aren’t there is called apophenia. Unlike an epiphany, the patterns perceived in an apophany are fundamentally illusory. Humans are actually pretty good at seeing patterns, especially if they have an incentive to do so.
McElreath & Smaldino (2016) argue that the problem behind the replicability crisis is one of incentives. To understand their model, we’ll have to take a brief detour into evolutionary game theory.
Evolutionary Game Theory
Game theory provides a formalism for studying how agents make decisions, when those decisions involve other agents. This makes it particularly useful for studying how people cooperate with each other (or manipulate each other) under different conditions.
One of the most well-known applications of game theory is what’s commonly called the Prisoner’s Dilemma, in which two individuals (let’s call them Alice and Bob) can either choose to cooperate (E.g. by “staying silent”) or not (by “betraying” the other). In most versions of the game, neither Alice and Bob know what the other is choosing, meaning they can’t make a fully informed decision. By manipulating the outcomes – and how much information Alice and Bob have, and whether that information is symmetric – researchers can study what affects a person’s decision to cooperate with or betray somebody else.
An evolutionary perspective
Evolutionary game theory takes an evolutionary perspective to studying strategic decision-making. In biological evolution, traits that increase fitness (e.g. passing on one’s genes) are selected for, and traits that decrease fitness are not. Similarly, in evolutionary game theory, certain strategies will be more successful than others, and thus result in that strategy propagating throughout a population.
This is a slightly different tack than traditional applications of game theory. Instead of asking, “Which strategy has the highest payout under which parameters?”, evolutionary game theory asks, “Which strategy is the most successful at propagating itself in a population over time?”
One example of an evolutionary game theory model is “Hawk vs. Dove”. Imagine there are two kinds of agents in a population: Hawks and Doves. Hawks are aggressive, and if necessary will fight until they win or lose; Doves will run away from a fight, but will share a resource if their opponent is peaceful. Thus, a Dove will share resources with another Dove, but will always lose the resources to a Hawk; and a Hawk will always take resources from a Dove, but might face death when faced by another Hawk.
Ultimately, the success of either strategy depends on the population of Hawks or Doves in the population. An evolutionary game theory model allows you to tinker with the population dynamics, then run thousands of iterations of agent encounters to simulate the passing of generations, after which point you can reassess.
Evolutionary game theory is a nice approach because it can be applied to any number of domains: political science (e.g. diplomatic strategies), romance (e.g. the success of different pick-up lines), or in this case, scientific research.
The Natural Selection of Bad Science
Smaldino & McElReath (2016) describe an evolutionary game theory model for testing the success of different scientific strategies. Here, the agents are different research labs, which all put different degrees of effort into their research.
A lab with a successful strategy will procure more funding, gain more prestige, and ultimately pass on this strategy to subsequent generations of researchers in that lab.
Stage One: Science
In the first stage of the model, the labs are initialized. Each lab has the goal of investigating some hypothesis, hi, which in turn has an underlying probability of being true, p(hi). Each lab also exerts some amount of effort, ei, towards “using high quality experimental and statistical methods” (between 1-100). Another way to think of “effort” is as “rigor” – e.g. the amount of rigor with which a lab designs its experiments and conducts its analyses.
Finally, each lab is associated with a typical statistical power, Wi, meaning that lab’s probability of correctly confirming a true hypothesis: p(significant | not null).
Using a combination of these terms, a false positive rate (alpha) can be computed for each lab using the following formula:
The authors write:
What this functional relationship reflects is the necessary signal- detection trade-off: finding all the true hypotheses necessitates labelling all hypotheses as true. Likewise, in order to never label a false hypothesis as true, one must label all hypotheses as false. Note that the false discovery rate—the proportion of positive results that are in fact false positives—is determined not only by the false-positive rate, but also by the base rate, b.
Put more simply, in order to detect truly “not null” findings (the lab’s “power”) and also reject truly “null” findings (false positives), a lab should invest in higher effort. Higher effort means better, more reproducible research, but it also means each hypothesis will take longer to investigate. Lower effort means that you can put out more research in a shorter time period, but you also have a higher chance of discovering a “false positive” – e.g. finding that hi is true, even though in actuality, it’s false.
At the end of this stage, the labs attempt to publish their data and communicate their research to the community. Different results have different pay-offs:
- Novel results have a pay-off of 1.
- Positive replications have a pay-off of .5.
- Negative replications have a pay-off of .1.
- Failing to replicate your own novel results has a pay-off of -100.
This is meant to represent the ways in which different results are rewarded in the scientific community. Making new discoveries is particularly prized, while replications are not as sought after, and producing research that can’t be replicated (e.g. a false positive) often hurts a lab’s reputation.
Stage Two: Evolution
In the second stage, under-performing labs are selected to die. Labs with the best pay-off are chosen to “reproduce”, and hence produce a new lab with the same strategies (e.g. effort). After this selection process, the model starts from the beginning.
Results: Bad science replicates itself due to insufficient replication
So what did the authors find, after running this model for over a million iterations?
Basically, low effort begat low effort. Even though investing more effort led to better, more reproducible research, low effort was more likely to result in publishable findings – regardless of whether or not those findings were actually true, or just a case of apophenia. Thus, over time, the mean effort of all labs gradually decreased, meaning low effort was selected for.
This, of course, coincided with a rapid rise in the False Discovery Rate – the rate at which scientists confirmed hypotheses that were actually false:
Even replication didn’t help the situation much. The authors included a pay-off incentive for replication in their model, but because it’s less than the incentive for novel research, it was “less lucrative than reducing effort and pursuing novel research”.
This creates something of a vicious cycle. Low-effort research leads to more novel findings (whether or not they’re true), which is good for a lab; ordinarily, these false discoveries would be kept in check by careful replications, but because replications aren’t as valued as novel findings, all labs ultimately have an incentive to invest more in low-effort research rather than replications, which means the false discoveries will be seen as true!
Points and counterpoints
Smaldino & McElreath (2016) showed how a system of broken incentives can lead scientists with the best intentions towards low-effort research that can’t be replicated. Even more concerning, because replication wasn’t incentivized enough in this model, there were no checks and balances for this low-effort research, meaning it could proliferate through the community.
As I pointed out before, this is a systemic problem. Systemic problems are hard to solve because there’s no individual or group of individuals particularly to blame; everyone is just acting in their own self-interest.
Smaldino & McElreath (2016) discuss several solutions. One kind of solution addresses institutional change, e.g. forcing high-end journals to adopt stricter standards for publication. This should limit “the extent to which individuals can game the system”, but as the authors point out, gaming the system is still possible, and as long as there’s an incentive to do so, it will be done.
A related solution is punishing individuals whose findings can’t be replicated. But this is problematic for a couple of reasons. First of all, it’s unclear exactly what form (and extent) this punishment should take, and even how the detection/enforcement process should work; false positives are common, but so are false negatives. Second, a harsh negative payoff (-100) wasn’t even shown to be particularly effective in tempering the incentives for low-effort research.
A more direct (but difficult) solution would be changing the incentives for success. But again, it’s unclear: a) what exactly this means (how does one measure “good research”); and b) how one would accomplish this. Doing this would mean changing the incentives at all levels of the scientific process, not just on the level of scientists themselves. Currently, many funding and grant agencies measure success in the form of “deliverables” – published papers, usable products, and so on. This means that scientists trying to invest in high-quality, long-term research would frequently come into conflict with their funding agencies (which, of course, already happens), which want quantifiable results as soon as possible.
This evolutionary model paints a pretty bleak picture of science; indeed, the authors point out, “our model presents a somewhat pessimistic picture of the scientific community”.
Despite the difficulty of implementing their suggested solutions, however, I (somewhat boldly) propose that there is a path forward. This belief rests on several key assumptions:
- Individual scientists are, for the most part, fundamentally interested in doing good science; “bad science” is, as Smaldino & McElreath (2016) point out, largely a consequence of a system that encourages quantity over quality of publications.
- Systemic change can be accomplished through a combination of top-down and bottom-up pressures. In this case, these correspond, roughly, to institutional changes (e.g. journal standards) and social changes (e.g. a shift in social pressures towards more careful science).
- The proliferation of false discoveries is due in large part to less careful science. False positives will always be a problem in science, of course, but: a) they can be significantly reduced by more rigorous analyses, such as using false discovery rate methods (Benjamini & Hochberg, 1995; Benjamini & Yekutieli, 2001) and careful pre-registration of planned analyses; and b) they can be more frequently detected through repeated replications.
All of these assumptions are up for debate, but I believe that they can each be defended. In a future post, I will present a more detailed proposal for how to address the replicability crisis.
For now, it’s important to recognize that there are institutions actively working on solving the problem. Scientists obviously care about the state of science, and have a stake in ensuring that science improves. One particularly impressive initiative is the Open Science Framework (OSF). As I mentioned previously, the OSF allows scientists to pre-register their analyses, share data and stimuli, and ultimately collaborate on both new studies and replication projects. The OSF has recently launched a pre-registration challenge, which aims to increase the number of scientists who pre-register their analyses. They provide incentives in the form of social pressures (e.g. a “leaderboard” comparing institutions by number of pre-registered studies) and financial incentives (a chance to win a small grant if you pre-register your study).
It’s not perfect – for example, it’s unclear how one would prove that they pre-registered their analysis before collecting data – but it’s a start.
McElreath, R., & Smaldino, P. E. (2015). Replication, communication, and the population dynamics of scientific discovery. PLoS One, 10(8), e0136088.
Smaldino, P. E., & McElreath, R. (2016). the Natural Selection of Bad Science. Royal Society of Open Science.
Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), 289-300.
 The notion of “significance” in science is a tricky one. Significance generally refers to the p-value, or probability value, reported in a study; roughly, a p-value is the probability that some observed effect (such as the relationship between Variable A and Variable B) would be observed at an equal or greater magnitude under the null hypothesis (such as the hypothesis that A and B are not related). Thus, a p-value of .05 (which, until recently, was taken as the standard of “significance” in many social sciences, though there have been recent calls to decrease this marker to .001) means that there’s a 5% chance that your results can be explained under the null hypothesis; since 5% is pretty small, this often results in rejecting the null hypothesis. So if a previously significant finding is no longer significant, that generally means that the p-value is larger (e.g. it went from .04 to .2).
 I’m going to ignore, here, the problem of distinguishing between “real” and “illusory” patterns. It does raise some deep questions regarding the access we have to reality or what is out there, so to speak, but the point of science is to try to overcome this tendency to perceive false patterns by imposing a rigorous objectivity on the whole enterprise.
 You can imagine that this is particularly useful for political purposes, e.g. when trying to get two criminals to betray the other (hence the name).
 In the case of Hawk vs. Dove, the population supposedly settles into an evolutionarily stable state (ESS) of 80% Doves and 20% Hawks, meaning that any disturbance (such as the introduction of new Hawks or Doves) ultimately result in the population settling back into the same dynamic. E.g., if you have 80% Doves and 20% Hawks, adding new Hawks or new Doves will ultimately result in settling back to that 80/20 split after enough generations.
 That said, in my opinion, making it more difficult to game the system, combined with increased social pressures for high-effort research, should act as significant bottlenecks to active deception. I’ll discuss this more in the end, and in a later post.
 Pre-registration is important for avoiding the problem of “p-hacking”, in which a scientist runs many different analyses until one comes out as significant.