P-hacking and false discoveries

The term “p-hacking” has made its way into the public discourse surrounding science, particularly regarding the replicability crisis. But although the term suggests intentional malevolence on the part of the scientist, it’s actually a scenario that many well-trained scientists can fall into. So what is p-hacking, and why is it dangerous?

Significance and the elusive “p”

First, the name. The “p” in “p-hacking” refers to the p-value of a statistical test. The p-value, or probability value, is often used as a metric for whether a given finding is statistically significant.

Specifically, a p-value is the probability that some statistical result (e.g. the difference between the means of two distributions) would occur under the null hypothesis. And while “null hypothesis” sounds like intimidating jargon to those not familiar with research methods, all it really amounts to is the assumption that the independent variable(s) you’re testing doesn’t have a predictable effect on your dependent variable.

For example, you might wonder: is a person’s gender predictive of their height? It seems like men are usually taller than women, but maybe you should run a statistical analysis to find out. To test this, you collect data from 500 men and 500 women, and find that they distribute roughly like so: Figure 1: Density plot for height distributions by gender. Data produced by randomly generating normal distributions around two means (71, 65) with SD of 1.

Are these distributions significantly different? One way to answer this question is to run an unpaired two-sample t-test, in which we compare two distributions by their means and variance.

This t-test yields a test statistic, which is assigned some p-value. This p-value is meant to be a measurement of how likely it would be to obtain a test statistic this large if gender had no effect on height. In this case, running a t-test in R yields an extremely small p-value (p<2*10-16) – which suggests that the distributions are significantly different. (Of course, this test was run on fake, randomly generated data, just for the purposes of illustration. Statistical tests run on actual data rarely yield such small p-values.)

The point at which a p-value becomes “significant” is largely arbitrary. In many fields, p<.05 is considered a “significant” finding, though recently – particularly in light of the replicability crisis – there’s been a lot of talk in the scientific research community about making our threshold for significance more stringent (e.g. p<.001).

P-hacking: An Example

P-hacking – also called data dredging, or fishing – is when researchers mine their data for statistically significant relationships. Exploratory research isn’t inherently bad, of course; it’s an important part of any field. The problem is that when you run many, many statistical tests on the same data, your chance of incorrectly failing to reject a null hypothesis (e.g. getting a false positive) significantly increases. A finding with p=.05 has a 5% chance of occurring under the null hypothesis. 5% might seem low, but the more tests you run, the more likely it is that you’ll end up seeing something that isn’t there – a case of apophenia.

Consider the example from above. To generate that data, I produced two normal distributions centered around separate means (71, 54), both with a standard deviation of 1. The distributions are pretty clearly different.

Now, if I shuffle the labels on each data-point – e.g. randomly permute all of the “male” and “female” labels – the mappings between gender and height should be truly random.

If I run a t-test on the shuffled data, the result should be a low test statistic, with an insignificant p-value. If I do this 500 times, it’s akin to testing your data against 500 variables, none of which have a “true” relationship – meaning none of these tests should be significant. I shuffled the data 500 times, and ran a t-test with each iteration. After each t-test, I extracted the corresponding p-value – just to see whether any of these results were considered “significant” using the canonical definition. The resulting distribution is below: Figure 2: P-values generated from running a t-test on randomly shuffled data 500 times. The red line indicates canonical significance threshold, p=.05.

As you can see in the figure, most of the p-values were far above .05. But 18 out of 500 (3.6%) were, ranging between .01 and .05 – despite being the result of randomly shuffled data. The problem is simply that if we run a battery of statistical tests on a bunch of variables – here, “A bunch of variables” is simulated by generating random labels 500 times – it becomes increasingly likely that some of that random noise will start to look significant.

Let’s pretend that each of these 500 permutations were actually separate categorical, binary variables. The more relationships we test, the more likely we are to see a spurious correlation. If a researcher isn’t careful, they might draw hasty – and possibly untrue – conclusions from those correlations.

What does this mean?

This doesn’t mean exploratory data analysis is misguided. Exploratory research is foundational to generating new hypotheses about the world. The key, then, is to use these hypotheses (e.g. the 18 “significant” findings from above) to motivate confirmatory research.

There are also statistical remedies to p-hacking, most of which involve adjusting your significance threshold based on the number of statistical tests you run. These are called false discovery rate tests (Benjamini & Yekutieli, 2001).

P-hacking can also happen in the lab. Even the best-intentioned experimentalists can get carried away with running different analyses on their data, “just to see”. This is why, as an experimentalist, it’s very important to pre-register your analyses – pre-registration allows you to identify which analyses were planned, and which were exploratory. The exploratory analyses can still yield valuable insights, but it’s important to recognize that they weren’t generated with the same rigor as the pre-registered analyses – which, ideally, were motivated by the study’s design and control conditions.

References

Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188.

Footnotes

 In most statistical tests, p-values are obtained by comparing your summary statistic to the distribution of summary statistics you’d expect under the null distribution. One problem with this methodology is that it assumes your data follows a normal distribution – which isn’t always true. One solution to this is to use randomization and bootstrapping methods, in which you shuffle your data many, many times to create an estimation of the “null hypothesis”, following the variance in your own data.