I was really sad to hear the Reasonable Doubts podcast was throwing in the towel; their thoughtful brand of counter-apologetics was something I’d never heard before, and I didn’t realize how much I’d appreciated it until it was gone.

Something bugged me about their final episode, though. Dr. Prof. Luke Galen talked about the reproducability issues in psychology, something I’ve covered too, but both of us missed out on an important aspect: statistical power.

In Neyman-Pearson hypothesis testing, there are two categories of error: rejecting the null hypothesis when you shouldn’t, labelled Type I, or failing to reject a null hypothesis when you should, sensibly called Type II. Each is represented by a probability, and fixed before the trial is even started. Statistical power is simply one minus the Type II error probability.

Type I errors usually gets all the attention, partly because they’re confused for p-values, but mostly because we’re much more worried about losses than gains. This fixation on losses is a big problem, because we tend to think of failing to replicate a study as proving the null hypothesis is true.

Not so! Suppose the power of my study is 80%, meaning that if the effect is as big as I think it is, I’ll reach statistical significance 80% of the time. I run my study, and fail to reach statistical significance. There are three possible reasons:

• The effect doesn’t exist, or is weaker than I thought, with probability 80%.
• The effect is as strong as I thought but I just got unlucky, with probability 20%.

Only one of those reasons involves proving the null hypothesis true, and it’s inseparable from another reason. This also ignores the fact that, much of the time, we already know the null hypothesis is false.[1]

This would seem to rig the system against the null hypothesis, except for that last reason. Suppose four other researchers attempt flawless replications of my work, and unlike me they all reject the null. Now our interpretation has changed dramatically: it’s more likely my “failure” was due to bad luck, and not a non-existent or weaker effect. So statistical power or Type II errors are quite critical, especially when it comes to replication.

… statistical power is generally low. For first tests, mean power is 13–16% to detect a small size effect and 40–47% to detect a medium effect. This is far lower than the generally recommended 80% (Cohen, 1988: 56) or 95% (Peterman, 1990). Using the 80% criterion for first statistical tests, only 2.9%, 21.2%, and 49.8% of 697 cases had the requisite power to detect a small, medium, or large effect, respectively. Likewise, for second tests, only 1.8%, 13.2%, and 36.5% of the 665 cases had sufficient power. If we only consider those tests that reported nonsignificant relationships, the equivalent figures are 1.4%, 17.8%, and 47.0% for the 219 first tests and 1.3%, 10.8%, and 32.5% for the 314 second tests.[2]

The median power for small, medium, and large effects was .14, .44, and .90, respectively. Twenty-four years earlier, the median power was . 17, .46, and .89, respectively. As a general result, therefore, the power has not increased after 24 years. …

Remarks on power were found in only two cases, and nobody estimated the power of his or her tests. In four additional articles, alpha was mentioned, either by saying that it was set at a certain level (.05) before the experiment or by referring to the danger of alpha inflation. No author discussed why a certain alpha or n was chosen or what effect size was looked for. This first result shows that concern about power is almost nonexistent, at least in print.[3]

If the typical power of a study is 45% or less, then 55% or more of all replications will fail to hit statistical significance even if the original study accurately found a legitimate effect. So it shouldn’t be much of a surprise when a major study finds 63% of studies fail to replicate,[4] because that’s what you’d expect from the current scientific record.

Are there other problems with the scientific process? Most certainly, but it would be nice if our studies were more accurate than a coin flip.

[1] Cohen, Jacob. “Things I have learned (so far).” American psychologist 45.12 (1990): 1307-8.

[2] Jennions, Michael D., and Anders Pape Møller. “A survey of the statistical power of research in behavioral ecology and animal behavior.Behavioral Ecology 14.3 (2003): 438-445.

[3] Sedlmeier, Peter, and Gerd Gigerenzer. “Do studies of statistical power have an effect on the power of studies?.Psychological Bulletin 105.2 (1989): 309.

[4] Open Science Collaboration. “Estimating the reproducibility of psychological science.” Science 349.6251 (2015): aac4716.

[HJH 2015-11-30: Fixed minor typo.]