Sitting comfortably? Good, because I have some bad news: traditional hypothesis testing, in the form widely used across the sciences, is horrifically broken. Why? I’ll retry Bem’s test with classical hypothesis testing, but this time plug in random numbers for the number of successes:
Null Hypothesis: A subject's selections come from chance alone. Alternate Hypothesis: A subject's selections cannot be explained by chance, and therefore must be precognition. Results: Out of 5400 trials over a binary choice, 2676 guessed correctly. Odds under Null Hypothesis: Using a one-tailed Binomial distribution, we calculate p = 0.52. Conclusion: As p > 0.05, we fail to reject the null hypothesis.
“Fail to reject?” But we already know the number is random! Maybe a second run will give a different result:
Null Hypothesis: A subject's selections come from chance alone. Alternate Hypothesis: A subject's selections cannot be explained by chance, and therefore must be precognition. Results: Out of 5400 trials over a binary choice, 2759 guessed correctly. Odds under Null Hypothesis: Using a one-tailed Binomial distribution, we calculate p = 0.11. Conclusion: As p > 0.05, we fail to reject the null hypothesis.
Nope. More tries!
|Success||p-value (1 tail)||Conclusion|
|2676||0.5224436379||Fail to Reject|
|2759||0.1113384143||Fail to Reject|
|2668||0.3912692183||Fail to Reject|
|2618||0.0265355239||Reject the Null|
Uh oh, we just rejected a hypothesis we know to be true! This isn’t too surprising, as with a p value of 0.05 there’s a
one in twenty one in four chance we’ll reject the null hypothesis by accident. But if it’s easy to run these experiments, and you’re only rewarded for rejecting your null hypothesis, then you could wind up in a situation where most of the published research is false.
All isn’t lost, though, because we still have Bayes’ Theorem to help combine multiple results together. By combining a lot of experiments, maybe we can find solid support for the null hypothesis.
|Success||p-value (1 tail)||Bayes Factor||Culm. BF|
[HJH 2015-07-11] And indeed it isn’t right, because I’ve made the common mistake of confusing p-values with Neyman-Pearson Type I errors. p-values are a very different beast, which I explain in detail elsewhere. One good bit of evidence for the weirdness of p-values is how you combine them; if they were the probability of falsely rejecting the null hypothesis, you could chain them like I did above. Instead, the best way to combine p-values is:
- Sum up all the logarithms of the p-values.
- Multiply by -2.
- Look up the result in a chi-square distribution with the degrees of freedom equal to twice that of the number of p-values you’re combining.
- This results in a single p-value, representing all that you’ve combined.
Oh well, that’s easy enough to do in a spreadsheet.
|Success||p-value (1 tail)||-2ln(p)||Culmulative||D.O.F||Culm. X^2|
This is ridiculous. We know the null hypothesis is the correct one, yet we never wind up supporting it! There’s a simple reason: traditional hypothesis testing never considers evidence in support of the null hypothesis, hence why the two options are “reject” and “fail to reject.” For that matter, it only considers the alternate hypothesis by forcing it to be the compliment of the null. As I’ve shown, you can hide some shady parameters by tucking them into the alternative.
This is most plain when you chart how classical and Bayesian hypothesis testing behave for different values. As you get closer to the null hypothesis, p-values (expressed as odds ratios) shoot off to infinity. They can’t be tied to evidential support, because the evidence here is finite!
Compare that with how the Bayesian approach behaves: its odd ratios level off as values favor the first hypothesis more strongly, reflecting the fact that there’s finite evidence. It also takes longer to reach neutrality, because it’s considering the evidence weighted against both hypotheses, rather than just one. This is further confirmed by pitting Bayesian hypothesis testing against the random data from before.
|Success||p BF||Bayes HT||Culm. BF|
But perhaps the best reason to favor the Bayesian approach is that the traditional way is actually Bayesian, but in a half-assed way. What is the “null hypothesis,” after all? It’s either the consensus view, and thus the one with the higher prior probability, or the simplest theory, which thanks to Ockham’s Razor also happens to be the one with the higher prior. That’s Bayesian! How high is that prior, however? It’s never stated. Instead, under the frequentist model it’s assumed to be ridiculously high because prior research had enough data to be conclusive. Hence why the p-value curve shot off to infinity.
The Bayesian model forces all our cards on the table. Priors are either stated or accommodated for. As mentioned before, there is no “null hypothesis” to rule over them all, there are just hypotheses with varying support from priors.
Nor is it as black and white; the traditional way demands you make the alternate hypothesis the exact mirror of the null one. The Bayesian way does not, which opens up far more possibilities!
[HJH 2015-07-03: Had listed p-values as two-tail when they were one-tail. Whoops!]
[HJH 2015-07-11: Corrected the parts where I conflated p-values and Neyman-Pearson alphas.]
 Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p’s ) Versus Errors ( α’s ) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): doi:10.1198/0003130031856. pg. 176