Tags

Sitting comfortably? Good, because I have some bad news: traditional hypothesis testing, in the form widely used across the sciences, is horrifically broken. Why? I’ll retry Bem’s test with classical hypothesis testing, but this time plug in random numbers for the number of successes:

```Null Hypothesis: A subject's selections come from chance alone.
Alternate Hypothesis: A subject's selections cannot be explained by chance, and therefore must be precognition.
Results: Out of 5400 trials over a binary choice, 2676 guessed correctly.
Odds under Null Hypothesis: Using a one-tailed Binomial distribution, we calculate p = 0.52.
Conclusion: As p > 0.05, we fail to reject the null hypothesis.```

“Fail to reject?” But we already know the number is random! Maybe a second run will give a different result:

```Null Hypothesis: A subject's selections come from chance alone.
Alternate Hypothesis: A subject's selections cannot be explained by chance, and therefore must be precognition.
Results: Out of 5400 trials over a binary choice, 2759 guessed correctly.
Odds under Null Hypothesis: Using a one-tailed Binomial distribution, we calculate p = 0.11.
Conclusion: As p > 0.05, we fail to reject the null hypothesis.```

Nope. More tries!

 Success p-value (1 tail) Conclusion 2676 0.5224436379 Fail to Reject 2759 0.1113384143 Fail to Reject 2668 0.3912692183 Fail to Reject 2618 0.0265355239 Reject the Null

Uh oh, we just rejected a hypothesis we know to be true! This isn’t too surprising, as with a p value of 0.05 there’s a one in twenty one in four chance[1] we’ll reject the null hypothesis by accident. But if it’s easy to run these experiments, and you’re only rewarded for rejecting your null hypothesis, then you could wind up in a situation where most of the published research is false.

All isn’t lost, though, because we still have Bayes’ Theorem to help combine multiple results together. By combining a lot of experiments, maybe we can find solid support for the null hypothesis.

 Success p-value (1 tail) Bayes Factor Culm. BF 2676 0.5224436379 1.0939936714 1.0939936714 2759 0.1113384143 0.1252877542 0.1370640102 2668 0.3912692183 0.6427623344 0.0880995832 2618 0.0265355239 0.0272588518 0.0024014935 2729 0.4379456408 0.7791873396 0.0018712133 2766 0.0746279648 0.0806464449 0.0001509067 2716 0.6731336217 2.0593541165 0.0003107703 2695 0.9025247249 9.2590118247 0.0028774262 2649 0.1693005438 0.2038048088 0.0005864333 2663 0.3205131286 0.4716987804 0.0002766199 2739 0.2947128555 0.4178622251 0.000115589 2723 0.5402952832 1.1753094181 0.0001358528 2694 0.88101016 7.4040788664 0.0010058651 2692 0.838258846 5.1827183467 0.0052131155 2658 0.2586904523 0.3489641447 0.0018191904 2709 0.8170527573 4.4660566908 0.0081246075

[HJH 2015-07-11] And indeed it isn’t right, because I’ve made the common mistake of confusing p-values with Neyman-Pearson Type I errors. p-values are a very different beast, which I explain in detail elsewhere. One good bit of evidence for the weirdness of p-values is how you combine them; if they were the probability of falsely rejecting the null hypothesis, you could chain them like I did above. Instead, the best way to combine p-values is:

1. Sum up all the logarithms of the p-values.
2. Multiply by -2.
3. Look up the result in a chi-square distribution with the degrees of freedom equal to twice that of the number of p-values you’re combining.
4. This results in a single p-value, representing all that you’ve combined.

Oh well, that’s easy enough to do in a spreadsheet.

 Success p-value (1 tail) -2ln(p) Culmulative D.O.F Culm. X^2 2676 0.522 1.298 1.298 2 0.522 2759 0.111 4.390 5.689 4 0.224 2668 0.391 1.877 7.566 6 0.272 2618 0.027 7.259 14.824 8 0.063 2729 0.438 1.651 16.475 10 0.087 2766 0.075 5.190 21.666 12 0.041 2716 0.673 0.792 22.458 14 0.070 2695 0.903 0.205 22.663 16 0.123 2649 0.169 3.552 26.215 18 0.095 2663 0.321 2.276 28.490 20 0.098 2739 0.295 2.444 30.934 22 0.097 2723 0.540 1.231 32.165 24 0.123 2694 0.881 0.253 32.419 26 0.180 2692 0.838 0.353 32.771 28 0.244 2658 0.259 2.704 35.476 30 0.226 2709 0.817 0.404 35.880 32 0.291

This is ridiculous. We know the null hypothesis is the correct one, yet we never wind up supporting it! There’s a simple reason: traditional hypothesis testing never considers evidence in support of the null hypothesis, hence why the two options are “reject” and “fail to reject.” For that matter, it only considers the alternate hypothesis by forcing it to be the compliment of the null. As I’ve shown, you can hide some shady parameters by tucking them into the alternative.

This is most plain when you chart how classical and Bayesian hypothesis testing behave for different values. As you get closer to the null hypothesis, p-values (expressed as odds ratios) shoot off to infinity. They can’t be tied to evidential support, because the evidence here is finite!

Compare that with how the Bayesian approach behaves: its odd ratios level off as values favor the first hypothesis more strongly, reflecting the fact that there’s finite evidence. It also takes longer to reach neutrality, because it’s considering the evidence weighted against both hypotheses, rather than just one. This is further confirmed by pitting Bayesian hypothesis testing against the random data from before.

 Success p BF Bayes HT Culm. BF 2676 1.0939936714 47.401306 47.401306 2759 0.1252877542 16.144822 765.2856479375 2668 0.6427623344 40.155061 30730.0918753561 2618 0.0272588518 4.862293 149418.710614901 2729 0.7791873396 42.970612 6420613.43937319 2766 0.0806464449 11.674787 74959294.3140194 2716 2.0593541165 53.357449 3999636723.43628 2695 9.2590118247 58.130329 232500198613.833 2649 0.2038048088 22.354771 5197488697466.76 2663 0.4716987804 35.339008 183674094659687 2739 0.4178622251 33.349302 6125402852382500 2723 1.1753094181 48.233529 295449796117074000 2694 7.4040788664 57.863342 17095712596552500000 2692 5.1827183467 57.294481 979489980544639000000 2658 0.3489641447 30.529921 29903751726319400000000 2709 4.4660566908 56.906419 1701715425409900000000000

But perhaps the best reason to favor the Bayesian approach is that the traditional way is actually Bayesian, but in a half-assed way. What is the “null hypothesis,” after all? It’s either the consensus view, and thus the one with the higher prior probability, or the simplest theory, which thanks to Ockham’s Razor also happens to be the one with the higher prior. That’s Bayesian! How high is that prior, however? It’s never stated. Instead, under the frequentist model it’s assumed to be ridiculously high because prior research had enough data to be conclusive. Hence why the p-value curve shot off to infinity.

The Bayesian model forces all our cards on the table. Priors are either stated or accommodated for. As mentioned before, there is no “null hypothesis” to rule over them all, there are just hypotheses with varying support from priors.

Nor is it as black and white; the traditional way demands you make the alternate hypothesis the exact mirror of the null one. The Bayesian way does not, which opens up far more possibilities!

[HJH 2015-07-03: Had listed p-values as two-tail when they were one-tail. Whoops!]

[HJH 2015-07-11: Corrected the parts where I conflated p-values and Neyman-Pearson alphas.]

[1] Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p’s ) Versus Errors ( α’s ) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): doi:10.1198/0003130031856. pg. 176