Tags

, ,

Sitting comfortably? Good, because I have some bad news: traditional hypothesis testing, in the form widely used across the sciences, is horrifically broken. Why? I’ll retry Bem’s test with classical hypothesis testing, but this time plug in random numbers for the number of successes:

Null Hypothesis: A subject's selections come from chance alone.
Alternate Hypothesis: A subject's selections cannot be explained by chance, and therefore must be precognition.
Results: Out of 5400 trials over a binary choice, 2676 guessed correctly.
Odds under Null Hypothesis: Using a one-tailed Binomial distribution, we calculate p = 0.52.
Conclusion: As p > 0.05, we fail to reject the null hypothesis.

“Fail to reject?” But we already know the number is random! Maybe a second run will give a different result:

Null Hypothesis: A subject's selections come from chance alone.
Alternate Hypothesis: A subject's selections cannot be explained by chance, and therefore must be precognition.
Results: Out of 5400 trials over a binary choice, 2759 guessed correctly.
Odds under Null Hypothesis: Using a one-tailed Binomial distribution, we calculate p = 0.11.
Conclusion: As p > 0.05, we fail to reject the null hypothesis.

Nope. More tries!

Success p-value (1 tail) Conclusion
2676 0.5224436379 Fail to Reject
2759 0.1113384143 Fail to Reject
2668 0.3912692183 Fail to Reject
2618 0.0265355239 Reject the Null

Uh oh, we just rejected a hypothesis we know to be true! This isn’t too surprising, as with a p value of 0.05 there’s a one in twenty one in four chance[1] we’ll reject the null hypothesis by accident. But if it’s easy to run these experiments, and you’re only rewarded for rejecting your null hypothesis, then you could wind up in a situation where most of the published research is false.

All isn’t lost, though, because we still have Bayes’ Theorem to help combine multiple results together. By combining a lot of experiments, maybe we can find solid support for the null hypothesis.

Success p-value (1 tail) Bayes Factor Culm. BF
2676 0.5224436379 1.0939936714 1.0939936714
2759 0.1113384143 0.1252877542 0.1370640102
2668 0.3912692183 0.6427623344 0.0880995832
2618 0.0265355239 0.0272588518 0.0024014935
2729 0.4379456408 0.7791873396 0.0018712133
2766 0.0746279648 0.0806464449 0.0001509067
2716 0.6731336217 2.0593541165 0.0003107703
2695 0.9025247249 9.2590118247 0.0028774262
2649 0.1693005438 0.2038048088 0.0005864333
2663 0.3205131286 0.4716987804 0.0002766199
2739 0.2947128555 0.4178622251 0.000115589
2723 0.5402952832 1.1753094181 0.0001358528
2694 0.88101016 7.4040788664 0.0010058651
2692 0.838258846 5.1827183467 0.0052131155
2658 0.2586904523 0.3489641447 0.0018191904
2709 0.8170527573 4.4660566908 0.0081246075

Dis don't seem rite...[HJH 2015-07-11] And indeed it isn’t right, because I’ve made the common mistake of confusing p-values with Neyman-Pearson Type I errors. p-values are a very different beast, which I explain in detail elsewhere. One good bit of evidence for the weirdness of p-values is how you combine them; if they were the probability of falsely rejecting the null hypothesis, you could chain them like I did above. Instead, the best way to combine p-values is:

  1. Sum up all the logarithms of the p-values.
  2. Multiply by -2.
  3. Look up the result in a chi-square distribution with the degrees of freedom equal to twice that of the number of p-values you’re combining.
  4. This results in a single p-value, representing all that you’ve combined.

Oh well, that’s easy enough to do in a spreadsheet.

Success p-value (1 tail) -2ln(p) Culmulative D.O.F Culm. X^2
2676 0.522 1.298 1.298 2 0.522
2759 0.111 4.390 5.689 4 0.224
2668 0.391 1.877 7.566 6 0.272
2618 0.027 7.259 14.824 8 0.063
2729 0.438 1.651 16.475 10 0.087
2766 0.075 5.190 21.666 12 0.041
2716 0.673 0.792 22.458 14 0.070
2695 0.903 0.205 22.663 16 0.123
2649 0.169 3.552 26.215 18 0.095
2663 0.321 2.276 28.490 20 0.098
2739 0.295 2.444 30.934 22 0.097
2723 0.540 1.231 32.165 24 0.123
2694 0.881 0.253 32.419 26 0.180
2692 0.838 0.353 32.771 28 0.244
2658 0.259 2.704 35.476 30 0.226
2709 0.817 0.404 35.880 32 0.291

This is ridiculous. We know the null hypothesis is the correct one, yet we never wind up supporting it! There’s a simple reason: traditional hypothesis testing never considers evidence in support of the null hypothesis, hence why the two options are “reject” and “fail to reject.” For that matter, it only considers the alternate hypothesis by forcing it to be the compliment of the null. As I’ve shown, you can hide some shady parameters by tucking them into the alternative.

A comparison of how p-values and Bayesian Hypothesis Testing weight evidence, according to the number of successesThis is most plain when you chart how classical and Bayesian hypothesis testing behave for different values. As you get closer to the null hypothesis, p-values (expressed as odds ratios) shoot off to infinity. They can’t be tied to evidential support, because the evidence here is finite!

Compare that with how the Bayesian approach behaves: its odd ratios level off as values favor the first hypothesis more strongly, reflecting the fact that there’s finite evidence. It also takes longer to reach neutrality, because it’s considering the evidence weighted against both hypotheses, rather than just one. This is further confirmed by pitting Bayesian hypothesis testing against the random data from before.

Success p BF Bayes HT Culm. BF
2676 1.0939936714 47.401306 47.401306
2759 0.1252877542 16.144822 765.2856479375
2668 0.6427623344 40.155061 30730.0918753561
2618 0.0272588518 4.862293 149418.710614901
2729 0.7791873396 42.970612 6420613.43937319
2766 0.0806464449 11.674787 74959294.3140194
2716 2.0593541165 53.357449 3999636723.43628
2695 9.2590118247 58.130329 232500198613.833
2649 0.2038048088 22.354771 5197488697466.76
2663 0.4716987804 35.339008 183674094659687
2739 0.4178622251 33.349302 6125402852382500
2723 1.1753094181 48.233529 295449796117074000
2694 7.4040788664 57.863342 17095712596552500000
2692 5.1827183467 57.294481 979489980544639000000
2658 0.3489641447 30.529921 29903751726319400000000
2709 4.4660566908 56.906419 1701715425409900000000000

That better.But perhaps the best reason to favor the Bayesian approach is that the traditional way is actually Bayesian, but in a half-assed way. What is the “null hypothesis,” after all? It’s either the consensus view, and thus the one with the higher prior probability, or the simplest theory, which thanks to Ockham’s Razor also happens to be the one with the higher prior. That’s Bayesian! How high is that prior, however? It’s never stated. Instead, under the frequentist model it’s assumed to be ridiculously high because prior research had enough data to be conclusive. Hence why the p-value curve shot off to infinity.

The Bayesian model forces all our cards on the table. Priors are either stated or accommodated for. As mentioned before, there is no “null hypothesis” to rule over them all, there are just hypotheses with varying support from priors.

Nor is it as black and white; the traditional way demands you make the alternate hypothesis the exact mirror of the null one. The Bayesian way does not, which opens up far more possibilities!

[HJH 2015-07-03: Had listed p-values as two-tail when they were one-tail. Whoops!]

[HJH 2015-07-11: Corrected the parts where I conflated p-values and Neyman-Pearson alphas.]

[1] Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p’s ) Versus Errors ( α’s ) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): doi:10.1198/0003130031856. pg. 176

Advertisements