In my Bayes series, I spent a little time trying to demonstrate why traditional hypothesis testing is broken. The rot is so bad, though, that I want to spend another post underlining the point.

A quick review: frequentism views probability in terms of long-term frequency, while Bayesians view it as a degree of certainty. It’s the difference between

assuming this coin is fair, if I were to toss it an infinite number of times I’d find that heads came up exactly half the time


my prior beliefs suggest this coin is almost certainly fair, and based on that I’m equally certain that the coin will come up heads as it will come up tails.

One is discussing the odds of the data on the assumption the hypothesis is true, the other is discussing the odds of the hypothesis assuming the data is true. The difference may seem subtle, but give it time.

P-values are a frequentist measure, thus they too speak to long-term frequencies. Let’s take one of Bem’s studies as an example: out of 5,400 trials were people had to guess the next image, 2,790 of those were correct. The null hypothesis, that precognition does not exist, predicts that the success ratio should be 50% over the long term. So if we repeat this same experiment on the same data, assuming the null hypothesis is true, how often do we get the same result or something more extreme? We can brute force that easily enough, and sure enough we get back the p-value we calculated by other means (roughly; the former was a Monte Carlo integration, after all). Finally, we apply some logic:

1. Assume the null hypothesis is true.
2. Given 1, we find the long-terms odds of getting the same value or a more extreme one falls below a certain threshold of probability.
3. Since 2 contradicts 1, we reject the null hypothesis and conclude it is false.

…. This should be setting off a few alarm bells. We’re not just looking at the data and hypothesis we have, our conclusion depends on us extrapolating our view forward over an infinite number of tests, and considering values we’ve never seen and likely never will see (as by definition they’re rarer than any value we’ve observed). What if this was a fluke result instead? We have no way of telling, short of repeating the test.

And what qualifies as a more extreme result? Suppose I’m testing the fairness of a coin. When I calculate the p-value, do I use one tail of the probability distribution, or both?

What one-tailed and two-tailed tests look like.Since I didn’t specify whether I was looking for heads or tails more frequently, we’d intuit that both tails are relevant. Yet the results are almost guaranteed to show a clear bias in one direction; shouldn’t we factor that into the conclusion, and thus only use one tail? But this approach lowers the bar to rejecting the null. Aren’t we now tailoring the conclusion to better fit the results?

This gets even worse. Back in Bem’s study, a subset of one experiment had fewer correct guesses than chance predicts, contradicting the positive tug of precognition. He consequently reported a p-value of 0.90, which looks like this:

What a p-value of 0.9 looks like.That excludes a single tail! I think Bem would justify that as protection against the alternate hypothesis being contradicted (as elsewhere he reports an expected sub-chance value with a p-value of 0.026), but we’re only supposed to be testing the null hypothesis here. The choice of a one- or two-tailed test allows information from the alternate hypothesis to “leak” into the null, biasing the results.

There must have been a good reason for including all these unobserved extreme values, one that justifies all the trouble they cause.

The result with the P value of exactly .05 (or any other value) is the most probable of all the other possible results included in the “tail area” that defines the P value. The probability of any individual result is actually quite small, and [Ronald A.] Fisher said he threw in the rest of the tail area “as an approximation.” … the inclusion of these rarer outcomes poses serious logical and quantitative problems for the P value, and using comparative rather than single probabilities to measure evidence eliminates the need to include outcomes other than what was observed. [1]

Huh. Well maybe Fisher has objective way to remove all this confusion and subjectivity, at least.

… no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas. […]

On choosing grounds on which a general hypothesis should be rejected, personal judgement may and should properly be exercised.[2]

You read it here, folks; Ronald A. Fisher’s “objective” alternative to the Bayesian approach requires a subjective threshold, some vague number of replications, and unobserved values tossed in to make the results more intuitive. For all his hatred of the explicit use of prior probability found in the Bayesian approach, Fisher was forced to smuggle it in implicitly.

Time and again, we keep finding Bayesian ideas creep into frequentist thought. All we really care about is whether or not the hypothesis is true, given the data we have, but that’s actually a Bayesian question and impossible to answer according to frequentism.

For example, if I toss a coin, the probability of heads coming up is the proportion of times it produces heads. But it cannot be the proportion of times it produces heads in any finite number of tosses. If I toss the coin 10 times and it lands heads 7 times, the probability of a head is not therefore 0.7. A fair coin could easily produce 7 heads in 10 tosses. The relative frequency must refer therefore to a hypothetical infinite number of tosses. The hypothetical infinite set of tosses (or events, more generally) is called the reference class or collective. […]

[In another experiment,] Each event is ‘throwing a fair die 25 times and observing the number of threes’. That is one event. Consider a hypothetical collective of an infinite number of such events. We can then determine the proportion of such events in which the number of threes is 5. That is a meaningful probability we can calculate. However, we cannot talk about P(H | D), for example P(‘I have a fair die’ | ‘I obtained 5 threes in 25 rolls’), [or] the probability that the hypothesis that I have a fair die is true, given I obtained 5 threes in 25 rolls. What is the collective? There is none. The hypothesis is simply true or false. [3]

Want a second opinion?

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.[4]

So if a p-value isn’t telling you whether or not a hypothesis is true or false, what is it saying? Many people think it’s the rate of false positives, or the odds of falsely rejecting the null hypothesis, but that isn’t true either. That’s confusing one type of frequentist hypothesis test for another.

Oh wait, you didn’t know there are three forms of frequentist hypothesis testing?

The level of significance shown by a p value in a Fisherian significance test refers to the probability of observing data this extreme (or more so) under a null hypothesis. This data-dependent p value plays an epistemic role by providing a measure of inductive evidence against H0 in single experiments. This is very different from the significance level denoted by a in a Neyman-Pearson hypothesis test. With Neyman-Pearson, the focus is on minimizing Type II, or β, errors (i.e., false acceptance of a null hypothesis) subject to a bound on Type I, or α, errors (i.e., false rejections of a null hypothesis). Moreover, this error minimization applies only to long-run repeated sampling situations, not to individual experiments, and is a prescription for behaviors, not a means of collecting evidence. When seen from this vantage, the two concepts of statistical significance could scarcely be further apart in meaning.[5]

To be fair, though, nearly all scientists and even many statisticians don’t realize this either. Let’s do a quick walk-through of Neyman-Pearson to clarify the differences.

Before getting near any data, you first lock down your Type I error rate, or the odds of falsely rejecting a true test. Next up, do a power analysis so that you know how much data you need to overcome the Type II error rate, or the odds of failing to reject a false hypothesis. The last step before data collection is creating your hypotheses. You’ll need two of them: a null hypothesis to serve as a bedrock or default, and the alternative you’re interested in. These do not need to be symmetric. Some care must be taken in choosing hypotheses, to ensure you can use a uniformly most powerful test. In the case of simple hypotheses that have a single fixed parameter, that test is the likelihood ratio.


If the proportion of likelihoods, A(x), is less than the odds of falsely rejecting a hypothesis, then you rejec-…. hey, doesn’t that look familiar?


… Waaaitaminute, Neyman-Pearson is Bayesian, at least in some circumstances, but with a flat prior locked in. That’s a bit of a problem.

The incidence of schizophrenia in adults is about 2%. A proposed screening test is estimated to have at least 95% accuracy in making the positive diagnosis (sensitivity) and about 97% accuracy in declaring normality (specificity). Formally stated, p(normal | H0) = .97, p(schizophrenia | H1) > .95. So, let

H0 = The case is normal,
H1 = The case is schizophrenic, and
D = The test result (the data) is positive for schizophrenia.

With a positive test for schizophrenia at hand,
given the more than .95 assumed accuracy of the test, P(D | H0)—the probability of a positive test given that the case is normal—is less than .05, that is, significant at p < .05. One would reject the hypothesis that the case is normal and conclude that the case has schizophrenia, as it happens mistakenly, but within the .05 alpha error. […]

By a Bayesian maneuver, this inverse probability, the probability that the case is normal, given a positive test for schizophrenia [ p(H0 | D) ], is about .60![6]

While Cohen is taking on Fisher’s approach, note that switching to a naive Neyman-Pearson isn’t an improvement. P( D | H1 ) is less than 0.05, so if we sit right on the boundaries of both likelihoods we find their proportion is just shy of our false positive boundary, 0.05. If one of those likelihoods was actually a touch more extreme, naive Neyman-Pearson would falsely reject H0 just like Fisher’s approach.

You can fix Neyman-Pearson, by building a 2×2 table that includes every possibility found in the population, but that’s sneaking in the prior probability and a frequentist no-no.

The third type is what I was taught, Fisher’s approach in Neyman-Pearson’s clothing. [7]

1. Confuse p-values with Type I errors, and as-per Neyman-Pearson set your p-value threshold before you start testing. Don’t bother calculating statistical power, though, just trust your prior experience.
2. Define a null and alternative hypothesis, like Neyman-Pearson, but force the alternative to be the mirror image of the null.
3. Abandon the alternative hypothesis and just calculate a p-value as-per Fisher.

So what are p-values? A statement about an infinite number of mythical replications, bolted into a system it was never designed for, and almost always mistaken for something Bayesian or pseudo-Bayesian. That’s pretty bullshit.


[1] Goodman, Steven. “A dirty dozen: twelve p-value misconceptions.” Seminars in hematology. Vol. 45. No. 3. WB Saunders, 2008.

[2] As reported in: Lew, Michael J. “Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don’t know P.” British journal of pharmacology 166.5 (2012): 1559-1567.

[3] Dienes, Zoltan. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan, 2008. pg. 58-59

[4] Neyman, Jerzy, and Egon S. Pearson. On the Problem of the Most Efficient Tests of Statistical Hypotheses. Springer, 1933. http://link.springer.com/chapter/10.1007/978-1-4612-0919-5_6.

[5] Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p’s ) Versus Errors ( α’s ) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): 171–78. doi:10.1198/0003130031856.

[6] Cohen, Jacob. “The earth is round (p < .05).” American Psychologist, Vol 49(12), Dec 1994, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.997

[7] Gigerenzer, Gerd. “The superego, the ego, and the id in statistical reasoning.” A handbook for data analysis in the behavioral sciences: Methodological issues (1993): 311-339.