While I’m sold on Bayesian statistics, I still like to read what frequentists have to say in reply. Maybe I missed something? Maybe someone will come up with a new argument? So I was fairly eager to read this recent paper:
García-Pérez, Miguel A. “Thou Shalt Not Bear False Witness Against Null Hypothesis Significance Testing.” Educational and Psychological Measurement, October 5, 2016, 13164416668232. doi:10.1177/0013164416668232.
Unfortunately, it didn’t live up to the hype.
I started noting errors immediately, like this contradiction:
Null hypotheses may thus be undeservedly rejected or not rejected, but whether the decision was adequate on any particular occasion will always remain unknown: The truth about the hypothesis is always beyond reach. In these circumstances, one would only expect that the decision rule lends itself to an examination that permits quantification of the frequency with which its outcomes will be inadequate, that is, how often application of the decision rule will reject a true hypothesis and how often it will not reject a false hypothesis. This is one of the undeniable and unique strengths of NHST.
If we can never know whether or not a hypothesis is true or not, how can we talk of “true hypotheses?” How can we ever tally up true hypotheses? The author contradicts themselves in the span of two sentences, and this error propagates to an extreme degree:
For the purpose of inference, one would need to know how likely it is that the Bayes factor indicates evidence against the null when the null is actually true, or how likely it is that the Bayes factor indicates evidence in favor of the null when the null is false. The probabilities of false positives and false negatives need to be known even though the Bayesian approach intentionally avoids dichotomous decisions.
The Bayesian interpretation doesn’t “avoid” dichotomous decisions, any more than deductive logic “avoids” statements which aren’t completely true or false, or Euclidean geometry “avoids” the idea that there are no parallel lines. A solid yes/no on empirical questions is impossible according to its core axioms, so it has no concept of a “false hypothesis” and cannot derive a probability for that.
If a priori probabilities for each hypothesis are additionally defined, the Bayes factor can be transformed into odds reflecting how much more likely one of the hypotheses is over the other, thus coming close to some scholars’ ideal of attaching probabilities to hypotheses.
“Coming close?” It’s a central axiom! The debate is not whether the Bayesian interpretation attaches probabilities to hypotheses, it’s whether that’s a valid move.
But that just scratches the surface. García-Pérez also makes the common mistake of equivocating between p-values and Type I errors.
One way or the other, tagging NHST results as significant only reflects the researcher’s principled and transparent decision regarding the criterion for rejection of the null, be it a critical value on the test statistic or a boundary p value. In principle, one could set α = .3 and declare significant any result whose p value does not exceed .3. This only means that the researcher intentionally adopts a liberal criterion by which rejecting true null hypotheses is permitted to occur with probability .3.
This comes from a fundamental confusion over Null Hypothesis Statistical Testing: are we testing the null hypothesis in isolation, as Fischer advocated, or judging how well it performs against an alternative hypothesis, as per Neyman and Pearson? At various times, the author tries to have it both ways:
NHST procedures derive the probability distribution that a test statistic computed from the data would have if the null hypothesis were true, which the researcher then uses as a reference to make a decision to reject or not to reject the null. […]
On their side, CIs are indeed calculated within an inferential framework that happens to be exactly that of NHST. The CIs that are typically calculated describe the range of parameter values that one might have placed in a null hypothesis that the current data would not have rejected in a two-sided test. Unpacking this statement for the case of differences between means, consider the non-nil null hypothesis H0: μ1 – μ2 = μd tested against the alternative H1: μ1 – μ2 ≠ μd (compared to the conventional case in which μd is written out as 0).
These two approaches are only consistent if you insist the alternative hypothesis is the exact mirror of the null. In practice, this is almost never the case; Newtonian gravity is not the mirror of General Relativity, in all but a handful of corner cases the two theories predict the same outcome. This is on top of other complaints brought up by Cohen (1994), such as the inherent falsehood of a point hypothesis.
The author does not grasp this. How else to explain this hot mess of a diagram:
Right off the top, this chart is comparing apples to concrete; pitting a proportion of belief between two hypotheses against a pseudo-probability that depends only on one makes as much sense as comparing a correlation coefficient to a mean.
Secondly, just because the values between the two appear to fall on a neat line, that doesn’t mean they map “one-to-one” as the paper claims. All those Factors above 0 are actually in favour of the null hypothesis, something no p-value can indicate.
Relatedly, see those dashed lines? The p-value one represents p = 0.05, or significant evidence against the null hypothesis. The Bayes Factor one represents BF = 1, where the evidence is equivocal between the null and alternative hypotheses. Those aren’t similar assertions, so it’s disingenuous to denote them in a similar fashion.
Fourth, no one is arguing Bayes Factors and p-values are uncorrelated; instead, the main complaint against p-values is that they exaggerate the evidence against the null hypothesis. Look carefully at the axes of the plots, and you’ll see that p-values consistently have a lower value than Bayes Factors. This can’t be easily compensated for by a linear transform, because of that “curve” at the upper-right of the dataset. The curve exists because in a Bayesian interpretation a finite amount of evidence can only provide a finite amount of belief, no matter how good the evidence fits.
Speaking of which, that upper-right sets off my alarm bells. Increasing the number of samples by an order of magnitude made almost no difference in the Bayes Factor when the null was false; in contrast, the Bayes Factor for the true case does change significantly with the sample count. How is García-Pérez calculating these Bayes Factors? They use a formula derived by Rouder et al.
Hmmmm, that’s an…. interesting approach. Rouder et al. were trying to come up with a generic Bayesian replacement for t-tests, but in the process they’re reducing an entire dataset to two numbers: t, the t-statistic, and N, the number of observations (v is the degrees of freedom and related to N, while g is integrated away). This necessarily increases the uncertainty of the result, leading to fuzzier conclusions and more assumptions.
It’s also the wrong formula.
Let’s go back to first principles. The likelihood of pulling a specific value x from a Gaussian distribution is
where Hμ,σ is the hypothesized distribution, μ is the mean, and σ is the standard deviation. The exponent of e reaches a maximum when (x – μ) = 0, and in all other cases is less than 0. This means that if the x you draw is exactly equal to μ, you’ve drawn the most likely value from that distribution. Draw any other value, or any range of values, and it’s guaranteed to be less likely… provided the standard deviation is held constant. When pitting hypotheses with two different standard deviations against each other, the one that has the lowest variance will always come out on top here.
What if you draw two values, though? Or three? Because these are independent draws, we multiply their probabilities together.
This is the odds-ratio form of Bayes’s Theorem, with the Bayes Factor itself on the left. One advantage of rewriting the theorem in this fashion is that it’s trivial to add additional evidence.
There’s an important consequence of these two observations: if x always equals μ, the hypothesis that the population’s mean is μ always has a greater likelihood than any other hypothesis about the mean.
This observation could be thwarted by the standard deviation, but look at Rouder et al.’s formula: it doesn’t depend on the standard deviation at all! From the looks of it, they’re integrating across all possible standard deviations, and weighting those results by a prior that favors lower values.
Since the t statistic is defined as
it is zero when (x – μ) is too. This gives us a test of Rouder et al.’s formula: if we plug in t = 0 and increase the sample count, we’d expect the Bayes Factor to grow without limit. Exponentially, in fact, since adding one more bit of evidence is another multiplication by the likelihood ratio and that should be a constant value greater than 1.
How would I handle the same problem? We’re aiming for a simple means test: what’s the p-value/Bayes Factor of N samples from a Gaussian distribution having a mean of μ? My null hypothesis will always assume the mean is 0; the Fischerian half will declare both tails to be in play, while the Bayesian will pick μ = 5 as the second/alternate hypothesis (while grumbling about point hypotheses). Paired with a standard deviation of 10, this gives us an effect size of 0.5.
Since all we care about is the mean, and the frequentist approach only cares about the variance to calculate the standard error, I’ll fix the variance at 100. This simplifies the math. Focusing just on the likelihood ratio,
That just leaves the prior. We have a handy cheat, however: when both hypotheses in a Bayes Factor are point hypotheses, both of their priors are constant scalars. The posteriors can only slide along a log plot of Bayes Factors, they cannot scale or twist or change shape. Since all we care about is the shape, we can defer on the prior for now and just calculate likelihood ratios.
This result is much cleaner than Rouder’s formula, several orders of magnitude faster, and trivial to pull off in Octave:
NULL0=randn(160,10000).*10; START=time(); BF = sum( ((NULL0.^2) - ((NULL0 - 5).^2)) / 200 ); END=time(); disp(sprintf("We took %f seconds to process 10,000 trials of 160 samples.", (END - START))); # "We took 0.037829 seconds to process 10,000 trials of 160 samples."
Enough build-up, though. How do my results compare?
These make a lot more sense. If the second/alternate hypothesis is true, you’d expect the results to cluster symmetrically around a negative likelihood ratio (assuming a mild, constant prior). That happens in my diagrams, but not García-Pérez’s. As both our tests are two-sided, it follows that the cluster should look like a “hump” around p = 1.0, and that’s visible in my diagram but not the author’s. The likelihood ratio when the null is true not only rises as more data is added, it rises exponentially as had been predicted.
My charts show the mapping isn’t one-to-one, either. Some of that fuzz comes from the different choice of metric, as the Bayesian approach emphasizes individual observations and defers on consolidation until calculating the test statistic. You might be able to see this near the fringes, where multiple likelihood ratios map onto a single p-value, though this could also be explained by precision issues.
But the curve to the data stands out from the noise, arcing down from the p = 1.0 mark. As the likelihood ratio increases, the p-value increases more. Mapping one to the other requires taking into account the effects of the prior and the number of samples taken, and is much less trivial here than in García-Pérez’s charts.
There’s also some interesting data hidden in the details. Looking at the case where the null hypothesis is true, both frequentist and Bayesian approaches generate false positives at roughly the same rate when N = 20 (5.3% vs. 5.3%, when using Kass et al’s “positive” metric and assigning a flat prior), but well before N = 160 we find the Bayesian approach is far less likely to false positive.
We can also plumb the dataset to answer another question: do p-values exaggerate the evidence against the null hypothesis? The curve of the graphs suggests this is true, but we can get more quantitative.
When the null is false and N = 20, 86.8% of the samples have a likelihood ratio that suggests the null is more likely false than not. Let’s flip that on its head: what p-value has the same acceptance rate, for a given sample size? This creates an equivalence between these two measures, and because we’re not invoking priors or posteriors we’re still in frequentist territory. The Neyman-Pearson branch of frequentism relies on likelihood ratios to assess the evidence, in fact, so we’re justified using it for the same task here.
This graph should be disturbing: it shows that as the sample size increases, you need to strengthen the p-value threshold to maintain the same level of evidence. When N > 70, in fact, the most common falsification threshold can be compatible with evidence in favour of the null! Even more disturbingly, there’s a curvilinear relationship: as you demand higher levels of evidence to justify your beliefs, you also need to strengthen your p-value threshold for small sample sizes.
Since the Bayes Factor depends heavily on the likelihood ratio, it must have similar behaviour. Let’s check, by multiplying the likelihood ratio by a prior and repeating the math. The most common prior for means is Cauchy’s, though its not without critics. Plugging in the appropriate values, we find
which is a fairly boring result, as it barely favors the null hypothesis over the second one and thus the Bayes Factor behaves almost exactly like the likelihood ratio. You’ll have to squint to see the slight improvement.
Let’s experiment with a stronger prior. Ever heard of Benford’s Law? It states that the first digit in a number has a logarithmic distribution, with “1” significantly more likely than “9”. This is a special case of a more general trend, that how often a number appears approximately follows a Pareto distribution.
P-values stand up better here, but that isn’t saying much. You still need to crank the threshold to stay on equal footing with Bayes Factors, it’s just that the point which mapped p = 0.05 to an equivocal Bayes Factor has moved from a sample size of 70 to 100.
Before I finish, I’m sure a few of you are wondering how I could imply we can never calculate a false positive rate for Bayesian stats, then link Bayes Factors to p-values via false positives. The answer is that I’m dealing with a toy universe where I already know what’s true and false. With real-world problems, we don’t have that luxury and so we’re forced to extrapolate from toy universe problems and cross our fingers. On paper the Bayesian and frequentist interpretations agree on that point, but in practice the misunderstandings of frequentism conspire to provide an illusion of certainty. If you mistake p-values for false positive rates, if you transfer the former’s “objectivity” onto the latter, then you can think frequentism gets you certainty.
By pretty much every metric, Bayes Factors beat out p-values. To believe otherwise, you must misunderstand statistics to a large (but sadly common) degree.
 García-Pérez, Miguel A. “Thou Shalt Not Bear False Witness Against Null Hypothesis Significance Testing.” Educational and Psychological Measurement, October 5, 2016, 13164416668232. doi:10.1177/0013164416668232.
 Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p ’S) Versus Errors ( α ’S) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): 171–78. doi:10.1198/0003130031856.
 Cohen, Jacob. “The Earth Is Round (P < .05).” American Psychologist 49, no. 12 (1994): 997–1003. doi:10.1037/0003-066X.49.12.997.
 Rouder, Jeffrey N., Paul L. Speckman, Dongchu Sun, Richard D. Morey, and Geoffrey Iverson. “Bayesian T Tests for Accepting and Rejecting the Null Hypothesis.” Psychonomic Bulletin & Review 16, no. 2 (2009): 225–237.
 Kass, Robert E., and Adrian E. Raftery. “Bayes Factors.” Journal of the American Statistical Association 90, no. 430 (1995): 773–795.
 Ghosh, Joyee, Yingbo Li, and Robin Mitra. “On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression.” arXiv Preprint arXiv:1507.07170, 2015. http://arxiv.org/abs/1507.07170.
 Gauvrit, Nicolas, Jean-Paul Delahaye, and Hector Zenil. “Sloane’s Gap. Mathematical and Social Factors Explain the Distribution of Numbers in the OEIS.” arXiv Preprint arXiv:1101.4470, 2011. http://arxiv.org/abs/1101.4470.