This is really just a continuation of Jamie Bernstein’s article on Bayes Theorem over at SkepChick, so read that first (at minimum, it’ll explain the animal pictures).
Anyway, in the last step, Bernstein had combined her prior belief of telepathy with the results of a study on the subject, and derived a new probability of telepathy.
Going back to our boxes, we want to calculate the values of C and D.
Box C = The probability psychic powers are real (1%) multiplied by the probability Bem’s test is correct (95%) = 95% * 1% = 0.95%
Box D = The probability psychic powers are not real (99%) multiplied by the probability Bem’s test is wrong (5%) = 5% * 99% = 4.95%
Baye’s[sic] Theorem = C / (C+D) = 0.95%/(0.95%+4.95%) = 16.1%
Bernstein didn’t reference the numbers in the actual study, but roughed them in. I’d rather work with real data, so I tracked the original down and pulled the numbers from one of the nine experiments.
Hrm. The jump from 16.1% to 58.9% is a bit alarming. Still, it’s just one experiment of one study; when you start combining it with null results from other studies, the odds should drop down to something more reasonable.
There’s an interesting value lurking in the equations, though. Let’s return to the common Bayes equation, and play around a bit.
Expressing the probabilities as odd ratios, instead of bare probabilities, makes it a lot easier to chain together multiple experiments:
|Bayes Factor||Strength of Evidence|
|Under 1/150||Extremely strong|
|1/20 to 1/150||Strongly against|
|1/3 to 1/20||Negative|
|1/3 to 1||Barely worth noting|
|1 to 3||Barely worth noting|
|3 to 20||Positive|
|20 to 150||Strongly in favor|
|Over 150||Extremely strong|
Notice the shift in terminology: now we’re talking in terms of the strength of the evidence, whereas p-values usually deal in the odds of results happening if the null hypothesis is true. That’s because Bayesian statistics deal in discreet chunks of data, some of which may be fuzzy or point in contradictory ways. There isn’t any real end-game; consolidating chunks might lead one hypothesis to become dominant over time, but even strong evidence can be muted by a stronger prior, so you don’t assume you’ll eventually converge to the One Ring of hypotheses. In essence, it refutes or supports hypotheses based on streams of data.
The more common “frequentist” view treats the data as a whole. Since the model doesn’t change over time, you should always get back compatible data if your model perfectly describes it. If you don’t, then either you didn’t look at enough data to properly test your hypothesis, or that model isn’t in line with reality and must be discarded for another. So frequentism actually deals with streams of hypotheses, which evolve to more closely match reality over time. There is a clear end goal here, in contrast to the more relativist Bayesian approach.
How does that play out in practice? Let’s try it…
 Adapted from: Kass, Robert E., and Adrian E. Raftery. “Bayes factors.” Journal of the american statistical association 90.430 (1995): 773-795. Here’s a PDF of another version.
[HJH 2015-09-18: Caught a minor mistake I’d made about p-values. Also, I missed a link.]