, ,

Oh whoops, we forget to actually run the numbers for the hypothesis we developed last time, H3. To do that in our little helper, swap out:

prob := random.Float64()                        // H1: not random


prob := (random.NormFloat64()*random.Float64()*0.05) + random.Float64()*0.1 + 0.45
if prob < 0.45 { prob = 1 - prob }              // H3: plausible precog, but sloppy sampling

And tack the results up next to our prior runs.

Success/Trials H1/H0 H2/H0 H3/H0
828/1560 0.588416 1.211814 4.775636
238/ 480 0.057852 0.049696 0.485005
246/ 480 0.066529 0.092845 0.552351
536/1080 0.039115 0.032298 0.345376
2790/5400 0.340399 0.678042 2.97905
1274/2400 2.448912 4.912015 19.251378
1186/2400 0.03033 0.017051 0.268514
1251/2496 0.025082 0.02749 0.226681
1105/2304 0.178838 0.008695 1.499272
overall Bayes Factor 0 0 2.312523
1/BF 99536693.052594 407936437.373478 0.432428

... Dammit.Wow, that’s quite an improvement! Jettisoning those nonsense values really cranked the odds, if by “cranked” we mean “reached the ‘barely noteworthy’ level.” Even then, the results are tenuous; remove the best study, and the Bayes Factor swings against precognition.

But the improvement comes with a cost. Here’s how H3 claims precognitive ability is spread across all humans. The vertical axis is the raw total, and is in logarithmic scale, with the horizontal is the success rate. Black areas represent certain precognitive ability, white areas indicate no precognition, and gray areas indicate plausible levels of precognition under the model.

The distribution of precognition among the general population, according to hypothesis H3 (plausible precognition, but sloppy sampling). Gray indicates uncertainty, and the Y axis is logarithmic.In real terms, some 490,000 Canadians can see what’s coming with greater than 66% accuracy at best, but the expected number of the model is around 9,000. Look at a success rate over 75%, and the numbers decline to 1,120 and 6, respectively. We’ve cranked the odds of precognition, but only by watering down what we mean by “precognition.” Those spectacular extremes don’t exist in this model.

It doesn’t help that H3 does a poor job of representing its assumptions.

A graph of which means and standard deviations lead to a plausible case for precognition.H3 samples every point within the bounds of this graph, yet only points on that fuzzy boomerang should count towards precognitive ability. With a little math and curve-fitting, we can craft an H4 that does a much better job of grasping the boomerang; on this chart, it only samples the area between those two red lines. See if you can spot the difference between H3‘s predictions for the population, and H4‘s:

The distribution of precognition among the general population, according to hypothesis H4 (plausible precognition, great sampling). Gray indicates uncertainty, and the Y axis is logarithmic.We should also stop equivocating between the variation in the general population and the variation between studies; while the two are related, they’re not the same. Say we randomly pluck two hundred people from the general public and calculate the average precognitive ability of the group. While this number is a decent estimate of the population average, the random variance within the population means it’s almost certain the sample’s average will be slightly better or worse than the population average. As you increase the sample size, this difference will gradually fade away.

A graph of how the population's variance effects the variance of the averages calculated from samples.But if you do multiple small studies, instead of one bigger one, you’ll find their averages dance around the population average, and the distribution of the dance will be strongly correlated to the distribution in the population. Since these averages are, well, averages, the variance between them will be muted down relative to the population. In the case of multiple two hundred sample studies, for instance, the standard deviation between their means will be about 7% of the population standard deviation. Compensating for this effect will toss out even more garbage samples, and further improve the numbers for precognition.

Tightening up the sampling, however, means that H4 requires substantially more code than H3.

mean := random.Float64()*.1 + .45               // H4: plausible precog, great sampling
sm := mean - 0.5
stdev := sm*sm*(-11.3743730736561 + sm*sm*(-6207.21103956717 + sm*sm*1409626.43860565)) + .045
variation := .38*(.05 - math.Abs(sm))
if variation > .0065 { variation = .0065 }
prob := (random.NormFloat64()*(stdev + 2*variation*random.Float64() - variation))/math.Sqrt(trials[slot]) + mean
if prob < 0.45 { prob = 1 - prob }

Whew! Figuring all that out was a pain, but it has a good effect on the numbers.

Success/Trials H1/H0 H2/H0 H3/H0 H4/H0
828/1560 0.588416 1.211814 4.775636 5.68318
238/ 480 0.057852 0.049696 0.485005 0.561392
246/ 480 0.066529 0.092845 0.552351 0.627585
536/1080 0.039115 0.032298 0.345376 0.391712
2790/5400 0.340399 0.678042 2.97905 3.431167
1274/2400 2.448912 4.912015 19.251378 23.724679
1186/2400 0.03033 0.017051 0.268514 0.301883
1251/2496 0.025082 0.02749 0.226681 0.25188
1105/2304 0.178838 0.008695 1.499272 1.768971
overall Bayes Factor 0 0 2.312523 8.588008
1/BF 99536693.052594 407936437.373478 0.432428 0.116441

We’ve finally entered the realm of “positive” results, as per Kass and Rafferty, and our “hero” study from last time is less important (though still a deal-breaker if removed). But can we get numbers even more favorable to precognition?

We could always cheat and shape our hypothesis to fit the data, then marvel when the two are a near-perfect match. Feeding these nine experiments into a spreadsheet, we find they result in a weighted average success rate of 50.8%, with a weighted standard deviation of 1.08 percentage points. This is a twisted result;  assuming the population distribution is Gaussian and reflects the deviation of the study means, then some numeric simulations suggest the population’s standard deviation is 65.3 percentage points. This would mean about nineteen in twenty people would have a noticeable level of precognition, and at the extremes 1.95 million Canadians would have a success rate greater than 95%. The priors on this H5 are subterranean.

my kind of terraneanStill, it makes a handy benchmark; if this doesn’t show a stronger signal in favor of precognition, something’s gone horribly wrong. So let’s swap in this…

// H5: cheating, drawn from experiments with certain values
prob := (random.NormFloat64()*0.0176074877) + 0.5082795699

… and see what we get.

Success/Trials H1/H0 H2/H0 H3/H0 H4/H0 H5/H0
828/1560 0.588416 1.211814 4.775636 5.68318 6.554058
238/ 480 0.057852 0.049696 0.485005 0.561392 0.733906
246/ 480 0.066529 0.092845 0.552351 0.627585 0.909321
536/1080 0.039115 0.032298 0.345376 0.391712 0.589348
2790/5400 0.340399 0.678042 2.97905 3.431167 6.562676
1274/2400 2.448912 4.912015 19.251378 23.724679 26.016863
1186/2400 0.03033 0.017051 0.268514 0.301883 0.464107
1251/2496 0.025082 0.02749 0.226681 0.25188 0.467214
1105/2304 0.178838 0.008695 1.499272 1.768971 1.295836
overall Bayes Factor 0 0 2.312523 8.588008 123.66863
1/BF 99536693.052594 407936437.373478 0.432428 0.116441 0.008086

Now we’re up to “strongly in favor,” but it only takes two datasets to flip the Bayes Factor. Still not impressive.

Maybe we need more data. There are eleven more studies and datasets mentioned in Bem’s paper, three of which are by other authors. Some of them are too poorly described to be sure of the numbers; others are better described by ANOVA or other statistical methods. However, all of them have a success rate and p-value attached, which can be used to reverse-engineer the total sample size for a binomial test. So we can throw them on the pile, too.

Success Trials Experiment
59 97 Experiment 3: retro prime, 0.25s < x < 1.5s
58 99 Experiment 4: retro prime, 0.25s < x < 1.5s
1242 2400 Experiment 6: negative
1153 2400 Experiment 6: erotic
109 174 Experiment 8: word recall, stimulus seeking
30 66 Experiment 8: word recall, not s.s.
32 53 Experiment 9: word recall, stimulus seeking
65 108 Experiment 9: word recall, not s.s.
257 479 Savva 2004: spiders
300 557 Parker 2010: habituation
79 176 Parker 2010: non-habituation

And thanks to the magic of computers, updating all our Bayes Factors is quite painless.

Success/Trials H1/H0 H2/H0 H3/H0 H4/H0 H5/H0
828/1560 0.606666 1.198943 4.784953 5.705459 6.572219
238/480 0.057603 0.04947 0.484612 0.562555 0.733823
246/480 0.065462 0.094775 0.551628 0.629002 0.908678
536/1080 0.039285 0.031954 0.345834 0.393145 0.58999
2790/5400 0.333243 0.680725 3.003229 3.448837 6.588456
1274/2400 2.434773 4.900013 19.316269 23.771786 26.150746
1186/2400 0.030896 0.016999 0.269643 0.300326 0.464338
1251/2496 0.025171 0.027394 0.225731 0.253144 0.465799
1105/2304 0.174386 0.008942 1.500294 1.766644 1.297245
59/97 1.212351 2.408451 2.59965 1.548541 1.631562
58/99 0.532767 1.018189 1.762424 1.269398 1.407962
1242/2400 0.111135 0.216742 0.955962 1.111527 1.959573
1153/2400 0.162183 0.009154 1.372023 1.617248 1.232864
109/174 25.427308 50.621886 23.055238 5.487296 4.325966
30/66 0.20185 0.09183 0.816047 0.948961 0.887938
32/53 0.522674 0.972936 1.409404 1.101852 1.226789
65/108 1.110218 2.211848 2.657623 1.598246 1.662689
257/479 0.203611 0.392438 1.480858 1.488709 1.762783
300/557 0.275613 0.542966 1.987574 1.970612 2.206632
79/176 0.239404 0.041192 0.917414 1.149373 0.82113
Overall BF 0 0 2679.759382 954.921232 17359.565519
1/Overall 217082215663.106 938221934045.138 0.000373 0.001047 0.000058

Interesting, the additional data didn’t benefit H4 as much as it benefited H3, and even H5 can be inferior to other hypotheses in some circumstances. How does that work? Let’s take a look at Experiment 9, non-stimulus seeking, which had an estimated 65 successes in 108 trials. Some poking on the pocket calculator reveals that’s a success ratio of 60%. Scroll back up to the graphs of H3 and H4; notice how H3 has a bit more height at the 60% mark than H4? Thanks to sloppy sampling, it puts more weight on values in that range than its sharper cousin, so it does better in studies with that success rate. H5 places a much greater emphasis on lower success ratios, so it too doesn’t fare as well as H3.

If data comes up that doesn’t square well with a hypothesis, its certainty takes a hit. But if we’re comparing it to another hypothesis that also doesn’t predict the data, the Bayes Factor will remain close to 1 and our certainties won’t shift much at all. Likewise, if both hypotheses strongly predict the data, the Factor again stays close to 1. If we’re looking to really shift our certainty around, we need a big Bayes Factor, which means we need to find scenarios where one hypothesis strongly predicts the data while the other strongly predicts this data shouldn’t happen.

Or, in other words, we should look for situations where one theory is… false. That sounds an awful lot like falsification! Wow, is there anything Bayes’ Theorem can’t do?

Bayes is MAGICKS?!Surprisingly, it also points out why being proven wrong is so valuable in a theory. Not seeing it? Let’s pit Aristotelian gravity (things seek out their natural place) against Newtonian gravity (objects attract one another proportionate to their mass and inversely proportionate to the square of their distance). Both predict that if I drop an object, it’ll hit Earth and come to rest. I drop an object. It hits Earth and comes to rest. The Bayes Factor between both hypotheses sits at 1, meaning I have equal certainty about both. I repeat the experiment a lot, and the Bayes Factor remains stuck at 1. Both theories describe this situation equally well, and I’m fully justified in using either.

I need a tie breaker, some situation where one theory falls flat while the other remains strong. The obvious one is to ask a different question: how does that object fall? There are an infinite number of possibilities to choose from; it might move with any number of constant speeds, it may accelerate or do all sorts of weird tricks. Newtonian gravity gives me a very precise answer, it’d accelerate at about 9.81 metres per square second until it strikes the Earth. There are substantially more possibilities than just this, so it’s highly unlikely and very easy to prove wrong. This makes it similar to H0, which is also just one possibility out of an infinite number, but one that also carries a substantial prior probability.

In contrast, Aristotelian gravity doesn’t say how the object falls at all. Every possibility remains, so like H1 we have to integrate over all of them. And just like the battle between H0 and H1, this lack of specificity kills its relative certainty when we run the experiment again. The Bayes Factor comes out decisively in favor of Newtonian gravity, and so our certainty shifts away from Aristotelian and towards Newtonian.

The moral of the story? Theories that make specific predictions are easy to prove wrong, but they’ll always trump ones that make vague or no predictions if the evidence happens to support the more brittle theory.

But by now I’m sure you’re asking: what does all this mean?

[HJH 2015-06-05: added a link]