Tags
Oh whoops, we forget to actually run the numbers for the hypothesis we developed last time, H3. To do that in our little helper, swap out:
prob := random.Float64() // H1: not random
for:
prob := (random.NormFloat64()*random.Float64()*0.05) + random.Float64()*0.1 + 0.45 if prob < 0.45 { prob = 1 - prob } // H3: plausible precog, but sloppy sampling
And tack the results up next to our prior runs.
Success/Trials | H1/H0 | H2/H0 | H3/H0 |
828/1560 | 0.588416 | 1.211814 | 4.775636 |
238/ 480 | 0.057852 | 0.049696 | 0.485005 |
246/ 480 | 0.066529 | 0.092845 | 0.552351 |
536/1080 | 0.039115 | 0.032298 | 0.345376 |
2790/5400 | 0.340399 | 0.678042 | 2.97905 |
1274/2400 | 2.448912 | 4.912015 | 19.251378 |
1186/2400 | 0.03033 | 0.017051 | 0.268514 |
1251/2496 | 0.025082 | 0.02749 | 0.226681 |
1105/2304 | 0.178838 | 0.008695 | 1.499272 |
overall Bayes Factor | 0 | 0 | 2.312523 |
1/BF | 99536693.052594 | 407936437.373478 | 0.432428 |
Wow, that’s quite an improvement! Jettisoning those nonsense values really cranked the odds, if by “cranked” we mean “reached the ‘barely noteworthy’ level.” Even then, the results are tenuous; remove the best study, and the Bayes Factor swings against precognition.
But the improvement comes with a cost. Here’s how H3 claims precognitive ability is spread across all humans. The vertical axis is the raw total, and is in logarithmic scale, with the horizontal is the success rate. Black areas represent certain precognitive ability, white areas indicate no precognition, and gray areas indicate plausible levels of precognition under the model.
In real terms, some 490,000 Canadians can see what’s coming with greater than 66% accuracy at best, but the expected number of the model is around 9,000. Look at a success rate over 75%, and the numbers decline to 1,120 and 6, respectively. We’ve cranked the odds of precognition, but only by watering down what we mean by “precognition.” Those spectacular extremes don’t exist in this model.
It doesn’t help that H3 does a poor job of representing its assumptions.
H3 samples every point within the bounds of this graph, yet only points on that fuzzy boomerang should count towards precognitive ability. With a little math and curve-fitting, we can craft an H4 that does a much better job of grasping the boomerang; on this chart, it only samples the area between those two red lines. See if you can spot the difference between H3‘s predictions for the population, and H4‘s:
We should also stop equivocating between the variation in the general population and the variation between studies; while the two are related, they’re not the same. Say we randomly pluck two hundred people from the general public and calculate the average precognitive ability of the group. While this number is a decent estimate of the population average, the random variance within the population means it’s almost certain the sample’s average will be slightly better or worse than the population average. As you increase the sample size, this difference will gradually fade away.
But if you do multiple small studies, instead of one bigger one, you’ll find their averages dance around the population average, and the distribution of the dance will be strongly correlated to the distribution in the population. Since these averages are, well, averages, the variance between them will be muted down relative to the population. In the case of multiple two hundred sample studies, for instance, the standard deviation between their means will be about 7% of the population standard deviation. Compensating for this effect will toss out even more garbage samples, and further improve the numbers for precognition.
Tightening up the sampling, however, means that H4 requires substantially more code than H3.
mean := random.Float64()*.1 + .45 // H4: plausible precog, great sampling sm := mean - 0.5 stdev := sm*sm*(-11.3743730736561 + sm*sm*(-6207.21103956717 + sm*sm*1409626.43860565)) + .045 variation := .38*(.05 - math.Abs(sm)) if variation > .0065 { variation = .0065 } prob := (random.NormFloat64()*(stdev + 2*variation*random.Float64() - variation))/math.Sqrt(trials[slot]) + mean if prob < 0.45 { prob = 1 - prob }
Whew! Figuring all that out was a pain, but it has a good effect on the numbers.
Success/Trials | H1/H0 | H2/H0 | H3/H0 | H4/H0 |
828/1560 | 0.588416 | 1.211814 | 4.775636 | 5.68318 |
238/ 480 | 0.057852 | 0.049696 | 0.485005 | 0.561392 |
246/ 480 | 0.066529 | 0.092845 | 0.552351 | 0.627585 |
536/1080 | 0.039115 | 0.032298 | 0.345376 | 0.391712 |
2790/5400 | 0.340399 | 0.678042 | 2.97905 | 3.431167 |
1274/2400 | 2.448912 | 4.912015 | 19.251378 | 23.724679 |
1186/2400 | 0.03033 | 0.017051 | 0.268514 | 0.301883 |
1251/2496 | 0.025082 | 0.02749 | 0.226681 | 0.25188 |
1105/2304 | 0.178838 | 0.008695 | 1.499272 | 1.768971 |
overall Bayes Factor | 0 | 0 | 2.312523 | 8.588008 |
1/BF | 99536693.052594 | 407936437.373478 | 0.432428 | 0.116441 |
We’ve finally entered the realm of “positive” results, as per Kass and Rafferty, and our “hero” study from last time is less important (though still a deal-breaker if removed). But can we get numbers even more favorable to precognition?
We could always cheat and shape our hypothesis to fit the data, then marvel when the two are a near-perfect match. Feeding these nine experiments into a spreadsheet, we find they result in a weighted average success rate of 50.8%, with a weighted standard deviation of 1.08 percentage points. This is a twisted result; assuming the population distribution is Gaussian and reflects the deviation of the study means, then some numeric simulations suggest the population’s standard deviation is 65.3 percentage points. This would mean about nineteen in twenty people would have a noticeable level of precognition, and at the extremes 1.95 million Canadians would have a success rate greater than 95%. The priors on this H5 are subterranean.
Still, it makes a handy benchmark; if this doesn’t show a stronger signal in favor of precognition, something’s gone horribly wrong. So let’s swap in this…
// H5: cheating, drawn from experiments with certain values prob := (random.NormFloat64()*0.0176074877) + 0.5082795699
… and see what we get.
Success/Trials | H1/H0 | H2/H0 | H3/H0 | H4/H0 | H5/H0 |
828/1560 | 0.588416 | 1.211814 | 4.775636 | 5.68318 | 6.554058 |
238/ 480 | 0.057852 | 0.049696 | 0.485005 | 0.561392 | 0.733906 |
246/ 480 | 0.066529 | 0.092845 | 0.552351 | 0.627585 | 0.909321 |
536/1080 | 0.039115 | 0.032298 | 0.345376 | 0.391712 | 0.589348 |
2790/5400 | 0.340399 | 0.678042 | 2.97905 | 3.431167 | 6.562676 |
1274/2400 | 2.448912 | 4.912015 | 19.251378 | 23.724679 | 26.016863 |
1186/2400 | 0.03033 | 0.017051 | 0.268514 | 0.301883 | 0.464107 |
1251/2496 | 0.025082 | 0.02749 | 0.226681 | 0.25188 | 0.467214 |
1105/2304 | 0.178838 | 0.008695 | 1.499272 | 1.768971 | 1.295836 |
overall Bayes Factor | 0 | 0 | 2.312523 | 8.588008 | 123.66863 |
1/BF | 99536693.052594 | 407936437.373478 | 0.432428 | 0.116441 | 0.008086 |
Now we’re up to “strongly in favor,” but it only takes two datasets to flip the Bayes Factor. Still not impressive.
Maybe we need more data. There are eleven more studies and datasets mentioned in Bem’s paper, three of which are by other authors. Some of them are too poorly described to be sure of the numbers; others are better described by ANOVA or other statistical methods. However, all of them have a success rate and p-value attached, which can be used to reverse-engineer the total sample size for a binomial test. So we can throw them on the pile, too.
Success | Trials | Experiment |
59 | 97 | Experiment 3: retro prime, 0.25s < x < 1.5s |
58 | 99 | Experiment 4: retro prime, 0.25s < x < 1.5s |
1242 | 2400 | Experiment 6: negative |
1153 | 2400 | Experiment 6: erotic |
109 | 174 | Experiment 8: word recall, stimulus seeking |
30 | 66 | Experiment 8: word recall, not s.s. |
32 | 53 | Experiment 9: word recall, stimulus seeking |
65 | 108 | Experiment 9: word recall, not s.s. |
257 | 479 | Savva 2004: spiders |
300 | 557 | Parker 2010: habituation |
79 | 176 | Parker 2010: non-habituation |
And thanks to the magic of computers, updating all our Bayes Factors is quite painless.
Success/Trials | H1/H0 | H2/H0 | H3/H0 | H4/H0 | H5/H0 |
828/1560 | 0.606666 | 1.198943 | 4.784953 | 5.705459 | 6.572219 |
238/480 | 0.057603 | 0.04947 | 0.484612 | 0.562555 | 0.733823 |
246/480 | 0.065462 | 0.094775 | 0.551628 | 0.629002 | 0.908678 |
536/1080 | 0.039285 | 0.031954 | 0.345834 | 0.393145 | 0.58999 |
2790/5400 | 0.333243 | 0.680725 | 3.003229 | 3.448837 | 6.588456 |
1274/2400 | 2.434773 | 4.900013 | 19.316269 | 23.771786 | 26.150746 |
1186/2400 | 0.030896 | 0.016999 | 0.269643 | 0.300326 | 0.464338 |
1251/2496 | 0.025171 | 0.027394 | 0.225731 | 0.253144 | 0.465799 |
1105/2304 | 0.174386 | 0.008942 | 1.500294 | 1.766644 | 1.297245 |
59/97 | 1.212351 | 2.408451 | 2.59965 | 1.548541 | 1.631562 |
58/99 | 0.532767 | 1.018189 | 1.762424 | 1.269398 | 1.407962 |
1242/2400 | 0.111135 | 0.216742 | 0.955962 | 1.111527 | 1.959573 |
1153/2400 | 0.162183 | 0.009154 | 1.372023 | 1.617248 | 1.232864 |
109/174 | 25.427308 | 50.621886 | 23.055238 | 5.487296 | 4.325966 |
30/66 | 0.20185 | 0.09183 | 0.816047 | 0.948961 | 0.887938 |
32/53 | 0.522674 | 0.972936 | 1.409404 | 1.101852 | 1.226789 |
65/108 | 1.110218 | 2.211848 | 2.657623 | 1.598246 | 1.662689 |
257/479 | 0.203611 | 0.392438 | 1.480858 | 1.488709 | 1.762783 |
300/557 | 0.275613 | 0.542966 | 1.987574 | 1.970612 | 2.206632 |
79/176 | 0.239404 | 0.041192 | 0.917414 | 1.149373 | 0.82113 |
Overall BF | 0 | 0 | 2679.759382 | 954.921232 | 17359.565519 |
1/Overall | 217082215663.106 | 938221934045.138 | 0.000373 | 0.001047 | 0.000058 |
Interesting, the additional data didn’t benefit H4 as much as it benefited H3, and even H5 can be inferior to other hypotheses in some circumstances. How does that work? Let’s take a look at Experiment 9, non-stimulus seeking, which had an estimated 65 successes in 108 trials. Some poking on the pocket calculator reveals that’s a success ratio of 60%. Scroll back up to the graphs of H3 and H4; notice how H3 has a bit more height at the 60% mark than H4? Thanks to sloppy sampling, it puts more weight on values in that range than its sharper cousin, so it does better in studies with that success rate. H5 places a much greater emphasis on lower success ratios, so it too doesn’t fare as well as H3.
If data comes up that doesn’t square well with a hypothesis, its certainty takes a hit. But if we’re comparing it to another hypothesis that also doesn’t predict the data, the Bayes Factor will remain close to 1 and our certainties won’t shift much at all. Likewise, if both hypotheses strongly predict the data, the Factor again stays close to 1. If we’re looking to really shift our certainty around, we need a big Bayes Factor, which means we need to find scenarios where one hypothesis strongly predicts the data while the other strongly predicts this data shouldn’t happen.
Or, in other words, we should look for situations where one theory is… false. That sounds an awful lot like falsification! Wow, is there anything Bayes’ Theorem can’t do?
Surprisingly, it also points out why being proven wrong is so valuable in a theory. Not seeing it? Let’s pit Aristotelian gravity (things seek out their natural place) against Newtonian gravity (objects attract one another proportionate to their mass and inversely proportionate to the square of their distance). Both predict that if I drop an object, it’ll hit Earth and come to rest. I drop an object. It hits Earth and comes to rest. The Bayes Factor between both hypotheses sits at 1, meaning I have equal certainty about both. I repeat the experiment a lot, and the Bayes Factor remains stuck at 1. Both theories describe this situation equally well, and I’m fully justified in using either.
I need a tie breaker, some situation where one theory falls flat while the other remains strong. The obvious one is to ask a different question: how does that object fall? There are an infinite number of possibilities to choose from; it might move with any number of constant speeds, it may accelerate or do all sorts of weird tricks. Newtonian gravity gives me a very precise answer, it’d accelerate at about 9.81 metres per square second until it strikes the Earth. There are substantially more possibilities than just this, so it’s highly unlikely and very easy to prove wrong. This makes it similar to H0, which is also just one possibility out of an infinite number, but one that also carries a substantial prior probability.
In contrast, Aristotelian gravity doesn’t say how the object falls at all. Every possibility remains, so like H1 we have to integrate over all of them. And just like the battle between H0 and H1, this lack of specificity kills its relative certainty when we run the experiment again. The Bayes Factor comes out decisively in favor of Newtonian gravity, and so our certainty shifts away from Aristotelian and towards Newtonian.
The moral of the story? Theories that make specific predictions are easy to prove wrong, but they’ll always trump ones that make vague or no predictions if the evidence happens to support the more brittle theory.
But by now I’m sure you’re asking: what does all this mean?
[HJH 2015-06-05: added a link]
Pingback: Bayes Theorem 204: Symmetry Breaking and Ockham’s Razor | SINMANTYX
Pingback: Bayes Theorem 206: What Does It Mean? | SINMANTYX