**Tags**

I wasn’t expecting a seventh part to this series, but the more I read about science and hypothesis testing, both Bayesian and not, the more I realize this diagram is deeply flawed.

That’s the frequency of values that *H0* predicts we’d find over a 5,400 sample trial if precognition doesn’t exist. It also claims the scientific literature on precognition should exhibit that distribution.

It will never happen, though. Even if the experimental outcome is a perfect random binary, the total number of experiments is still finite. It may converge on that distribution if run indefinitely, but it still *converges*; when the first data point arrives it will land anywhere in that distribution, perhaps even in the extremes, and subsequent data will only gradually reveal the underlying distribution. It will never exactly match it.

That’s no big deal for Bayesian approaches, which don’t assume an infinite number of trials, but it’s the doorway to a similar problem: we’re assuming all of these tests are free of all bias. We know that even high-level physics experiments can be completely invalidated by a subtle skew, and there’s no reason to think a messier social science experiment would be less vulnerable. Critics of Bem have argued that all but two of the data-sets I’ve examined here have a level of subjective interpretation by researchers.[1] Worst of all, it doesn’t take much; a test that results in strong p-values under a perfectly random null hypothesis will fail to reach significance if the results have even a slight skew, well below what a human being could detect.

So what skews are possible? I can’t find any good studies on that, and the question itself implies more than one skew is possible; like the precog-friendly hypotheses, we’re going to be sampling from a range. This follows what the literature suggests, but it still doesn’t answer the question. We could easily nullify the results by arbitrarily tweaking epsilon, or fitting to the data we have, so we need a non-arbitrary tweak.

Back when we were forming those precognition hypotheses, though, I pointed to 55% as the minimal probability above 50/50 that the typical person could tell was different. Scientists are people too, so it’s plausible they’d also fail to notice that level of bias. But they also do experiments for a living, so noticeable biases should be weeded out completely; this suggests a distribution that clamps to zero outside of 45-55% and clumps around the 50% mark. The simplest distribution that fits is a pyramid.

I know it doesn’t look much like a pyramid here, but I put the y-axis to logarithmic scale so I could compare it to *H3* and *H4*. At least there also isn’t much resemblance to those two, so we can’t be accused of sneaking a precog hypothesis into the null. After some fun math shenanigans, we craft a proper *H(-1)* to plug into our program:

prob := random.Float64() // H(-1): incorporating non-null nulls if prob < 0.5 { prob = 0.05*math.Sqrt( 2.0*prob ) + 0.45 } else { prob = 0.55 - 0.05*math.Sqrt( 2.0 - 2.0*prob ) }

… and when we hit “run” …

Success/Trials |
H(-1)/H0 |
H3/H(-1) |
H4/H(-1) |

828/1560 | 4.744144 | 1.0086019733 | 1.2026319184 |

238/480 | 0.73873 | 0.6560069308 | 0.7615163862 |

246/480 | 0.787375 | 0.700591205 | 0.7988595015 |

536/1080 | 0.587574 | 0.5885794811 | 0.6690987008 |

2790/5400 | 4.560903 | 0.658472456 | 0.7561741611 |

1274/2400 | 18.993672 | 1.0169844462 | 1.2515634681 |

1186/2400 | 0.490103 | 0.5501761875 | 0.6127813949 |

1251/2496 | 0.424827 | 0.5313480546 | 0.5958754976 |

1105/2304 | 2.083653 | 0.7200306385 | 0.8478590245 |

59/97 | 1.274387 | 2.0399219389 | 1.2151261744 |

58/99 | 1.142614 | 1.5424491561 | 1.1109596067 |

1242/2400 | 1.428732 | 0.6690981934 | 0.7779814549 |

1153/2400 | 1.941262 | 0.7067685866 | 0.8330910511 |

109/174 | 2.968007 | 7.767919011 | 1.8488150466 |

30/66 | 0.97406 | 0.8377789869 | 0.9742325935 |

32/53 | 1.054968 | 1.3359684844 | 1.0444411584 |

65/108 | 1.298163 | 2.0472182615 | 1.2311597234 |

257/479 | 1.344006 | 1.1018239502 | 1.107665442 |

300/557 | 1.656973 | 1.1995210544 | 1.1892843154 |

79/176 | 1.088495 | 0.8428279413 | 1.0559285987 |

Overall BF |
2361.947846 | 1.1345548491 | 0.4042939532 |

1/Overall |
0.000423 | 0.881402958 | 2.4734478267 |

Well now, looks like we nearly nullified twenty tests of precognition. But we’re only partway done, because we haven’t done anything to cover publication bias.

Null results are very difficult to publish, even when they’re relevant to the literature. Most science journals have a fetish for p-values, with a small but growing number of exceptions, and taken to the extreme that results in a graph more like this.

Any study which doesn’t make the significance cutoff, falling into that gray area, is shoved into the file drawer and starves *H0* of data that would have vindicated it.

To compensate for this effect, then, we need to make more adjustments to *H(-1)*. Fortunately, we have studies of published p-values to guide us.[2][3]

There’s a suspicious-looking cliff at p = 0.05, due to some combination of filed-away results or statistical massaging. You can see this more clearly with data I parsed from Google Scholar.

Now, this dataset is problematic; Scholar tries to be smart and automatically matches “<” with “=”, adding in extra results, and it’s based off their estimate of how many results were returned, which is frequently wrong. Even with a muddy dataset, though, that cliff still stands tall. The Scholar set also includes a bit of a pimple just below that cliff, probably from both scientists and publishers saying “close enough to significance.” My curve-fit above suggests a study is about 2.6x as likely to be published if it passes that 0.05 significance level, which is in the range of what other studies have found. We can compensate for it easily enough in the null.

prob := random.Float64() // H(-1): incorporating non-null nulls if prob < 0.5 { prob = 0.05*math.Sqrt( 2.0*prob ) + 0.45 } else { prob = 0.55 - 0.05*math.Sqrt( 2.0 - 2.0*prob ) } mean := trials[slot] * 0.5 // H(-2): H(-1) + 39% of non-sig tests aren't published stdev := math.Sqrt( mean*0.5 ) pval := 0.5*(1 + math.Erf( (mean - prob*trials[slot])/(stdev * math.Sqrt(2)) )) if (pval > 0.05) && (random.Float64() > 0.39) { continue }

And here’s the results (I had to halve the sample count, as it kept timing out).

Success/Trials |
H(-2)/H0 |
H3/H(-2) |
H4/H(-2) |

828/1560 | 7.309466 | 0.6546241545 | 0.7805575674 |

238/480 | 0.71228 | 0.6803672713 | 0.7897947436 |

246/480 | 0.775151 | 0.7116394096 | 0.8114573806 |

536/1080 | 0.507649 | 0.6812462942 | 0.7744425775 |

2790/5400 | 0.150697 | 19.9289236017 | 22.8859035017 |

1274/2400 | 0.030544 | 632.4079688318 | 778.2800550026 |

1186/2400 | 2.725926 | 0.0989179457 | 0.1101739372 |

1251/2496 | 2.975454 | 0.0758643891 | 0.0850774369 |

1105/2304 | 0.642528 | 2.3349861796 | 2.7495206435 |

59/97 | 1.276081 | 2.0372139386 | 1.213513092 |

58/99 | 1.143731 | 1.5409427566 | 1.1098746121 |

1242/2400 | 0.575279 | 1.6617363053 | 1.9321529206 |

1153/2400 | 0.68717 | 1.9966281997 | 2.3534904027 |

109/174 | 2.971811 | 7.7579758605 | 1.8464485124 |

30/66 | 0.975364 | 0.8366589294 | 0.9729301061 |

32/53 | 1.051658 | 1.3401733263 | 1.047728444 |

65/108 | 1.290772 | 2.058940696 | 1.2382093817 |

257/479 | 1.43638 | 1.0309653434 | 1.0364311672 |

300/557 | 1.895535 | 1.0485556848 | 1.0396072877 |

79/176 | 1.087194 | 0.8438365186 | 1.0571921847 |

Overall BF |
3672.650221 | 0.7296527632 | 0.2600087606 |

1/Overall |
0.000273 | 1.3705149222 | 2.4734478267 |

Now the data argues *against* every pre-cognition hypothesis to varying degrees, save that cheater *H5*. Interestingly, under *H4* we still find that half or more of the studies have Bayes Factors greater than 1, as it only went from 14 to 10 such datasets in the transition from *H0* to *H(-2). *This reversal came mostly by chopping down magnitudes.

But there’s another interesting pattern here. Published p-values follow Benford’s Law: smaller values are more likely to occur than larger ones, according to a power law. For numbers, that means few-digit ones outnumber many-digit ones, and lesser values are more common than greater but with an equal number of digits. That’s most obvious in my dataset, where four and five digit p-values are less published than two or three digit ones but still show the same basic power law.

I don’t see a need to compensate for this one, as the p-value distribution under *H(-1)* and *H(-2)* has a similar distribution (but not identical, for instance *H(-1)* is symmetric across p = 0.5 and both have a lump around p = 0.5).

Having said that, Benford’s Law could be handy for checking if Bem’s results violate it. I’ll leave that as an exercise for the reader. For now, I defer to the Bayes Bunny.

For later, though, is a different story.

[1] Galak, Jeff, et al. “Correcting the past: Failures to replicate psi.” *Journal of Personality and Social Psychology* 103.6 (2012): 933.

[2] Masicampo, E.J. and Lalande, D. (2012). A peculiar prevalence of p values just below .05. *The Quarterly Journal of Experimental Psychology.* http://www.tandfonline.com/doi/abs/10.1080/17470218.2012.711335.

[3] Dwan, Kerry, et al. “Systematic review of the empirical evidence of study publication bias and outcome reporting bias.” (2008): e3081.

Pingback: A Guide to Sampling, Part Two: Non-uniform Distributions | SINMANTYX