, ,

Bayes Bunny: Iz hate loose tweads.I wasn’t expecting a seventh part to this series, but the more I read about science and hypothesis testing, both Bayesian and not, the more I realize this diagram is deeply flawed.

A pure Gaussian curve.That’s the frequency of values that H0 predicts we’d find over a 5,400 sample trial if precognition doesn’t exist. It also claims the scientific literature on precognition should exhibit that distribution.

It will never happen, though. Even if the experimental outcome is a perfect random binary, the total number of experiments is still finite. It may converge on that distribution if run indefinitely, but it still converges; when the first data point arrives it will land anywhere in that distribution, perhaps even in the extremes, and subsequent data will only gradually reveal the underlying distribution. It will never exactly match it.

That’s no big deal for Bayesian approaches, which don’t assume an infinite number of trials, but it’s the doorway to a similar problem: we’re assuming all of these tests are free of all bias. We know that even high-level physics experiments can be completely invalidated by a subtle skew, and there’s no reason to think a messier social science experiment would be less vulnerable. Critics of Bem have argued that all but two of the data-sets I’ve examined here have a level of subjective interpretation by researchers.[1] Worst of all, it doesn’t take much; a test that results in strong p-values under a perfectly random null hypothesis will fail to reach significance if the results have even a slight skew, well below what a human being could detect.

So what skews are possible? I can’t find any good studies on that, and the question itself implies more than one skew is possible; like the precog-friendly hypotheses, we’re going to be sampling from a range. This follows what the literature suggests, but it still doesn’t answer the question. We could easily nullify the results by arbitrarily tweaking epsilon, or fitting to the data we have, so we need a non-arbitrary tweak.

How altering the epsilon around H0 effects the resulting Bayes factor.Back when we were forming those precognition hypotheses, though, I pointed to 55% as the minimal probability above 50/50 that the typical person could tell was different. Scientists are people too, so it’s plausible they’d also fail to notice that level of bias. But they also do experiments for a living, so noticeable biases should be weeded out completely; this suggests a distribution that clamps to zero outside of 45-55% and clumps around the 50% mark. The simplest distribution that fits is a pyramid.

A visual comparison of H3, H(-1), and H4. The first and last have long tails.I know it doesn’t look much like a pyramid here, but I put the y-axis to logarithmic scale so I could compare it to H3 and H4. At least there also isn’t much resemblance to those two, so we can’t be accused of sneaking a precog hypothesis into the null. After some fun math shenanigans, we craft a proper H(-1) to plug into our program:

prob    := random.Float64()            // H(-1): incorporating non-null nulls
if prob < 0.5 {    prob = 0.05*math.Sqrt( 2.0*prob ) + 0.45
} else {    prob = 0.55 - 0.05*math.Sqrt( 2.0 - 2.0*prob ) }

… and when we hit “run” …

Success/Trials H(-1)/H0 H3/H(-1) H4/H(-1)
828/1560 4.744144 1.0086019733 1.2026319184
238/480 0.73873 0.6560069308 0.7615163862
246/480 0.787375 0.700591205 0.7988595015
536/1080 0.587574 0.5885794811 0.6690987008
2790/5400 4.560903 0.658472456 0.7561741611
1274/2400 18.993672 1.0169844462 1.2515634681
1186/2400 0.490103 0.5501761875 0.6127813949
1251/2496 0.424827 0.5313480546 0.5958754976
1105/2304 2.083653 0.7200306385 0.8478590245
59/97 1.274387 2.0399219389 1.2151261744
58/99 1.142614 1.5424491561 1.1109596067
1242/2400 1.428732 0.6690981934 0.7779814549
1153/2400 1.941262 0.7067685866 0.8330910511
109/174 2.968007 7.767919011 1.8488150466
30/66 0.97406 0.8377789869 0.9742325935
32/53 1.054968 1.3359684844 1.0444411584
65/108 1.298163 2.0472182615 1.2311597234
257/479 1.344006 1.1018239502 1.107665442
300/557 1.656973 1.1995210544 1.1892843154
79/176 1.088495 0.8428279413 1.0559285987
Overall BF 2361.947846 1.1345548491 0.4042939532
1/Overall 0.000423 0.881402958 2.4734478267

Surprise Reversal Reversal - Time to Strut!Well now, looks like we nearly nullified twenty tests of precognition. But we’re only partway done, because we haven’t done anything to cover publication bias.

Null results are very difficult to publish, even when they’re relevant to the literature. Most science journals have a fetish for p-values, with a small but growing number of exceptions, and taken to the extreme that results in a graph more like this.

A Gaussian distribution, but with 5% of its area highlighted.Any study which doesn’t make the significance cutoff, falling into that gray area, is shoved into the file drawer and starves H0 of data that would have vindicated it.

To compensate for this effect, then, we need to make more adjustments to H(-1). Fortunately, we have studies of published p-values to guide us.[2][3]

Graph generated by Larry Wasserman, based off data from: Masicampo, E.J. and Lalande, D. (2012). A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology. http://www.tandfonline.com/doi/abs/10.1080/17470218.2012.711335There’s a suspicious-looking cliff at p = 0.05, due to some combination of filed-away results or statistical massaging. You can see this more clearly with data I parsed from Google Scholar.

Three-digit p-values listed in Google Scholar's database, with a curve fit which suggests only 40% of studies which fail to reach p < 0.05 get published.Now, this dataset is problematic; Scholar tries to be smart and automatically matches “<” with “=”, adding in extra results, and it’s based off their estimate of how many results were returned, which is frequently wrong. Even with a muddy dataset, though, that cliff still stands tall. The Scholar set also includes a bit of a pimple just below that cliff, probably from both scientists and publishers saying “close enough to significance.” My curve-fit above suggests a study is about 2.6x as likely to be published if it passes that 0.05 significance level, which is in the range of what other studies have found. We can compensate for it easily enough in the null.

prob    := random.Float64()            // H(-1): incorporating non-null nulls
if prob < 0.5 {    prob = 0.05*math.Sqrt( 2.0*prob ) + 0.45
} else {    prob = 0.55 - 0.05*math.Sqrt( 2.0 - 2.0*prob ) }

mean    := trials[slot] * 0.5           // H(-2): H(-1) + 39% of non-sig tests aren't published
stdev   := math.Sqrt( mean*0.5 )
pval    := 0.5*(1 + math.Erf( (mean - prob*trials[slot])/(stdev * math.Sqrt(2)) ))
if (pval > 0.05) && (random.Float64() > 0.39) { continue }

And here’s the results (I had to halve the sample count, as it kept timing out).

Success/Trials H(-2)/H0 H3/H(-2) H4/H(-2)
828/1560 7.309466 0.6546241545 0.7805575674
238/480 0.71228 0.6803672713 0.7897947436
246/480 0.775151 0.7116394096 0.8114573806
536/1080 0.507649 0.6812462942 0.7744425775
2790/5400 0.150697 19.9289236017 22.8859035017
1274/2400 0.030544 632.4079688318 778.2800550026
1186/2400 2.725926 0.0989179457 0.1101739372
1251/2496 2.975454 0.0758643891 0.0850774369
1105/2304 0.642528 2.3349861796 2.7495206435
59/97 1.276081 2.0372139386 1.213513092
58/99 1.143731 1.5409427566 1.1098746121
1242/2400 0.575279 1.6617363053 1.9321529206
1153/2400 0.68717 1.9966281997 2.3534904027
109/174 2.971811 7.7579758605 1.8464485124
30/66 0.975364 0.8366589294 0.9729301061
32/53 1.051658 1.3401733263 1.047728444
65/108 1.290772 2.058940696 1.2382093817
257/479 1.43638 1.0309653434 1.0364311672
300/557 1.895535 1.0485556848 1.0396072877
79/176 1.087194 0.8438365186 1.0571921847
Overall BF 3672.650221 0.7296527632 0.2600087606
1/Overall 0.000273 1.3705149222 2.4734478267

Now the data argues against every pre-cognition hypothesis to varying degrees, save that cheater H5.  Interestingly, under H4 we still find that half or more of the studies have Bayes Factors greater than 1, as it only went from 14 to 10 such datasets in the transition from H0 to H(-2). This reversal came mostly by chopping down magnitudes.

But there’s another interesting pattern here. Published p-values follow Benford’s Law: smaller values are more likely to occur than larger ones, according to a power law. For numbers, that means few-digit ones outnumber many-digit ones, and lesser values are more common than greater but with an equal number of digits. That’s most obvious in my dataset, where four and five digit p-values are less published than two or three digit ones but still show the same basic power law.

A comparison of various p-value frequences, taken from Google ScholarI don’t see a need to compensate for this one, as the p-value distribution under H(-1) and H(-2) has a similar distribution (but not identical, for instance H(-1) is symmetric across p = 0.5 and both have a lump around p = 0.5).

The odds of finding a specific p-value under H(-1) and H(-2). For clarity, only 0.0 > p > 0.5 are shown.Having said that, Benford’s Law could be handy for checking if Bem’s results violate it. I’ll leave that as an exercise for the reader. For now, I defer to the Bayes Bunny.

My virtict on pwecontition: hiwwy unwikewy (image via EarthPorn)For later, though, is a different story.

[1] Galak, Jeff, et al. “Correcting the past: Failures to replicate psi.” Journal of Personality and Social Psychology 103.6 (2012): 933.

[2] Masicampo, E.J. and Lalande, D. (2012). A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology. http://www.tandfonline.com/doi/abs/10.1080/17470218.2012.711335.

[3] Dwan, Kerry, et al. “Systematic review of the empirical evidence of study publication bias and outcome reporting bias.” (2008): e3081.