, ,

Bunny: Out of winner hyber-, hibwrer-... woke up from winner napIf you’ve read the entire series, you probably think that invoking Bayesian statistics is a huge pain. In general that is true, since you’re typically dealing with non-parametric distributions which can only be tackled with a lot of numeric integration.

There are specific instances where that’s not the case, however. If your prior and likelihood function follow the normal distribution, the posterior will also follow the normal distribution. Better still, you can easily calculate its exact parameters! If all you care about is the mean, plus the prior has a mean of μ0 and standard error SE0 , and you make NE observations that have mean μE and standard deviation σE , then your posterior will be described by:

The hyperposterior distribution for a Gaussian conjugate hyperprior.Distributions that remain unchanged during Bayesian updating are known as conjugate distributions, and are analytic gravy. The normal distribution is especially handy as measurement errors generally follow it, so progressively refining your uncertainty is as easy as repeatedly chaining together your means, standard deviations, and pooled measurement counts. Things are even easier if you know the standard error of each observation, as you can swap that in for the standard deviation and set NE to 1.

Why the difference? The standard deviation of your pool of observations relates to the observations, while the standard error actually relates to the mean. It’s the confidence interval of where the true population mean lies, based on your observations, hence the word “error.” The standard error isn’t a parameter of the data, it’s a parameter of another parameter, and given the prefix “hyper” to mark it as special. That also means the prior distribution is actually a hyperprior, since it’s a probability distribution of our confidence in various means, and we’re left with a hyperposterior after our calculations.

It’s pretty confusing, I know, especially since the normal distribution has a convenient property relating the standard error of the mean to the standard deviation.

The standard error of a Gaussian distribution's mean is equal to the standard deviation of the observed values divided by the square root of the number of observations.One way to cut down on the confusion is to pay close attention to the wording. If you see “error,” “confidence,” or “credible” mentioned, you’re dealing with a hyperparameter. Remember too that the only parameter we care about above, and in the following examples, is the mean. The standard deviation of future measurements isn’t being predicted here, otherwise we’d need a confidence value for that too and we’d be tracking four parameters instead of two.

The examples should clear up the confusion. The Gravitational constant from Newtonian Mechanics is surprisingly difficult to measure. A number of authors have attempted it, and we can consolidate their work via conjugate distributions.[1] Since G is a constant, it should have no variance and so we’ll only track the mean. Each measurement has a standard error attached, so our work is pretty easy.

Updating our confidence over where Newton's G lies via conjugate distributions.Look at that! No programming was involved, in fact the math was trivial enough to do in a spreadsheet. Up for another round? Let’s hit the USA presidential polls![2]

Hold on a second, though, no-one lists a standard error for their numbers. Not to worry, they do have something called the “margin of error” (hey, “error!”) which is almost always a 95% confidence interval (“confidence!!”). Since that interval spans +/- 1.96 standard errors, we have enough info to run the stats. Again, we don’t care about the standard deviation here, as there’d be no variance in the level of support if we were able to question everyone in the USA.

Using conjugate priors on the poll results for the USA's 2016 election.Let’s ratchet up the difficulty level another notch. The inspiration for this addition was an excellent guest post at Sometimes I’m Wrong.

A chart of effect sizes from Tuk, Mirjam A., Kuangjie Zhang, and Steven Sweldens. “The Propagation of Self-Control: Self-Control in One Domain Simultaneously Improves Self-Control in Other Domains.” Journal of Experimental Psychology: General 144, no. 3 (2015): 639–54. doi:10.1037/xge0000065.This chart summarizes all the results from subject of the post, a study which did multiple experiments around the premise that self-control in one domain will translate into others.[3] It’s all well and good, but the main metric of significance is the p-value. Ugh. What can a Bayesian approach tell us?

The standard error is already in place, but what’s this “Std. diff in means (d)?” That’s a measure of effect size, specifically Cohen’s d. Never heard? It’s all the rage in meta-analysis, and fairly easy to understand. Basically, you subtract the mean of the control group from the mean of the test group, and divide by the pooled standard deviation from both.

A visual guide to Cohen's d. In brief, it's the test mean minus the control mean, divided by the pooled standard deviation.

At this point, it would be very tempting to leap on the fixed standard deviation of one, and use that in the analysis. That applies to the data, however, not the mean, hence it shouldn’t be a part of the hyperprior. We should just stick with the listed standard errors.

Pooled effects sizes from: Tuk, Mirjam A., Kuangjie Zhang, and Steven Sweldens. “The Propagation of Self-Control: Self-Control in One Domain Simultaneously Improves Self-Control in Other Domains.” Journal of Experimental Psychology: General 144, no. 3 (2015): 639–54. doi:10.1037/xge0000065.With a Gaussian distribution in place, the natural next step is to sample from it. The easy way to do this is by calculating a discrete odds ratio between two hypotheses:

H1 : Self-control is real, with effect size d = 0.219099.
H0 : Self-control is a statistical artifact, with d = 0.

As you’ll remember from last time, though, the discrete approach isn’t very realistic. So let’s also do a simple range:

H1 : Self-control is real, with an effect size between d = 0.5 and d = 0.2.
H0 : Self-control is a statistical artifact, with d between 0.1 and -0.1.

As it turns out, this is trivially easy to do in a spreadsheet.

The odds ratios for two hypothesis comparisons, a point-to-point one (OR ~= 1849) and a ranged one (OR ~= 36).Overall, that’s decent but not ironclad evidence in favor of self-control across a variety of conditions.

Did you notice the conjugate distribution’s mean was in perfect agreement with the pooled Cohen’s d in the original paper? There’s a good reason for that: the conjugate Gaussian distribution for the mean is a simple weighted average.

Deriving the Gaussian conjugate distribution from the weighted average.This also means it’s trivial to whip up a home-brew meta-analysis. All the usual caveats about garbage-in apply, of course. No d value listed? Converting to it is pretty easy. This opens up a lot of doors to applying Bayesian statistics in practice, no doubt helped by the above examples in spreadsheet form and the abundant number of conjugate distributions.

I’ll try to have another practical example up shortly.

[1] Pitkänen, Matti. “Variation of Newton’s Constant and of Length of Day.” Prespacetime Journal 6, no. 5 (2015).

[2] “RealClearPolitics – 2016 Election – General Election: Trump vs. Clinton.” Accessed May 28th, 2016.

[3] Tuk, Mirjam A., Kuangjie Zhang, and Steven Sweldens. “The Propagation of Self-Control: Self-Control in One Domain Simultaneously Improves Self-Control in Other Domains.” Journal of Experimental Psychology: General 144, no. 3 (2015): 639–54. doi:10.1037/xge0000065.