Back in the first instalment, I showed a number of methods to uniformly sample between 0 and 1. Sometimes, though, you want to a non-uniform sample. What then?

Let’s answer that by walking through a practical example. In my series on Bayesian hypothesis testing, at one point I went looking for a distribution that was zero outside a range and maximized at the precise halfway point of that range. There, I settled on a pyramid function…

What I call a pyramid function, but is better known as a triangular, hat, or tent function.… but that isn’t the only function which satisfies the above. This one has a sharp peak at the median, which can be a pain in some circumstances, so let’s look for another one with a smooth peak. My first guess would be a simple polynomial function, with zeroes at 0 and 1.

The derivation of a quadratic function that's zero at 0 and 1, yet smooth at 0.5That’s a quadratic function, with a derivative or slope that smoothly changes at all points and thus no sharp points. Nailed it in one!

It’s usually more convenient to deal with continuous distribution functions, or CDFs, which are simply the sum of all values that are more extreme than the current one. In the magical world of calculus, this is also known as an indefinite integral.

Integrating the proposed distribution function from 0 to 1Hmmm, it would be more convenient if the integral of our distribution function was equal to 1 over the range 0 to 1. That’s easy to fix.

The finalized quadratic distnribution function, df(x) = 6x(1-x), with cdf(x) = 3x^2 - 2x^3A graph of the distribution function 6x(1-x), including CDF.There, now we have a nice distribution to sample from. Let’s start walking through how we’d go about that.

Imagine taking the above diagram to an archery range and pasting it over a target. An arrow shot randomly within that area will either land in that grey part, or it won’t. The odds of it landing in the gray above any point is exactly the same as that point showing up in the distribution. We’re effectively warping our uniform sample into a non-uniform one! So, if we let that arrow fly…

x := random.Float64()
if 4.0*x*(1.0-x) < random.Float64() { continue } // rescale the DF to save an op
// do something with the sample

This is known as rejection sampling, for obvious reasons. The flaws are obvious too: you need to grab two random numbers, not just one, and if you reject your sample you’ve just wasted some computation time. This method can be fairly wasteful, but if you’re clever it can be moulded into something quite zippy, or used as a first pass for a more expensive sampling technique.

Like, say, importance sampling. Rather than reject samples, we “compress” them down by weighting it by how important it is; here, “importance” is a synonym for “frequency.” A point that’s half as likely to appear as another gets half the importance weight, and so on down the line.

x := random.Float64()
w := x*(1.0-x) // importance is relative, so we can save some ops by removing all scalars
// do something to get a result
//  say a rolling weighted average:
sum_result = (sum_result*sum_w + result*w) / (sum_w + w)
sum_w += w

Surprisingly, this has most of the flaws of rejection sampling. A sample with a small weight contributes little to the finished product, so you’re still wasting computation time. Potentially more, in fact, if it’s quicker to reject a sample than carry out the full computation for that sample. Still, if you’re clever and team it up with rejection sampling, you can get something very cool.

But there’s a third way, which manages to make every sample count to the maximal degree yet doesn’t toss any away. Remember that cumulative distribution function I calculated, then ignored?

The CDF, 3x^2 - 2x^3, with the density marked off both by grid width and colour changes.The slope of that graph is determined by how dense the distribution is; lots of values means a steeper slope, and it always increases or stays flat. Flip your head sideways so the graph’s on its side, though. Areas that were common in the distribution now have a shallow slope, and lead to dense regions of results, while the opposite is true for rare areas. By inversing the CDF, we’ve created a function which warps an area of uniform density into a non-uniform one. Fire a uniform sampler through this function, and it becomes a non-uniform sampler! What could be simpler?

Everything else. In practice, finding this inverse is usually difficult if not outright impossible. With this distribution, the inverse function has up to three valid answers for any given input.

A graph of the inverse CDF for 6x(1-x). Note that some points have three possible outcomes.Notice, though, that in the region we care about there’s only one value for any given input. Even if a general-purpose inversion is impossible, that region must be invertible.

If you don’t need exact precision, you can get excellent results from curve fitting. I went snooping around in gnuplot, and fished out these approximations:

A graph of three approximations to the inverse CDF of 6x(1-x).Depending on the precision or number of samples needed, these might be adequate. We can even chart and calculate how rapidly they diverge from the original distribution, making it much easier to pick.

Testing each approximation to see how many points it takes to noticably diverge from the actual distribution. What "noticably" means is left up to the reader.A proper inverse function should have a cube root involved, though, since the original involved a cube. No curve fitting adventures managed to produce such a function, which is strange. My first tactic against tough problems is useless here, so let’s try my second: ask Wolfram Alpha.

Wolfram Alpha's proposed inverse CDF for 6x(1-x). It's a bit complex.Huzzah! That mess under the cube root explains why I couldn’t find a fit. It’s nice to have something soooo…. easy…

No, hang on here. See that square root? It punts out an imaginary number when x is between 0 and 1. We need complex numbers to graph this equation, but there’s also a cube root there; all complex numbers have three numbers which, when cubed, will result in that numb- oh, there’s a second cube root there, so this equation has nine possible outputs. No wait, make that thirty-six: each complex number has two square roots, and there’s two square roots lurking in the equation.

Somewhere in one of those thirty-six possibilities is the equation we want. Actually, I shouldn’t say that; there’s an inflection point between 0 and 1, so we might have to combine two solutions together to get what we want.

Thank goodness I have programming skillz. Black lines are real numbers, orange is imaginary.

All 36 versions of the exact inverse of y = 3x^2 + 2x^3 … Yep, there it is, the variant with both tertiary cube roots and both positive square roots. And it’s all in one piece, too.

z0       := complex( x, 0.0 )                                                    // Go supports complex numbers natively
z1       := cmplx.Pow( 2.0*cmplx.Sqrt( z0*z0 - z0 ) - 2.0*z0 + 1, 1.0/3.0 )      // grab the primary root ...
z2       := complex( math.Sqrt(3)*.5*imag(z1) - 0.5*real(z1), -math.Sqrt(3)*.5*real(z1) - 0.5*imag(z1) )
z3       := cmplx.Conj(z2)                                                       // ^ rotate to 3rd root
x_nonuni :=  real( 0.5*(z2 + ( z3/(z2*z3) ) + 1.0) )                             // combine
// do something with x_nonuni

Ouch. That’s an orgy of math, much tougher to calculate than other two techniques. We pay a heavy price for precision, even for simple distributions.

By this point, though, you’re probably dying to see a bake-off between these techniques. Let’s plug each back into my precognition code as H(-3) and tack on a p-value check to make it H(-4), but now we’ll gather timing data as well as results. Run the code, and we get…

Overall Samples: 1200000
Sampler Bayes Factor Total Sec. ns / sample Relative time
rej. 3142.318416 0 0 NaN
import. 3210.506234 0 0 NaN
approx1 3164.481266 0 0 NaN
approx2 3197.465815 0 0 NaN
approx3 3213.629163 0 0 NaN
inv.CDF 3190.743567 0 0 NaN
rand 806.460688 0 0 NaN
null 1 0 0 NaN

Whoops, I forgot the Go Playground doesn’t provide timing information. I’ll have to run the code locally; while I’m at it, I’ll bump up the sample count and grab the median timing from multiple runs.

sampler bayes factor ns / sample minus overh relative
rej. 3203.767221 621.956158 175.0551335 109.5%
import. 3204.504777 606.704943 159.8039185 100.0%
approx1 3168.401728 1018.7148755 571.813851 357.8%
approx2 3201.219314 1403.349404 956.4483795 598.5%
approx3 3205.578364 1768.0189825 1321.117958 826.7%
inv.CDF 3203.0537 1404.932393 958.0313685 599.5%
rand 794.923891 671.390087 224.4890625 140.5%
null 1 446.9010245

The “null” sampler doesn’t do any calculation at all, so it can be used to subtract out all the overhead unrelated to sampling. The “rand” one is almost identical, save that it calls the random number generator instead so we can calculate that part of the overhead.

Surprisingly, what looked like a sluggish exact solution managed to be faster than one approximation and tie with another; most likely, the use of only one power function and matrix multiplication instead of polar coordinates kept it competitive. There’s still room for improvement, as a custom cube root routine and more work unrolling the complex number math offer a decent chance at a speed-up, but it’s highly unlikely that’ll make it six times faster. Both the rejection and importance samplers seem to perform equally speedy, but

Testing out the various non-uniform samplers; importance does best, rejection varies quite a bit, and the inverse CDFs do much as you'd expect.when you chart their progress over time, it becomes clear that rejection is the least well-behaved of the samplers. In contrast, importance had the same runtime but managed to look more like a horizontal line than any other. It’s tough to tell exactly what’s happening near the end, though, so let’s zoom in.

Zooming in on the non-uniform sampler results. There's a surprising amount of spread between the samplers at the end, especially when compared to the same samplers fed by a PRNG.Hmm, that’s not good. The poor performance of the first approximation isn’t unexpected, and it was obvious the rejection sampler would be weaving around from the larger picture… but the importance and exact inverse CDF samplers aren’t converging to a single value. 10 billion samples should have been enough to create a tight consensus, and I’d know: see those three horizontal lines? They’re where the rejection, importance, and exact inverse CDF wound up after warping 10 billion samples from a uniform random sequence, as opposed to the additive LDS in the code shared above.

It’s obvious a poor choice of uniform sampler can bias the non-uniform sample. In this case, I should have upgraded my low discrepancy sampler to something more sophisticated, like a paired-random or Sobol sequence. Down-shifting the LDS isn’t a good option.

H(-3) for ex. 2 importance exact rejection
random 4.478571 4.475946 4.474478
additive LDS 4.472374 4.472373 4.471968
vdC 4.472374 4.472374 3.005217
linear 4.472373 4.472374 3.425201

The van der Corput and linear uniform samplers don’t work with rejection sampling, at least when the sequence values correlate with the odds of rejection. You can fix this by adding another dimension, sampling within each sample, but that just compounds the misery of being unable to quit whenever you like. Even invoking a random number generator isn’t without issues; while modern PRNGs based on the Mersenne Twister or linear feedback won’t give you trouble, other generators might.

On the flip side, while I was fairly hard on the additive LDS, bear in mind it still landed within 0.1% of the answer, using 1/100th to 1/10000th of the samples a uniform random number generator would have needed. This is more of a call to double-check your work and be honest about the precision you need than reject (ha!) a technique.

But by this point I think you get the gist of non-uniform sampling. I haven’t decided where to go from here, so consider any future entries to be random walks from here.