[*HJH 2015-08-30*: While working on the next instalment I realized I’d confused my classifications back here. That’s been fixed.]

Congratulations! You’ve just been handed a task that’s way too big to take on analytically. Your only hope is numeric sampling. How to do it, though?

Let’s start off with the simple case, where you need to sample values uniformly between *A* and *B*, and simplify it further by just sampling between 0 and 1. Mapping that back to [*A*,*B*] is as simple as *y* = *A* + (*B-A*)*x* .

Anyway. The obvious approach is to march one-by-one over the possibilities.

for (float s = 0.0; s <= 1.0; s += step ) { // do what you need to with s ... }

This is super-easy to implement, and you know the error bars quite well. You have no way to interrupt the process and still get useful data, though, because until the very end a fraction of the domain will always remain untouched.

Another approach is to spam pseudo-random numbers.

Most stock random number generators are of high enough quality to allow

while (continueSampling) { s = rand(); // do what you need to with s ... }

Now you can stop the process whenever you like and always get a usable result. Random numbers do clump together a bit, however, so your error isn’t proportional to the inverse of the number of samples you taken, it’s actually the inverse of the square root. Cranking the number of samples to compensate just leads to diminishing returns.

But there’s a third way. If we wanted to halve the error of a linear sequence we’d already finished, we could just sample the midpoints between every point we sampled last time.

But what’s stopping us from doing that recursively, starting with the midpoint of the full interval?

float divisor = 0.5; while (continueSampling) { float s = divisor; while (s < 1.0) { // do what you need to with s ... s += 2 * divisor; } divisor *= 0.5; }

On paper this “van der Corput” sequence has all the advantages of both a linear and random sample: you get predictable errors and excellent convergence, while gaining the ability to stop the process. But those stop times are severely restricted, as like a linear sample you can’t use a fractional run. You can dodge around this by keeping each pass in temporary storage, to be discarded on early exit, but then you’ll be tossing a quarter of your calculations on average (as the last pass will take up half your runtime).

[*HJH 2015-08-30*] van der Corput is just one type of low-discrepancy sequence, which sometimes go by the name quasi-random numbers. As that implies, most of these sequences look random at a glance, but when charted it becomes obvious they aren’t.

You can stop sampling at any time, yet still retain the rapid convergence of linear sampling. One of the easiest such sequences is nothing more than an add, a branch, and a subtract.

float s = sqrt( 1.25 ) - 0.5; while (continueSampling) { // do what you need to with s ... s += sqrt( 1.25 ) - 0.5; if (s > 1.0) s -= 1.0; // as both the original and offset are less than one, their sum must be less than two }

You’ll kick yourself for not thinking of this back in High School. Remember the Pythagoreans, and their discomfort at proving the square root of two is irrational? Every number is a fraction inside a computer, so they can never exactly represent an irrational number. This means that no two decimal portions within a sequence of sums will repeat, because otherwise the original value *can* be represented as a fraction.

In theory, of course; computers can’t accurately represent irrational numbers, after all, so we’re actually adding fractions to fractions. You can try to be sneaky and calculate the square root on-the-fly, but you’ll chew up extra CPU cycles and eventually run into precision issues. It’s much more efficient to pick a good pseudo-irrational number and just sum with it.

This is just the barest tip of the discrepancy iceberg, though. Wikipedia lists several other variants, each with their own pluses and minuses. I’ll visit some of them later on, but for now it’s time for a practical example.

Let’s use each of the above techniques to estimate an integral, one we already know the exact answer for.

Ah, that’ll do. Let’s whip up a program to do the grunt work, and review the results.

( in log_10(calculated – actual) ) | ||||

samples |
linear |
random |
v.d.C. |
additive low disc. |

1 | 0 | -0.43 | -0.43 | -0.47 |

3 | 0 | -1.09 | -0.81 | -0.85 |

7 | 0 | -1.27 | -1.15 | -0.9 |

15 | 0 | -1.6 | -1.48 | -1.25 |

31 | 0 | -1.58 | -1.79 | -1.48 |

63 | 0 | -2.22 | -2.1 | -2.12 |

127 | 0 | -2.86 | -2.4 | -2.53 |

255 | 0 | -1.53 | -2.71 | -2.5 |

511 | 0 | -1.96 | -3.01 | -2.98 |

1023 | 0 | -1.9 | -3.31 | -3.83 |

2047 | 0 | -2.99 | -3.61 | -3.67 |

4095 | 0 | -2.48 | -3.91 | -4.41 |

8191 | 0 | -3.47 | -4.21 | -4.55 |

16383 | 0 | -2.84 | -4.52 | -4.43 |

32767 | 0 | -3.09 | -4.82 | -5.28 |

65535 | -0.01 | -3.46 | -5.12 | -5.05 |

131071 | -0.01 | -2.84 | -5.42 | -5.18 |

262143 | -0.03 | -3.09 | -5.72 | -5.95 |

524287 | -0.06 | -3.61 | -6.02 | -6.71 |

1048575 | -0.12 | -4.19 | -6.32 | -6.18 |

2097151 | -0.25 | -4.74 | -6.62 | -6.09 |

4194303 | -0.6 | -3.99 | -6.92 | -6.57 |

8388607 | -6.75 | -3.69 | -7.22 | -7.77 |

Hmm, some explanation is required here. Since we already know what the final result should be, the numbers in this table represent how close each technique comes to that value. Rather than show those differences straight-up, I’ve presented their base-ten logarithms. This way, the whole number portion matches how many digits the method got right; for instance, the final low-discrepancy result is actually 0.000000017150 from the true answer. This table also favors van der Corput by printing out values exactly when it completes a layer. Stopping at any other point should result in inferior results there.

Some of you might be scratching your head over the linear method. Why does it do a fair bit worse than the binary van der Corput sequence, even though one is almost exactly a rearranged version of the other? In one word, precision: because this linear sampling starts off adding incredibly small numbers to more incredibly small numbers, a lot of rounding takes place. This seemingly minor loss of precision adds up. If you instead reverse the linear scan so it progresses from 1 to 0, the two methods give nearly-identical results.

A little knowledge goes a long way. But it also reveals there’s more than just uniform sampling…

Pingback: A Guide to Sampling, Part Two: Non-uniform Distributions | SINMANTYX