While out and about, you run into an old friend of yours. To your surprise, they have a young boy in tow. This friend quickly tells you all about the child, most notably that they were born on a Tuesday, but then hurries off to pick up their other child. You wave goodbye, then realize: you’d no idea they had two kids, let alone one, yet your friend only talked about the tyke next to them. What can we say about this other child?
An obvious starting point is their gender. Not because it’s important, but because it’s easy: we only assign babies to one of two genders, mare or female. That suggests the following diagram.
There are only two possibilities, therefore you’re 50% certain the other child was assigned male, and 50% certain they were assigned female.
We can double-check this with Bayes’ Theorem. Let’s represent “the next child is male” with B, and the odds of B with p(B). The odds of two male children coming up will be p(BB), and the odds of the second child being male, assuming the first was, as p(BB | B1). That translates to
The Bayesian version is more flexible than the table one; some wags point out that, worldwide, there are 107 boys for every 100 girls. That’s tough to accommodate in tabular form, but trivial via Bayes Theorem.
|1st Child||2nd Child|
There are four states in total: one is impossible as it has two girls, two have the other child is a girl (order doesn’t matter here), and one has a boy as the other. The odds of the other kid being a girl are actually two-thirds, not half, and thus the odds of them being a boy is a third.
The Bayesian account shifts slightly. The likelihood of encountering a boy given a two-child pairing is 3/4, according to the above table, but so too is the likelihood of encountering a girl, thus the two hypotheses wind up with equal probability. If we let E stand for “encountering a child of an old friend,” and 2K stand for “that friend having two kids,”
So which answer is right? Let’s pretend your friend had an infinite number of kids, and always brings the first when wandering around. If we assume perfect gender parity, then of all the infinite arrangements half will have a boy as the first kid, while half will have a girl. We can then ignore that first kid and consider the rest of the possibilities. But since an infinite list doesn’t get shorter when you subtract an item from it, we’re left with the exact same collection of infinite arrangements. Every single arrangement after we subtract that kid can be mapped to one before, with no extras or gaps. We can then reapply the same logic of the first kid to the second: if the first kid had a fifty-fifty chance of being assigned female, so too must the second.
But we’re not dealing with an infinite number of kids here. The question states there are only two kids, hence “other” instead of “another.” Selecting one without replacing them does have an impact on the rest of the sample, especially one this small, and to ignore that would be to ignore some information shared by the question. The third-and-two-third answer seems to be most sensible of the two; indeed, when Marilyn vos Savant empirically evaluated this question via poll, she got that answer as opposed to a fifty-fifty split, and some very pedantic researchers looked at actual demographic data to arrive at the same conclusion.
No, that’s not quite right. vos Savant nor the researchers didn’t consider this exact question, because this version mentions the boy was born on a Tuesday. It seems like a trivial detail, but then again I nitpicked the difference between “other” and “another.” Let’s redo the analysis, this time incorporating birth day of week.
Right, so 169 of 196 possibilities drop out because they involve a pair of girls or no boys born on Tuesdays. That leaves 27 left, of which 14 have girls as the next child, so that puts the odds of the second child being female at 14/27.
14/27?! What a bizarre number, yet if you count squares above you’ll see it’s true. The Bayesian approach comes to the same conclusion.
This result screams at common sense. How can such a trivial detail have so great an effect? Yet the explanation is exactly the same as for the two-child version: samples of a finite pool without replacement can do strange things. By bumping up the size of the pool, from four to 196, we make the scenario behave more like the infinite case. So rejecting 13/27 as an answer means rejecting 1/3 as well.
But accepting it leads to all of wild scenarios. What if we changed the qualifier from “born on a Tuesday” to, say, being born in December? Or being born with red hair, or being born on a leap day?
It looks like we can rig the results to whatever we want. And what’s worse, it isn’t because we’re deploying sketchy statistical methods or screwing up our math; all we’re doing is changing what we sample in the possibility space, by changing what we consider important to the answer.
Normally this would be the place where I turn around and cut to a simple solution. But this is a “problem” and not a “paradox:” it’s really just a reminder to carefully consider “what do we care about?” If a variable isn’t important to your conclusions, don’t factor it in. If it is, justify why, and do some quick testing to see how much it wiggles your conclusion around. Be very mindful of how sampling can change your conclusion, because even trivial-looking decisions can do so.
I know this example sounds like a rare corner case, but there are variations that recur in the scientific literature.
How different investigators might conceive the planning and execution of a study can also lead to p values with widely varying magnitudes. As an example of this, let us examine Fisher’s (1935, ch. 2) classic experiment of the ‘lady tasting tea,’ as described by Lindley (1993). The lady in question claimed she could distinguish between whether milk or tea had been poured first into a cup (of tea). In the experiment, the lady is presented with six pairs of cups of tea, and she must determine whether milk or tea entered the cup first. The null hypothesis—that she cannot, in fact, discriminate—is that she would guess 50% right (R) and 50% wrong (W). Suppose that she gets the first five results right and the last one wrong, or RRRRRW. The p value for this outcome, Lindley notes, is 7(1⁄2) 6 , or .110, which is not statistically significant at the .05 level. This p value, like all of them, consists of two parts. In this case: 6(1⁄2) 6 = .094 (probability of observed outcomes) + (1⁄2) 6 = .016 (probability of more extreme outcomes). …
Suppose instead of the above design, another researcher decides to repeat the experiment until the lady makes her first mistake. In this case, and with the same RRRRRW data, the p value is now statistically significant at the .032 level [(1⁄2) 6 + (1⁄2) 6 = .016 + .016 = .032]. The two parts of this p value are explained as follows: (1⁄2) 6 = .016 (probability of observed outcomes)—but without this expression being multiplied by 6 because the mistaken choice, W, must always come at the end (see, e.g., Goodman, 1999)— + (1⁄2) 6 = .016 (probability of more extreme outcomes).
Of course, these experimental results make no sense. The exact same data, obtained in the exact same sequence, should yield the exact same p values. But they do not. And all because two different investigators held alternate conceptions as to how the experiment should be run.
There’s a huge debate in the literature about stopping rules, as it’s been known for some time that they can change study outcomes. Switching to the Bayesian interpretation is an easy fix for that one, but it should be obvious from the above that sampling issues will persist no matter which probability framework you choose.
Account for them as best you can.
 Carlton, Matthew A., and William D. Stansfield. “Making babies by the flip of a coin?.” The American Statistician 59.2 (2005).
 Hubbard, Raymond, and R. Murray Lindsay. “Why P values are not a useful measure of evidence in statistical significance testing.” Theory & Psychology 18.1 (2008): 69-88.