Is This Coin Loaded? A Hypothesis Test Walkthrough

Is This Coin Loaded? A Hypothesis Test Walkthrough
Is This Coin Loaded? A Hypothesis Test Walkthrough

Imagine a friend hands you a coin and challenges you to twenty flips. The result: 17 heads, 3 tails. You stare at the coin. You stare at your friend. Is the coin loaded? Are they cheating? Or were you just unlucky? The intuition that "17 out of 20 is too many" is correct — but how do we turn that gut feeling into a defensible answer?

This is exactly the question that a hypothesis test is designed to settle, and it's one of the simplest cases in all of inferential statistics. This article walks through the full reasoning, including the math, so the next time something feels too lopsided to be honest, you'll have the language to say how lopsided.

The intuition first

With a fair coin, you expect about 10 heads in 20 flips. Some random variation around 10 is normal: 11, 9, even 13 or 7 wouldn't surprise anyone. But 17 is far enough from 10 that something feels off. The question a hypothesis test answers is sharper: if the coin really were fair, how often would we see a result this extreme purely by chance? If the answer is "essentially never," we have grounds to suspect the coin.

Setting up two competing hypotheses

Statistical reasoning frames the problem as a contest between two ideas:

  • Null hypothesis (H₀): The coin is fair. P(heads) = 0.5.
  • Alternative hypothesis (H₁): The coin is biased. P(heads) ≠ 0.5.

We assume H₀ is true and ask whether the observed data is plausible under that assumption. If the data is wildly implausible under H₀, we reject H₀ in favour of H₁. This is the central move of every hypothesis test.

The math: what's the probability of 17 heads in 20 flips?

Under a fair coin, the number of heads in 20 flips follows a binomial distribution: P(X = k) = C(20, k) · 0.5²⁰, where C(20, k) is the number of ways to choose k heads from 20 flips. The distribution is symmetric — k heads and 20−k heads are equally likely — so the key values tell the whole story:

HeadsProbability
10 (the peak)17.62%
9 or 1116.02% each
8 or 1212.01% each
6 or 143.70% each
4 or 160.462% each
3 or 170.109% each
0 or 200.0001% each

Results pile up around 10 and fall off a cliff toward the edges — by 17 heads you are already deep in the tail.

The chance of getting exactly 17 heads is about 0.109 % — roughly 1 in 920. That alone is striking. But the proper hypothesis test asks a slightly different question.

Why we sum the tail probabilities (the two-tailed p-value)

Suppose the coin were loaded. The bias could go either way — too many heads or too many tails. If we are willing to say "17 heads is suspicious," we should equally accept "3 heads is suspicious." A fair test must therefore count all outcomes at least as extreme as the one we observed, in both directions.

For 20 flips, "at least as extreme as 17 heads" means 17, 18, 19, or 20 heads — and their symmetric counterparts 0, 1, 2, or 3 heads. The two-tailed p-value is the sum of those probabilities:

P = P(X ≤ 3) + P(X ≥ 17) ≈ 0.129% + 0.129% ≈ 0.26%

So if the coin were fair, you would see something this extreme (in either direction) only about one in 380 trials. That is the number you reason with.

Significance levels: how small is "too small"?

Statisticians settled on two conventional thresholds (called significance levels, denoted α):

  • α = 0.05 (5 %): the everyday threshold for "statistically significant."
  • α = 0.01 (1 %): the stricter threshold for "highly significant."

Our p-value of 0.26 % is well below both. Whether you use the 5 % or the 1 % rule, the conclusion is the same: under the fair-coin assumption, the data is too implausible to defend. We reject the null hypothesis and lean toward the alternative — this coin is most likely loaded.

What counts as a "normal" 20-flip result?

It is illuminating to run the test the other direction: how many heads in 20 flips would still be considered consistent with a fair coin at each threshold? The answer is the narrowest range whose combined probability of landing outside it is at or below α — equivalently, every outcome whose two-tailed p-value still exceeds α.

Significance level (α)"Fair-coin" acceptance range (heads)Tail probability outside that range
0.05 (5%)6 to 14 heads≈ 4.14% — reject if outside
0.01 (1%)4 to 16 heads≈ 0.26% — reject if outside

At the 5 % level, anything outside 6–14 heads is "significant." At the 1 % level, you need to land outside 4–16 heads. The stricter the level, the more extreme the result needs to be before you call the coin loaded. 17 heads sits clearly outside both ranges.

What this test does — and does not — prove

A few important caveats:

  • Rejecting H₀ does not "prove" the coin is loaded. A p-value of 0.26 % means that if the coin were fair, the observed data would be very unusual. It doesn't tell you how loaded the coin is, or what the true probability of heads actually is.
  • Failing to reject H₀ does not "prove" the coin is fair. A small sample can hide a real bias. Twenty flips is enough to catch a heavy bias; it is nowhere near enough to certify a coin as fair to within 1 %.
  • The significance level is a choice, not a fact. α = 5 % is a convention, not a law of nature. In high-stakes scientific work (drug trials, particle physics) the threshold is often much stricter — particle physicists use the famous "five sigma" rule, equivalent to α ≈ 0.00006 %.

Scaling up: what happens with 1,000 or 10,000 flips?

With 20 flips, the binomial distribution is small enough to write out by hand. With 1,000 or 10,000 flips, the same logic still applies — but instead of summing dozens of individual probabilities, you switch to the normal approximation: the Z-test (or, equivalently, the chi-squared test for one binomial proportion). The principle is identical: pick a null hypothesis, find the tail probability of seeing a result at least as extreme, and compare to α.

This is exactly the test the site applies to its own cumulative flip data on the Statistics page. With tens of millions of flips, the test is so sensitive that a deviation of just a few thousand flips from a perfect 50/50 split would be flagged as significant — yet such a deviation would be irrelevant in practical terms. Statistical significance and practical significance are not the same thing, and very large samples make the difference visible. (For an interactive version of this and other simulators, see the statistics simulator article.)

Try it on a different number

The reasoning works for any observed count. Suppose your friend's coin had landed 15 heads in 20 flips — the same direction, but less extreme. The two-tailed p-value would be:

P = P(X ≤ 5) + P(X ≥ 15) ≈ 2.07% + 2.07% ≈ 4.14%

At the 5 % level, this is borderline significant — you would (just barely) reject the null hypothesis. At the 1 % level, you would not. The exact same observed direction, with two fewer heads, lands on the opposite verdict for the stricter test. This is why intuition is no substitute for the actual calculation: "feels suspicious" depends entirely on what tail probability and what significance level you are willing to commit to in advance.

Bottom line

A hypothesis test turns a vague suspicion ("that's too many heads") into a precise question ("how often would this happen if the coin were fair?") and a numerical answer (the p-value). The cost is committing to a significance level in advance and accepting that you are not "proving" anything — only rejecting one hypothesis in favour of another, with a known false-positive rate. The benefit is that you no longer have to argue from intuition. Twenty flips, 17 heads, p ≈ 0.26 %: statistically suspect, almost certainly loaded. That sentence is now defensible, not a guess.

Want to run the experiment yourself? Flip a coin 20 times with a cryptographically fair coin and test your own result against the table above.

References

  • "Statistical hypothesis testing." Wikipedia. en.wikipedia.org — definitions of null/alternative hypothesis, p-value, significance level.
  • "Binomial test." Wikipedia. en.wikipedia.org — the exact small-sample test used in this article.
  • Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd. — the classic introduction of the 5 % significance level convention.
  • Grinstead, C. M. & Snell, J. L. (1997). Introduction to Probability. American Mathematical Society. chance.dartmouth.edu — free undergraduate text with chapters on hypothesis testing and the binomial distribution.

Frequently Asked Questions

How many heads in 20 flips suggests a rigged coin?
At the everyday 5% significance level, anything outside 6–14 heads is statistically significant. At the stricter 1% level, you need a result outside 4–16 heads. 17 heads is clearly outside both ranges.
What is a p-value in simple terms?
The probability of seeing a result at least as extreme as yours, assuming the coin is actually fair. A tiny p-value means your result would almost never happen with a fair coin — grounds for suspicion.
Can 20 flips prove a coin is fair?
No. Twenty flips can catch a heavily loaded coin, but it is nowhere near enough to certify fairness — a small bias would need thousands of flips to detect reliably.
What does "statistically significant" mean in a coin test?
It means the observed result's p-value falls below a threshold you committed to in advance (conventionally 5% or 1%). It does not prove the coin is loaded — it quantifies how implausible the result would be if the coin were fair.

Share this link via

Or copy link