Share this link via
Or copy link
Bet The Flip (β)
Your current points are 10.
Current win streak: 2
Win streak bonus: x1.5
Which one will you bet on?
How much would you like to bet?
You're betting 3 points on heads.
If you're ready, please tap the button below, and flip your coin!
Result
You Win! You've earned points!
Current points: 10

Imagine a friend hands you a coin and challenges you to twenty flips. The result: 17 heads, 3 tails. You stare at the coin. You stare at your friend. Is the coin loaded? Are they cheating? Or were you just unlucky? The intuition that "17 out of 20 is too many" is correct — but how do we turn that gut feeling into a defensible answer?
This is exactly the question that a hypothesis test is designed to settle, and it's one of the simplest cases in all of inferential statistics. This article walks through the full reasoning, including the math, so the next time something feels too lopsided to be honest, you'll have the language to say how lopsided.
With a fair coin, you expect about 10 heads in 20 flips. Some random variation around 10 is normal: 11, 9, even 13 or 7 wouldn't surprise anyone. But 17 is far enough from 10 that something feels off. The question a hypothesis test answers is sharper: if the coin really were fair, how often would we see a result this extreme purely by chance? If the answer is "essentially never," we have grounds to suspect the coin.
Statistical reasoning frames the problem as a contest between two ideas:
We assume H₀ is true and ask whether the observed data is plausible under that assumption. If the data is wildly implausible under H₀, we reject H₀ in favour of H₁. This is the central move of every hypothesis test.
Under a fair coin, the number of heads in 20 flips follows a binomial distribution: P(X = k) = C(20, k) · 0.5²⁰, where C(20, k) is the number of ways to choose k heads from 20 flips. The distribution is symmetric — k heads and 20−k heads are equally likely — so the key values tell the whole story:
| Heads | Probability |
|---|---|
| 10 (the peak) | 17.62% |
| 9 or 11 | 16.02% each |
| 8 or 12 | 12.01% each |
| 6 or 14 | 3.70% each |
| 4 or 16 | 0.462% each |
| 3 or 17 | 0.109% each |
| 0 or 20 | 0.0001% each |
Results pile up around 10 and fall off a cliff toward the edges — by 17 heads you are already deep in the tail.
The chance of getting exactly 17 heads is about 0.109 % — roughly 1 in 920. That alone is striking. But the proper hypothesis test asks a slightly different question.
Suppose the coin were loaded. The bias could go either way — too many heads or too many tails. If we are willing to say "17 heads is suspicious," we should equally accept "3 heads is suspicious." A fair test must therefore count all outcomes at least as extreme as the one we observed, in both directions.
For 20 flips, "at least as extreme as 17 heads" means 17, 18, 19, or 20 heads — and their symmetric counterparts 0, 1, 2, or 3 heads. The two-tailed p-value is the sum of those probabilities:
P = P(X ≤ 3) + P(X ≥ 17) ≈ 0.129% + 0.129% ≈ 0.26%
So if the coin were fair, you would see something this extreme (in either direction) only about one in 380 trials. That is the number you reason with.
Statisticians settled on two conventional thresholds (called significance levels, denoted α):
Our p-value of 0.26 % is well below both. Whether you use the 5 % or the 1 % rule, the conclusion is the same: under the fair-coin assumption, the data is too implausible to defend. We reject the null hypothesis and lean toward the alternative — this coin is most likely loaded.
It is illuminating to run the test the other direction: how many heads in 20 flips would still be considered consistent with a fair coin at each threshold? The answer is the narrowest range whose combined probability of landing outside it is at or below α — equivalently, every outcome whose two-tailed p-value still exceeds α.
| Significance level (α) | "Fair-coin" acceptance range (heads) | Tail probability outside that range |
|---|---|---|
| 0.05 (5%) | 6 to 14 heads | ≈ 4.14% — reject if outside |
| 0.01 (1%) | 4 to 16 heads | ≈ 0.26% — reject if outside |
At the 5 % level, anything outside 6–14 heads is "significant." At the 1 % level, you need to land outside 4–16 heads. The stricter the level, the more extreme the result needs to be before you call the coin loaded. 17 heads sits clearly outside both ranges.
A few important caveats:
With 20 flips, the binomial distribution is small enough to write out by hand. With 1,000 or 10,000 flips, the same logic still applies — but instead of summing dozens of individual probabilities, you switch to the normal approximation: the Z-test (or, equivalently, the chi-squared test for one binomial proportion). The principle is identical: pick a null hypothesis, find the tail probability of seeing a result at least as extreme, and compare to α.
This is exactly the test the site applies to its own cumulative flip data on the Statistics page. With tens of millions of flips, the test is so sensitive that a deviation of just a few thousand flips from a perfect 50/50 split would be flagged as significant — yet such a deviation would be irrelevant in practical terms. Statistical significance and practical significance are not the same thing, and very large samples make the difference visible. (For an interactive version of this and other simulators, see the statistics simulator article.)
The reasoning works for any observed count. Suppose your friend's coin had landed 15 heads in 20 flips — the same direction, but less extreme. The two-tailed p-value would be:
P = P(X ≤ 5) + P(X ≥ 15) ≈ 2.07% + 2.07% ≈ 4.14%
At the 5 % level, this is borderline significant — you would (just barely) reject the null hypothesis. At the 1 % level, you would not. The exact same observed direction, with two fewer heads, lands on the opposite verdict for the stricter test. This is why intuition is no substitute for the actual calculation: "feels suspicious" depends entirely on what tail probability and what significance level you are willing to commit to in advance.
A hypothesis test turns a vague suspicion ("that's too many heads") into a precise question ("how often would this happen if the coin were fair?") and a numerical answer (the p-value). The cost is committing to a significance level in advance and accepting that you are not "proving" anything — only rejecting one hypothesis in favour of another, with a known false-positive rate. The benefit is that you no longer have to argue from intuition. Twenty flips, 17 heads, p ≈ 0.26 %: statistically suspect, almost certainly loaded. That sentence is now defensible, not a guess.
Want to run the experiment yourself? Flip a coin 20 times with a cryptographically fair coin and test your own result against the table above.