What is calibration?
The Brier score explained

Imagine you're asked whether it will rain tomorrow and you say, "I'm pretty sure — I'd put it at about 80%." The interesting question isn't just whether you're right that one time. It's whether, across all the times you say 80%, roughly 80% of those things actually happen.

That pattern — whether your confidence reliably matches your track record — is what forecasters call calibration. And most people have no idea whether they're well-calibrated, because the feedback almost never comes. You make confident calls all week. Some land, some don't. Without a written record, your memory quietly drops the misses and files the hits. You never discover that the things you called "90% sure" only came true about half the time.

Calibration is the practice of closing that loop. Write down what you expect, attach a confidence level, set a deadline, check back. Do it long enough and a pattern emerges — one that tells you a great deal more about your thinking than any single right-or-wrong result.

Being well-calibrated is not the same as being right

Here's the key subtlety: a well-calibrated forecaster who says "60% chance" and it doesn't happen has not made a mistake. Forty percent of the time, that is exactly the outcome that should occur. Calibration is about accuracy across many predictions, not any individual outcome.

A well-calibrated person's 70% predictions come true about 70% of the time. Their 50% predictions come true about half the time. If you say "I'm 90% confident" and those things only happen 60% of the time, you're overconfident — your internal sense of certainty outpaces reality. Conversely, if your 80% calls come true 95% of the time, you're underconfident — you're systematically underselling your own accuracy.

Neither overconfidence nor underconfidence is a character flaw; they're measurement signals. Knowing which direction you lean makes you a better decision-maker.

Putting a number on it: the Brier score

Intuition is useful; a number is better. The Brier score turns your calibration into a single metric you can track over time. It was introduced by meteorologist Glenn Brier in 1950 to evaluate weather forecasts, and it remains one of the clearest ways to measure probabilistic accuracy.

Here is how it works. For each resolved prediction, express your stated confidence as a decimal between 0 and 1 — so 70% becomes 0.70. Subtract the actual outcome: 1 if the prediction came true, 0 if it did not. Square that difference. Average those squared differences across all your resolved predictions.

Brier score = average of (confidence − outcome)² where: confidence ∈ [0, 1] (your stated probability as a decimal) outcome ∈ {0, 1} (1 = came true, 0 = did not) average is taken over all resolved predictions

Lower is better. A score of 0 is theoretically perfect — every prediction you called 100% came true, and every prediction you called 0% did not. A score of 1 is the worst possible: you said 100% on everything that failed, and 0% on everything that succeeded. The important reference point is roughly 0.25 — that is the score you would get if you said exactly 50% on every single prediction, regardless of context. Beating 0.25 means you are providing real information; landing above it means your stated confidences are actively misleading.

What specific scores mean in practice

0.00 Theoretical perfection. Every 100% prediction came true; every 0% prediction did not.
< 0.10 Very strong. High confidence on things that happen; low confidence on things that don't.
0.15 – 0.20 Skilled. Well above the uninformative baseline.
~ 0.25 The "say 50% on everything" baseline. A common starting point.
> 0.25 Stated confidences are worse than random. Overconfidence on failed predictions is the usual driver.

A worked example (illustrative — these are made-up numbers)

Suppose you have resolved four predictions. Here is how the Brier score is calculated step by step:

Prediction	Confidence	Came true?	Outcome	Squared error
"My team will hit the Q3 target"	80% → 0.80	Yes	1	(0.80 − 1)² = 0.04
"I'll finish this book this month"	70% → 0.70	Yes	1	(0.70 − 1)² = 0.09
"The meeting will run over time"	60% → 0.60	No	0	(0.60 − 0)² = 0.36
"It will rain on my day off"	90% → 0.90	No	0	(0.90 − 0)² = 0.81

Mean squared error = (0.04 + 0.09 + 0.36 + 0.81) ÷ 4 = 0.325

In this example: 2 out of 4 predictions came true, so accuracy is 50%. But the average stated confidence was (80 + 70 + 60 + 90) ÷ 4 = 75%. Because the average confidence (75%) considerably exceeded the actual hit rate (50%), a calibration tracker would flag this as overconfident — the most common finding for people who are new to this practice. The Brier score of 0.325 is above the 0.25 baseline, consistent with that diagnosis: the two missed high-confidence predictions (60% and 90%) drove the score up.

Why bother? And how to start

The value of tracking calibration isn't the score itself — it's what happens to your thinking when you know the score is coming. When you commit to writing down a specific, falsifiable prediction — not "I think this might happen" but "I think X will happen by the end of next month, and I'm 70% confident" — you can no longer move the goalposts after the fact. That small act of commitment changes how carefully you think about the claim before you make it.

Over time, patterns emerge. You might find you are consistently overconfident about timelines but well-calibrated on interpersonal predictions. You might be accurate at 60–70% confidence but erratic at 90%. These asymmetries are genuine feedback from reality about where your mental model is strong.

The habit is simple: write the prediction in one sentence, attach a number from 1% to 99%, and set a date to check back. A plain text file works. So does a spreadsheet. The discipline is in the doing.

For the case that doing this privately — away from public audiences — changes what you predict and how honestly, see Why track predictions privately?

One tool built for this habit

Hunch is a free prediction journal that runs in your browser — no account, no sign-up, your data never leaves your device. It calculates your Brier score and flags overconfidence and underconfidence as you build your log. For a factual comparison with other tools, see Hunch vs PredictionBook vs Fatebook.

Start tracking your calibration — no sign-up →