# Knowledge vs Uncertainty

What probability should you assign to propositions about which, informally, you would prefer to throw your hands up in despair and say “I don’t know!”? This is a very difficult question in general, but there are certain situations in which a sensible answer can be reached without too much trouble.

Coin flips are always a nice way to understand probability, so imagine a coin. If it is just a generic coin, you probably have observed enough coin flips, and have enough intuitive understanding of physics, to justify to yourself that the coin is probably unbiased, that is, you are probably happy to say that the probability of “this coin will come up heads if I give it a good flip” is 0.5, or 50%. That is, assuming landing on the edge is negligible, you could trade the word “heads” for “tails” in that sentence, and your knowledge about the truth of the sentence would remain the same.

But consider now that you know nothing about coins; you are a small child, and you know nothing of physics or the properties of ordinary coins. Or perhaps you are in a dodgy gambling den, rife with con artists, and dodgy coins of all kinds are to be found at every turn. Given one of the arbitrary coins, what, then, do you assign to the probability of flipping heads on the next flip?

I will argue here that since your state of knowledge is symmetric under the exchange of the labels “heads” and “tails”, you should assign the two possibilities equal probability. This is the intuition behind the principle of indifference, however I think the argument about the symmetry of your knowledge is far more powerful.

But, then, what is the difference between the P(H)=0.5 we assigned when we thought the coin was probably fair, and the P(H)=0.5 we assigned when the coin was almost certainly dodgy? The crux of this post is to demonstrate this difference; but to pre-empt the answer, the point is that the belief network supporting this assertion is quite different in each case, and one of them is far more robust than the other in the face of new data.

State of knowledge A – Not much knowledge at all!

Let us see how this works. To make things as crystal clear as I can, let me establish some notation:

$H$ – “The next flip of this coin will land on heads”
$k$ – Fraction of heads in a sequence of flips
$n$ – Total number of flips
$p$ – “This coin has bias p”

Let us now set some priors, for demonstration purposes. In this dodgy gambling den, we think the coin we have been given is probably dodgy, but we have no idea whether it is biased towards heads or tails. We must therefore pick a prior for $p$ which is symmetric about $p=0.5$. For simplicity let’s just use the uniform prior density,

$P(p|I)=1$,

where $I$ is our background information about the dodgy-ness of the gambling den, the apparent symmetry of the coin, etc.

Given some bias parameter $p$ (implicitly specifying a binomial model for the behaviour of the coin), the probability of flipping a head is $p$, that is,

$P(H|p,I) = p$.

However, we don’t know $p$, so we have to consider all possible values it might have when making our prediction, weighting according to our prior density for $p$, that is, we marginalise over $p$ to obtain

$P(H|I) = \int_0^1 P(H,p|I) dp = \int_0^1 P(H|p,I) P(p|I) dp = \int_0^1 p dp = [\frac{p^2}{2}]_0^1 = 0.5$

Unsurprisingly we get the answer $0.5$, because our prior was symmetric about this value.

Great! So this defines our initial state of knowledge about the coin. We are only basing this on some symmetry arguments, not any empirical evidence, so our knowledge here is not very certain. My goal now is to define a much more certain state of knowledge, which makes the same predictions for $H$, but which behaves very differently as we perform more flips and learn more about the coin.

State of knowledge B – Lots of knowledge!

To proceed from state of knowledge A (not knowing much) to state of knowledge B (knowing a whole lot), let us do science! That is, let us experiment on the coin. We will do this by flipping the coin one million times. To figure out what impact this has on us, we need to calculate the probability of getting our sequence of coin flips under various hypotheses about the coin bias. This goes as follows (note that under the hypotheses about the coin we are considering, each flip is independent. Therefore we only need to know how many heads there were in the sequence to figure out its probability of occurring. If we were concerned that flips might be correlated then this analysis would get somewhat more complicated…)

Probability of any sequence of length $N$ containing $k$ heads,
$P(k|p,N,I) = p^k(1-p)^{N-k}$

(Interestingly we can leave the binomial coefficient out here I believe, since we are talking about the probability of getting specific sequences of heads and tails, not the usual binomial question of getting k “successes” in N “trials”. I am being a bit lazy in my notation by summarising this whole sequence with “k”. Turns out it doesn’t matter for our problem either way since the binomial coefficient is constant in p and gets normalised away.)

For simplicity we’ll assume there were exactly $k=200$ heads in our sequence of $N=400$. We use this data to update our beliefs about the coin bias $p$, via Bayes’ theorem:

$P(p|k,N,I) = \frac{P(k|p,N,I) P(p|I)}{P(k|N,I)}$

This causes the following to happen to our beliefs about $p$:

Whoops no axes labels. X is bias parameter p, Y is probability density of p.

This defines state of knowledge B. Note that it remains symmetric about $p=0.5$, so it predicts $P(H|k,N,I)=0.5$, that is, we still have the same belief that the next flip will be heads as we did when we started. So has anything changed in our state of knowledge? Yes! Dramatically! We are now much more sure about what the bias parameter of the coin is! This affects future inferences greatly, as we shall now see.

The coin starts to look dodgy…

Suppose now that we encounter a sequence of flips that, a-priori, might look really suspicious. 10 heads in a row! After observing this, what are our beliefs about the bias of the coin, and our belief that the next flip will be heads, starting from both states of knowledge A and B.

Applying Bayes’ theorem in both cases, with $N=10$ and $k=10$, we get the following posterior distributions:

Same axes as before :).

And the following probabilities for getting a head next flip (integrating the likelihood over the prior as before)

$P(H|A) = 0.92$
$P(H|B) = 0.51$

As you can see, starting from our state of great uncertainty, after seeing ten heads in a row we are now really suspicious of this coin, and think there is a $92\%$ chance of flipping a head on the eleventh trial. On the other hand, when we have seen the coin behaving perfectly fairly up till now, we think that not much out of the ordinary is happening; we’ll need a lot more heads in a row to convince us something is wrong. We have gotten slightly suspicious, and now think $P(H)=0.51$ rather than $0.5$, but really haven’t changed our mind very much.

And now I have to catch a flight! I hope this post has been fun to read :). Till next time!

Update: I was inspired to add a little extra. To demonstrate further the difference in confidence with which one can make predictions from these various states of knowledge, I computed the predictive probability of the fraction of heads one would expect in samples of 20 and 1000 flips of the mystery coin. Note that in the N=20 case the variance of the prediction is dominated by the low sample size, while in the N=1000 case it is dominated by the variance in the knowledge of the coin bias. In both cases, when the prior for the coin bias is flat, we know pretty much nothing about the fraction of heads to expect to see in the sample. The mean is 50%, but the variance is maxed out, so we have minimal confidence that we will actually see an outcome near the mean in this case. When we have knowledge about the coin bias, we can predict the fraction of heads in the sample with increasingly greater confidence.

Prediction for the fraction of heads in a sample of 20 coin flips, based on various states of knowledge about the coin. X axis is the fraction of heads in the sample, Y is the predictive probability density for that outcome.

Prediction for the fraction of heads in a sample of 1000 coin flips, based on various states of knowledge about the coin. X axis is the fraction of heads in the sample, Y is the predictive probability density for that outcome.