It has been a while since I talked about probability theory, but I want to get back to it today. I am going to continue on, roughly speaking, from my post on induction, but it will not be at all necessary for you to have read that post for this one to make sense.

I will give a little background. When procrastinating from my more pressing projects, I tend to drift towards thinking about the nature of scientific models, what our beliefs about them are, what is rational to expect to discover in the future and why, etc. In the course of these thoughts, and while reading an old book by Keynes, I came across an argument that tells you something both obvious, and a little discouraging.

Say we are doing that old thing where we want to know if all swans are white. Let us consider two competing hypotheses:

1 – “All swans are white”

2 – “All European swans are white”

Now, it is probably obvious to you that hypothesis 2 is “a-priori” more probably correct than hypothesis 1 (disregard anything you may actually know about swans… we are talking a-priori here, or at least conditional on only some very primitive information). This is because we have restricted the extent of our generalisation in hypothesis 2; finding a black swan in Australia will falsify 1, but not 2. We can see this a little bit more formally by writing the following; let

“x is white”

“x is a swan”

“x is European”

We can then form the compound propositions

“for all x where x is a swan, x is white”

“for all x where x is a swan and x is European, x is white”

where we may call a “generalisation”. In this case it is a propositional function, with the structure

“for all x where is true, is also true”

You can then hopefully convince yourself that

which in words is just

“all swans are white” = “all European swans are white” and “all non-European swans are white”.

If you accept that the two propositions on the right are independent, then the probabilities of these propositions, conditioned on primitive evidence , are related by

from which it is clear that

.

So, before you go out and check out any swans, it is less probable that ALL swans are white, than that just the European ones are white. We can’t really get more quantitative than this, but it is nice to reassure ourselves that this much is reasonable.

So, what happens when we go out and find some swans? In particular, let us examine the case where the only samples available to us are European swans. We could run our argument through for a sample of N swans, but let’s just do it for 1 swan for simplicity. Our new evidence is thus

or “a is a swan” and “a is European” and “a is white”

Given this new data, the posterior odds of our hypotheses 1 vs 2 are (from Bayes’ theorem)

where the final equality follows because both hypotheses predict that the European swan we sampled should be white, with probability 1. So nuts, we learned nothing to discriminate these two hypotheses from each other. That is, even though we may believe a little more in the proposition that “all swans are white” than we did before (we could show this but I won’t do it now…), we still prefer the “all European swans are white” by exactly as much as we did before. This holds no matter how many white European swans we are able to find.

The lesson: to believe in more grandly sweeping hypotheses, we need more broadly sampled data! Looking at the same sort of thing forever doesn’t help us learn anything; we must go forth into the unknown! And a-priori, the more broad the generalisation, the less likely that it holds.

(Final aside – I was a little sneaky in the way I used Bayes’ theorem here. If you recall, there is the “global evidence” piece, e.g. the term in

which is calculated as

, or just

This generally assumes our hypothesis space consists of mutually exclusive hypotheses, which is clearly not the case in my example (since if all swans are white, then certainly all European swans are also white). I silently divided out this term. However, I think this is fine, because P(D) is indeed independent of any hypothesis, and in each application of Bayes’ theorem we could divide the hypothesis space up into mutually exclusive portions with no problem (it would just be a different division in each case). We can thus compute posterior odds just fine, we just have to keep in mind that they mean something a little different to usual, i.e. that both hypotheses may simultaneous be correct

Final final aside 😉 – There is one important argument to be made against the proposition that “more sweeping generalisations are a-priori less probable”, and that arises when you consider *how else* you might expect your extended sample set to behave. It may very well be the case that “All swans are white” is a-priori MORE probable than “All European swans are white, and all other swans are pink”. Nevertheless, in both cases “All European swans are white” is true, so this is a bit of a different argument to the one I am making.)