Jeffreys’ “solution” to the problem of induction


So I have a thing I discovered recently that I thought I might run past those of you who are interested. It involves the so-called problem of induction, which keeps me up at night from time to time, seeing as how it threatens the very foundations of empiricism and leaves behind a gulf of chaos and uncertainty, babies having babies and so on.

Anyway, I have been reading the great Harold Jeffreys’ book on probability theory, the aptly titled “Theory of Probability”, and in section 1.6 of the first chapter, “Fundamental Notions”, he produces an interesting exposition of the Bayesian answer to the problem of induction. It is not really a “proper” answer I think, since there are of course still various assumptions involving the validity of probability theory, but if you can accept those then it provides some semblance of hope, and is kind of interesting in its own right. So for your interest and comments, I shall reproduce it for you here.

Bayes’ theorem revision

Let us suppose we have some hypothesis q, and available information I, and some series of experimental facts p_1...p_n, which are inevitable consequences of q (i.e. if q is true then it implies that we must observe p_1...p_n, although of course observing any individual p_1...p_n does not tell us that q is true, merely that it is not disproven.

By Bayes’ theorem we have

\displaystyle   P(q|p_1,I)=\frac{P(p_1|q,I)P(q|I)}{P(p_1|I)}

where P(q|I) is the probability (read: degree of belief) that a rational agent should assign to the proposition that q is true given the information I (in some sense; people like to argue that “every model is wrong”, but we can more carefully formulate what we mean by the proposition “q is true” if we want. For now the simple version of the statement will suit us), P(p_1|q,I) is the probability that p_1 is true (where you can read “p_1 is true” as meaning that some piece of experimental data is obtained/observed in some apparatus, or otherwise learned) given that q is true (and we know the background information I). This probability is 1, since we stated that p_1...p_n were inevitable consequences of q, so if q is true there is no chance we will not observe p_1...p_n. This is just to make the rest of the argument simpler. We are not finished with our definitions though: P(p_1|I) is the probability that p_1 is true (would be observed) regardless of whether or not q is true (this one is pretty hard to nail down, since it can only be computed by considering ALL hypotheses that might explain p_1. We are making a theoretical argument, however, so this is not too big a problem right now) and finally P(q|p_1,I) is the probability we assign to q being true, once we have observed p_1 (and considering that we already know I).

So the point is that via Bayes’ theorem we are obtaining a more accurate idea of how probable it is that q is true by taking into account that we have observed something that it predicted, i.e. p_1. Since we said P(p_1|q,I)=1, and since P(p_1|I)\le 1 (with suitable restriction on exactly what p_1 is), then observing p_1 can only make q more probable, although if there remain a lot of other hypotheses which also predict that we should see p_1 then q may only become a tiny tiny bit more probable (since that would mean P(p_1|I)\approx1. Basically q will only become a lot more probable if observing p_1 kills off most of the competing hypotheses).

The argument

So, here is what Jeffreys’ says. Recall that we know P(p_1|q,I)=1, so that

\displaystyle   P(q|p_1,I)=\frac{P(q|I)}{P(p_1|I)}

If we observe the further consequences of q, p_2...p_n, to also be true then this extends successively to

\displaystyle   P(q|p_1,...,p_n,I)=\frac{P(q|I)}{P(p_1,...,p_n|I)}

or expanding:

\displaystyle   P(q|p_1,...,p_n,I)=\frac{P(q|I)}{P(p_1|I)P(p_2|p_1,I)P(p_n|p_1,...,p_{n-1},I)}

so that each subsequent verification divides the probability that q is true by the probability that the verification be observed, given previous information. So, he argues, one of three things must happen as more and more verifications are obtained:
(1) The probability of q on the information available eventually exceeds 1.
(2) it is always zero.
(3) P(q|p_1,...,p_n,I) tends to 1.
(1) is impossible due to the axioms of probability theory, (2) would mean that q can never reach a positive probability. However, (3) means that repeated verifications of a hypothesis will make it practically certain that the next consequences will also be verified, and so he argues is the basis for our pragmatic confidence in inductive inferences.

Even more interesting…

So the previous proof is sort of obvious, and Bayesian inference would be pretty useless if it was not the case. However, Jeffreys’ next comes out with another derivation which he says was pointed out by V. S. Huzurbazar in 1955. Jeffreys’ does not say so explicitly, but it seems to me that it covers an interesting case which the three options above for the limiting behaviour of P(q|p_1,...,p_n,I) seem to ignore, namely that P(q|p_1,...,p_n,I) converges to some value other than 1, let us call it \alpha. Even if \alpha is incredibly tiny, it appears that the mere existence of the convergence would allow us to have extreme confidence in our inductions anyway!

So, we have the following properties:

\displaystyle  P(q|p_1,...,p_m,I) \to \alpha as m\to\infty

(so that P(q|p_1,...,p_m,I) < \alpha for any finite m), and let us choose an n such that

\displaystyle  P(q|p_1,...,p_n,I) > \alpha-\epsilon,

where \epsilon is infinitesimal, so that after n verifications the probability of q has almost converged to \alpha (remember that \alpha may, however, be very small). We want to know what can be said about the probability of future verifications, p_{n+1}...p_m, given that the probability of q has become basically as large as it possibly can, given the type of verification p_i that it is possible for us to make.

So, dividing these two probabilities by each other:

\displaystyle   \frac{P(q|p_1,...,p_n,I)}{P(q|p_1,...,p_m,I)}>\frac{\alpha-\epsilon}{\alpha}

which, applying Bayes’ theorem to the top and bottom, gives:

\displaystyle   \frac{P(q|I)/P(p_1,...,p_n|I)}  {P(q|I)/P(p_1,...,p_m|I)}  >\frac{\alpha-\epsilon}{\alpha}

so, somewhat miraculously, the pieces involving q cancel out and we are left with (after a little simplification)

\displaystyle   P(p_{n+1},...,p_m|p_1,...,p_n,I)  >\frac{\alpha-\epsilon}{\alpha}

which is arbitrarily close to 1. This is amazing! So we can be almost totally sure that the future data p_{n+1}...p_m will be observed, even though our belief that q is true is only \alpha!

So what is going on here? How can we be so sure that we will see this future data, even though we are apparently not so sure about what is causing it (q), or even whether or not there are other explanations for it that we have not yet thought of?

The answer is that we have collected so much of this kind of data (p_1...p_n) that ALL hypothesis that are not so far excluded by it must ALSO predict the same further data p_{n+1}...p_m. We don’t know which of the remaining hypotheses is correct (and perhaps q is only one of very very many, and perhaps is ugly theoretically and so we think it less plausible than some of the alternatives), yet we have collected so much of this kind of data that we know all possible hypotheses must make the same predictions for this data, so as far as prediction goes it doesn’t matter which one is correct.

Everyone’s a winner!

This is how we could use Newton’s laws for so long and have no problem. For hundreds of years the only data that it was possible to collect supported Newton, and even if modern theories like general relativity or quantum mechanics had been known, they would not have made any different predictions for the sorts of experiments that people were doing. So it wouldn’t matter which theory one believed in, we’d all be predicting the same things and getting our predictions correct.

It is only once new types of experiments become possible, due to perhaps technology or a new theory which provides insight into some new class of experiments which nobody thought to do before, that we “break out” of this pseudo-convergent phase and begin learning things again. However, we do not need to worry that some new theory may come along and suddenly we won’t be able to use Newton’s laws to build bridges and such anymore, because we know how that sort of data works so well that any new theory that comes along will almost certainly make the exact same predictions for that class of experiments.


So I took a few liberties in the last paragraph, namely that Jeffreys’ arguments extend to the case where P(p_i|q,I)\ne 1, so that our data is probabilistic in origin. I cannot see any reason why they would not, perhaps with some extra conditions required, but I have not checked. It would also be interesting to make a more formal argument about p_1...p_m being in a certain “class” of verifications, which lead to some temporary convergence of the probability of q, while being explicit about the possible existence of other verifications which might break this convergence. I alluded that such situations exist, but didn’t make it explicit. I have not actually seen anyone demonstrate such a thing, but it seems “obvious” to me that it must be so. Maybe I will try and formalise it sometime.

Also, the faith in induction that this argument brings doesn’t really extend to the extreme cases, such as to ideas like we live in the matrix and any moment Morpheus might turn up and give you the red pill, or the vacuum is only metastable and could decay at any moment, or the demon controlling your conciousness gets bored of his game and throws your soul into an eternal void, all of which plunge our precious empiricism into ruin, but at least they assure us that so long as the universe continues running in vaguely the same condition as it currently is, then there are no new hypotheses that we might suddenly think of which are going to suddenly prevent us from being able to build computers or aeroplanes or something.

Which again is obvious since probability theory is a theory of knowledge and information, and it would make no sense if learning new things somehow changed the nature of the external world.

So I am not really sure what we learned after all that. It seemed like a nice result to me though, and I think it means something. I guess the point is that, indeed, we can make sense of the world with only limited theories and limited experimental data, and make accurate predictions of phenomenon even if we don’t have the “right” underlying theory to explain it. And perhaps it makes somewhat precise what we mean when we say that new theories must reproduces all the successes of the old.

edit: I guess something I should also have talked about is how this relates to the old “I have only seen white swans previously; therefore, all swans are white” inductive inference. It seems that in this case, the probability of the hypothesis that “all swans are white” need not converge to 1 as more and more white swans are observed; rather, say it tends just to \alpha. At this point, we can remain extremely sure that if we continue to observe swans under the same conditions that all previous swans were observed, then we will continue to find that they are all white. The “all swans are white” hypothesis is then some kind of baseline hypothesis that we don’t know to be true, but which we know to make true predictions about those data we are able to collect. If we stumble upon a new theory, that black swans indeed exists, not in Europe, but in Australia, then it is logically required to predict white swans in Europe as our previous hypothesis did. So until it becomes possible to go to Australia and observe black swans, our predicted observations are the same whichever theory we most believe.


4 thoughts on “Jeffreys’ “solution” to the problem of induction

  1. Thanks Ben, I like it 🙂

    One question though.. what is the ‘argument’ that the fourth equation should go to 0, infinity or 1.. why not some other fraction of 1?

    • I already talked to Jay about it, but for the benefit of others who might happen to read this, read the section after that bit :p.

      Although I will mention that it can’t ‘go’ to zero, it will just stay at zero if it starts there. No amount of verification of its consequences can make you believe a hypothesis you set to zero probability to start with. The ‘infinity’ possibility was just a naive one since extra verifications can only make the probability go up (for this situation where all likelihoods are either 1 or 0 that we are examining), and at first glance perhaps it might not be obvious that it cannot be unbounded. But yes, convergence to some fraction of 1 appears to be totally possible, at very least temporarily.

  2. Pingback: Why is extrapolation “riskier” than interpolation? « Everything Is Bayesian

  3. Pingback: On trying to figure out whether all swans are white | Everything Is Bayesian

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s