**Introduction**

So I have a thing I discovered recently that I thought I might run past those of you who are interested. It involves the so-called problem of induction, which keeps me up at night from time to time, seeing as how it threatens the very foundations of empiricism and leaves behind a gulf of chaos and uncertainty, babies having babies and so on.

Anyway, I have been reading the great Harold Jeffreys’ book on probability theory, the aptly titled “Theory of Probability”, and in section 1.6 of the first chapter, “Fundamental Notions”, he produces an interesting exposition of the Bayesian answer to the problem of induction. It is not really a “proper” answer I think, since there are of course still various assumptions involving the validity of probability theory, but if you can accept those then it provides some semblance of hope, and is kind of interesting in its own right. So for your interest and comments, I shall reproduce it for you here.

**Bayes’ theorem revision**

Let us suppose we have some hypothesis , and available information , and some series of experimental facts , which are inevitable consequences of (i.e. if is true then it implies that we must observe , although of course observing any individual does not tell us that is true, merely that it is not disproven.

By Bayes’ theorem we have

where is the probability (read: degree of belief) that a rational agent should assign to the proposition that is true given the information (in some sense; people like to argue that “every model is wrong”, but we can more carefully formulate what we mean by the proposition “ is true” if we want. For now the simple version of the statement will suit us), is the probability that is true (where you can read “ is true” as meaning that some piece of experimental data is obtained/observed in some apparatus, or otherwise learned) given that is true (and we know the background information ). This probability is 1, since we stated that were inevitable consequences of , so if is true there is no chance we will not observe . This is just to make the rest of the argument simpler. We are not finished with our definitions though: is the probability that is true (would be observed) regardless of whether or not is true (this one is pretty hard to nail down, since it can only be computed by considering ALL hypotheses that might explain . We are making a theoretical argument, however, so this is not too big a problem right now) and finally is the probability we assign to being true, once we have observed (and considering that we already know ).

So the point is that via Bayes’ theorem we are obtaining a more accurate idea of how probable it is that is true by taking into account that we have observed something that it predicted, i.e. . Since we said , and since (with suitable restriction on exactly what is), then observing can only make *more* probable, although if there remain a lot of other hypotheses which also predict that we should see then may only become a tiny tiny bit more probable (since that would mean . Basically will only become a *lot* more probable if observing kills off most of the competing hypotheses).

**The argument**

So, here is what Jeffreys’ says. Recall that we know , so that

If we observe the further consequences of , , to also be true then this extends successively to

or expanding:

so that each subsequent verification divides the probability that is true by the probability that the verification be observed, given previous information. So, he argues, one of three things must happen as more and more verifications are obtained:

(1) The probability of on the information available eventually exceeds 1.

(2) it is always zero.

(3) tends to 1.

(1) is impossible due to the axioms of probability theory, (2) would mean that can never reach a positive probability. However, (3) means that repeated verifications of a hypothesis will make it practically certain that the next consequences will also be verified, and so he argues is the basis for our pragmatic confidence in inductive inferences.

**Even more interesting…**

So the previous proof is sort of obvious, and Bayesian inference would be pretty useless if it was not the case. However, Jeffreys’ next comes out with another derivation which he says was pointed out by V. S. Huzurbazar in 1955. Jeffreys’ does not say so explicitly, but it seems to me that it covers an interesting case which the three options above for the limiting behaviour of seem to ignore, namely that converges to some value other than 1, let us call it . Even if is incredibly tiny, it appears that the mere existence of the convergence would allow us to have extreme confidence in our inductions anyway!

So, we have the following properties:

as

(so that for any finite ), and let us choose an such that

,

where is infinitesimal, so that after verifications the probability of has almost converged to (remember that may, however, be very small). We want to know what can be said about the probability of future verifications, , given that the probability of has become basically as large as it possibly can, given the type of verification that it is possible for us to make.

So, dividing these two probabilities by each other:

which, applying Bayes’ theorem to the top and bottom, gives:

so, somewhat miraculously, the pieces involving cancel out and we are left with (after a little simplification)

which is arbitrarily close to 1. This is amazing! So we can be almost totally sure that the future data will be observed, even though our belief that is true is only !

So what is going on here? How can we be so sure that we will see this future data, even though we are apparently not so sure about what is causing it (), or even whether or not there are other explanations for it that we have not yet thought of?

The answer is that we have collected so much of this kind of data () that ALL hypothesis that are not so far excluded by it must ALSO predict the same further data . We don’t know which of the remaining hypotheses is correct (and perhaps is only one of very very many, and perhaps is ugly theoretically and so we think it less plausible than some of the alternatives), yet we have collected so much of this kind of data that we know all possible hypotheses must make the same predictions for this data, so as far as prediction goes it doesn’t matter which one is correct.

**Everyone’s a winner!**

This is how we could use Newton’s laws for so long and have no problem. For hundreds of years the only data that it was possible to collect supported Newton, and even if modern theories like general relativity or quantum mechanics had been known, they would not have made any different predictions for the sorts of experiments that people were doing. So it wouldn’t matter which theory one believed in, we’d all be predicting the same things and getting our predictions correct.

It is only once new types of experiments become possible, due to perhaps technology or a new theory which provides insight into some new class of experiments which nobody thought to do before, that we “break out” of this pseudo-convergent phase and begin learning things again. However, we do not need to worry that some new theory may come along and suddenly we won’t be able to use Newton’s laws to build bridges and such anymore, because we know how that sort of data works so well that any new theory that comes along will almost certainly make the exact same predictions for that class of experiments.

**Conclusions**

So I took a few liberties in the last paragraph, namely that Jeffreys’ arguments extend to the case where , so that our data is probabilistic in origin. I cannot see any reason why they would not, perhaps with some extra conditions required, but I have not checked. It would also be interesting to make a more formal argument about being in a certain “class” of verifications, which lead to some temporary convergence of the probability of , while being explicit about the possible existence of other verifications which might break this convergence. I alluded that such situations exist, but didn’t make it explicit. I have not actually seen anyone demonstrate such a thing, but it seems “obvious” to me that it must be so. Maybe I will try and formalise it sometime.

Also, the faith in induction that this argument brings doesn’t really extend to the extreme cases, such as to ideas like we live in the matrix and any moment Morpheus might turn up and give you the red pill, or the vacuum is only metastable and could decay at any moment, or the demon controlling your conciousness gets bored of his game and throws your soul into an eternal void, all of which plunge our precious empiricism into ruin, but at least they assure us that so long as the universe continues running in vaguely the same condition as it currently is, then there are no new hypotheses that we might suddenly think of which are going to suddenly prevent us from being able to build computers or aeroplanes or something.

Which again is obvious since probability theory is a theory of knowledge and information, and it would make no sense if learning new things somehow changed the nature of the external world.

So I am not really sure what we learned after all that. It seemed like a nice result to me though, and I think it means something. I guess the point is that, indeed, we can make sense of the world with only limited theories and limited experimental data, and make accurate predictions of phenomenon even if we don’t have the “right” underlying theory to explain it. And perhaps it makes somewhat precise what we mean when we say that new theories must reproduces all the successes of the old.

edit: I guess something I should also have talked about is how this relates to the old “I have only seen white swans previously; therefore, all swans are white” inductive inference. It seems that in this case, the probability of the hypothesis that “all swans are white” need not converge to 1 as more and more white swans are observed; rather, say it tends just to . At this point, we can remain extremely sure that if we continue to observe swans under the same conditions that all previous swans were observed, then we will continue to find that they are all white. The “all swans are white” hypothesis is then some kind of baseline hypothesis that we don’t know to be true, but which we know to make true predictions about those data we are able to collect. If we stumble upon a new theory, that black swans indeed exists, not in Europe, but in Australia, then it is logically required to predict white swans in Europe as our previous hypothesis did. So until it becomes possible to go to Australia and observe black swans, our predicted observations are the same whichever theory we most believe.

Thanks Ben, I like it 🙂

One question though.. what is the ‘argument’ that the fourth equation should go to 0, infinity or 1.. why not some other fraction of 1?

I already talked to Jay about it, but for the benefit of others who might happen to read this, read the section after that bit :p.

Although I will mention that it can’t ‘go’ to zero, it will just stay at zero if it starts there. No amount of verification of its consequences can make you believe a hypothesis you set to zero probability to start with. The ‘infinity’ possibility was just a naive one since extra verifications can only make the probability go up (for this situation where all likelihoods are either 1 or 0 that we are examining), and at first glance perhaps it might not be obvious that it cannot be unbounded. But yes, convergence to some fraction of 1 appears to be totally possible, at very least temporarily.

Pingback: Why is extrapolation “riskier” than interpolation? « Everything Is Bayesian

Pingback: On trying to figure out whether all swans are white | Everything Is Bayesian