You probably feel like you already know the answer to the title question, but how rigorously can you justify it? This question has annoyed me for some time now and I have never quite been able to elucidate the Bayesian answer, however I think I now have it; and of course it turns out to not be so complicated after all.
(edit: that said, this post has become very long since I take some time to get to my point, so you have my apologies for that)
(edit 2: I take it all back; I have not progressed much at all in answering this question. Oh well. Nevertheless I leave my explorations of it below for you to ponder.)
To give ourselves a concrete scenario in which to work, consider the following. We are in the undergraduate physics laboratories, and are doing some resistivity measurements on a doped Germanium sample. We heat the thing up in an oven and measure its resistivity as we go. There are all kinds of sources of uncertainty to consider, but we’ll forget about all that and assume some Gaussian uncertainties for our temperature and resistivity measurements. The following data is obtained (with arbitrary y units: pretend they are something meaningful):
What is your first instinct when you see this data? (nb. if you know things about semiconductors then pretend that you are an undergraduate who does not, for now ;))
Straight line? Ok, well let’s do the naïve thing and see how well that fits, using a test. Let’s assume the x error is small enough to ignore. Our test statistic is then
where is the measured resistivity at data point , is the standard deviation of resistivity measurement (which hopefully we have estimated well), and is the resistivity predicted at data point by our best fit straight line model (with parameters ). The statistical test is simply to calculate using the actually observed data and estimated uncertainties, and compute the probability of observing a value of this large or larger if our data indeed was generated from this idealised best-fit statistical model, i.e. assuming independent measurements, with Gaussian errors of the size we estimated, ignoring the temperature measurement uncertainty – in these circumstances is sampled from a distribution with degrees of freedom (k) equal to the number of data points we have, minus two for the two fitted parameters (although see here for some hairy issues to avoid in computing degrees of freedom). The probability in question is thus a p-value, computed by
(which is just what I described above, in equation form). For the above data this analysis gives us:
Best fit chi2 = 15.9860008544
Best fit chi2/dof = 1.14185720389
Best fit p-value = 0.314229940653
Number of "sigmas"= 1.00638596851
p=0.45 counts as zero evidence against the null hypothesis (straight line) in anyones book. I threw in the equivalent number of “sigmas” for the particle physicists out there (computed from , or if you prefer – where and are the inverse cumulative distribution functions of the standard normal and distributions respectively), which we see is a most reasonable .
Great! But how happy are we that this straight line is really a good model for the data? How much do we trust it, where do we expect it to break down, and why?
So first, interpolation. What are we really doing when we do this mentally? It would seem that we assess a wide variety of models that might explain the data, and make a prediction on the basis of some kind of consensus among the models. (This is of course related to my post about induction). In the Bayesian language we are computing
where is just the ordinary likelihood, if the data are statistically independent, and is the posterior probability of model given the already known data . Interestingly, we could also write this “global predictive distribution” as
where , similarly to before, except now is the prior for (dependent on prior information). Old-hat Bayesians will recognise as the “evidence” or “marginalised likelihood”, which we often toss away in hypothesis tests (very much the same as the more restricted evidence we might compute in a restricted subset of the hypothesis space, i.e. the parameter space of some model).
The likelihoods can be translated directly into goodness of fit measures such as , so it is clear that models must fit the old data before they contribute to our prediction for the new data. However, whacked out “wiggly” models such as I alluded to previously can do this just fine. For example, here are a selection of curves of the form
(i.e. straight lines with Gaussian bumps) which fit the data at the level:
From this image it is evident that, based on goodness of fit alone, we can justify almost any prediction (since these curves cover almost the whole plane, and they are just one of infinite alternate model classes), and there is vast room for horrible errors to occur in interpolation.
However, notice something interesting about the figure. As evidences by the increased overlap (darker blue) in the “bump free” region and in the “extrapolated” region, there are in some sense more curves that predict future data to be measured in these regions. If true this would mesh nicely with the above picture of prediction through model averaging, and seems a good candidate for a legitimate reason to favour the straight line models, and only in the “interpolation” region.
However, we now get to the tricky part, and it is the same tricky part we always encounter in the Bayesian game. Any measure we put on the global model space in order to define whether or not there are “more” curves predicting one thing than another equates directly to a prior, and the philosophical foundations for choosing priors remain shaky, particularly in the global hypothesis space (there are some convincing arguments for what to do given certain classes of models, however).
Nevertheless, if we forget about such troubles for the moment and restrict our allowed models to just these funny straight lines with bumps — and you allow me a prior that is Gaussian in and (centered on the best fit values for the line alone – say we had some good prior information which told us to search near these values), and also in (restricted to bumps which don’t go too far off the top of the plot); a flat prior in (forcing bumps to be vaguely in the temperature range of the plot); and logarithmically flat in , restricted to stop the bumps being too extremely thin or fat — then we may blindly charge ahead and see what inferences can be drawn.
So what do we get for the predictive distributions for future data? This:
The colours reflect the predictive probability , as a distribution over (the y axis, essentially), with there being a separate such distribution for each temperature we might choose to measure at. Clearly this does not predict that we will measure any of the narrow bumps which, in principle, could fit the data, since these predictions occupy too small a portion of the hypothesis space in question. One notices that the predictions get shakier in the extrapolated region, but we would get essentially the same thing if the “bump” part of the model was cut out entirely: it merely reflects the uncertainty in the gradient of the linear part of the model. I actually expected to see a bit more influence of the wider bumps that can fit in this extrapolated region, but I guess they too are overpowered by the straight line predictions that come from every other model whose bump is either small or just at a different region.
Now, there are of course infinitely many other models which might fit this data, and it is difficult to say how they should enter the model average and change this picture. We could cook up any crazy piecewise functions that change discontinuously in the regions where we have no data, and again all such models can fit just as well as a straight line. If we cook things up correctly then it seems plausible that the parts of ensuing parameter space where the fit is good may involve large volumes that are very different from the straight line, and so throw the predictive distribution right out. Yet, intuitively, it does seem like “cooking” is required; it seems like there is something “unnatural” about these models, something contrived; that we have to especially design them so as to make their predictive distributions different from straight lines.
Here is a second example: a 6th order polynomial. This has 7 parameters, and has a very complicated likelihood surface. In fact it is so complicated that I had a lot of trouble getting my search algorithms to converge. In the end I had to move away from the “standard form” parameterisation, etc, and move to a somewhat strange parameterisation:
where and were not parameters, but were constants chosen to be near the best fit parameters for the straight line model. This parameterisation thus looks at 6th order polynomials in terms of their deviation from this linear best fit, in some sense. It doesn’t cover the full space of 6th order polynomials because I only allowed real parameters, and some of the roots of the first term can easily be imaginary for polynomials in general. The mixing of the various parameters when you expand this expression into standard form might also be killing the independence of the parameters somewhat – I haven’t thought about it too carefully. These issues might wreck the exercise a little since it probably biases predictions towards the linear fit, but anyway, here is what we get (with linear priors on the roots, and a very wide Gaussian prior centred on zero for ):
We get a very nice “funnelling” of the good-fitting polynomials through the data, but again the predictions seem not to take us far from the linear model. There is somewhat more outwards “bleeding” of the probability than we saw in the Gaussian bump case, however, and keeping in mind that the colormap is linear in posterior probability (separately for each T value) this is not small enough to totally neglect.
So, is this the answer to our question then? There are just “more” possible ways that data can be extrapolated than interpolated? Certainly this seems to be true if we force a certain amount of smoothness onto our models, but is that “cheating”? Perhaps. However, it may be the case that the “pointy” models are also more “rare” in some sense, and so won’t contribute much to our posterior inferences even if we do allow them.
When I began this post I thought I had a better answer than this up my sleeve. I had planned to look at the influence of uncertainty in the x direction on our inferences, having had the idea that there might even be a “frequentist” sense in which the pointy models are less probable (since we would have had to be “unlucky” in some sense to have chosen our data samples in just such a way as to “miss” the patterns those pointy models would usually display). But I am not sure about that anymore, and I have been writing this post for about a week, when I intended to knock it off in a couple of hours. So I think I am just going to call it quits, and perhaps return to this topic at another time. I hope it has been at least somewhat interesting for those of you who for one reason or another read it all :).
Before I finish though, I feel that I cannot close this post without mentioning Solomonoff induction. I don’t have time to go into details, so I will point you to wikipedia wikipedia(which has a wishy-washy no-math article, but gives the gist of things), and this one from “less wrong”, which still has no math but describes the concept very well (although it is very lengthy). Perhaps I will come back and update this post to include better references later.
Anyway I mention it because I am not sure that we can fully answer the title question without heading down this road, of attacking the problem of induction itself and trying to understand how we can make any valid inferences at all in the absence of solid prior information. Sigh.
Still, perhaps the frequentist angle I mentioned can help somewhat. I will investigate it and get back to you if I think of anything.