LOST CAUSES IN STATISTICS II: Noninformative Priors

Just a link today, to a nice post I came across (from mid last year) on the blog of Larry Wasserman, who if you are not familiar with him is a professor in the Department of Statistics and also the Machine Learning Department at Carnegie Mellon University. He is rather well know in the statistics community.

Here he criticises somewhat harshly the whole project of looking for non-informative priors. As much as I, like many, feel the compulsion to wish for such things, I am inclined to agree with Larry that they simply don’t exist. Yet, as he also discusses, some of the formal constructions in this direction can still be useful in the right situation.

Knowledge vs Uncertainty

What probability should you assign to propositions about which, informally, you would prefer to throw your hands up in despair and say “I don’t know!”? This is a very difficult question in general, but there are certain situations in which a sensible answer can be reached without too much trouble.

Coin flips are always a nice way to understand probability, so imagine a coin. If it is just a generic coin, you probably have observed enough coin flips, and have enough intuitive understanding of physics, to justify to yourself that the coin is probably unbiased, that is, you are probably happy to say that the probability of “this coin will come up heads if I give it a good flip” is 0.5, or 50%. That is, assuming landing on the edge is negligible, you could trade the word “heads” for “tails” in that sentence, and your knowledge about the truth of the sentence would remain the same.

But consider now that you know nothing about coins; you are a small child, and you know nothing of physics or the properties of ordinary coins. Or perhaps you are in a dodgy gambling den, rife with con artists, and dodgy coins of all kinds are to be found at every turn. Given one of the arbitrary coins, what, then, do you assign to the probability of flipping heads on the next flip?

I will argue here that since your state of knowledge is symmetric under the exchange of the labels “heads” and “tails”, you should assign the two possibilities equal probability. This is the intuition behind the principle of indifference, however I think the argument about the symmetry of your knowledge is far more powerful.

But, then, what is the difference between the P(H)=0.5 we assigned when we thought the coin was probably fair, and the P(H)=0.5 we assigned when the coin was almost certainly dodgy? The crux of this post is to demonstrate this difference; but to pre-empt the answer, the point is that the belief network supporting this assertion is quite different in each case, and one of them is far more robust than the other in the face of new data.

State of knowledge A – Not much knowledge at all!

Let us see how this works. To make things as crystal clear as I can, let me establish some notation:

$H$ – “The next flip of this coin will land on heads”
$k$ – Fraction of heads in a sequence of flips
$n$ – Total number of flips
$p$ – “This coin has bias p”

Let us now set some priors, for demonstration purposes. In this dodgy gambling den, we think the coin we have been given is probably dodgy, but we have no idea whether it is biased towards heads or tails. We must therefore pick a prior for $p$ which is symmetric about $p=0.5$. For simplicity let’s just use the uniform prior density,

$P(p|I)=1$,

where $I$ is our background information about the dodgy-ness of the gambling den, the apparent symmetry of the coin, etc.

Given some bias parameter $p$ (implicitly specifying a binomial model for the behaviour of the coin), the probability of flipping a head is $p$, that is,

$P(H|p,I) = p$.

However, we don’t know $p$, so we have to consider all possible values it might have when making our prediction, weighting according to our prior density for $p$, that is, we marginalise over $p$ to obtain

$P(H|I) = \int_0^1 P(H,p|I) dp = \int_0^1 P(H|p,I) P(p|I) dp = \int_0^1 p dp = [\frac{p^2}{2}]_0^1 = 0.5$

Unsurprisingly we get the answer $0.5$, because our prior was symmetric about this value.

Great! So this defines our initial state of knowledge about the coin. We are only basing this on some symmetry arguments, not any empirical evidence, so our knowledge here is not very certain. My goal now is to define a much more certain state of knowledge, which makes the same predictions for $H$, but which behaves very differently as we perform more flips and learn more about the coin.

State of knowledge B – Lots of knowledge!

To proceed from state of knowledge A (not knowing much) to state of knowledge B (knowing a whole lot), let us do science! That is, let us experiment on the coin. We will do this by flipping the coin one million times. To figure out what impact this has on us, we need to calculate the probability of getting our sequence of coin flips under various hypotheses about the coin bias. This goes as follows (note that under the hypotheses about the coin we are considering, each flip is independent. Therefore we only need to know how many heads there were in the sequence to figure out its probability of occurring. If we were concerned that flips might be correlated then this analysis would get somewhat more complicated…)

Probability of any sequence of length $N$ containing $k$ heads,
$P(k|p,N,I) = p^k(1-p)^{N-k}$

(Interestingly we can leave the binomial coefficient out here I believe, since we are talking about the probability of getting specific sequences of heads and tails, not the usual binomial question of getting k “successes” in N “trials”. I am being a bit lazy in my notation by summarising this whole sequence with “k”. Turns out it doesn’t matter for our problem either way since the binomial coefficient is constant in p and gets normalised away.)

For simplicity we’ll assume there were exactly $k=200$ heads in our sequence of $N=400$. We use this data to update our beliefs about the coin bias $p$, via Bayes’ theorem:

$P(p|k,N,I) = \frac{P(k|p,N,I) P(p|I)}{P(k|N,I)}$

This causes the following to happen to our beliefs about $p$:

Whoops no axes labels. X is bias parameter p, Y is probability density of p.

This defines state of knowledge B. Note that it remains symmetric about $p=0.5$, so it predicts $P(H|k,N,I)=0.5$, that is, we still have the same belief that the next flip will be heads as we did when we started. So has anything changed in our state of knowledge? Yes! Dramatically! We are now much more sure about what the bias parameter of the coin is! This affects future inferences greatly, as we shall now see.

The coin starts to look dodgy…

Suppose now that we encounter a sequence of flips that, a-priori, might look really suspicious. 10 heads in a row! After observing this, what are our beliefs about the bias of the coin, and our belief that the next flip will be heads, starting from both states of knowledge A and B.

Applying Bayes’ theorem in both cases, with $N=10$ and $k=10$, we get the following posterior distributions:

Same axes as before :).

And the following probabilities for getting a head next flip (integrating the likelihood over the prior as before)

$P(H|A) = 0.92$
$P(H|B) = 0.51$

As you can see, starting from our state of great uncertainty, after seeing ten heads in a row we are now really suspicious of this coin, and think there is a $92\%$ chance of flipping a head on the eleventh trial. On the other hand, when we have seen the coin behaving perfectly fairly up till now, we think that not much out of the ordinary is happening; we’ll need a lot more heads in a row to convince us something is wrong. We have gotten slightly suspicious, and now think $P(H)=0.51$ rather than $0.5$, but really haven’t changed our mind very much.

And now I have to catch a flight! I hope this post has been fun to read :). Till next time!

Update: I was inspired to add a little extra. To demonstrate further the difference in confidence with which one can make predictions from these various states of knowledge, I computed the predictive probability of the fraction of heads one would expect in samples of 20 and 1000 flips of the mystery coin. Note that in the N=20 case the variance of the prediction is dominated by the low sample size, while in the N=1000 case it is dominated by the variance in the knowledge of the coin bias. In both cases, when the prior for the coin bias is flat, we know pretty much nothing about the fraction of heads to expect to see in the sample. The mean is 50%, but the variance is maxed out, so we have minimal confidence that we will actually see an outcome near the mean in this case. When we have knowledge about the coin bias, we can predict the fraction of heads in the sample with increasingly greater confidence.

Prediction for the fraction of heads in a sample of 20 coin flips, based on various states of knowledge about the coin. X axis is the fraction of heads in the sample, Y is the predictive probability density for that outcome.

Prediction for the fraction of heads in a sample of 1000 coin flips, based on various states of knowledge about the coin. X axis is the fraction of heads in the sample, Y is the predictive probability density for that outcome.

On trying to figure out whether all swans are white

It has been a while since I talked about probability theory, but I want to get back to it today. I am going to continue on, roughly speaking, from my post on induction, but it will not be at all necessary for you to have read that post for this one to make sense.

I will give a little background. When procrastinating from my more pressing projects, I tend to drift towards thinking about the nature of scientific models, what our beliefs about them are, what is rational to expect to discover in the future and why, etc. In the course of these thoughts, and while reading an old book by Keynes, I came across an argument that tells you something both obvious, and a little discouraging.

Say we are doing that old thing where we want to know if all swans are white. Let us consider two competing hypotheses:

1 – “All swans are white”
2 – “All European swans are white”

Now, it is probably obvious to you that hypothesis 2 is “a-priori” more probably correct than hypothesis 1 (disregard anything you may actually know about swans… we are talking a-priori here, or at least conditional on only some very primitive information). This is because we have restricted the extent of our generalisation in hypothesis 2; finding a black swan in Australia will falsify 1, but not 2. We can see this a little bit more formally by writing the following; let

$f_1(x) \equiv$ “x is white”
$\phi_1(x) \equiv$ “x is a swan”
$\phi_2(x) \equiv$ “x is European”

We can then form the compound propositions

$g(\phi_1,f_1) =$ “for all x where x is a swan, x is white”
$g(\phi_1.\phi_2,f_1) =$ “for all x where x is a swan and x is European, x is white”

where we may call $g$ a “generalisation”. In this case it is a propositional function, with the structure

$g(\phi,f) \equiv$ “for all x where $\phi(x)$ is true, $f(x)$ is also true”

You can then hopefully convince yourself that

$g(\phi_1,f_1) = g(\phi_1.\phi_2,f_1).g(\phi_1.\bar{\phi_2},f_1)$

which in words is just

“all swans are white” = “all European swans are white” and “all non-European swans are white”.

If you accept that the two propositions on the right are independent, then the probabilities of these propositions, conditioned on primitive evidence $h$, are related by

$P(g(\phi_1,f_1)|h) = P(g(\phi_1.\phi_2,f_1)|h).P(g(\phi_1.\bar{\phi_2},f_1)|h)$

from which it is clear that

$P(g(\phi_1,f_1)|h) \le P(g(\phi_1.\phi_2,f_1)|h)$.

So, before you go out and check out any swans, it is less probable that ALL swans are white, than that just the European ones are white. We can’t really get more quantitative than this, but it is nice to reassure ourselves that this much is reasonable.

So, what happens when we go out and find some swans? In particular, let us examine the case where the only samples available to us are European swans. We could run our argument through for a sample of N swans, but let’s just do it for 1 swan for simplicity. Our new evidence is thus

$D \equiv \phi_1(a)\phi_2(a)f(a)$ or “a is a swan” and “a is European” and “a is white”

Given this new data, the posterior odds of our hypotheses 1 vs 2 are (from Bayes’ theorem)

$\displaystyle \frac{P(g(\phi_1,f_1)|D,h)}{P(g(\phi_1.\phi_2,f_1)|D,h)} = \frac{P(D|g(\phi_1,f_1),h)}{P(D|g(\phi_1.\phi_2,f_1),h)} \frac{P(g(\phi_1,f_1)|h)}{P(g(\phi_1.\phi_2,f_1)|h)} = \frac{P(g(\phi_1,f_1)|h)}{P(g(\phi_1.\phi_2,f_1)|h)}$

where the final equality follows because both hypotheses predict that the European swan we sampled should be white, with probability 1. So nuts, we learned nothing to discriminate these two hypotheses from each other. That is, even though we may believe a little more in the proposition that “all swans are white” than we did before (we could show this but I won’t do it now…), we still prefer the “all European swans are white” by exactly as much as we did before. This holds no matter how many white European swans we are able to find.

The lesson: to believe in more grandly sweeping hypotheses, we need more broadly sampled data! Looking at the same sort of thing forever doesn’t help us learn anything; we must go forth into the unknown! And a-priori, the more broad the generalisation, the less likely that it holds.

(Final aside – I was a little sneaky in the way I used Bayes’ theorem here. If you recall, there is the “global evidence” piece, e.g. the $P(D)$ term in

$\displaystyle P(H|D)=\frac{P(D|H)}{P(D)}P(H)$

which is calculated as

$P(D) = \sum_i P(D|H_i) P(H_i)$, or just $P(D) = \sum_i P(D,H_i)$

This generally assumes our hypothesis space consists of mutually exclusive hypotheses, which is clearly not the case in my example (since if all swans are white, then certainly all European swans are also white). I silently divided out this term. However, I think this is fine, because P(D) is indeed independent of any hypothesis, and in each application of Bayes’ theorem we could divide the hypothesis space up into mutually exclusive portions with no problem (it would just be a different division in each case). We can thus compute posterior odds just fine, we just have to keep in mind that they mean something a little different to usual, i.e. that both hypotheses may simultaneous be correct

Final final aside 😉 – There is one important argument to be made against the proposition that “more sweeping generalisations are a-priori less probable”, and that arises when you consider *how else* you might expect your extended sample set to behave. It may very well be the case that “All swans are white” is a-priori MORE probable than “All European swans are white, and all other swans are pink”. Nevertheless, in both cases “All European swans are white” is true, so this is a bit of a different argument to the one I am making.)

Deeply moving mysteries

I was feeling extremely unmotivated this weekend past, so I burned a lot of time watching anime. My favourite was a beautiful single-season show called “Denpa Onna to Seishun Otoko” (電波女と青春 男) which translates into the slightly odd “Radio wave girl and young (youthful) man”. Wikipedia tells me that this phrase “denpa” (radio wave) is also used to describe alien conspiracy type people as well, i.e. with similar associations as “tin foil hat” in English.

In some ways it is a relatively standard sort of teen romance comedy/drama (although it turns out I am a sucker for those anyway), but since the main female protagonist is a spaced-out blue haired girl who vanished for six months before she was found floating in the ocean –and then started claiming she was an alien and running about the city wrapped in her futon– it has some non-standard qualities. It asks a bunch of my favourite philosophical questions along the way too: What is it rational to believe? Is it morally right to break other peoples beliefs in things you think are crazy? What would make you change your beliefs about something you previously thought impossible? This sort of thing. If you like anime I can certainly recommend this one.

Erio wrapped in her futon

Anyway, this got me thinking about mysteries of various sorts. Science is sometimes unfairly accused of taking the mystery and wonder out of life, and while I emphasise the “unfairly” I do admit that there is a certain amount of heartwarming and uplifting childish wonder that knowledge inevitably destroys. When we are children our sense of wonder is of a deeply emotional origin, and in a world illuminated by the light of knowledge some of this is lost, and is generally replaced by a more academic kind of wonder. If you can retain that sense of child-like wonder about the things in the universe that really are mysterious then you are most fortunate, and wise, since these things deserve your wonder no less than the wild imaginings that your child-mind created. I admit this can be hard in the face of the many harsh and soul-crushing aspects of reality, and I struggle with it myself.

I think part of the struggle is with what we do and do not think is probable, not so much what is possible. For example you would not be crazy to believe in the existence of other universes, there are various quite convincing arguments that such things may exist –and if we one day discover that they do that will surely be as heart-rendingly severe and wonderous a discovery as science has ever made– but deep down I don’t really think it is probable. I have no real reason for this, and I would probably intellectually calculate that it is more probable than I emotionally feel, but, since wonder comes from the heart, my wonder at this possibility is not very strong. Likewise, while we could be living in the matrix and it would be truly incredible if we were, deep down I rate the probability of this too low for the possibility to increase my heart rate.

There is one mystery that does still resonate with my soul. And of course it is part of why I liked Denpa Onna to Seishun Otoko: extraterrestrial life. This is much closer to home than other universes, and much more probable in my opinion. The universe is so staggeringly large, and the Sun and Earth so un-special seeming, that there must be other life out there somewhere; and though I don’t know what kind of number I would put on the probability, the fact that I would be heartbroken if we somehow learned there really was nothing out there tells me that deep in my belief structure that number is not tiny.

The way of the command line

コマンドライン道

So you see how a simple thing like wanting to bastardise some Japanese for the sake of a grandiose title can lead to quite a lot of wasted time.

But let us move on with business. As some of you may be aware, Google is shutting down their self-titled “reader” on July 1, or sometime thereabouts, and since this date rapidly approaches I have found myself in need of an alternative method of accessing my now-beloved RSS feeds. I read about quite a few, though I can’t be bothered going back to think about them again now for the sake of this post. Suffice it to say I remained unimpressed by any of them.

And so I decided that I should just go back to the fundamentals. I use Ubuntu at the office, so a Linux-based open-source RSS reader of some kind was appealling on philosophical grounds. I thus came across newsbeuter. It is terminal based, but hey, this is Linux! The terminal is your friend.

Oh yeah…

I have to say, I am pretty happy with it so far! It is totally straightforward, has no junk attached, and if I need to see images or other rich content I just press “o” and it opens in a browser (which is similar to what I did a lot from Google reader anyway). The down side is of course the lack of syncing, which is what Google reader was good for. I feel like I can probably hook some database file or other up to dropbox and achieve the same thing with newsbeuter though. Have not tried yet.

So after my success with this, I started to think, what else can I abandon in favour of a command line version? Turns out I can Tweet from the command line perfectly well with Twidge:

Down with GUIs!

And I can even read my email with Mutt! (Though I may not stick with this one, it might be taking things too far :p. We’ll see if I can get used to it.)

Beautiful beautiful plain text.

The next thing on my hit-list was Facebook. Now certainly I cannot abandon it entirely –Facebook have made sure we can’t interact much with their content without going through their web page– but it turns out we can do a bit, and perhaps more than they would like. They are currently kind enough to supply an RSS feed for your notifications in a url something like this:
http://www.facebook.com/feeds/notifications.php?id=YOURID&viewer=710466528&key=YOURKEY&format=rss20
which you can easily obtain by clicking “See all” in your notifications list, at the top of the ensuing page at “Get notifications via: Text message . RSS”.

So they seem happy for you to grab that, presumably because it may cause you to come back and look at some Facebook ads when you get a notification. But what about the news feed? That would be handy to have in RSS form!

But no, Facebook seem to not want you to do that. Yet, we can achieve something at least a bit similar in spirit…

The plan will be this: we will grab the RSS feeds from a bunch of our friends with the most interesting status updates (turns out we can get feeds of these), combine them all together into one feed with Yahoo Pipes (and also tag them with their origins so we know whose status is whose) and grab this single feed into our RSS reader so that we only have to look at the one feed for everyone.

First, one needs to install the Facebook app FB RSS, which will give you url feeds of every one of your friends’ status updates, individually. Also it gives you feeds from pages you are a “fan” of, and possibly something else I am forgetting.

Anyway these are pretty terrible to deal with all individually. So we want to combine them together! To do this we can use Yahoo Pipes, which I had never heard of before yesterday but turns out is a very powerful tool that lets you harvest all kinds of stuff from all over the internet and transform it into an RSS feed, all using a visual “piping” programming language that reminds me a lot of SpaceChem.

The method I cooked up for my purposes was to grab the urls I wanted from FB RSS and put them into a csv file, in a series of “feed name”:”url” pairs, dumping this file into my public dropbox folder, and then telling Yahoo Pipes to grab the data out of this file. It took me a while to figure out the piping (but if you want to see or use the pipes I invented they are here: Main pipe, Sub-pipe).

So, now we just give the combined feed to newsbeuter and huzzah, status updates without having to look at Facebook!

Suck it Zuckerburg!

I’m feeling pretty good about this new setup, though I will admit that it took an obscenely long time to figure out that piping business. But now that I did it once I am sure there must be some other junk on the internet that would be handy to pipe together…

Time to go home! Peace out.

Update: I have now added my YouTube “subscriptions” RSS feed to newsbeuter, with some more help from Yahoo Pipes to modify the item titles to contain the authors and video durations (pipe here). Gotta say I like looking at my subscriptions feed this way much better than the usual way!

A few interesting kanji

(note, throughout this post when I mention “my kanji dictionary”, I mean the website http://www.saiga-jp.com/kanji_dictionary.html)

I have come to wordpress with the purpose of bragging about some terminal-based RSS wizardry I have achieved today, but on the way to doing so I became side-tracked due to the title I thought up for that post, which involved some Japanese which I felt obliged to research a little before butchering.

So anyway, I ended up reading about a few Japanese words of interest to me and the kanji with which they are written (well, at least some of the time, I don’t know the language well enough to know what is ‘standard’). There were a couple of connections between words which I didn’t realise before, and a few that I knew (well, suspected strongly…) which were nice to see supported by the kanji.

Those of you who actually speak Japanese feel free to jump right in and correct any wildly inaccurate statements I might make :).

So first, here are some words for us to consider:

These are all martial-arts related words of course, and most of them practically English words by now. However, it appears that they are all spelled somewhat differently in Japanese to what I expected, which was the first surprising thing for me. I have been studing karate for quite a while now; you would think I would have known the correct way to say it in Japanese.

Well, at least the “karate” part is indeed how I expect it to be – only the “dou” part I had wrong. However the actual name of the organisation I study with, goujyuukai, is rather different to what I thought!

But anyway, to the associations. Firstly, if you examine the kanji, you will notice the character 道 (pronounced dou in these words) appearing several times. My understanding of this character was that it means “way” or “path” or “method”, this sort of thing (where there are both practical and spiritual connotations in there), so that karate-do meant roughly “empty-hand method”, or, a little more poetically, “the way of the empty hand”. Likewise for judo we have “soft/gentle method” or “the way of softness”, something like that, and for aikido, well I don’t really know, something about the way of ki (気, kind of tough to translate but my kanji dictionary lists all of {air, atmosphere, spirit, mind, heart, will, intention, feelings, a mood, nature, a disposition, attention, care, a sign, an indication}), though I still don’t know what the “ai” part really means (kanji dictionary says 合 – {match, fit, suit, join, combine, unite, coincide, agree}), but if I was forced to guess I suppose it must be something like “the way of harmonious spirit”, since I know aikido to be very non-violence based.

(I brushed over the meanings of a bunch of other kanji there, so let me quickly list them:

So “dou” in this sense more or less made sense to me. However, to get to my point, I was quite surprised to see the “dou” in “dojo” was this same character 道. The kanji dictionary just lists “dojo” as “an exercise (training) hall”, which seemed straightforward. I suppose it makes sense as being the place in which you practice your “dou”, even superficially.

Still, looking up dou (道) on google gives me the following Chinese wikipedia page (http://zh.wikipedia.org/wiki/%E9%81%93), which, although I can’t read it, contains a great big yin-yang on one side. This seems to strongly imply that this “dou” is a Taoist concept, and indeed the characters above the yin-yang are “道教”, which Google translate tells me indeed means “Taoism”. It also tells me that the Chinese pronounciation is “Dàojiào”, so it seems that “dou” in Japanese does come quite directly from the Chinese. So one sees that “doujou” has deeply spiritual connotations! It is not simply a training hall; at least, not simply a place for training physical skills.

So I guess I will have to think deeper about these things some more. At least it is getting slightly easier to investigate some of these language issues as I learn a bit more Japanese. My skills are still pitiful, but slow and steady, or some such phrase.

Anyway that is certainly a large enough deviation from my original purpose, which was itself a serious deviation from other things I had planned to do tonight… so back to it!

Battle for the soul – the God question and epistemology, or “why do you believe what you believe”?

Let me begin this article with a disclaimer: this post is somewhat tangential to the main theme of this blog (although I do think there are lots of interesting connections worth talking about between religious belief and Bayesianism. Disclaimer 2: I stayed up later than I planned writing this, and I didn’t really finish moulding it into a smoothly flowing essay. I am now too tired to continue and don’t care anymore, but I hope it is at least a little interesting for those of you who might read it. Please forgive its sloppiness and half-finished character :p.)

No, instead I am more interested in the particular kind of inner conflict that is waged in the soul of anyone who has struggled with the question of whether or not God exists, or with more general questions about the nature of reality, or even with more mundane things like whether Santa Claus exists. The conflict I refer to is the battle to achieve a self-consistent epistemology, that is, a self-consistent theory of knowledge. You may not have ever called it by such names, but if you have ever encountered a proposition which fundamentally challenged your belief system, if you have undergone a mental battle to either reject that proposition or else modify your belief system, then it is quite probable your brain has also struggled to decide whether it is indeed judging the validity of propositions by the right criteria; such criteria effectively form your own internal theory of knowledge.

There is a large subconscious element to this struggle, I believe. For example, I like to think I know quite a lot about fundamental physics, and thus about what sorts of things are and are not possible in this universe, yet I still on occasional feel a chill down my spine working late at night in my big empty open-plan office, as if my subconcious mind has not yet entirely dismissed the possibility of paranormal phenomenon, no matter how vehemently my conscious mind argues that such notions are ridiculous.

But I digress. I have an intuition that, for most of us, our internal theory of knowledge solidifies when we are quite young; this is why, for instance, some of us are happy to accept the possibility of transcendental sources of knowledge (such as being spoken to by a God or gods, or perhaps ‘feeling’ his/their presence in our lives), and why others of us deny the validity of this. I perhaps superficially fall into the ‘no transcendental knowledge’ camp; however, I consider the question of why I believe this to be of the utmost seriousness, the entire foundations of ones views on reality being too important to leave to the random events of one’s particular childhood.

If one is serious about this examination, it seems to me that one should rapidly approach the following question: why believe anything? What in fact justifies a person to any degree of belief in one proposition or another? Even our (my?) most sacred scientific values are not immune to such epistemological nihilism. If one cannot convince oneself of at minimum a partial answer to this question, then all knowledge must be denied; even the words I type right now should be denied meaning.

This blog being somewhat about Bayesianism and statistics you might think I am about to argue that those should form the foundations on which all human knowledge is built. And to some sense I do think this, but the question of the preceding paragraph undercuts the foundations even of these. So one must first have some reason to accept these as a foundation for knowledge.

In fact there is no real satisfictory solution to this kind of extreme deconstruction of knowledge, as far as I am aware. It is a “catch 22” – no argument can argue for its own logical foundations in such a completely self-contained way. If I was a logician I expect something deep could be said about Gödel’s theorems at this point.

So, we must work from axioms and build a theory of logic, and, subsequently, of knowledge. But what should our axioms be? And could we choose other ones? By what process should we accept some set of axioms? We appear to be again screwed. We seem to have nothing but our intuition with which to produce said axioms, and no reason to believe that our intuition will produce for us the “correct” axioms, if such things exist. If we do work from some axioms (or the intuitions to which they correspond) it is easy to imagine we might convince ourself of the validity of our axioms, but such reasoning would be circular, and my own axioms (at least) lead to the rejection of such circular logic. Some philosophers seem to argue that while perceptions are fallible we have some primary core intuitions which are trustworthy, and that these give us enough foundation to perform logical deduction and so build ourselves a valid system of logic, but I really don’t see how this can be the case, or rather I don’t see how we can know if we have achieved what we think we have achieved. If a demon (or machine superintelligence) could potentially control our perceptions, what prevents them controlling our thoughts and intuitions? Similarly if our perceptions of the external world are warped by psychosis, why not too our beliefs about logical deduction? When we dream we find it very easy to accept the fantastic as plausible, and I don’t consider it a stretch that the mind could similarly be tricked into thinking it is performing logical deduction when it is in fact doing nothing of the sort.

But let us suppose that we can indeed build some logical system. Can transcendental knowledge be allowed in a self-consistent system? I have no argument for why it cannot; my only argument against it is that is seems extremely dangerous and vulnerable to self-deception. How does one tell the difference between information from a transcendental source, and illusion?

Perhaps a counter-argument would be that all intuition is subject to such self-deception. I would be inclined to agree, but rather than using this argument to accept more intuitions it instead drives me to reject as much of my own intuition as is possible. Not in a pragmatic sense: of course our intuitions are well trained for many pedestrian matters. But when it comes to questions of fundamental significance, I cannot see any valid way to reason about them from intuition or feelings. I admit that actually adhering to such a policy is much more difficult, given that we need to begin from axioms…

Before I finish this somewhat rambling post, let me ask any readers I might have a question: what do you know about Descartes’ “evil Demon”? (I alluded to this earlier: http://en.wikipedia.org/wiki/Evil_demon – This is essentially the same as what we might call “the Matrix problem”: how do we know that all our perceptions of the supposedly external world are not controlled by an “evil Demon”, or machine superintelligence, etc?) I never read Descartes so I don’t actually know what he had to say on the matter. It seems that he didn’t think that such a hypothesis was plausible, but I don’t know why, especially since, as I understand it, he goes on to postulate the existence of an omnipotent benevolent creator. What logic leads him to one and not the other I don’t know.

The Monty Hall Problem

I thought I might do another “fun” post today :). So here we go; I shall present to you the Bayesian solution to the “Monty Hall” problem! For fun, I will also take things a bit further than the usual solutions do…

What is the Monty Hall problem you say? Well, it is a very famous probability problem and there is a lot written about it all over the internet (Wikipedia of course has some nice history and a lot of other info about other ways of attacking it than what I will show you). But its fame is not because it is paradoxical or fundamentally tricky or any such thing that might make it particularly interesting to mathematicians. Rather the solution is simply unintuitive, and many people refuse to believe the answer at first.

The scenario is as follows: You are on a game show (the host is Monty Hall, a real person who hosted a real U.S. game show, though one a little different to what I will describe, or so I am led to believe). Before you are three doors. Monty tells you that behind one of the doors is A NEW CAR! Behind the other two doors are goats (let us assume you prefer to win the car…). You get to pick a door and win whatever is behind it.

So you pick a door. But to increase the dramatic tension of the game, Monty does the following: he opens one of the two doors you didn’t pick, revealing a goat. He then offers you the chance to change your chosen door.

So, the problem: do you switch?

Are you more, or less, likely to win the car by switching, or does it make no difference?

If you were there what would you do? Think about it for a minute.

Alright, so now I will take you through the answer. There will be math, but it is high school math, so I hope it is accessible. We are going to be computing “odds”, like bookies. These are just ratios of probabilities for different outcomes, e.g.

$\text{Odds(horse A wins VS horse B wins)}$
$= \text{Pr(horse A wins)}:\text{Pr(horse B wins)}$.

So first, consider the odds for the car being behind door A, B or C, before anything happens. We may write this as

$\displaystyle \text{Odds(Car)} = 1:1:1$

To be really clear but encumber you with extra notation, by this I mean

$\displaystyle \text{Odds(Car)} = \text{Pr(Car behind A)}:\text{Pr(Car behind B)}:\text{Pr(Car behind C)}$

where $\text{Pr(statement)}$ is the probability that “statement” is true. We can multiply these probabilities by a common factor to make them nicer numbers though, which is why I wrote $1:1:1$ rather than $\frac{1}{3}:\frac{1}{3}:\frac{1}{3}$. Here I also claim there that there are initially equal odds of the car being behind each of the doors, or probability 1/3 that it is behind any particular door. This is a statement about what we believe about the current situation. We have no information to point us in the direction of any particular door at this stage, so this initial assignment of equal odds is really the only reasonable one.

Next you pick a door. I could be a bit more formal about this step, but for brevity let me assume that it is intuitive to you that nothing should happen to the odds of the car being being behind various doors just because you picked one. We are assuming here that the game show people are not jerks who switch the prizes around based on your chosen door. The situation is symmetric, based on our initial odds, so say you choose door A for simplicity. We now have:

$\displaystyle \text{Odds}(\text{Car}|\text{You A}) = \text{Pr}(\text{Car behind A}|\text{You chose A}):\text{etc.}:\text{etc.}$
$= 1:1:1$

where the vertical line “|” means “given that”; i.e. the statements following it are things on which that probability is conditional, i.e. those statements are assumed to be true for the purposes of that particular probability.

We want to know what happens to these odds (of where the car is) after Monty opens one of the doors we didn’t pick (say B) i.e. we want to calculate this:

$\displaystyle \text{Odds}(\text{Car}|\text{Monty B, You A}) = ?:?:?$

Again in more cumbersome notation, this is supposed to mean:

$\displaystyle \text{Odds}(\text{Car}|\text{Monty B, You A}) = \text{Pr}(\text{Car behind A}|$
$\text{You picked door A and Monty opened door B}):\text{etc.}:\text{etc.}$

So how do we calculate these odds? Well, of course the answer is Bayes’ theorem, which in this case we can write like this:

$\displaystyle \text{Odds}(\text{Car}|\text{Monty B, You A})$
$= \text{B}(\text{Monty B}|\text{You A})\times \text{Odds}(\text{Car}|\text{You A})$

This thing; $\text{B}(\text{Monty B}|\text{You A})$; is called the “Bayes factor”. It is not so scary; it is just the ratio of probabilities of seeing various data in different circumstances. In this case, it involves the probability that “Monty opens door B”, given that “You picked door A” and “The car is behind X” (where X is whatever corresponds to that bit of the ratio, i.e. we have X=A:B:C in each bit). In our notation,

$\text{B}(\text{Monty B}|\text{You A})$
$= \text{Pr}(\text{Monty opens B}|\text{You chose A and Car behind A}):\text{etc.}:\text{etc.}$

This here is the crucial point of this whole problem, so let me be extra clear. Bayes’ theorem says that our beliefs about where the prize is after Monty opens a door, depend on the initial odds (here equal), and the ratio of the following three probabilities:

$\text{Pr}(\text{Monty opens B}|\text{You chose A and Car behind A})$
$\text{Pr}(\text{Monty opens B}|\text{You chose A and Car behind B})$
and
$\text{Pr}(\text{Monty opens B}|\text{You chose A and Car behind C})$

Consider these carefully.

Now, let me focus on an important point that cruelly I left ambiguous. Given only the information I have supplied in the above scenario, these probabilities can be quite different depending on what you think my wording implies about the behaviour of Monty. The way I wrote it was this:

“Monty does the following: he opens one of the two doors you didn’t pick, revealing a goat.”

This could mean any of the following:

1. Monty knows what is behind all the doors, and for showmanship he opened a door he knew hid a goat.

2. Monty does not know where the car is, and he just opened one of the remaining doors “at random”, and it just happened to have a goat behind it in this instance.

3. Monty knows you picked the car first up, and is offering you the choice to switch in order to screw you (i.e. he wouldn’t have offered you the choice otherwise).

Depending on which of these you think is the case, the answer is different (but after we go though the normal solution, I will show what to do when you don’t know which of these three option is the case!)

For now, let us initially assume option 1 is the case. In this case our three crucial probabilities are

$\text{Pr}(\text{Monty opens B}|\text{You chose A and Car behind A}) = 1/2$
$\text{Pr}(\text{Monty opens B}|\text{You chose A and Car behind B}) = 0$
and
$\text{Pr}(\text{Monty opens B}|\text{You chose A and Car behind C}) = 1$

If you chose A, and the car was behind A, Monty could open either B or C to dramatically reveal a goat. It is reasonably to assume it makes no difference to him which goat he reveals, so setting the probability he opens B to 1/2 is the reasonable thing to do.

If you chose A, and the car was behind B, then Monty cannot open B if he wants dramatic goatage. So the probability of this outcome is zero.

Similarly, if you chose A, and the car was behind C, then Monty is definitely going to open door B for dramatic effect, so then opening door B has probability 1.

Putting this all together, our Bayes factor is

$\text{B}(\text{Monty B}|\text{You A}) = 1:0:2$

(where I scaled the numbers up to integers). Multiplying these factors into our initial odds, we get the final odds:

$\displaystyle \text{Odds}(\text{Car}|\text{Monty B, You A}) = (1:0:2)\times(1:1:1) = 1:0:2$

So, it is twice as probable that the car is behind door C than door A! You picked A, so you should definitely switch if you want to win.

But what about if you had interpreted my information about the scenario differently? Well, this post is already too long, so I leave it as an exercise for the reader to figure out what happens. I will, however, come back to my promise of how to reason in the face of the ambiguity I presented, without making any one of the wild assumptions, next time! I’ll put I link here when I do it.

But for now, adios! Or since I am trying to learn Japanese, じゃまたね!

Testing the distribution of the digits of pi

Just a fun one for today. No outlandish Bayesian-based philosophy, just straight up classical significance tests. I will even take it relatively gently for those of you who are not “into” statistics, because today’s fun is for everybody!

So, the question of the day is “if I pick a random digit of pi (that I don’t know in advance of course), is it equally probable to be any of 0 to 9?”. Actually one could go a lot deeper into this question than I am going to, and look for correlations and all sorts of things, but we will just look at the statistics of some big samples of the first digits of pi.

Disclaimer: I am not going to calculate these digit frequencies myself, due to “effort minimisation” ;). The following data comes from this website, and I have made no effort to verify its accuracy. So with that in mind, here is a histogram of the frequencies with which the digits 0 to 9 appear in the first three billion digits of pi:

Pretty damn uniform looking right? Lets zoom in a bit:

Note the scale on the y axis. We are looking at fractional variations of around $10^{-4}$, or $(1/1000) \%$ if you prefer. So on the face of it the hypothesis that the digits of pi are evenly distributed over 0 to 9 seems to be extremely good. However, this vague intuition is not good enough! We should do some statistics and see how good the evidence really is!

The simplest statistical test we can do is to calculate the p-value for this data, under the “null hypothesis” that the digits are randomly sampled from the options of 0 to 9, with equal probability for each option. This is the equivalent of having a bag with equal numbers of tiles with the digits 0 to 9 written on them, say 10 of each, and reaching into this bag to pull out the next digit of pi (and then putting the tile back after each draw to avoid messing up the probabilities on the next draw). This process will give you sets of numbers whose properties follow the famous binomial distribution:

$\displaystyle \text{Pr}(K = k) = {n\choose k}p^k(1-p)^{n-k}$

for $k=0,1,2,...,n$, where

$\displaystyle {n\choose k}=\frac{n!}{k!(n-k)!}$

For our case, $K$ is the number of occurrences of digit $i$ we get in our sample (i.e. one of 0-9; we are hypothesising that we get every digit equally, therefore each digit has the same distribution). So $\text{Pr}(K = k)$ is just telling us the probability that we get $k$ occurrences of digit $i$ in our sample, which is of size $n$. Finally, $p$ is the probability of drawing the digit in question from the bag on each individual draw, which since we are claiming is equal for each of the 10 possible digits gives us $p=1/10$

The relevant properties of the binomial distribution that we need to consider are the expected value and the variance; i.e. we want to know how many occurrences of $i$ we expect to see in our sample, and the typical spread around this value a sample will have. These are

$\displaystyle \text{E}(K) = n p \text{,~~~~and~~~~} \text{Var}(K) = n p(1-p).$

Plugging in our numbers (for n = 3 billion) we get E = 0.3 billion (i.e. on average every number should appear one tenth of the time – as one knows intuitively), and Var ~ 0.3 billion also. However the variance is not very intuitive I find, generally the standard deviation is better (the square root of the variance) since we have an intuitive understanding of it thanks to the normal distribution. So for us the standard deviation is $\sigma = \sqrt{\text{Var}} \approx 16000$. So if our pi digits are really following this distribution we should expect the digit counts to fluctuate by about this much around E, which is a fractional variation of $\sigma/E \approx 0.5 \times 10^{-4}$, or about $1/(2000)\%$, which is pretty close to the guess we eyeballed from the histograms above.

There are two more pieces of information I must mention before we can actually do our statistical test. The first I alluded to a moment ago — that is, the binomial distribution is asymptotically normal, which just means that for large sample sizes we can use the normal distribution (with mean and variance given by the formulae above) instead of the binomial distribution, and we will get the same answers. Our sample size of 3 billion is vast so this approximation is essentially exact, and life will be much easier if we work with normal distributions, so we will. The second involves a property of samples from normal distributions. This is that if we take the sum of the squares of normally distributed random variables (shifted by their means and scaled by their variance), then the resulting sum is a random variable that follows the famous chi-squared distribution. In equation form, if we have $N$ normal random variables $X_i$, with mean $\mu_i$ and variance $\sigma_i^2$, then the sum I am talking about is

$\displaystyle Q = \frac{(X_1-\mu_1)^2}{\sigma_1^2} + ... + \frac{(X_N-\mu_N)^2}{\sigma_N^2}$

and where $Q$ can then be considered a sample from the $\chi^2_{k=N}$ distribution, i.e. the chi-squared distribution with N “degrees of freedom”. I will not go into details about degrees of freedom here, for now it will be enough to say that in this case it is just the number of terms in the sum, the number of normal variables we are adding together. We also have an additional constraint which will make us have to reduce this “degrees of freedom” by one, specifically the constraint that the sum of occurrences of each digit has to add up to the sample size.

We are now ready to do our test! So, what is it that we want to know? There are many questions we may ask of our data and each requires a different test. However, the most obvious one to ask is whether the observed digit frequencies sit around the expect values of $1/10$ with the expected spread. The common translation of this question is to ask whether the sum of the (scaled) squares of the frequencies is as small as we expect based on our hypothesis. So, (remembering that $k_i$ is the number of occurrences of each digit $i$ in our data, $\text{E}(K_i)$ is the expected number of occurrences, and $\text{Var}(K_i)$ is the variance) we compute the sum

$\displaystyle Q = \frac{(k_0-\text{E}(K_0))^2}{\text{Var}(K_0)} + ... + \frac{(k_9-\text{E}(K_9))^2}{\text{Var}(K_9)}$

The statistical question now solidifies into asking “assuming the null hypothesis is true, what is the probability of observing data which gives a value of $Q$ larger than the one our actual data gives?”. This quantity is known as the “p-value”. There are many issues associated with p-values, but today I will not talk about them. For now, noting that Q is supposed to be distributed according to the $\chi^2_{k=9}$ distribution if our hypothesis is true (degrees of freedom 10, minus 1 for our constraint), we can compute the p value as

$\displaystyle p(Q) = \int_Q^\infty \! \chi^2_{k=9}(q) \, \mathrm{d}q.$

For brevity, I must leave it to you to wikipedia chi-squared distribution to see what this distribution looks like and get and idea of what this integral means. Suffice it to say it requires numerical evaluation, but any statistics package or library will have routines do this so just pick your favourite. I am a Python guy so I use the scipy.stats package.

So now I will skip to the punchline, which is that according to this test, with the null hypothesis as described and using the data I stole from the linked website, I compute

$\displaystyle p = 0.33$

This means that there is a $33\%$ chance of seeing digit occurrences with larger average (roughly speaking) deviations from the predicted mean in the idealised “bag-drawn” random number, than we observed in the first 3 billion digits of pi (and $67\%$ chance of seeing smaller such deviations). This is not very surprising at all if the bag-drawn model is valid, so the classical interpretation is that this data contains no evidence for rejecting the null hypothesis. It does not let us say that indeed the digits of pi are randomly sampled in this way, only that this test provides no evidence that they are not.

The final warning is that this test considers only the value of $Q$. There are many ways that digits could appear in pi that would be extremely un-random looking, yet would give identical values of $Q$ to a set of digits that really was random, i.e. $Q$ is global property of the data, and will tell you nothing about local patterns. To look into these more intricate details we would have to devise yet more tests!

Why is extrapolation “riskier” than interpolation?

You probably feel like you already know the answer to the title question, but how rigorously can you justify it? This question has annoyed me for some time now and I have never quite been able to elucidate the Bayesian answer, however I think I now have it; and of course it turns out to not be so complicated after all.

(edit: that said, this post has become very long since I take some time to get to my point, so you have my apologies for that)

(edit 2: I take it all back; I have not progressed much at all in answering this question. Oh well. Nevertheless I leave my explorations of it below for you to ponder.)

Background

To give ourselves a concrete scenario in which to work, consider the following. We are in the undergraduate physics laboratories, and are doing some resistivity measurements on a doped Germanium sample. We heat the thing up in an oven and measure its resistivity as we go. There are all kinds of sources of uncertainty to consider, but we’ll forget about all that and assume some Gaussian uncertainties for our temperature and resistivity measurements. The following data is obtained (with arbitrary y units: pretend they are something meaningful):

Fake resistivity of Germanium measurements (room – oven temperature)

What is your first instinct when you see this data? (nb. if you know things about semiconductors then pretend that you are an undergraduate who does not, for now ;))

Straight line? Ok, well let’s do the naïve thing and see how well that fits, using a $\chi^2$ test. Let’s assume the x error is small enough to ignore. Our test statistic is then

$\displaystyle X^2 = \sum\limits_{i=1}^n \frac{(\rho_{\mu_i}(\hat{\theta}) - \rho_i)^2}{\sigma_{\rho_i}^2}$

where $\rho_i$ is the measured resistivity at data point $i$, $\sigma_{\rho_i}$ is the standard deviation of resistivity measurement $\rho_{\mu_i}$ (which hopefully we have estimated well), and $\rho_{\mu_i}(\hat{\theta})$ is the resistivity predicted at data point $i$ by our best fit straight line model (with parameters $\hat{\theta}$). The statistical test is simply to calculate $X^2$ using the actually observed data and estimated uncertainties, and compute the probability of observing a value of $X^2$ this large or larger if our data indeed was generated from this idealised best-fit statistical model, i.e. assuming independent measurements, with Gaussian errors of the size we estimated, ignoring the temperature measurement uncertainty – in these circumstances $X^2$ is sampled from a $\chi^2$ distribution with degrees of freedom (k) equal to the number of data points we have, minus two for the two fitted parameters (although see here for some hairy issues to avoid in computing degrees of freedom). The probability in question is thus a p-value, computed by

$\displaystyle p = \int_{X^2_{obs}}^{\infty} \! \chi^2_{k=14}(x) \, \mathrm{d}x$

(which is just what I described above, in equation form). For the above data this analysis gives us:

Linear best fit (minimum chi^2) to fake room-oven temperature resistivity data

 Best fit chi2 = 15.9860008544 Best fit chi2/dof = 1.14185720389 Best fit p-value = 0.314229940653 Number of "sigmas"= 1.00638596851 

p=0.45 counts as zero evidence against the null hypothesis (straight line) in anyones book. I threw in the equivalent number of “sigmas” for the particle physicists out there (computed from $n=\left|\Phi^{-1}(p/2)\right|$, or $n=\sqrt{\Phi_{\chi^2}^{-1}(1-p)}$ if you prefer – where $\Phi^{-1}$ and $\Phi_{\chi^2}^{-1}$ are the inverse cumulative distribution functions of the standard normal and $\chi^2_{k=1}$ distributions respectively), which we see is a most reasonable $1\sigma$.

Great! But how happy are we that this straight line is really a good model for the data? How much do we trust it, where do we expect it to break down, and why?

Bayesian prediction

So first, interpolation. What are we really doing when we do this mentally? It would seem that we assess a wide variety of models that might explain the data, and make a prediction on the basis of some kind of consensus among the models. (This is of course related to my post about induction). In the Bayesian language we are computing

$\displaystyle P(x_{new}|x_{old}) = \sum\limits_{i} P(x_{new}|M_i, x_{old}) P(M_i|x_{old})$

where $P(x_{new}|M_i, x_{old}) = P(x_{new}|M_i)$ is just the ordinary likelihood, if the data $x$ are statistically independent, and $P(M_i|x_{old})$ is the posterior probability of model $M_i$ given the already known data $x_{old}$. Interestingly, we could also write this “global predictive distribution” as

$\displaystyle P(x_{new}|x_{old}) = \frac{P(x_{new},x_{old})}{P(x_{old})}$

where $P(x) = \sum\limits_{i} P(x|M_i) P(M_i)$, similarly to before, except now $P(M_i)$ is the prior for $M_i$ (dependent on prior information). Old-hat Bayesians will recognise $P(x)$ as the “evidence” or “marginalised likelihood”, which we often toss away in hypothesis tests (very much the same as the more restricted evidence we might compute in a restricted subset of the hypothesis space, i.e. the parameter space of some model).

The likelihoods $P(x|M_i)$ can be translated directly into goodness of fit measures such as $X^2$, so it is clear that models must fit the old data before they contribute to our prediction for the new data. However, whacked out “wiggly” models such as I alluded to previously can do this just fine. For example, here are a selection of curves of the form

$\displaystyle y = A + B x + C \exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right)$

(i.e. straight lines with Gaussian bumps) which fit the data at the $1\sigma$ level:

1 sigma Gaussian bump fits to the data

From this image it is evident that, based on goodness of fit alone, we can justify almost any prediction (since these curves cover almost the whole plane, and they are just one of infinite alternate model classes), and there is vast room for horrible errors to occur in interpolation.

However, notice something interesting about the figure. As evidences by the increased overlap (darker blue) in the “bump free” region and in the “extrapolated” region, there are in some sense more curves that predict future data to be measured in these regions. If true this would mesh nicely with the above picture of prediction through model averaging, and seems a good candidate for a legitimate reason to favour the straight line models, and only in the “interpolation” region.

However, we now get to the tricky part, and it is the same tricky part we always encounter in the Bayesian game. Any measure we put on the global model space in order to define whether or not there are “more” curves predicting one thing than another equates directly to a prior, and the philosophical foundations for choosing priors remain shaky, particularly in the global hypothesis space (there are some convincing arguments for what to do given certain classes of models, however).

Nevertheless, if we forget about such troubles for the moment and restrict our allowed models to just these funny straight lines with bumps — and you allow me a prior that is Gaussian in $A$ and $B$ (centered on the best fit values for the line alone – say we had some good prior information which told us to search near these values), and also in $C$ (restricted to bumps which don’t go too far off the top of the plot); a flat prior in $\mu$ (forcing bumps to be vaguely in the temperature range of the plot); and logarithmically flat in $\sigma$, restricted to stop the bumps being too extremely thin or fat — then we may blindly charge ahead and see what inferences can be drawn.

So what do we get for the predictive distributions for future data? This:

Bayesian posterior predictive distributions for the “linear plus bump” model (priors described in the text).

The colours reflect the predictive probability $P(\rho_{new}|\rho_{old})$, as a distribution over $\rho_{new}$ (the y axis, essentially), with there being a separate such distribution for each temperature we might choose to measure at. Clearly this does not predict that we will measure any of the narrow bumps which, in principle, could fit the data, since these predictions occupy too small a portion of the hypothesis space in question. One notices that the predictions get shakier in the extrapolated region, but we would get essentially the same thing if the “bump” part of the model was cut out entirely: it merely reflects the uncertainty in the gradient of the linear part of the model. I actually expected to see a bit more influence of the wider bumps that can fit in this extrapolated region, but I guess they too are overpowered by the straight line predictions that come from every other model whose bump is either small or just at a different $T$ region.

Now, there are of course infinitely many other models which might fit this data, and it is difficult to say how they should enter the model average and change this picture. We could cook up any crazy piecewise functions that change discontinuously in the regions where we have no data, and again all such models can fit just as well as a straight line. If we cook things up correctly then it seems plausible that the parts of ensuing parameter space where the fit is good may involve large volumes that are very different from the straight line, and so throw the predictive distribution right out. Yet, intuitively, it does seem like “cooking” is required; it seems like there is something “unnatural” about these models, something contrived; that we have to especially design them so as to make their predictive distributions different from straight lines.

Here is a second example: a 6th order polynomial. This has 7 parameters, and has a very complicated likelihood surface. In fact it is so complicated that I had a lot of trouble getting my search algorithms to converge. In the end I had to move away from the “standard form” parameterisation, $y=a+bx+cx^2+dx^3+...$ etc, and move to a somewhat strange parameterisation:

$\displaystyle y = C(x-b_1)(x-b_2)(x-b_3)...(x-b_6) + A x + B$

where $A$ and $B$ were not parameters, but were constants chosen to be near the best fit parameters for the straight line model. This parameterisation thus looks at 6th order polynomials in terms of their deviation from this linear best fit, in some sense. It doesn’t cover the full space of 6th order polynomials because I only allowed real parameters, and some of the roots of the first term can easily be imaginary for polynomials in general. The mixing of the various parameters when you expand this expression into standard form might also be killing the independence of the parameters somewhat – I haven’t thought about it too carefully. These issues might wreck the exercise a little since it probably biases predictions towards the linear fit, but anyway, here is what we get (with linear priors on the roots, and a very wide Gaussian prior centred on zero for $C$):

1 sigma 6th order polynomial fits to the data

Bayesian posterior predictive distributions for the (restricted) 6th order polynomial model (priors described in the text).

We get a very nice “funnelling” of the good-fitting polynomials through the data, but again the predictions seem not to take us far from the linear model. There is somewhat more outwards “bleeding” of the probability than we saw in the Gaussian bump case, however, and keeping in mind that the colormap is linear in posterior probability (separately for each T value) this is not small enough to totally neglect.

So, is this the answer to our question then? There are just “more” possible ways that data can be extrapolated than interpolated? Certainly this seems to be true if we force a certain amount of smoothness onto our models, but is that “cheating”? Perhaps. However, it may be the case that the “pointy” models are also more “rare” in some sense, and so won’t contribute much to our posterior inferences even if we do allow them.

When I began this post I thought I had a better answer than this up my sleeve. I had planned to look at the influence of uncertainty in the x direction on our inferences, having had the idea that there might even be a “frequentist” sense in which the pointy models are less probable (since we would have had to be “unlucky” in some sense to have chosen our data samples in just such a way as to “miss” the patterns those pointy models would usually display). But I am not sure about that anymore, and I have been writing this post for about a week, when I intended to knock it off in a couple of hours. So I think I am just going to call it quits, and perhaps return to this topic at another time. I hope it has been at least somewhat interesting for those of you who for one reason or another read it all :).

Before I finish though, I feel that I cannot close this post without mentioning Solomonoff induction. I don’t have time to go into details, so I will point you to wikipedia wikipedia(which has a wishy-washy no-math article, but gives the gist of things), and this one from “less wrong”, which still has no math but describes the concept very well (although it is very lengthy). Perhaps I will come back and update this post to include better references later.

Anyway I mention it because I am not sure that we can fully answer the title question without heading down this road, of attacking the problem of induction itself and trying to understand how we can make any valid inferences at all in the absence of solid prior information. Sigh.

Still, perhaps the frequentist angle I mentioned can help somewhat. I will investigate it and get back to you if I think of anything.