Thursday, August 11, 2011

Testing Intuitions about Markov Chain Monte Carlo: Do I have a bug?

Posted by Danny Tarlow
For one project I've been working on recently, I'm using a Markov Chain Monte Carlo (MCMC) method known as slice sampling. There are some good tutorials, examples, and implementations out there (e.g., by Iain Murray or Radford Neal), but for various reasons, I wanted to implement my own version.

Now, debugging MCMC algorithms is somewhat troublesome, due to their random nature and the fact that chains just sometimes mix slowly, but there are some good ways to be pretty sure that you get things right. For example, the Geweke method is highly regarded as _the_ method to make sure you're getting it right. So this exercise is not actually really about debugging. It's more about testing intuitions about the behavior of a sampler.

With that out of the way, on to the question:
I implemented my sampler, initialized it with small random numbers for the parameters, and set it running on a simple test model (which I'm intentionally not describing in detail). One high level statistic that is relevant to look at is the (log) probability of samples versus iteration of the sampler, so I made that plot. It looks like this:
This plot looks a bit surprising. Upon initialization, the sampler moves directly to regions of space that have very low probability (remember, this is a _log_ probability*), and it appears to just keep going to exponentially less and less probable regions. The point of a sampler is that it should spend an amount of time in a state in proportion to the state's probability. And this sampler is making a beeline to a state that is e^-600 times less probable than where it started.

So here's the question: do I have a bug? In other words, if you were my supervisor and I came to you with this plot, would you dismiss this plot and send me back to debugging. If not, explain how this possibly could make sense.

I'll post my answer sometime in the next couple days.

* I'm leaving out constants, so the graph would be shifted down (but wouldn't change shape or scale) if I was including all the constants.

Update: This is long overdue, but Iain Murray nailed it in the comments:
No there isn’t (necessarily) a bug. This type of plot is very easily produced with valid code: e.g. by slice sampling a unit spherical Gaussian distribution in D=5000 dimensions and initializing at an atypical point of high-probability (much closer to the origin than sqrt(D) away). Simple Metropolis and slice samplers can’t change the log-prob by more than ≈1 per iteration, so large log-prob changes are slow and painful. Intelligent proposals, reparameterizations, or auxiliary variable methods can improve matters. This is a nice illustration that initializing with an optimizer (without some form of early stopping) can be a bad idea!
In fact, I was using Matlab's built-in slice sampler, along with a zero-mean, spherical, many thousand-dimensional Gaussian distribution for the likelihood, and initializing near the mode (0).


Alexandre Passos said...

I've had many bugs that look exactly like this. The standard way to have them in Metropolis-Hastings samplers seems to be having a proposal distribution that almost always proposes a worse state and tuning the step size to make the acceptance raste very high. For example, gaussian proposal distributions in bayesian neural networks with very small variance. I've seen this happen with Gibbs samplers, but it's rarer, and I've never seen it happen to slice samplers.

Technically it's not a bug as you can have this kind of behavior and theoretically your sampler should still converge to the true distribution, but I think in pratice this means you're not mixing and traversing the likelihood via a very bad random walk, which in high-dimensional spaces gives you a horrible bound on the number of steps to traverse the posterior.

steppe2205 said...

If it isn't a bug, it may be due to local modes of the distribution. Your posterior have a local mode that has a very high peak but the probability of the zone around it is small, and your starting point falls intentionally on this peak for example. Another explication is that your chain is going through a low-probability zone and it isn't long enough. In all cases I think that the posterior is very bad; the prior and likelihood funtion are very far. I've been in this situation and i tried with others priors; and it works:D

Danny Tarlow said...

I can give some additional information, which is that the posterior has only one mode.

Iain Murray said...

I'll ROT13 my answer for those that want to think about the puzzle:

Ab gurer vfa’g (arprffnevyl) n oht. Guvf glcr bs cybg vf irel rnfvyl cebqhprq jvgu inyvq pbqr: r.t. ol fyvpr fnzcyvat n havg fcurevpny Tnhffvna qvfgevohgvba va Q=5000 qvzrafvbaf naq vavgvnyvmvat ng na nglcvpny cbvag bs uvtu-cebonovyvgl (zhpu pybfre gb gur bevtva guna fdeg(Q) njnl). Fvzcyr Zrgebcbyvf naq fyvpr fnzcyref pna’g punatr gur ybt-cebo ol zber guna ≈1 cre vgrengvba, fb ynetr ybt-cebo punatrf ner fybj naq cnvashy. Vagryyvtrag cebcbfnyf, ercnenzrgrevmngvbaf, be nhkvyvnel inevnoyr zrgubqf pna vzcebir znggref. Guvf vf n avpr vyyhfgengvba gung vavgvnyvmvat jvgu na bcgvzvmre (jvgubhg fbzr sbez bs rneyl fgbccvat) pna or n onq vqrn!