Posted by Danny TarlowIn this day and age, we're used to tackling problems where one processor just isn't enough.
The standard solution, of course, is to use more processors. Frameworks like MapReduce (and Hadoop) give us an abstraction where a master process spawns off a bunch of worker processes, each processor does some work on its own, then they all report their piece of results, which are combined together by the master process at the end to give the answer.
This works great when the pieces of the problem are independent-- that is, the workers don't need to communicate with each other in order to do their job. If your goal is to compute the average of a huge list of numbers, this works great: each processor reports the sum of numbers in its part of the list and the number of elements. If your goal is to compute the median, then it doesn't work so easily. (Think about it.)
So what do we do when work that the processors are doing isn't independent? Imagine that there are a bunch of interacting pieces to your model. For example, in computer vision, a common problem is to label each pixel in an image with the class of object at that pixel. We shouldn't expect to be able to label each pixel independently, because we know that objects in the world are "blobby", so knowing the value of one pixel will tell you something about the likely values of nearby pixels.
As another example, consider the problem of trying to label a webpage with a category representing what it's about. You could try to do it independently--just looking at the words and images on the page. However, a better approach would probably be to take into account dependencies between pages: if page A links to page B, that probably tells us something about the relationship between the category of A and the category of B. This could be especially useful for the cases where a page itself has few words, but it's linked to by many other pages.
In machine learning, perhaps our most powerful formalism to represent these types of interaction is the probabilistic graphical model. The power comes in the ability to precisely specify the structure of interactions in our problem via a graph (or hypergraph). We can then design message passing algorithms that take advantage of this structure: each region does some local computation, and then parts of the model that directly interact with each other exchange messages. In this way, information is propagated amongst the workers, but computation is still mostly done locally. If you've read about Google's Pregel infrastructure, it probably sounds pretty similar.
Unfortunately, there are certain fundamental limitations to doing things completely locally. The interesting thing is that the same fundamental issue comes up in a variety of settings.
In message passing algorithms (e.g., http://en.wikipedia.org/wiki/Belief_propagation), neighboring regions send messages to each other to try to enforce local consistency--that is, when two regions have beliefs about overlapping variables, they pass messages to each other to try to ensure that they agree on the value of the overlapping variables. You would hope that if you can find a solution that is locally consistent everywhere, that this would imply that there is a globally consistent solution.
Interestingly, this isn't the case. In MAP inference, this crops up in the looseness of the standard linear programming relaxation, which has constraints that enforce local but not global consistency. In constraint satisfaction problems, this comes up in the incompleteness of arc-consistency algorithms: there are certain unsatisfiable constraint satisfaction problems that arc-consistency, a local consistency checking algorithm, cannot recognize as inconsistent.
I was discussing these connections at my tea talk the other day, and Geoff Hinton pointed out that this same phenomenon also comes up in the impossible triangle optical illusion. I think that's kind of cool -- people also get confused when things look sufficiently consistent locally but actually are inconsistent globally. Here's a diagram showing my interpretation of his comment along with the constraint satisfaction example.
Note that when you look at each corner independently (top row), everything looks consistent in isolation. However, when you look at the figure as a whole (bottom left), there is no globally consistent interpretation. In the constraint satisfaction example (bottom right), bold black edges represent legal combinations of variables, while missing edges represent illegal combinations. In this case, the arc-consistency algorithm terminates without finding an inconsistency, but if you work through the possibilities, you'll see that there is no legal assignment.
Now this isn't to say things are hopeless. Even though our locally consistent solutions can be very bad in the worst case, they often are good enough to be useful. If we want globally consistent exact solutions, we have lift and project methods from theoretical computer science, which show how to gradually move from local to global consistency. The big three frameworks are covered here:
A Comparison of the Sherali-Adams, Lovász-Schrijver and Lasserre Relaxations for 0-1 Programming
In the worst case, these methods become exponentially expensive as you move to become more global (which is not surprising, since we can represent many NP-hard problems with graphical models). However, it turns out that in practice, the real-world problems that we are interested are not the hardest computational problems. In some recent work, David Sontag and company have shown that just making things a little more global can be enough to exactly solve hard protein folding and protein design problems from computational biology: D. Sontag, T. Meltzer, A. Globerson, Y. Weiss, T. Jaakkola. Tightening LP Relaxations for MAP using Message Passing. Uncertainty in Artificial Intelligence (UAI) 24, July 2008.
I should also (shamelessly) point out that global interactions don't always have to amount to intractable models. Some of my own recent work focuses on how we can get the best of both worlds when we have parts of the model that interact locally along with some additional global interactions. It turns out that you can develop very efficient message passing algorithms in some of these cases as well: D. Tarlow, I. Givoni, and R. Zemel. HOP-MAP: Efficient Message Passing with High Order Potentials. The 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010)