*Posted by Danny Tarlow*

The standard solution, of course, is to use more processors. Frameworks like MapReduce (and Hadoop) give us an abstraction where a master process spawns off a bunch of worker processes, each processor does some work on its own, then they all report their piece of results, which are combined together by the master process at the end to give the answer.

This works great when the pieces of the problem are independent-- that is, the workers don't need to communicate with each other in order to do their job. If your goal is to compute the average of a huge list of numbers, this works great: each processor reports the sum of numbers in its part of the list and the number of elements. If your goal is to compute the median, then it doesn't work so easily. (Think about it.)

So what do we do when work that the processors are doing isn't independent? Imagine that there are a bunch of interacting pieces to your model. For example, in computer vision, a common problem is to label each pixel in an image with the class of object at that pixel. We shouldn't expect to be able to label each pixel independently, because we know that objects in the world are "blobby", so knowing the value of one pixel will tell you something about the likely values of nearby pixels.

As another example, consider the problem of trying to label a webpage with a category representing what it's about. You could try to do it independently--just looking at the words and images on the page. However, a better approach would probably be to take into account dependencies between pages: if page A links to page B, that probably tells us something about the relationship between the category of A and the category of B. This could be especially useful for the cases where a page itself has few words, but it's linked to by many other pages.

In machine learning, perhaps our most powerful formalism to represent these types of interaction is the probabilistic graphical model. The power comes in the ability to precisely specify the structure of interactions in our problem via a graph (or hypergraph). We can then design message passing algorithms that take advantage of this structure: each region does some local computation, and then parts of the model that directly interact with each other exchange messages. In this way, information is propagated amongst the workers, but computation is still mostly done locally. If you've read about Google's Pregel infrastructure, it probably sounds pretty similar.

Unfortunately, there are certain fundamental limitations to doing things completely locally. The interesting thing is that the same fundamental issue comes up in a variety of settings.

In message passing algorithms (e.g., http://en.wikipedia.org/wiki/Belief_propagation), neighboring regions send messages to each other to try to enforce local consistency--that is, when two regions have beliefs about overlapping variables, they pass messages to each other to try to ensure that they agree on the value of the overlapping variables. You would hope that if you can find a solution that is locally consistent everywhere, that this would imply that there is a globally consistent solution.

Interestingly, this isn't the case. In MAP inference, this crops up in the looseness of the standard linear programming relaxation, which has constraints that enforce local but not global consistency. In constraint satisfaction problems, this comes up in the incompleteness of arc-consistency algorithms: there are certain unsatisfiable constraint satisfaction problems that arc-consistency, a local consistency checking algorithm, cannot recognize as inconsistent.

http://en.wikipedia.org/wiki/Local_consistency#Arc_consistency

I was discussing these connections at my tea talk the other day, and Geoff Hinton pointed out that this same phenomenon also comes up in the impossible triangle optical illusion. I think that's kind of cool -- people also get confused when things look sufficiently consistent locally but actually are inconsistent globally. Here's a diagram showing my interpretation of his comment along with the constraint satisfaction example.

Note that when you look at each corner independently (top row), everything looks consistent in isolation. However, when you look at the figure as a whole (bottom left), there is no globally consistent interpretation. In the constraint satisfaction example (bottom right), bold black edges represent legal combinations of variables, while missing edges represent illegal combinations. In this case, the arc-consistency algorithm terminates without finding an inconsistency, but if you work through the possibilities, you'll see that there is no legal assignment.

Now this isn't to say things are hopeless. Even though our locally consistent solutions can be very bad in the worst case, they often are good enough to be useful. If we want globally consistent exact solutions, we have lift and project methods from theoretical computer science, which show how to gradually move from local to global consistency. The big three frameworks are covered here:

A Comparison of the Sherali-Adams, LovĂˇsz-Schrijver and Lasserre Relaxations for 0-1 Programming

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.893

In the worst case, these methods become exponentially expensive as you move to become more global (which is not surprising, since we can represent many NP-hard problems with graphical models). However, it turns out that in practice, the real-world problems that we are interested are not the hardest computational problems. In some recent work, David Sontag and company have shown that just making things a little more global can be enough to exactly solve hard protein folding and protein design problems from computational biology: D. Sontag, T. Meltzer, A. Globerson, Y. Weiss, T. Jaakkola. Tightening LP Relaxations for MAP using Message Passing. Uncertainty in Artificial Intelligence (UAI) 24, July 2008.

[pdf] [code]

I should also (shamelessly) point out that global interactions don't always have to amount to intractable models. Some of my own recent work focuses on how we can get the best of both worlds when we have parts of the model that interact locally along with some additional global interactions. It turns out that you can develop very efficient message passing algorithms in some of these cases as well: D. Tarlow, I. Givoni, and R. Zemel. HOP-MAP: Efficient Message Passing with High Order Potentials. The 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010)

[pdf] [code]

## 8 comments:

Sontag has some new work in EMNLP 2010 that you might be interested in "On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing"

Have you guys thought about whether your HOP potential approach carries over to sum-product?

Hi Yaroslav,

Good question. The answer is that whether it is tractable depends on the type of factor. For cardinality potentials, this paper gives a formulation that works for sum-product:

Efficient Belief Propagation for Vision Using Linear Constraint Nodes

Brian Potetz

CVPR (IEEE Conference on Computer Vision and Pattern Recognition) June 2007 (in press)

http://www.cs.cmu.edu/~bpotetz/Publications/bpotetz_ConstraintNodes_cvpr07.pdf

For the sum-product analog of composite factors (as we call them), take a look at Gates by Tom Minka and John Winn.

http://research.microsoft.com/apps/pubs/default.aspx?id=78856

As I guess is obvious just from the names of the algorithms, it's a question of maximization versus summing, so in general, if you can come up with an algorithm that sums (counts) rather than maximizes, you can come up with sum-product updates.

There are two potential problems, as I see it.

(1) In general, you'll run into #P subproblems. For example, in our NIPS 2006 paper, we essentially defined a high order factor where max-product message updates require solving a bunch of maximum weight bipartite matching problems. You can use dynamic algorithms to do this efficiently, but if you tried to compute the sum-product updates for this factor, I believe it would be #P hard (by reduction to counting the number of matchings. Aside: though there appears to have been some interesting recent work on approximate counting of bipartite matchings: http://dukespace.lib.duke.edu/dspace/handle/10161/1054).

(2) In our AISTATS paper, we get an extra O(N) efficiency out of our max-product updates by using clever dynamic algorithms rather than solving the O(N) related-but-distinct problems required to compute messages. To get the same win from sum-product, you'd need to think about dynamic counting algorithms. I'm not sure this is a well-developed area, though. Do you know of any interesting such algorithms?

And I should say that I haven't implemented any of the sum-product variants, so I can't say how they work empirically.

1) I've haven't seen max-product->matching reduction before, is your 2006 paper the best read for an introduction?

Counting does seem harder in terms of existing algorithms, but I wonder if there are any formal results showing max-product to be easier than sum-product in some cases. Like with max-product, there are sometimes surprising reductions, for instance, computation of Ising partition function can be made tractable in some cases by reducing it to Pfaffian computation

2) By dynamic counting you mean computing sum-product efficiently? Two categories of exact counting algorithms I know are dynamic programming like Junction Tree which is tractable for bounded treewidth, and holographic like Kasteleyn's Ising partition function algorithm which is tractable for bounded genus. Also, there are a few approximate counting algorithms appearing in SODA recently that can be adapted to computing marginals and will work in polynomial time if edge strength is low compared to graph branching factor. Additionally, I've seen a randomized approximation algorithm which runs in polynomial time for minor-excluded graphs.

One nice thing I found about sum-product is that it reduces to matrix multiplications in regular algebra. Matrix multiplications in this setting are "ergodic" in a sense that "older" matrices become irrelevant, so you limit yourself to a subset of matrices you need to multiply and get a guaranteed approximation algorithm. There is well developed theory of such bounds. For max-product on other hand, you have matrix multiplication in "tropical algebra", and I couldn't find any results about ergodicity of matrix multiplication in this setting. If you found such bounds, you could use "Correlation Decay Tree" construction to get a polynomial time approximation algorithm for max-product

Hi Yaroslav, sorry for the slow response.

1. I don't mean that you can actually reduce max-product to bipartite matching. I just mean that if you have a substructure in your model that contributes the same energy and constraints as a bipartite matching problem, then you can use special purpose algorithms inside a factor representing that whole subproblem. What you care about is what messages that factor sends to the rest of the network, and you can compute exact max-product messages using dynamic combinatorial algorithms. That's what's in the NIPS paper. If you tried to do the analog to sum product there, it would be #P-complete.

2. If you look at the message updates for standard sum-product, you need to compute M different message vectors, where M is the number of variables in the scope of the factor. Each entry in each vector is the result of a slightly different counting problem. By dynamic counting algorithm, I mean a way of solving all of these related problems in a combined way--typically by solving one problem, then modifying the problem slightly and solving the next problem, etc.

As for the correlation decay, I don't know much about that, but I'd definitely be interested to know why it is or why it isn't applicable in the max-product case.

Regarding correlation decay, here's an example for chains -- suppose you want to find a marginal probability of state 1 in an HMM of length n given all the observations. Standard sum-product reduces to matrix multiplications, results from matrix theory tell you that if your transition matrices are primitive, after k matrix multiplications you'll be fairly close to true answer, so you can discard remaining n-k observations. In an analogous max-product case, we want to get max-marginal instead of marginal. Can we say anything about how many observations we need until the rest becomes (mostly) irrelevant?

My quick guess is that it's dependent on the minimum strength of pairwise potential along the path from state 1 to n. It should be pretty straightforward to construct examples that get that error--make state n have strong preference for x_n=1, make state_1 have strong preference for x_1=0, then have pairwise potentials encouraging neighboring variables to take on the same value. So unless you make some assumption like pairwise strengths are drawn randomly from say a zero-mean Gaussian, I don't see how you could make such an argument. But you say that it works in the sum-product case? So long as transitions aren't deterministic?

Yes, it works for sum-product, if matrices has no 0 components, then product of two matrices is closer to rank-1 matrix, in terms of Birkhoff's contraction coefficient, than either of the original matrices. You can collapse last n steps of belief propagation on chain into 1 by multiplying n transition matrices together. For large enough n, this matrix will be essentially a rank-1 matrix, which means that evidence beyond n+1 steps in the past irrelevant. I imagine something similar should happen for max-marginals too. As far as I can see, Birkhoff's contraction coefficient is the best possible measure of contraction for matrix multiplication, and it comes up under different names in work studying convergence of belief propagation, like Joris Mooij's thesis. I imagine one could come up with similar "contraction" properties when sum-product is replaced with max-sum, but haven't seen any work on it

Post a Comment