Wednesday, June 17, 2009

Large Scale Graph Computing

Posted by Danny Tarlow
The official Google research blog has a post on large scale graph computing:
we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology ... Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.
This sounds pretty powerful. It's interesting to think about what new problems you could think about tackling if you had access to this type of computing power. One early thought I had would be to run belief propagation algorithms over graphical models with nodes representing something like n-grams, search queries, or names. You could imagine a network where every mention of a name on the internet gets its own node (or set of nodes) in a factor graph. The goal would be to cluster mentions of names, so that mentions referring to the same underlying person would be put in the same group, while mentions referring to different underlying people would be put in different groups.

One choice would be to tackle this problem with exemplar-based clustering. (Shameless self-promotion ahead). A natural way to tackle this problem might be with flexible priors and exemplar-based clustering.

Two sources of information would come into play in the model.
  • First, you'd need to define a similarity measure between mentions of names: how different is the actual text (e.g., Danny and Daniel should be close, while Danny and Andy should be far); do the mentions occur on the same page?; do the pages they are on link to each other?; how similar is the other text on the page surrounding the mention?; etc.
  • Second, you'd need some way to decide how many names there are (model selection). One way might be to just choose that you want k different groups. This is how, for example, the k-means, k-medians, and the normalized cut algorithms works. This is probably less than ideal, though, since many of these methods have implicit priors that encourage all of the groups to be roughly the same size (see my paper for details).

    Instead, it might be natural to express a prior distribution over the size of groups that we expect to see in the name-mention groups. You can imagine that some names will occur extremely frequently (E.g., Paris Hilton, Kobe Bryant, Barack Obama), while others will occur infrequently (E.g., me), and yet others will occur maybe once at best (E.g., my Gramma Lorraine). Using flexible priors, we can express this prior belief.

If nothing else, running this algorithm would surely put their suggestion of infinite computing power to test:
so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel
What would you do if you had this kind of power on your desktop?

1 comment: