Posted by Danny Tarlow
The Netflix PrizeMy little region of the internet is abuzz with news that the 10% improvement needed to win the $1M Netflix prize has been achieved. The prize is now in the "last call" stage:
As of the submission by team "BellKor's Pragmatic Chaos" on June 26, 2009 18:42:37 UTC, the Netflix Prize competition entered the "last call" period for the Grand Prize. In accord with the Rules, teams have thirty (30) days, until July 26, 2009 18:42:37 UTC, to make submissions that will be considered for this Prize. Good luck and thank you for participating!
The winning method is a combination of several different teams' approaches, blended together using some form of ensemble learning. It's an interesting lesson -- to really eek out the last bit of performance, it appears to work better to throw the kitchen sink of methods at the problem and combine them in an intelligent way. This is rather contrary to the "ideal" research approach, where some clever, (relatively) simple, elegant breakthrough would lead to the best performance.
In honor of the prize barrier being broken, I put together a little implementation of an early leader's approach to the problem. They experimented with several different approaches, but the one I use the most is the original probabilistic matrix factorization (PMF) approach. To see all the details, including how it performs on the full Netflix problem, see Russ and Andrei's paper.
The Problem
The rating prediction problem is easy to explain: there is a set of users, a set of items (movies, in the Netflix case) and a set of ratings. For some combinations of user and movie, you have a rating, which says how much the user liked the movie. This is called the training set. From the ratings in the training set, we want to build a model that will predict the rating given by some previously unseen combination of user and movie.
In order to evaluate the model, some known ratings are taken out of the training set and put in a test set. The model then takes the training set as input, along with the user-movie combinations that are in the test set, but it isn't shown the actual ratings in the test set. Instead, it is asked to make predictions based on what it has learned from the training set. In the case of the Netflix challenge, Netflix never reveals the actual values of these ratings -- they will only take in a set of predictions and report back your score.
For example, a small data set might look as follows:
Danny - Good Will Hunting: 10
Danny - Prefontaine: 10
Danny - Casino Royale: 5
Tyler - Toy Story: 2
Tyler - Top Gun: 10
Tyler - Casino Royale: 10
Tyler - The Little Mermaid: 2
Tyler - Prefontaine: 6
Andrew - Up: 10
Andrew - Toy Story: 10
Andrew - Top Gun: 3
Probabilistic Matrix Factorization
The big idea of PMF is to describe each user and each movie by a small set of attributes. In this case, we could say something along the lines that Danny likes dramas and doesn't really like action or Disney movies. Tyler loves action movies but hates Disney movies, and Andrew loves Disney movies but nothing else. Toy Story and Up are Disney movies, Top Gun and Casino Royale are action movies, and Good Will Hunting and Prefontaine are dramas.
This could be represented by the PMF model by using three dimensional vectors to describe each user and each movie. If the first dimension was how much of a Disney movie it is, the second is how much of an action movie it was, and the third is how much of a drama it is, then the latent vectors might look something like as follows:
Danny: [0, 3, 10]
Andrew: [10, 3, 5]
Tyler: [0, 8, 4]
Good Will Hunting: [0, 0, 1]
Prefontaine: [0, .5, 1]
Casino Royale: [0, 1, .2]
Toy Story: [1, .1, .1]
Top Gun: [0, 1, .2]
...
To predict how a user rates a movie, we take the dot product of the user vector with the item vector. So to predict how Tyler would rate Top Gun:
0 * 0 + 8 * 1 + .2 * 4 = 8.8
Not bad, given that we know his actual rating was a 10.
Learning
However, in the PMF model, the descriptions of the users and movies are not known ahead of time. Instead, the learning algorithm automatically discovers these latent characteristics, finding latent representations of users and movies such that the model's predicted ratings match the ratings in the training set as best as possible. Going back to the example, if we wanted to make the predicted rating match the actual rating even more, we could bump up Tyler's love for action movies (or make several other changes to the latent vectors). When you have many users rating many movies, it gets trickier to know how to change the latent vectors, because making Tyler's rating for Top Gun closer might make several other of his ratings get worse. Thankfully, calculus and the PMF algorithm help us out here.
For my implementation, I don't worry too much about making it efficient enough to scale to something like the Netflix problem, but I use the same model presented in the paper, and I use a similar gradient-ascent optimization scheme (though I use batch updates with momentum instead of stochastic updates).
I test my code using synthetic data, where I first make up latent vectors for users and items, then I generate some training set ratings by multiplying some latent user vectors by latent item vectors then adding some noise. I then discard the latent vectors and just give the model the synthetic ratings.
The code progresses iteratively, increasing the likelihood of the data given the latent vectors at each iteration. I use 5 dimensional latent vectors for these experiments, but you could use any small number. Here's what the likelihood (which you can think of as roughly the (negative) sum of squared error between predictions on the training set and actual training set values) looks like over time:

Results
Upon convergence of the training, the latent vectors look like this. There is one row for each individual user and item, then there are five columns per row -- these are the learned descriptions of users and movies:

Finally, using the learned latent vectors, we can then make rating predictions for any combination of user and movie. I do that here:

The Code
And here is the code that generates the synthetic data, trains the model, and produces the figures:
Update (10/25/09): Fixed a bug in try_updates that was causing overcounting of the regularization component of the update. Thanks, Conrad.
import pylab import matplotlib.pyplot as plt import matplotlib.cm as cm import numpy import pickle class ProbabilisticMatrixFactorization(): def __init__(self, rating_tuples, latent_d=1): self.latent_d = latent_d self.learning_rate = .0001 self.regularization_strength = 0.1 self.ratings = numpy.array(rating_tuples).astype(float) self.converged = False self.num_users = int(numpy.max(self.ratings[:, 0]) + 1) self.num_items = int(numpy.max(self.ratings[:, 1]) + 1) print (self.num_users, self.num_items, self.latent_d) print self.ratings self.users = numpy.random.random((self.num_users, self.latent_d)) self.items = numpy.random.random((self.num_items, self.latent_d)) self.new_users = numpy.random.random((self.num_users, self.latent_d)) self.new_items = numpy.random.random((self.num_items, self.latent_d)) def likelihood(self, users=None, items=None): if users is None: users = self.users if items is None: items = self.items sq_error = 0 for rating_tuple in self.ratings: if len(rating_tuple) == 3: (i, j, rating) = rating_tuple weight = 1 elif len(rating_tuple) == 4: (i, j, rating, weight) = rating_tuple r_hat = numpy.sum(users[i] * items[j]) sq_error += weight * (rating - r_hat)**2 L2_norm = 0 for i in range(self.num_users): for d in range(self.latent_d): L2_norm += users[i, d]**2 for i in range(self.num_items): for d in range(self.latent_d): L2_norm += items[i, d]**2 return -sq_error - self.regularization_strength * L2_norm def update(self): updates_o = numpy.zeros((self.num_users, self.latent_d)) updates_d = numpy.zeros((self.num_items, self.latent_d)) for rating_tuple in self.ratings: if len(rating_tuple) == 3: (i, j, rating) = rating_tuple weight = 1 elif len(rating_tuple) == 4: (i, j, rating, weight) = rating_tuple r_hat = numpy.sum(self.users[i] * self.items[j]) for d in range(self.latent_d): updates_o[i, d] += self.items[j, d] * (rating - r_hat) * weight updates_d[j, d] += self.users[i, d] * (rating - r_hat) * weight while (not self.converged): initial_lik = self.likelihood() print " setting learning rate =", self.learning_rate self.try_updates(updates_o, updates_d) final_lik = self.likelihood(self.new_users, self.new_items) if final_lik > initial_lik: self.apply_updates(updates_o, updates_d) self.learning_rate *= 1.25 if final_lik - initial_lik < .1: self.converged = True break else: self.learning_rate *= .5 self.undo_updates() if self.learning_rate < 1e-10: self.converged = True return not self.converged def apply_updates(self, updates_o, updates_d): for i in range(self.num_users): for d in range(self.latent_d): self.users[i, d] = self.new_users[i, d] for i in range(self.num_items): for d in range(self.latent_d): self.items[i, d] = self.new_items[i, d] def try_updates(self, updates_o, updates_d): alpha = self.learning_rate beta = -self.regularization_strength for i in range(self.num_users): for d in range(self.latent_d): self.new_users[i,d] = self.users[i, d] + \ alpha * (beta * self.users[i, d] + updates_o[i, d]) for i in range(self.num_items): for d in range(self.latent_d): self.new_items[i, d] = self.items[i, d] + \ alpha * (beta * self.items[i, d] + updates_d[i, d]) def undo_updates(self): # Don't need to do anything here pass def print_latent_vectors(self): print "Users" for i in range(self.num_users): print i, for d in range(self.latent_d): print self.users[i, d], print print "Items" for i in range(self.num_items): print i, for d in range(self.latent_d): print self.items[i, d], print def save_latent_vectors(self, prefix): self.users.dump(prefix + "%sd_users.pickle" % self.latent_d) self.items.dump(prefix + "%sd_items.pickle" % self.latent_d) def fake_ratings(noise=.25): u = [] v = [] ratings = [] num_users = 100 num_items = 100 num_ratings = 30 latent_dimension = 10 # Generate the latent user and item vectors for i in range(num_users): u.append(2 * numpy.random.randn(latent_dimension)) for i in range(num_items): v.append(2 * numpy.random.randn(latent_dimension)) # Get num_ratings ratings per user. for i in range(num_users): items_rated = numpy.random.permutation(num_items)[:num_ratings] for jj in range(num_ratings): j = items_rated[jj] rating = numpy.sum(u[i] * v[j]) + noise * numpy.random.randn() ratings.append((i, j, rating)) # thanks sunquiang return (ratings, u, v) def plot_ratings(ratings): xs = [] ys = [] for i in range(len(ratings)): xs.append(ratings[i][1]) ys.append(ratings[i][2]) pylab.plot(xs, ys, 'bx') pylab.show() def plot_latent_vectors(U, V): fig = plt.figure() ax = fig.add_subplot(121) cmap = cm.jet ax.imshow(U, cmap=cmap, interpolation='nearest') plt.title("Users") plt.axis("off") ax = fig.add_subplot(122) ax.imshow(V, cmap=cmap, interpolation='nearest') plt.title("Items") plt.axis("off") def plot_predicted_ratings(U, V): r_hats = -5 * numpy.ones((U.shape[0] + U.shape[1] + 1, V.shape[0] + V.shape[1] + 1)) for i in range(U.shape[0]): for j in range(U.shape[1]): r_hats[i + V.shape[1] + 1, j] = U[i, j] for i in range(V.shape[0]): for j in range(V.shape[1]): r_hats[j, i + U.shape[1] + 1] = V[i, j] for i in range(U.shape[0]): for j in range(V.shape[0]): r_hats[i + U.shape[1] + 1, j + V.shape[1] + 1] = numpy.dot(U[i], V[j]) / 10 fig = plt.figure() ax = fig.add_subplot(111) ax.imshow(r_hats, cmap=cm.gray, interpolation='nearest') plt.title("Predicted Ratings") plt.axis("off") if __name__ == "__main__": DATASET = 'fake' if DATASET == 'fake': (ratings, true_o, true_d) = fake_ratings() #plot_ratings(ratings) pmf = ProbabilisticMatrixFactorization(ratings, latent_d=5) liks = [] while (pmf.update()): lik = pmf.likelihood() liks.append(lik) print "L=", lik pass plt.figure() plt.plot(liks) plt.xlabel("Iteration") plt.ylabel("Log Likelihood") plot_latent_vectors(pmf.users, pmf.items) plot_predicted_ratings(pmf.users, pmf.items) plt.show() pmf.print_latent_vectors() pmf.save_latent_vectors("models/")Related Posts:
22 comments:
What would be the problem in applying your code on the NetFlix data?
There's nothing in theory that should keep it from working on the Netflix problem. In practice, stochastic gradient descent or mini-batch gradient descent seems to work a little better (as opposed to my batch update), and you could be more efficient about computing the gradients. In C this would be fine, but I think in Python you'd be better off replacing the for loops with map calls or matrix computations.
Good catch, sunquiang. Thanks.
Line 180 should be indented 4 spaces more -- I've corrected it above.
Everything still works the same, but making the change will run it on a more realistic data set.
sorry for my accidental deletion.
and thanks for the confirmation.
Maybe you'd be interested in MMMF?
http://people.csail.mit.edu/jrennie/matlab/
(Using the appropriate loss) MMMF can be viewed as a generalization of regularized logistic regression to a matrix of discrete, ordered labels.
You might also be interested in a writeup I did comparing ordinal regression loss functions (which are important when predicting discrete, ordered ratings):
http://people.csail.mit.edu/jrennie/writing/proportionalOdds.pdf
Here is a suggestion for another post that would fit in your series of simple machine learning algorithms in python.
How about sum-product? Or max-product? Or something? And then do a bit of learning with a small graphical model.
I've been deliberately stalling on this because I intend to release quite a bit of max-product code in the near future, but yes, this would be very doable. Any specific applications areas you'd like to request?
Whatever application you think is must suited to a demo. Something with a reasonably manageable and understandable dataset. Maybe non-vision data? Surprise me. :)
Yes, sir!
This is interesting, but really without knowing how it scores on the Netflix dataset its hard to assess its usefulness.
I strongly recommend you run this on Netflix and determine its score. I'm not saying you need to aim to beat the Netflix prize, but anything below 0.91 would indicate that your approach has some degree of usefulness.
For example, the "Simon Funk" algorithm, which is similar to the one you describe, scores around 0.90 IIRC.
@sanity: It has been tried on Netflix, just not by Danny.
Thanks George. @sanity: what he said.
http://www.cs.toronto.edu/~amnih/papers/pmf.pdf
I think there is a discrepancy between your code and what you used to show the data: a small mistake in line 122 causes the estimated user matrix to be a repeated stack of the same column vector after one gradient iteration.
cournape -- Ah, yes, you are correct. Thanks for the careful eye.
(You're also right about the discrepancy. I made some changes first in my local copy of the code that produces all the outputs, then since they were small, I figured I'd just edit the blog code by hand. Never a good idea.)
Hey Danny, that is a very interessting post. I am thinking about using the algorithm for a similar optimization but in this case I don't have an absolute rating for the movies. I just have relative ratings: The user can pick which of two movies he prefers. Could you think of a way to use this algorithm in this case?
It's an interesting question, dotdotdot. The first thing you'd need to do is change the objective function, since it's currently measuring the squared error between predicted and observed ratings--instead, you'd probably want it to somehow depend on the number of pairwise orderings that you violate.
The problem will be that this objective function won't have an informative derivative: if you make a small change to latent vectors that doesn't flip any ordering, the error won't change.
So to stay in the continuous space, you'd probably then need to think of a smoothed approximation to the modified objective function--one that's close to what you care about but which has an informative derivative. Alternatively, you could think about trying to do optimization in the discrete space.
So in short, I'm not sure there's a clear best way to do it (though maybe somebody else knows and can point out a good reference?).
Regardless, it's an interesting question--if you have interesting data and/or want to talk about it further, let me know.
Re: relative ratings
Consider representing each movie as a feature vector and each user as a weight vector. Then, treat the relative ratings as classification examples such that the preferred movie should have the larger dot-product and solve all the classification problems simultaneously.
Easy to do using the framework of Fast MMMF:
Fast Maximum Margin Matrix Factorization for Collaborative Prediction
Thanks, Jason. I thought you might have some thoughts on the subject.
One question: Are you suggesting that the movie vectors stay fixed? I don't think you mean so, but calling the movie representation a feature vector for classification would make me think so.
But basically you are just saying to penalize each pairwise violation like you'd penalize a classification problem (smoothly), then optimize the sum of errors over the user (and movie?) weights?
Sounds good to me.
You could have fixed feature vectors. That's what Nati and I did in Loss Functions for Preference Levels: Regression with Discrete Ordered Labels. There we used fixed ratings, but it's easy to adapt to relative ratings. Alternatively, you could simultaneously learn weight vectors and "feature vectors" (I agree that nomenclature could be better :-). This is Fast MMMF in a nutshell.
You've got the intuition correct: all the techniques I worked on minimize a sum of classification losses (plus a regularization term). The sum is over examples (ratings).
Hello everyone, I am new to this topic. I know how to use svd for collaborative filtering. But i am facing problems with PMF. i tried to understand your code but couldn't. i have some doubts regarding PMF.
1) In SVD matrix factorization , we decompose Rating matrix into feature matrices using svd and then use gradient descent to update the feature matrices using learning rate and regularization coefficients. But in PMF how to obtain these FEATURE MATRICES and How to Update them. do we use gradient ascent here? i have read a paper in which it is said to maximize the log posterior of user and movie features. can you please enlighten me on this?
Hi Danny,
Good to know your post, but how about data in this post.
Can you send me data for test program ?
Thanks very much !
Post a Comment