The Netflix Prize
My little region of the internet is abuzz with news that the 10% improvement needed to win the $1M
Netflix prize has been achieved. The prize is now in the "last call" stage:
As of the submission by team "BellKor's Pragmatic Chaos" on June 26, 2009 18:42:37 UTC, the Netflix Prize competition entered the "last call" period for the Grand Prize. In accord with the Rules, teams have thirty (30) days, until July 26, 2009 18:42:37 UTC, to make submissions that will be considered for this Prize. Good luck and thank you for participating!
The winning method is a combination of several different teams' approaches, blended together using some form of
ensemble learning. It's an interesting lesson -- to really eek out the last bit of performance, it appears to work better to throw the kitchen sink of methods at the problem and combine them in an intelligent way. This is rather contrary to the "ideal" research approach, where some clever, (relatively) simple, elegant breakthrough would lead to the best performance.
In honor of the prize barrier being broken, I put together a little implementation of an
early leader's approach to the problem. They experimented with several different approaches, but the one I use the most is the original
probabilistic matrix factorization (PMF) approach. To see all the details, including how it performs on the full Netflix problem, see
Russ and Andrei's paper.
The Problem
The rating prediction problem is easy to explain: there is a set of users, a set of items (movies, in the Netflix case) and a set of ratings. For some combinations of user and movie, you have a rating, which says how much the user liked the movie. This is called the
training set. From the ratings in the training set, we want to build a model that will predict the rating given by some previously unseen combination of user and movie.
In order to evaluate the model, some known ratings are taken out of the training set and put in a
test set. The model then takes the training set as input, along with the user-movie combinations that are in the test set, but it isn't shown the actual ratings in the test set. Instead, it is asked to make predictions based on what it has learned from the training set. In the case of the Netflix challenge, Netflix never reveals the actual values of these ratings -- they will only take in a set of predictions and report back your score.
For example, a small data set might look as follows:
Danny - Good Will Hunting: 10
Danny - Prefontaine: 10
Danny - Casino Royale: 5
Tyler - Toy Story: 2
Tyler - Top Gun: 10
Tyler - Casino Royale: 10
Tyler - The Little Mermaid: 2
Tyler - Prefontaine: 6
Andrew - Up: 10
Andrew - Toy Story: 10
Andrew - Top Gun: 3
Probabilistic Matrix Factorization
The big idea of PMF is to describe each user and each movie by a small set of attributes. In this case, we could say something along the lines that Danny likes dramas and doesn't really like action or Disney movies. Tyler loves action movies but hates Disney movies, and Andrew loves Disney movies but nothing else. Toy Story and Up are Disney movies, Top Gun and Casino Royale are action movies, and Good Will Hunting and Prefontaine are dramas.
This could be represented by the PMF model by using three dimensional vectors to describe each user and each movie. If the first dimension was how much of a Disney movie it is, the second is how much of an action movie it was, and the third is how much of a drama it is, then the latent vectors might look something like as follows:
Danny: [0, 3, 10]
Andrew: [10, 3, 5]
Tyler: [0, 8, 4]
Good Will Hunting: [0, 0, 1]
Prefontaine: [0, .5, 1]
Casino Royale: [0, 1, .2]
Toy Story: [1, .1, .1]
Top Gun: [0, 1, .2]
...
To predict how a user rates a movie, we take the dot product of the user vector with the item vector. So to predict how Tyler would rate Top Gun:
0 * 0 + 8 * 1 + .2 * 4 = 8.8
Not bad, given that we know his actual rating was a 10.
Learning
However, in the PMF model, the descriptions of the users and movies are
not known ahead of time. Instead, the learning algorithm automatically discovers these latent characteristics, finding latent representations of users and movies such that the model's predicted ratings match the ratings in the training set as best as possible. Going back to the example, if we wanted to make the predicted rating match the actual rating even more, we could bump up Tyler's love for action movies (or make several other changes to the latent vectors). When you have many users rating many movies, it gets trickier to know how to change the latent vectors, because making Tyler's rating for Top Gun closer might make several other of his ratings get worse. Thankfully, calculus and the PMF algorithm help us out here.
For my implementation, I don't worry too much about making it efficient enough to scale to something like the Netflix problem, but I use the same model presented in the paper, and I use a similar gradient-ascent optimization scheme (though I use batch updates with momentum instead of stochastic updates).
I test my code using synthetic data, where I first make up latent vectors for users and items, then I generate some training set ratings by multiplying some latent user vectors by latent item vectors then adding some noise. I then discard the latent vectors and just give the model the synthetic ratings.
The code progresses iteratively, increasing the likelihood of the data given the latent vectors at each iteration. I use 5 dimensional latent vectors for these experiments, but you could use any small number. Here's what the likelihood (which you can think of as roughly the (negative) sum of squared error between predictions on the training set and actual training set values) looks like over time:
Results
Upon convergence of the training, the latent vectors look like this. There is one row for each individual user and item, then there are five columns per row -- these are the learned descriptions of users and movies:
Finally, using the learned latent vectors, we can then make rating predictions for any combination of user and movie. I do that here:

The left side of the image has the vertical stacking of user latent vectors, the top side has the horizontal stacking of movie latent vectors, then the large bottom right region has the predicted ratings for every combination of user and item. Note that if a user or item has relatively small entries in all of its latent vectors, that represents something along the lines of being non-polar (apathetic, almost), while users or items with larger positive or negative latent values will be more polarizing -- the love-it-or-hate-it idea. You can see these sorts of patterns appear in the rows and columns.
The Code
And here is the code that generates the synthetic data, trains the model, and produces the figures:
Update (10/25/09): Fixed a bug in try_updates that was causing overcounting of the regularization component of the update. Thanks, Conrad.
import pylab
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy
import pickle
class ProbabilisticMatrixFactorization():
def __init__(self, rating_tuples, latent_d=1):
self.latent_d = latent_d
self.learning_rate = .0001
self.regularization_strength = 0.1
self.ratings = numpy.array(rating_tuples).astype(float)
self.converged = False
self.num_users = int(numpy.max(self.ratings[:, 0]) + 1)
self.num_items = int(numpy.max(self.ratings[:, 1]) + 1)
print (self.num_users, self.num_items, self.latent_d)
print self.ratings
self.users = numpy.random.random((self.num_users, self.latent_d))
self.items = numpy.random.random((self.num_items, self.latent_d))
self.new_users = numpy.random.random((self.num_users, self.latent_d))
self.new_items = numpy.random.random((self.num_items, self.latent_d))
def likelihood(self, users=None, items=None):
if users is None:
users = self.users
if items is None:
items = self.items
sq_error = 0
for rating_tuple in self.ratings:
if len(rating_tuple) == 3:
(i, j, rating) = rating_tuple
weight = 1
elif len(rating_tuple) == 4:
(i, j, rating, weight) = rating_tuple
r_hat = numpy.sum(users[i] * items[j])
sq_error += weight * (rating - r_hat)**2
L2_norm = 0
for i in range(self.num_users):
for d in range(self.latent_d):
L2_norm += users[i, d]**2
for i in range(self.num_items):
for d in range(self.latent_d):
L2_norm += items[i, d]**2
return -sq_error - self.regularization_strength * L2_norm
def update(self):
updates_o = numpy.zeros((self.num_users, self.latent_d))
updates_d = numpy.zeros((self.num_items, self.latent_d))
for rating_tuple in self.ratings:
if len(rating_tuple) == 3:
(i, j, rating) = rating_tuple
weight = 1
elif len(rating_tuple) == 4:
(i, j, rating, weight) = rating_tuple
r_hat = numpy.sum(self.users[i] * self.items[j])
for d in range(self.latent_d):
updates_o[i, d] += self.items[j, d] * (rating - r_hat) * weight
updates_d[j, d] += self.users[i, d] * (rating - r_hat) * weight
while (not self.converged):
initial_lik = self.likelihood()
print " setting learning rate =", self.learning_rate
self.try_updates(updates_o, updates_d)
final_lik = self.likelihood(self.new_users, self.new_items)
if final_lik > initial_lik:
self.apply_updates(updates_o, updates_d)
self.learning_rate *= 1.25
if final_lik - initial_lik < .1:
self.converged = True
break
else:
self.learning_rate *= .5
self.undo_updates()
if self.learning_rate < 1e-10:
self.converged = True
return not self.converged
def apply_updates(self, updates_o, updates_d):
for i in range(self.num_users):
for d in range(self.latent_d):
self.users[i, d] = self.new_users[i, d]
for i in range(self.num_items):
for d in range(self.latent_d):
self.items[i, d] = self.new_items[i, d]
def try_updates(self, updates_o, updates_d):
alpha = self.learning_rate
beta = -self.regularization_strength
for i in range(self.num_users):
for d in range(self.latent_d):
self.new_users[i,d] = self.users[i, d] + \
alpha * (beta * self.users[i, d] + updates_o[i, d])
for i in range(self.num_items):
for d in range(self.latent_d):
self.new_items[i, d] = self.items[i, d] + \
alpha * (beta * self.items[i, d] + updates_d[i, d])
def undo_updates(self):
# Don't need to do anything here
pass
def print_latent_vectors(self):
print "Users"
for i in range(self.num_users):
print i,
for d in range(self.latent_d):
print self.users[i, d],
print
print "Items"
for i in range(self.num_items):
print i,
for d in range(self.latent_d):
print self.items[i, d],
print
def save_latent_vectors(self, prefix):
self.users.dump(prefix + "%sd_users.pickle" % self.latent_d)
self.items.dump(prefix + "%sd_items.pickle" % self.latent_d)
def fake_ratings(noise=.25):
u = []
v = []
ratings = []
num_users = 100
num_items = 100
num_ratings = 30
latent_dimension = 10
# Generate the latent user and item vectors
for i in range(num_users):
u.append(2 * numpy.random.randn(latent_dimension))
for i in range(num_items):
v.append(2 * numpy.random.randn(latent_dimension))
# Get num_ratings ratings per user.
for i in range(num_users):
items_rated = numpy.random.permutation(num_items)[:num_ratings]
for jj in range(num_ratings):
j = items_rated[jj]
rating = numpy.sum(u[i] * v[j]) + noise * numpy.random.randn()
ratings.append((i, j, rating)) # thanks sunquiang
return (ratings, u, v)
def plot_ratings(ratings):
xs = []
ys = []
for i in range(len(ratings)):
xs.append(ratings[i][1])
ys.append(ratings[i][2])
pylab.plot(xs, ys, 'bx')
pylab.show()
def plot_latent_vectors(U, V):
fig = plt.figure()
ax = fig.add_subplot(121)
cmap = cm.jet
ax.imshow(U, cmap=cmap, interpolation='nearest')
plt.title("Users")
plt.axis("off")
ax = fig.add_subplot(122)
ax.imshow(V, cmap=cmap, interpolation='nearest')
plt.title("Items")
plt.axis("off")
def plot_predicted_ratings(U, V):
r_hats = -5 * numpy.ones((U.shape[0] + U.shape[1] + 1,
V.shape[0] + V.shape[1] + 1))
for i in range(U.shape[0]):
for j in range(U.shape[1]):
r_hats[i + V.shape[1] + 1, j] = U[i, j]
for i in range(V.shape[0]):
for j in range(V.shape[1]):
r_hats[j, i + U.shape[1] + 1] = V[i, j]
for i in range(U.shape[0]):
for j in range(V.shape[0]):
r_hats[i + U.shape[1] + 1, j + V.shape[1] + 1] = numpy.dot(U[i], V[j]) / 10
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(r_hats, cmap=cm.gray, interpolation='nearest')
plt.title("Predicted Ratings")
plt.axis("off")
if __name__ == "__main__":
DATASET = 'fake'
if DATASET == 'fake':
(ratings, true_o, true_d) = fake_ratings()
#plot_ratings(ratings)
pmf = ProbabilisticMatrixFactorization(ratings, latent_d=5)
liks = []
while (pmf.update()):
lik = pmf.likelihood()
liks.append(lik)
print "L=", lik
pass
plt.figure()
plt.plot(liks)
plt.xlabel("Iteration")
plt.ylabel("Log Likelihood")
plot_latent_vectors(pmf.users, pmf.items)
plot_predicted_ratings(pmf.users, pmf.items)
plt.show()
pmf.print_latent_vectors()
pmf.save_latent_vectors("models/")
Related Posts: