## Monday, June 29, 2009

### Solar energy -- what's the hold up?

Posted by Danny Tarlow
Guest post by Andrew Peterman from an op-ed piece in the Idaho Statesman, dated 05/28/09.

Andrew and I will be giving a joint presentation in Scotland next month on a data-driven approach to predicting energy consumption in buildings -- I was the data guy, and he was the energy guy. He grew up in Boise, has B.S. and M.S. degrees in Civil and Environmental Engineering from Stanford University, and currently works for the Walt Disney Company on worldwide sustainability, energy and environmental issues.

I've written a little bit about energy issues in the past, but most of the posts have been inspired by conversations with Andrew. This is meant as a test -- if it's received well, we might ask Andrew to write guest posts here more often about a variety of energy and energy policy issues. Let me know what you think.

Idaho should be a leader when it comes to solar energy
By Andrew Peterman
(Previously published in the Idaho Statesman)

I often find myself wondering why, as a state, we are not yet a leader in centralized or rooftop solar energy generation. According to the Energy Information Administration (EIA) and Interstate Renewable Energy Council, Idaho is not even ranked in the top 10, yet our close neighbor, Nevada, is ranked third in large-scale generation and second in per capita small and large solar PV generation.

The sun has incredible potential within Idaho to provide a key source of non-carbon emitting electricity and heating. Solar is on the cutting-edge of providing a financially smart method to reduce our dependence on foreign oil and reduce the deadly buildup of greenhouse gas emissions. We can bring Idaho to a state of foreign oil independence with an array of options, but not without harnessing the power we naturally absorb from the sun.

Idaho gets a lot of sun. Across the state, engineers and energy analysts have estimated that we have the potential to produce 60 billion kilowatt-hours of electricity annually from solar energy (Department of Energy) - over 40 percent of Idaho's total energy demand using solar alone (EIA).

Solar technology can scale from the single-family homeowner to the large utility companies building massive collector systems and investing in our state.

Given Idaho's premium location for solar collection combined with nationally rising energy costs driven by high oil prices, solar technology has the potential to bring foreign oil independence to the state, reduce Idahoans' monthly utility bill, and protect Idaho's lakes, streams, mountains and valleys from the potentially harmful effects of global warming.

Solar is one of the cleanest technologies that makes financial sense. Incentives are currently in place to make the initial investment in solar panels more affordable and often financially beneficial for the individual or family. Through incentive programs such as Northwest Solar Cooperative - Green Tag Purchase, every Idahoan has an opportunity to receive money for every kW of solar electricity produced.

Solar produced electricity is estimated in some cases to be as low as 2 cents/kWh. Solar projects have the potential to create numerous jobs locally to help install, maintain, and monitor the performance of solar projects. Installation and maintenance jobs - nearly 50 percent of the total cost - cannot be shipped overseas.

Solar benefits all. Solar has the potential to provide the following meaningful benefits to Idahoans, without singling out a particular group:
• Lowers peak electricity demand across the state.
• Reduces the need to build new transmission and distribution.
• Protects Idaho streams and rivers.
• Lowers emissions of global warming pollution.
• Challenges citizens of Idaho to take action and own the solution to the climate change problem.

The benefits of solar sound obvious.

Why is our state lagging? In these tough economic times, the last thing on Idahoans' minds is solar electricity production. This is why we need our local leaders to help bring about the necessary incentives and training to implement progressive technologies. As individuals, we have to take on much of the responsibility of reducing energy consumption and switching to cleaner sources. Idahoans have to own the solution to global warming, but our leaders must first provide resources to help create these new market opportunities.

The global imperative to move toward cleaner energy sources has presented unprecedented opportunity to create new jobs and bolster the struggling economy of Idaho. Through hard work, we have the potential to be on the cutting edge in this newly forming green economy - but not without solar.

## Saturday, June 27, 2009

### Netflix algorithm: Prize Tribute Recommendation Algorithm in Python

Posted by Danny Tarlow
The Netflix Prize
My little region of the internet is abuzz with news that the 10% improvement needed to win the \$1M Netflix prize has been achieved. The prize is now in the "last call" stage:
As of the submission by team "BellKor's Pragmatic Chaos" on June 26, 2009 18:42:37 UTC, the Netflix Prize competition entered the "last call" period for the Grand Prize. In accord with the Rules, teams have thirty (30) days, until July 26, 2009 18:42:37 UTC, to make submissions that will be considered for this Prize. Good luck and thank you for participating!

The winning method is a combination of several different teams' approaches, blended together using some form of ensemble learning. It's an interesting lesson -- to really eek out the last bit of performance, it appears to work better to throw the kitchen sink of methods at the problem and combine them in an intelligent way. This is rather contrary to the "ideal" research approach, where some clever, (relatively) simple, elegant breakthrough would lead to the best performance.

In honor of the prize barrier being broken, I put together a little implementation of an early leader's approach to the problem. They experimented with several different approaches, but the one I use the most is the original probabilistic matrix factorization (PMF) approach. To see all the details, including how it performs on the full Netflix problem, see Russ and Andrei's paper.

The Problem
The rating prediction problem is easy to explain: there is a set of users, a set of items (movies, in the Netflix case) and a set of ratings. For some combinations of user and movie, you have a rating, which says how much the user liked the movie. This is called the training set. From the ratings in the training set, we want to build a model that will predict the rating given by some previously unseen combination of user and movie.

In order to evaluate the model, some known ratings are taken out of the training set and put in a test set. The model then takes the training set as input, along with the user-movie combinations that are in the test set, but it isn't shown the actual ratings in the test set. Instead, it is asked to make predictions based on what it has learned from the training set. In the case of the Netflix challenge, Netflix never reveals the actual values of these ratings -- they will only take in a set of predictions and report back your score.

For example, a small data set might look as follows:
Danny - Good Will Hunting: 10
Danny - Prefontaine: 10
Danny - Casino Royale: 5
Tyler - Toy Story: 2
Tyler - Top Gun: 10
Tyler - Casino Royale: 10
Tyler - The Little Mermaid: 2
Tyler - Prefontaine: 6
Andrew - Up: 10
Andrew - Toy Story: 10
Andrew - Top Gun: 3

Probabilistic Matrix Factorization
The big idea of PMF is to describe each user and each movie by a small set of attributes. In this case, we could say something along the lines that Danny likes dramas and doesn't really like action or Disney movies. Tyler loves action movies but hates Disney movies, and Andrew loves Disney movies but nothing else. Toy Story and Up are Disney movies, Top Gun and Casino Royale are action movies, and Good Will Hunting and Prefontaine are dramas.

This could be represented by the PMF model by using three dimensional vectors to describe each user and each movie. If the first dimension was how much of a Disney movie it is, the second is how much of an action movie it was, and the third is how much of a drama it is, then the latent vectors might look something like as follows:
Danny: [0, 3, 10]
Andrew: [10, 3, 5]
Tyler: [0, 8, 4]

Good Will Hunting: [0, 0, 1]
Prefontaine: [0, .5, 1]
Casino Royale: [0, 1, .2]
Toy Story: [1, .1, .1]
Top Gun: [0, 1, .2]
...

To predict how a user rates a movie, we take the dot product of the user vector with the item vector. So to predict how Tyler would rate Top Gun:
0 * 0 + 8 * 1 + .2 * 4 = 8.8

Not bad, given that we know his actual rating was a 10.

Learning
However, in the PMF model, the descriptions of the users and movies are not known ahead of time. Instead, the learning algorithm automatically discovers these latent characteristics, finding latent representations of users and movies such that the model's predicted ratings match the ratings in the training set as best as possible. Going back to the example, if we wanted to make the predicted rating match the actual rating even more, we could bump up Tyler's love for action movies (or make several other changes to the latent vectors). When you have many users rating many movies, it gets trickier to know how to change the latent vectors, because making Tyler's rating for Top Gun closer might make several other of his ratings get worse. Thankfully, calculus and the PMF algorithm help us out here.

For my implementation, I don't worry too much about making it efficient enough to scale to something like the Netflix problem, but I use the same model presented in the paper, and I use a similar gradient-ascent optimization scheme (though I use batch updates with momentum instead of stochastic updates).

I test my code using synthetic data, where I first make up latent vectors for users and items, then I generate some training set ratings by multiplying some latent user vectors by latent item vectors then adding some noise. I then discard the latent vectors and just give the model the synthetic ratings.

The code progresses iteratively, increasing the likelihood of the data given the latent vectors at each iteration. I use 5 dimensional latent vectors for these experiments, but you could use any small number. Here's what the likelihood (which you can think of as roughly the (negative) sum of squared error between predictions on the training set and actual training set values) looks like over time:

Results
Upon convergence of the training, the latent vectors look like this. There is one row for each individual user and item, then there are five columns per row -- these are the learned descriptions of users and movies:

Finally, using the learned latent vectors, we can then make rating predictions for any combination of user and movie. I do that here: The left side of the image has the vertical stacking of user latent vectors, the top side has the horizontal stacking of movie latent vectors, then the large bottom right region has the predicted ratings for every combination of user and item. Note that if a user or item has relatively small entries in all of its latent vectors, that represents something along the lines of being non-polar (apathetic, almost), while users or items with larger positive or negative latent values will be more polarizing -- the love-it-or-hate-it idea. You can see these sorts of patterns appear in the rows and columns.

The Code
And here is the code that generates the synthetic data, trains the model, and produces the figures:

Update (10/25/09): Fixed a bug in try_updates that was causing overcounting of the regularization component of the update. Thanks, Conrad.
import pylab
import matplotlib.pyplot as plt
import matplotlib.cm as cm

import numpy
import pickle

class ProbabilisticMatrixFactorization():

def __init__(self, rating_tuples, latent_d=1):
self.latent_d = latent_d
self.learning_rate = .0001
self.regularization_strength = 0.1

self.ratings = numpy.array(rating_tuples).astype(float)
self.converged = False

self.num_users = int(numpy.max(self.ratings[:, 0]) + 1)
self.num_items = int(numpy.max(self.ratings[:, 1]) + 1)

print (self.num_users, self.num_items, self.latent_d)
print self.ratings

self.users = numpy.random.random((self.num_users, self.latent_d))
self.items = numpy.random.random((self.num_items, self.latent_d))

self.new_users = numpy.random.random((self.num_users, self.latent_d))
self.new_items = numpy.random.random((self.num_items, self.latent_d))

def likelihood(self, users=None, items=None):
if users is None:
users = self.users
if items is None:
items = self.items

sq_error = 0

for rating_tuple in self.ratings:
if len(rating_tuple) == 3:
(i, j, rating) = rating_tuple
weight = 1
elif len(rating_tuple) == 4:
(i, j, rating, weight) = rating_tuple

r_hat = numpy.sum(users[i] * items[j])

sq_error += weight * (rating - r_hat)**2

L2_norm = 0
for i in range(self.num_users):
for d in range(self.latent_d):
L2_norm += users[i, d]**2

for i in range(self.num_items):
for d in range(self.latent_d):
L2_norm += items[i, d]**2

return -sq_error - self.regularization_strength * L2_norm

def update(self):

for rating_tuple in self.ratings:
if len(rating_tuple) == 3:
(i, j, rating) = rating_tuple
weight = 1
elif len(rating_tuple) == 4:
(i, j, rating, weight) = rating_tuple

r_hat = numpy.sum(self.users[i] * self.items[j])

for d in range(self.latent_d):
updates_o[i, d] += self.items[j, d] * (rating - r_hat) * weight
updates_d[j, d] += self.users[i, d] * (rating - r_hat) * weight

while (not self.converged):
initial_lik = self.likelihood()

print "  setting learning rate =", self.learning_rate

final_lik = self.likelihood(self.new_users, self.new_items)

if final_lik > initial_lik:
self.learning_rate *= 1.25

if final_lik - initial_lik < .1:
self.converged = True

break
else:
self.learning_rate *= .5

if self.learning_rate < 1e-10:
self.converged = True

return not self.converged

for i in range(self.num_users):
for d in range(self.latent_d):
self.users[i, d] = self.new_users[i, d]

for i in range(self.num_items):
for d in range(self.latent_d):
self.items[i, d] = self.new_items[i, d]

alpha = self.learning_rate
beta = -self.regularization_strength

for i in range(self.num_users):
for d in range(self.latent_d):
self.new_users[i,d] = self.users[i, d] + \
alpha * (beta * self.users[i, d] + updates_o[i, d])
for i in range(self.num_items):
for d in range(self.latent_d):
self.new_items[i, d] = self.items[i, d] + \
alpha * (beta * self.items[i, d] + updates_d[i, d])

# Don't need to do anything here
pass

def print_latent_vectors(self):
print "Users"
for i in range(self.num_users):
print i,
for d in range(self.latent_d):
print self.users[i, d],
print

print "Items"
for i in range(self.num_items):
print i,
for d in range(self.latent_d):
print self.items[i, d],
print

def save_latent_vectors(self, prefix):
self.users.dump(prefix + "%sd_users.pickle" % self.latent_d)
self.items.dump(prefix + "%sd_items.pickle" % self.latent_d)

def fake_ratings(noise=.25):
u = []
v = []
ratings = []

num_users = 100
num_items = 100
num_ratings = 30
latent_dimension = 10

# Generate the latent user and item vectors
for i in range(num_users):
u.append(2 * numpy.random.randn(latent_dimension))
for i in range(num_items):
v.append(2 * numpy.random.randn(latent_dimension))

# Get num_ratings ratings per user.
for i in range(num_users):
items_rated = numpy.random.permutation(num_items)[:num_ratings]

for jj in range(num_ratings):
j = items_rated[jj]
rating = numpy.sum(u[i] * v[j]) + noise * numpy.random.randn()

ratings.append((i, j, rating))  # thanks sunquiang

return (ratings, u, v)

def plot_ratings(ratings):
xs = []
ys = []

for i in range(len(ratings)):
xs.append(ratings[i][1])
ys.append(ratings[i][2])

pylab.plot(xs, ys, 'bx')
pylab.show()

def plot_latent_vectors(U, V):
fig = plt.figure()
cmap = cm.jet
ax.imshow(U, cmap=cmap, interpolation='nearest')
plt.title("Users")
plt.axis("off")

ax.imshow(V, cmap=cmap, interpolation='nearest')
plt.title("Items")
plt.axis("off")

def plot_predicted_ratings(U, V):
r_hats = -5 * numpy.ones((U.shape[0] + U.shape[1] + 1,
V.shape[0] + V.shape[1] + 1))

for i in range(U.shape[0]):
for j in range(U.shape[1]):
r_hats[i + V.shape[1] + 1, j] = U[i, j]

for i in range(V.shape[0]):
for j in range(V.shape[1]):
r_hats[j, i + U.shape[1] + 1] = V[i, j]

for i in range(U.shape[0]):
for j in range(V.shape[0]):
r_hats[i + U.shape[1] + 1, j + V.shape[1] + 1] = numpy.dot(U[i], V[j]) / 10

fig = plt.figure()
ax.imshow(r_hats, cmap=cm.gray, interpolation='nearest')
plt.title("Predicted Ratings")
plt.axis("off")

if __name__ == "__main__":

DATASET = 'fake'

if DATASET == 'fake':
(ratings, true_o, true_d) = fake_ratings()

#plot_ratings(ratings)

pmf = ProbabilisticMatrixFactorization(ratings, latent_d=5)

liks = []
while (pmf.update()):
lik = pmf.likelihood()
liks.append(lik)
print "L=", lik
pass

plt.figure()
plt.plot(liks)
plt.xlabel("Iteration")
plt.ylabel("Log Likelihood")

plot_latent_vectors(pmf.users, pmf.items)
plot_predicted_ratings(pmf.users, pmf.items)
plt.show()

pmf.print_latent_vectors()
pmf.save_latent_vectors("models/")

Related Posts:

## Wednesday, June 24, 2009

### Really Bayesian logistic regression in Python

Posted by Danny Tarlow
The typical derivation for logistic regression uses a maximum likelihood objective: vary model parameters so that the probability of the data given the model is greatest.

In practice, we need to be careful using maximum likelihood because of its tendency towards overfitting. One of the easiest ways to address the problem is to add some regularization to the objective -- telling the model that you prefer weights that are small, in turn encouraging the model to be simple. By simple, I mean less likely to pick up on spurious correlations in the data.

I don't think this was the original motivation for regularization (maybe a reader can help me if I'm wrong), but somebody eventually noticed that you could frame standard L1 or L2 regularization as equivalent to placing independent, zero-mean Laplace or Gaussian priors, respectively, on the model parameters. Then, optimizing an objective that includes both the prior and the maximum likelihood term means we're finding the MAP (maximum a posteriori) setting of parameters. Not only does our model work a little better, but we can also now get away with saying that we're being Bayesian. I'll let you decide which benefit is more important =P

Still, there's something a little unsatisfying to the whole thing. We still don't have a very good idea about how sensitive our model is to changes in the model parameters, and we only have one set of weights to make predictions with -- which will probably give us overconfident predictions.

Instead, I wanted a model that was really Bayesian. By really Bayesian, I mean that I want a distribution over model parameters as my answer. Somebody could easily complain that (in this implementation) I'm not maintaining a distribution over my hyperparameters, so in that sense I'm not being really Bayesian, and that would be a nitpicky but valid complaint.

Anyhow, I took my logistic regression code as a starting point, and I modified it to be really Bayesian.

I started by deciding I wanted to go the Markov Chain Monte Carlo route, and I am fine alternating through each parameter and resampling from its conditional distribution while holding all other parameters fixed. However, I didn't want a Gibbs sampler because the normalization constants in the conditional distributions looked intimidating. By my math, the Gaussian case looks something like the following:
P(b_k | b_{-k}) \propto exp[ \sum_k log N(b_k | 0, \alpha) + \sum_i log(g(y_i * (b_{-k} x_{i-k} + b_k x_{ik})))]

where I use the notation b_{-k} to mean all b except for b_k, and alpha is the precision of the prior (and equivalently the strength of the regularization).

I got about this far, didn't see any easy way to move the likelihood terms into a big Gaussian, then I thought about computing the integral over b_k of that distribution. Soon after, I decided I didn't want to deal with computing the normalization constant.

The next step for me was to turn to slice sampling. Wikipedia has a short post on it, but I think Radford's paper is better motivated and easier to understand. Regardless, the big idea is as follows:
The method is based on the observation that to sample a random variable one can sample uniformly from the region under the graph of its density function.
The nice part about slice sampling is that we never need to evaluate the normalization constant. All we need is the ability to evaluate the likelihood function under any given setting of weights, and to find intervals of the range of the density function that lie below some value. If you haven't seen slice sampling before, I recommend reading the paper for the details. There's nothing I can say that will explain it better than Radford. Skip straight to Section 3 if you just want to get right into it.

Anyhow, keeping with my "poor man's" implementation where I never do any hard math, I kept the likelihood the same as in the original logistic regression, just adding one likelihood function that keeps all but one parameter fixed. To solve for intervals, I use scipy's root-finding optimizers. The objective is unimodal, so there aren't any tricky cases where I have to find discontinuous regions. This lets me call the root finder once, decide if I want to look left or right for the other root, then call the root finder again.

That's about it. Since I already have code to find the MAP solution, I initialize weights by solving the regularized logistic regression problem, then I leave it up to the slice sampler to explore the surrounding region. I go in every 10 times through the resampling and pull a sample set of new weights.

To see what this looks like, I generated a one dimensional data set then built a logistic regression model where the first dimension of the X's is always 1, and the second dimension is the data. So the weights that the model learns are essentially like intercept and slope terms.

I set the sampler running for 1000 * 10 iterations, and added every 10th set of weights to a histogram. Here's what it comes up with. The intercept weight is on the x axis, and the slope weight on the y axis:
In addition, I plotted the MAP setting of weights with a red X. You have to look fairly hard to see it, but it's where it should be, in the highest density region of samples.

I haven't debugged this code as carefully, but I'll still post it here. Leave a note if you use it so I can let you know if any bugs come up that I fix. I may also put up some more detailed experiments and visualizations at some point.
from scipy.optimize.optimize import fmin_cg, fmin_bfgs, fmin, check_grad
from scipy.optimize import fsolve, bisect
import numpy as np

def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

class SyntheticClassifierData():

def __init__(self, N, d):
""" Create N instances of d dimensional input vectors and a 1D
class label (-1 or 1). """

means = .25 * np.array([[-1] * (d-1), [1] * (d-1)])

self.X_train = np.zeros((N, d))
self.Y_train = np.zeros(N)
for i in range(N):
if np.random.random() > .5:
y = 1
else:
y = 0
self.X_train[i, 0] = 1
self.X_train[i, 1:] = np.random.random(d-1) + means[y, :]
self.Y_train[i] = 2.0 * y - 1

self.X_test = np.zeros((N, d))
self.Y_test = np.zeros(N)
for i in range(N):
if np.random.randn() > .5:
y = 1
else:
y = 0
self.X_test[i, 0] = 1
self.X_test[i, 1:] = np.random.random(d-1) + means[y, :]
self.Y_test[i] = 2.0 * y - 1

class LogisticRegression():
""" A simple logistic regression model with L2 regularization (zero-mean
Gaussian priors on parameters). """

def __init__(self, x_train=None, y_train=None, x_test=None, y_test=None,
alpha=.1, synthetic=False):

# Set L2 regularization strength
self.alpha = alpha

# Set the data.
self.set_data(x_train, y_train, x_test, y_test)

# Initialization only matters if you don't call train().
self.all_betas = []
self.betas = np.random.randn(self.x_train.shape[1])

def negative_lik(self, betas):
return -1 * self.lik(betas)

def lik(self, betas):
""" Likelihood of the data under the current settings of parameters. """

# Data likelihood
l = 0
for i in range(self.n):
l += log(sigmoid(self.y_train[i] * \
np.dot(betas, self.x_train[i,:])))

# Prior likelihood
for k in range(0, self.x_train.shape[1]):
l -= (self.alpha / 2.0) * self.betas[k]**2

return l

def lik_k(self, beta_k, k):
""" The likelihood only in terms of beta_k. """

new_betas = self.betas.copy()
new_betas[k] = beta_k

return self.lik(new_betas)

def train(self):
""" Define the gradient and hand it off to a scipy gradient-based
optimizer. """

# Define the derivative of the likelihood with respect to beta_k.
# Need to multiply by -1 because we will be minimizing.
dB_k = lambda B, k : (k > -1) * self.alpha * B[k] - np.sum([ \
self.y_train[i] * self.x_train[i, k] * \
sigmoid(-self.y_train[i] *\
np.dot(B, self.x_train[i,:])) \
for i in range(self.n)])

# The full gradient is just an array of componentwise derivatives
dB = lambda B : np.array([dB_k(B, k) \
for k in range(self.x_train.shape[1])])

# Optimize
self.betas = fmin_bfgs(self.negative_lik, self.betas, fprime=dB)

def resample(self):
""" Use slice sampling to pull a new draw for logistic regression
parameters from the posterior distribution on beta. """

failures = 0
for i in range(10):
try:
new_betas = np.zeros(self.betas.shape[0])
order = range(self.betas.shape[0])
order.reverse()
for k in order:
new_betas[k] = self.resample_beta_k(k)
except:
failures += 1
continue

for k in range(self.betas.shape[0]):
self.betas[k] = new_betas[k]

print self.betas,
print self.lik(self.betas)

self.all_betas.append(self.betas.copy())

if failures > 0:
"Warning: %s root-finding failures" % (failures)

def resample_beta_k(self, k):
""" Resample beta_k conditional upon all other settings of beta.
This can be used in the inner loop of a Gibbs sampler to get a
full posterior over betas.

Uses slice sampling (Neal, 2001). """

#print "Resampling %s" % k

# Sample uniformly in (0, f(x0)), but do it in the log domain
lik = lambda b_k : self.lik_k(b_k, k)
x0 = self.betas[k]
g_x0 = lik(x0)
e = np.random.exponential()
z = g_x0 - e

# Find the slice of x where z < g(x0) (or where y < f(x0))
#print "y=%s" % exp(z)
lik_minus_z = lambda b_k : (self.lik_k(b_k, k) - z)

# Find the zeros of lik_minus_k to give the interval defining the slice
r0 = fsolve(lik_minus_z, x0)

# Figure out which direction the other root is in
eps = .001
look_right = False
if lik_minus_z(r0 + eps) > 0:
look_right = True

if look_right:
r1 = bisect(lik_minus_z, r0 + eps, 1000)
else:
r1 = bisect(lik_minus_z, -1000, r0 - eps)

L = min(r0, r1)
R = max(r0, r1)
x = (R - L) * np.random.random() + L

#print "S in (%s, %s) -->" % (L, R),
#print "%s" % x
return x

def set_data(self, x_train, y_train, x_test, y_test):
""" Take data that's already been generated. """

self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.n = y_train.shape[0]

def training_reconstruction(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_train[i,:]))

return p_y1

def test_predictions(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_test[i,:]))

return p_y1

def plot_training_reconstruction(self):
plot(np.arange(self.n), .5 + .5 * self.y_train, 'bo')
plot(np.arange(self.n), self.training_reconstruction(), 'rx')
ylim([-.1, 1.1])

def plot_test_predictions(self):
plot(np.arange(self.n), .5 + .5 * self.y_test, 'yo')
plot(np.arange(self.n), self.test_predictions(), 'rx')
ylim([-.1, 1.1])

if __name__ == "__main__":
from pylab import *

data = SyntheticClassifierData(50, 2)

alphas = [.1]
for j, a in enumerate(alphas):
figure()

# Create a new learner, but use the same data for each run
lr = LogisticRegression(x_train=data.X_train, y_train=data.Y_train,
x_test=data.X_test, y_test=data.Y_test,
alpha=a)

print "Initial likelihood:"
print lr.lik(lr.betas)

# Train the model
lr.train()
map_betas = lr.betas.copy()

subplot(10, 2, 1)
lr.plot_training_reconstruction()
ylabel("Alpha=%s" % a)
if j == 0:
title("Training set reconstructions")
axis('off')

subplot(10, 2, 2)
lr.plot_test_predictions()
if j == 0:
title("Test set predictions")
axis('off')

# Display execution info
print "Final betas:"
print lr.betas
print "Final lik:"
print lr.lik(lr.betas)

for l in range(1, 10000):
lr.resample()

# Plot the results
if l % 1000 == 0:
subplot(10, 2, 2*(l/1000) + 1)
lr.plot_training_reconstruction()
axis('off')
subplot(10, 2, 2*(l/1000) + 2)
lr.plot_test_predictions()
axis('off')

figure()
all_betas = np.array(lr.all_betas)

hexbin(all_betas[:,0], all_betas[:,1], bins='log')
plot([map_betas[0]], [map_betas[1]], 'rx', markersize=10)
show()

fvgsa9he5y

## Monday, June 22, 2009

### How to display pretty code in Blogger

Posted by Danny Tarlow
I've been posting more code directly to the blog lately, so I figured it was time to figure out how to make it look nicer. Thanks to Alex Gorbatchev and techqi, this should be pretty easy.

Here's a test displaying the ultra-famous 24 line Affinity Propagation MATLAB implementation:
N=size(S,1); A=zeros(N,N); R=zeros(N,N); % Initialize messages
S=S+(eps*S+realmin*100).*rand(N,N); % Remove degeneracies
lam=0.5; % Set damping factor
for iter=1:100
% Compute responsibilities
Rold=R;
AS=A+S; [Y,I]=max(AS,[],2);
for i=1:N AS(i,I(i))=-realmax; end;
[Y2,I2]=max(AS,[],2);
R=S-repmat(Y,[1,N]);
for i=1:N R(i,I(i))=S(i,I(i))-Y2(i); end;
R=(1-lam)*R+lam*Rold; % Dampen responsibilities

% Compute availabilities
Aold=A;
Rp=max(R,0); for k=1:N Rp(k,k)=R(k,k); end;
A=repmat(sum(Rp,1),[N,1])-Rp;
dA=diag(A); A=min(A,0); for k=1:N A(k,k)=dA(k); end;
A=(1-lam)*A+lam*Aold; % Dampen availabilities
end;
E=R+A; % Pseudomarginals
I=find(diag(E)>0); K=length(I); % Indices of exemplars
[tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % Assignments


Edit: not working yet

Edit: working, but I had to change global line break settings to not convert line breaks. Now all my old posts have the paragraphs messed up.

Edit: I made a couple modifications.
• I moved the CSS for syntaxhighlighter directly into my Blogger template so that it loads faster.
• I changed the color of the line to the left of the text from green to something that matches my color scheme a bit better

Posted by Danny Tarlow
TO DO: Play around with Dumbo.
Dumbo is a Python module that allows you to easily write and run Hadoop streaming programs (it's named after Disney's flying circus elephant, since the logo for Hadoop is an elephant and Python was named after the BBC series "Monty Python's Flying Circus").
It apparently makes running Hadoop via Python a breeze: http://www.audioscrobbler.net/development/dumbo/

## Sunday, June 21, 2009

### Stanford Machine Learning and Other Online Courses

Posted by Danny Tarlow
I'm not sure how long this has been around, but I just came across Stanford's repository of classes posted on YouTube:

I count somewhere on the order of 25 courses, which isn't bad. The one that caught my eye is Machine Learning, taught by Andrew Ng. This course was my first introduction to machine learning -- during winter of second year of undergrad -- and one of the reasons I got into the field (even though I felt like I was in wayyy over my head during parts).

If you're looking for just an introduction to what machine learning is along with some cool example applications, watch the second half of the first lecture:

I'd love to see Daphne's course on graphical models up too. That was another incredibly influential course for me.

## Wednesday, June 17, 2009

### Large Scale Graph Computing

Posted by Danny Tarlow
The official Google research blog has a post on large scale graph computing:
we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology ... Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.
This sounds pretty powerful. It's interesting to think about what new problems you could think about tackling if you had access to this type of computing power. One early thought I had would be to run belief propagation algorithms over graphical models with nodes representing something like n-grams, search queries, or names. You could imagine a network where every mention of a name on the internet gets its own node (or set of nodes) in a factor graph. The goal would be to cluster mentions of names, so that mentions referring to the same underlying person would be put in the same group, while mentions referring to different underlying people would be put in different groups.

One choice would be to tackle this problem with exemplar-based clustering. (Shameless self-promotion ahead). A natural way to tackle this problem might be with flexible priors and exemplar-based clustering.

Two sources of information would come into play in the model.
• First, you'd need to define a similarity measure between mentions of names: how different is the actual text (e.g., Danny and Daniel should be close, while Danny and Andy should be far); do the mentions occur on the same page?; do the pages they are on link to each other?; how similar is the other text on the page surrounding the mention?; etc.
• Second, you'd need some way to decide how many names there are (model selection). One way might be to just choose that you want k different groups. This is how, for example, the k-means, k-medians, and the normalized cut algorithms works. This is probably less than ideal, though, since many of these methods have implicit priors that encourage all of the groups to be roughly the same size (see my paper for details).

Instead, it might be natural to express a prior distribution over the size of groups that we expect to see in the name-mention groups. You can imagine that some names will occur extremely frequently (E.g., Paris Hilton, Kobe Bryant, Barack Obama), while others will occur infrequently (E.g., me), and yet others will occur maybe once at best (E.g., my Gramma Lorraine). Using flexible priors, we can express this prior belief.

If nothing else, running this algorithm would surely put their suggestion of infinite computing power to test:
so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel
What would you do if you had this kind of power on your desktop?

## Sunday, June 7, 2009

### Python nearest neighbors binary classifier

Posted by Danny Tarlow
Binary Classification 101: Round 2 I wrote last time about how easy it is to build a logistic regression classifier in Python using numpy and scipy.

Maintaining the "Machine Learning 101" theme, here's another dead-simple classifier. Remember, we're still talking about binary classification problems:
binary classification problems -- where you have some examples that are "on" and other examples that are "off." You get as input a training set which has some examples of each class along with a label saying whether each example is "on" or "off". The goal is to learn a model from the training data so that you can predict the label of new examples that you haven't seen before and don't know the label of.

For one example, suppose that you have data describing a bunch of buildings and earthquakes (E.g., year the building was constructed, type of material used, strength of earthquake,etc), and you know whether each building collapsed ("on") or not ("off") in each past earthquake. Using this data, you'd like to make predictions about whether a given building is going to collapse in a hypothetical future earthquake.

Classification with Nearest Neighbors
This time we're going to build a nearest neighbors classifier. The idea is even simpler than logistic regression: when you want to predict the label of a point, look at the most similar points, and report the average label as your prediction. For example, suppose we have four examples in a 2D training set:
 Point Label [0, 0] 0 [0, 5] 1 [10, 0] 0 [10, 5] 1
Now suppose we want to predict the label of a point at [0, 2], and we're using only the first nearest neighbor. In this case, we'd note that [0, 0] is the closest point, and it has the label 0. So our prediction for [0, 2] with k=1 is a 100% vote for label 0.

Now suppose we choose k=3, which means our predictions are using the three nearest neighbors. In this case, we'd find that [0, 0] (label 0), [0, 5] (label 1), and [10, 0] (label 0) are the closest points. If we want to give a probability as our answer, then a reasonable estimate would be that there's a 2/3 chance of the label being 0, and a 1/3 chance of the label being 1. If we just want to make a prediction of "off" (0) or "on" (1), we'd probably guess "off."

Experiments
If you look back at the logistic regression post, you'll notice that there's nothing preventing us from running the same set of experiments as last time using a nearest neighbors classifier instead of the logistic regression. It's just training set in, classifier out. Once you have your classifier, you can ask it to make predictions about new points.

The one chance is that instead of having a regularization parameter, we have a k parameter, which tells us how many neighbors to look at making a prediction. Here's what happens when you run it for a few different settings of k:

It's kind of interesting, huh? Setting k to 1 causes some overfitting that looks quite similar to unregularized logistic regression. Choosing k to be larger has a somewhat similar effect to increasing the regularization strength.

The Code
And here's the code that implements the classifier and produces the figure. For those of you who thought this was all super obvious, take a look at how easy scipy's KDTree class makes this. I almost feel bad taking any credit at all for writing this code.

Also, if you're George, hopefully you'll be happy to see that I took some of your design suggestions into account when deciding how to structure the interface.

from scipy.spatial import KDTree
import numpy as np
import networkx as nx

from synthetic_classifier_data import *

class NearestNeighborBinaryClassifier():

def __init__(self, x_train, y_train):

self.set_data(x_train, y_train)

def set_data(self, x_train, y_train):
""" y_train entries should be either -1 or 1."""
self.x_train = x_train
self.y_train = y_train
self.n = y_train.shape[0]

self.kd_tree = KDTree(self.x_train)

def predict(self, X, k=3):

predictions = np.zeros(X.shape[0])

for i in range(X.shape[0]):
# Let the kd-tree do all the real work
d_i, n_i = self.kd_tree.query(X[i, :], k=k)

predictions[i] = np.sum(.5 + .5 * self.y_train[n_i]) / float(k)

return predictions

if __name__ == "__main__":
from pylab import *

# Create 20 dimensional data set with 25 points
data = SyntheticClassifierData(25, 20)
nn = NearestNeighborBinaryClassifier(data.X_train, data.Y_train)

# Run for a variety of settings of k
ks = [1, 3, 5, 10]
for j, k in enumerate(ks):
# Predict the label of each (training and test) y given its
# k nearest neighbors
hat_y_train = nn.predict(data.X_train, k=k)
hat_y_test = nn.predict(data.X_test, k=k)

# Plot the results
subplot(len(ks), 2, 2*j + 1)
plot(np.arange(data.X_train.shape[0]), .5 + .5 * data.Y_train, 'bo')
plot(np.arange(data.X_train.shape[0]), hat_y_train, 'rx')
ylim([-.1, 1.1])
ylabel("K=%s" % k)
if j == 0:
title("Training set reconstructions")

subplot(len(ks), 2, 2*j + 2)
plot(np.arange(data.X_test.shape[0]), .5 + .5 * data.Y_test, 'yo')
plot(np.arange(data.X_test.shape[0]), hat_y_test, 'rx')
ylim([-.1, 1.1])
if j == 0:
title("Test set predictions")

show()


## Saturday, June 6, 2009

### Python logistic regression (with L2 regularization)

Posted by Danny Tarlow
Logistic Regression
Logistic regression is used for binary classification problems -- where you have some examples that are "on" and other examples that are "off." You get as input a training set which has some examples of each class along with a label saying whether each example is "on" or "off". The goal is to learn a model from the training data so that you can predict the label of new examples that you haven't seen before and don't know the label of.

For one example, suppose that you have data describing a bunch of buildings and earthquakes (E.g., year the building was constructed, type of material used, strength of earthquake,etc), and you know whether each building collapsed ("on") or not ("off") in each past earthquake. Using this data, you'd like to make predictions about whether a given building is going to collapse in a hypothetical future earthquake.

One of the first models that would be worth trying is logistic regression.

Coding it up
I wasn't working on this exact problem, but I was working on something close. Being one to practice what I preach, I started looking for a dead simple Python logistic regression class. The only requirement is that I wanted it to support L2 regularization (more on this later). I'm also sharing this code with a bunch of other people on many platforms, so I wanted as few dependencies on external libraries as possible.

I couldn't find exactly what I wanted, so I decided to take a stroll down memory lane and implement it myself. I've written it in C++ and Matlab before but never in Python.

I won't do the derivation, but there are plenty of good explanations out there to follow if you're not afraid of a little calculus. This one looks good, for example. The big idea is to write down the probability of the data given some setting of internal parameters, then to take the derivative, which will tell you how to change the internal parameters to make the data more likely. Got it? Good.

For those of you out there that know logistic regression inside and out, take a look at how short the train() method is. I really like how easy it is to do in Python.

Regularization
I caught a little indirect flak during March madness season for talking about how I regularized the latent vectors in my matrix-factorization model of team offensive and defensive strengths. See here and here.

But seriously, guys -- it's a good idea.

Let me drive home the point. Take a look at this. This is the output of running the code below:

Take a look at the top row.

On the left side, you have the training set. There are 25 examples laid out along the x axis, and the y axis tells you if the example is "on" (1) or "off" (0). For each of these examples, there's a vector describing its attributes that I'm not showing. After training the model, I ask the model to ignore the known training set labels and to estimate the probability that each label is "on" based only on the examples's description vectors and what the model has learned (hopefully things like stronger earthquakes and older buildings increase the likelihood of collapse). The probabilities are shown by the red X's. In the top left, the red X's are right on top of the blue dots, so it is very sure about the labels of the examples, and it's always correct.

Now on the right side, we have some new examples that the model hasn't seen before. This is called the test set. This is essentially the same as the left side, but the model knows nothing about the test set class labels (yellow dots). What you see is that it still does a decent job of predicting the labels, but there are some troubling cases where it is very confident and very wrong. This is known as overfitting.

This is where regularization comes in. As you go down the rows, there is stronger L2 regularization -- or equivalently, pressure on the internal parameters to be zero. This has the effect of reducing the model's certainty. Just because it can perfectly reconstruct the training set doesn't mean that it has everything figured out. You can imagine that if you were relying on this model to make important decisions, it would be desirable to have at least a bit of regularization in there.

And here's the code. It looks long, but most of it is to generate the data and plot the results. The bulk of the work is done in the train() method, which is only three (dense) lines. It requires numpy, scipy, and pylab.

* For full disclosure, I should admit that I generated my random data in a way such that it is susceptible to overfitting, possibly making logistic-regression-without-regularization look worse than it is.
from scipy.optimize.optimize import fmin_cg, fmin_bfgs, fmin
import numpy as np

def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

class SyntheticClassifierData():

def __init__(self, N, d):
""" Create N instances of d dimensional input vectors and a 1D
class label (-1 or 1). """

means = .05 * np.random.randn(2, d)

self.X_train = np.zeros((N, d))
self.Y_train = np.zeros(N)
for i in range(N):
if np.random.random() > .5:
y = 1
else:
y = 0
self.X_train[i, :] = np.random.random(d) + means[y, :]
self.Y_train[i] = 2.0 * y - 1

self.X_test = np.zeros((N, d))
self.Y_test = np.zeros(N)
for i in range(N):
if np.random.randn() > .5:
y = 1
else:
y = 0
self.X_test[i, :] = np.random.random(d) + means[y, :]
self.Y_test[i] = 2.0 * y - 1

class LogisticRegression():
""" A simple logistic regression model with L2 regularization (zero-mean
Gaussian priors on parameters). """

def __init__(self, x_train=None, y_train=None, x_test=None, y_test=None,
alpha=.1, synthetic=False):

# Set L2 regularization strength
self.alpha = alpha

# Set the data.
self.set_data(x_train, y_train, x_test, y_test)

# Initialize parameters to zero, for lack of a better choice.
self.betas = np.zeros(self.x_train.shape[1])

def negative_lik(self, betas):
return -1 * self.lik(betas)

def lik(self, betas):
""" Likelihood of the data under the current settings of parameters. """

# Data likelihood
l = 0
for i in range(self.n):
l += log(sigmoid(self.y_train[i] * \
np.dot(betas, self.x_train[i,:])))

# Prior likelihood
for k in range(1, self.x_train.shape[1]):
l -= (self.alpha / 2.0) * self.betas[k]**2

return l

def train(self):
""" Define the gradient and hand it off to a scipy gradient-based
optimizer. """

# Define the derivative of the likelihood with respect to beta_k.
# Need to multiply by -1 because we will be minimizing.
dB_k = lambda B, k : (k > 0) * self.alpha * B[k] - np.sum([ \
self.y_train[i] * self.x_train[i, k] * \
sigmoid(-self.y_train[i] *\
np.dot(B, self.x_train[i,:])) \
for i in range(self.n)])

# The full gradient is just an array of componentwise derivatives
dB = lambda B : np.array([dB_k(B, k) \
for k in range(self.x_train.shape[1])])

# Optimize
self.betas = fmin_bfgs(self.negative_lik, self.betas, fprime=dB)

def set_data(self, x_train, y_train, x_test, y_test):
""" Take data that's already been generated. """

self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.n = y_train.shape[0]

def training_reconstruction(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_train[i,:]))

return p_y1

def test_predictions(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_test[i,:]))

return p_y1

def plot_training_reconstruction(self):
plot(np.arange(self.n), .5 + .5 * self.y_train, 'bo')
plot(np.arange(self.n), self.training_reconstruction(), 'rx')
ylim([-.1, 1.1])

def plot_test_predictions(self):
plot(np.arange(self.n), .5 + .5 * self.y_test, 'yo')
plot(np.arange(self.n), self.test_predictions(), 'rx')
ylim([-.1, 1.1])

if __name__ == "__main__":
from pylab import *

# Create 20 dimensional data set with 25 points -- this will be
# susceptible to overfitting.
data = SyntheticClassifierData(25, 20)

# Run for a variety of regularization strengths
alphas = [0, .001, .01, .1]
for j, a in enumerate(alphas):

# Create a new learner, but use the same data for each run
lr = LogisticRegression(x_train=data.X_train, y_train=data.Y_train,
x_test=data.X_test, y_test=data.Y_test,
alpha=a)

print "Initial likelihood:"
print lr.lik(lr.betas)

# Train the model
lr.train()

# Display execution info
print "Final betas:"
print lr.betas
print "Final lik:"
print lr.lik(lr.betas)

# Plot the results
subplot(len(alphas), 2, 2*j + 1)
lr.plot_training_reconstruction()
ylabel("Alpha=%s" % a)
if j == 0:
title("Training set reconstructions")

subplot(len(alphas), 2, 2*j + 2)
lr.plot_test_predictions()
if j == 0:
title("Test set predictions")

show()


## Thursday, June 4, 2009

### Organic rankings: online marketing update

Posted by Danny Tarlow
I wrote a few days ago about doing some online marketing for a rental home in Lake Oswego, OR (my hometown). Since I only registered the site a couple hours before I posted, there wasn't much to say about organic search engine results, and I actually expected that to be the case for several weeks. Nonetheless, being a good (read: compulsive) data collector, I've been tracking the site's rank in Google for several relevant keywords. Granted it's not the largest market in the world (Lake Oswego has a population of about 35,000), but I was still fairly surprised to see that we've cracked the top 100 Google results only four days after buying the domain. Organic traffic doesn't seem to have kicked in at all yet, but it's still pretty cool! Here are the charts. A rank of 100 means I didn't find it on the first ten pages of results:

## Tuesday, June 2, 2009

### Python multidimensional sparse array of objects

Posted by Danny Tarlow
A research project I'm working on requires that I store a ton of objects. Each object can be identified by a 4-dimensional index, but for any given execution, I only need a small (different) subset of the indexes to be used. I was surprised not to find something that fits my needs in numpy, scipy, or by Googling (maybe I didn't look hard enough).

Anyhow, I grabbed the blist module and built a simple multidimensional version on top of it. It isn't fancy, but it does the trick for me.

Let me know if you know of any more fleshed-out packages that can do this.
from blist import *

class SparseArray():
def __init__(self, dims):
self.dims = dims

self.size = 1
for d in self.dims:
self.size *= d

self.base_list = self.size * blist([0])

def base_index(self, index):
""" Convert N-d index to index in base list. """

result = 0
multiplier = 1
for i in range(len(index)):
result += index[i] * multiplier
multiplier *= self.dims[i]

if result >= self.size:
raise IndexError("Index out of range %s (size: %s)" % (result,
self.size))

return result

def get(self, index):
return self.base_list[self.base_index(index)]

def set(self, index, val):
self.base_list[self.base_index(index)] = val

if __name__ == "__main__":
s = SparseArray((100, 100, 100, 100))

s.set((0, 0, 0, 0), ['first'])
s.set((23, 33, 10, 5), 10)
s.set((99, 99, 99, 99), ['last'])

print s.get((0, 0, 0, 0))
print s.get((23, 33, 10, 5))
print s.get((99, 99, 99, 98))
print s.get((99, 99, 99, 99))

try:
print s.get((99, 99, 99, 100))
except:
print "Properly caught index error"