## Saturday, June 6, 2009

### Python logistic regression (with L2 regularization)

Posted by Danny Tarlow
Logistic Regression
Logistic regression is used for binary classification problems -- where you have some examples that are "on" and other examples that are "off." You get as input a training set which has some examples of each class along with a label saying whether each example is "on" or "off". The goal is to learn a model from the training data so that you can predict the label of new examples that you haven't seen before and don't know the label of.

For one example, suppose that you have data describing a bunch of buildings and earthquakes (E.g., year the building was constructed, type of material used, strength of earthquake,etc), and you know whether each building collapsed ("on") or not ("off") in each past earthquake. Using this data, you'd like to make predictions about whether a given building is going to collapse in a hypothetical future earthquake.

One of the first models that would be worth trying is logistic regression.

Coding it up
I wasn't working on this exact problem, but I was working on something close. Being one to practice what I preach, I started looking for a dead simple Python logistic regression class. The only requirement is that I wanted it to support L2 regularization (more on this later). I'm also sharing this code with a bunch of other people on many platforms, so I wanted as few dependencies on external libraries as possible.

I couldn't find exactly what I wanted, so I decided to take a stroll down memory lane and implement it myself. I've written it in C++ and Matlab before but never in Python.

I won't do the derivation, but there are plenty of good explanations out there to follow if you're not afraid of a little calculus. This one looks good, for example. The big idea is to write down the probability of the data given some setting of internal parameters, then to take the derivative, which will tell you how to change the internal parameters to make the data more likely. Got it? Good.

For those of you out there that know logistic regression inside and out, take a look at how short the train() method is. I really like how easy it is to do in Python.

Regularization
I caught a little indirect flak during March madness season for talking about how I regularized the latent vectors in my matrix-factorization model of team offensive and defensive strengths. See here and here.

But seriously, guys -- it's a good idea.

Let me drive home the point. Take a look at this. This is the output of running the code below:

Take a look at the top row.

On the left side, you have the training set. There are 25 examples laid out along the x axis, and the y axis tells you if the example is "on" (1) or "off" (0). For each of these examples, there's a vector describing its attributes that I'm not showing. After training the model, I ask the model to ignore the known training set labels and to estimate the probability that each label is "on" based only on the examples's description vectors and what the model has learned (hopefully things like stronger earthquakes and older buildings increase the likelihood of collapse). The probabilities are shown by the red X's. In the top left, the red X's are right on top of the blue dots, so it is very sure about the labels of the examples, and it's always correct.

Now on the right side, we have some new examples that the model hasn't seen before. This is called the test set. This is essentially the same as the left side, but the model knows nothing about the test set class labels (yellow dots). What you see is that it still does a decent job of predicting the labels, but there are some troubling cases where it is very confident and very wrong. This is known as overfitting.

This is where regularization comes in. As you go down the rows, there is stronger L2 regularization -- or equivalently, pressure on the internal parameters to be zero. This has the effect of reducing the model's certainty. Just because it can perfectly reconstruct the training set doesn't mean that it has everything figured out. You can imagine that if you were relying on this model to make important decisions, it would be desirable to have at least a bit of regularization in there.

And here's the code. It looks long, but most of it is to generate the data and plot the results. The bulk of the work is done in the train() method, which is only three (dense) lines. It requires numpy, scipy, and pylab.

* For full disclosure, I should admit that I generated my random data in a way such that it is susceptible to overfitting, possibly making logistic-regression-without-regularization look worse than it is.
```from scipy.optimize.optimize import fmin_cg, fmin_bfgs, fmin
import numpy as np

def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

class SyntheticClassifierData():

def __init__(self, N, d):
""" Create N instances of d dimensional input vectors and a 1D
class label (-1 or 1). """

means = .05 * np.random.randn(2, d)

self.X_train = np.zeros((N, d))
self.Y_train = np.zeros(N)
for i in range(N):
if np.random.random() > .5:
y = 1
else:
y = 0
self.X_train[i, :] = np.random.random(d) + means[y, :]
self.Y_train[i] = 2.0 * y - 1

self.X_test = np.zeros((N, d))
self.Y_test = np.zeros(N)
for i in range(N):
if np.random.randn() > .5:
y = 1
else:
y = 0
self.X_test[i, :] = np.random.random(d) + means[y, :]
self.Y_test[i] = 2.0 * y - 1

class LogisticRegression():
""" A simple logistic regression model with L2 regularization (zero-mean
Gaussian priors on parameters). """

def __init__(self, x_train=None, y_train=None, x_test=None, y_test=None,
alpha=.1, synthetic=False):

# Set L2 regularization strength
self.alpha = alpha

# Set the data.
self.set_data(x_train, y_train, x_test, y_test)

# Initialize parameters to zero, for lack of a better choice.
self.betas = np.zeros(self.x_train.shape[1])

def negative_lik(self, betas):
return -1 * self.lik(betas)

def lik(self, betas):
""" Likelihood of the data under the current settings of parameters. """

# Data likelihood
l = 0
for i in range(self.n):
l += log(sigmoid(self.y_train[i] * \
np.dot(betas, self.x_train[i,:])))

# Prior likelihood
for k in range(1, self.x_train.shape[1]):
l -= (self.alpha / 2.0) * self.betas[k]**2

return l

def train(self):
""" Define the gradient and hand it off to a scipy gradient-based
optimizer. """

# Define the derivative of the likelihood with respect to beta_k.
# Need to multiply by -1 because we will be minimizing.
dB_k = lambda B, k : (k > 0) * self.alpha * B[k] - np.sum([ \
self.y_train[i] * self.x_train[i, k] * \
sigmoid(-self.y_train[i] *\
np.dot(B, self.x_train[i,:])) \
for i in range(self.n)])

# The full gradient is just an array of componentwise derivatives
dB = lambda B : np.array([dB_k(B, k) \
for k in range(self.x_train.shape[1])])

# Optimize
self.betas = fmin_bfgs(self.negative_lik, self.betas, fprime=dB)

def set_data(self, x_train, y_train, x_test, y_test):
""" Take data that's already been generated. """

self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.n = y_train.shape[0]

def training_reconstruction(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_train[i,:]))

return p_y1

def test_predictions(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_test[i,:]))

return p_y1

def plot_training_reconstruction(self):
plot(np.arange(self.n), .5 + .5 * self.y_train, 'bo')
plot(np.arange(self.n), self.training_reconstruction(), 'rx')
ylim([-.1, 1.1])

def plot_test_predictions(self):
plot(np.arange(self.n), .5 + .5 * self.y_test, 'yo')
plot(np.arange(self.n), self.test_predictions(), 'rx')
ylim([-.1, 1.1])

if __name__ == "__main__":
from pylab import *

# Create 20 dimensional data set with 25 points -- this will be
# susceptible to overfitting.
data = SyntheticClassifierData(25, 20)

# Run for a variety of regularization strengths
alphas = [0, .001, .01, .1]
for j, a in enumerate(alphas):

# Create a new learner, but use the same data for each run
lr = LogisticRegression(x_train=data.X_train, y_train=data.Y_train,
x_test=data.X_test, y_test=data.Y_test,
alpha=a)

print "Initial likelihood:"
print lr.lik(lr.betas)

# Train the model
lr.train()

# Display execution info
print "Final betas:"
print lr.betas
print "Final lik:"
print lr.lik(lr.betas)

# Plot the results
subplot(len(alphas), 2, 2*j + 1)
lr.plot_training_reconstruction()
ylabel("Alpha=%s" % a)
if j == 0:
title("Training set reconstructions")

subplot(len(alphas), 2, 2*j + 2)
lr.plot_test_predictions()
if j == 0:
title("Test set predictions")

show()
```

George said...

Not bad. I have my own simple python 2-class logistic regression implementation (supports L1 and L2 regularization) that only depends on numpy (I try to avoid dependencies on the rest of scipy and matplotlib which means the calling code has to plot things).

A few differences. I a slightly tweaked version of Roland's python port of Carl Rasmussen's minimize.m instead of the scipy minimizer. This avoids dependency on scipy and uses a conjugate gradient minimizer that I feel is quite competitive with any of the scipy minimizers.

I also include code to check my gradient by numerically differentiating the cost function. This is just for debugging however.

Also, my code doesn't use classes, which makes I believe makes it cleaner in this case. I have a single predict function

def predict(weights, data):
return sigmoid( num.dot(data, weights) )

I also have a simple, but convenient, command line interface implemented with python's getopt module.

So I guess I like my version a lot more. :P But I would, since it is exactly the way I like it.

Danny Tarlow said...

Good point about the single predict function -- I'll probably change mine to work the same way. I also haven't used the getopt module, but that sounds useful. Thanks for the pointer.

As for the classes, though, I'm pretty deep in the make-everything-a-class boat, so we'll have to agree to disagree on that =).

George said...

I guess the use of the class keyword isn't really what bothers me, it is that the class has the training and testing data as instance data. Using class or not is a mostly irrelevant implementation detail. What happens if a user of your class allows the data instance variables to be their defaults of None and then tries to call a function? I expect an ungraceful crash. I am not suggesting add more code to do error handling, but instead that this shows a (slight) design flaw. If you had the data always get passed in as an argument to a function that needed it you could more reasonably assume that it would never be None and the precondition of the function would be obvious. This is what my code does.

Maybe I should post mine somewhere since I am picking at every little thing I disagree with in yours!

Danny Tarlow said...

So first off, I'm far from being a great software architect, and I really haven't thought about object oriented design since second year of undergrad, so I don't have super strong feelings about this.

But I'll still bite. I probably buy the argument with respect to the test data -- it may be cleaner not to store it as an instance variable, since it's not really a "part" of the classifier.

I think I disagree with respect to the training data. The classifier is essentially just a view of the training data. Keeping them bundled together in a class makes that clear, and it also seems more practical to me in cases where (for example) you might want to add to or modify the data then update the parameters to incorporate the change. Sure you could keep track of all the data outside the class, but it seems cleaner to have the synchronization between classifier and data to be automatic (I'm envisioning a modify_data() method that automatically retrains the classifier after you make changes to the data).

And if you'd like me to post your code here, I'm happy to do so, of course giving you all the proper attribution.

alex_land said...

didn't i see this exact example on ESPN just before MM?

Danny Tarlow said...

Alex,

Do you mean my other post on march madness predictions? It was fairly popular on Digg and a few small news outlets, but it would be news to me if it made it to ESPN.

Pussinboots said...

A few tips:

* The basic approach of using classes is a good one---avoids you needing to pass arguments via the optimizer to objective/gradient calculations.

* Test data, the parameter vector and the optimization method should not be part of the class or instance. You should be able to interchange these w/o changing your model.

* Your class should define (1) objective (as negative log-likelihood for probabilistic models like this one), (2) gradient, and (3) prediction methods. Objective and gradient methods should take the parameter vector (betas in your case). Prediction should take the parameter vector and "x" test data. A separate routine should calculate error against "y" test as there may be different ways you want to calculate error (e.g. 0-1 vs. absolute difference vs. least squares).

* For numerical stability: propagage the log into the sigmoid; consider using numpy.log1p

* The linesearch that scipy uses is buggy and limited. You'll see this if you use high-dimensional data and/or a numerically difficult objective (where 1e-8 step sizes aren't sufficient). Like George, I've found Roland's port of CR's minimize to be solid.

Danny Tarlow said...

Thanks, Jason. I appreciate the (very good) tips.

When I get some spare time, maybe I'll walk down the hall to George's office and hammer out a really clean version of this.

CVK said...

A wonderful Post! Thanks a lot!
Also thanks to Jason for his tips.

DF said...

What is the license on this code? Do you allow commercial use?

V1P3R said...

A wonderful Post

V1P3R said...

A wonderful Post

willrc said...

This is useful, thanks.

Currently this works for a single dependent variable/y. Do you think it could be extended to multiple y e.g. for categorization purposes?

Steven said...

This is a silly question, but in your dB_k function, everything is multiplied by (k>0). Since the bias term (I think) corresponds to the k=0 term of the beta vector, it looks like you're keeping the bias term fixed at 0, since (k>0) = 0 when k = 0.