Wednesday, June 24, 2009

Really Bayesian logistic regression in Python

Posted by Danny Tarlow
The typical derivation for logistic regression uses a maximum likelihood objective: vary model parameters so that the probability of the data given the model is greatest.

In practice, we need to be careful using maximum likelihood because of its tendency towards overfitting. One of the easiest ways to address the problem is to add some regularization to the objective -- telling the model that you prefer weights that are small, in turn encouraging the model to be simple. By simple, I mean less likely to pick up on spurious correlations in the data.

I don't think this was the original motivation for regularization (maybe a reader can help me if I'm wrong), but somebody eventually noticed that you could frame standard L1 or L2 regularization as equivalent to placing independent, zero-mean Laplace or Gaussian priors, respectively, on the model parameters. Then, optimizing an objective that includes both the prior and the maximum likelihood term means we're finding the MAP (maximum a posteriori) setting of parameters. Not only does our model work a little better, but we can also now get away with saying that we're being Bayesian. I'll let you decide which benefit is more important =P

Still, there's something a little unsatisfying to the whole thing. We still don't have a very good idea about how sensitive our model is to changes in the model parameters, and we only have one set of weights to make predictions with -- which will probably give us overconfident predictions.

Instead, I wanted a model that was really Bayesian. By really Bayesian, I mean that I want a distribution over model parameters as my answer. Somebody could easily complain that (in this implementation) I'm not maintaining a distribution over my hyperparameters, so in that sense I'm not being really Bayesian, and that would be a nitpicky but valid complaint.

Anyhow, I took my logistic regression code as a starting point, and I modified it to be really Bayesian.

I started by deciding I wanted to go the Markov Chain Monte Carlo route, and I am fine alternating through each parameter and resampling from its conditional distribution while holding all other parameters fixed. However, I didn't want a Gibbs sampler because the normalization constants in the conditional distributions looked intimidating. By my math, the Gaussian case looks something like the following:
P(b_k | b_{-k}) \propto exp[ \sum_k log N(b_k | 0, \alpha) + \sum_i log(g(y_i * (b_{-k} x_{i-k} + b_k x_{ik})))]

where I use the notation b_{-k} to mean all b except for b_k, and alpha is the precision of the prior (and equivalently the strength of the regularization).

I got about this far, didn't see any easy way to move the likelihood terms into a big Gaussian, then I thought about computing the integral over b_k of that distribution. Soon after, I decided I didn't want to deal with computing the normalization constant.

The next step for me was to turn to slice sampling. Wikipedia has a short post on it, but I think Radford's paper is better motivated and easier to understand. Regardless, the big idea is as follows:
The method is based on the observation that to sample a random variable one can sample uniformly from the region under the graph of its density function.
The nice part about slice sampling is that we never need to evaluate the normalization constant. All we need is the ability to evaluate the likelihood function under any given setting of weights, and to find intervals of the range of the density function that lie below some value. If you haven't seen slice sampling before, I recommend reading the paper for the details. There's nothing I can say that will explain it better than Radford. Skip straight to Section 3 if you just want to get right into it.

Anyhow, keeping with my "poor man's" implementation where I never do any hard math, I kept the likelihood the same as in the original logistic regression, just adding one likelihood function that keeps all but one parameter fixed. To solve for intervals, I use scipy's root-finding optimizers. The objective is unimodal, so there aren't any tricky cases where I have to find discontinuous regions. This lets me call the root finder once, decide if I want to look left or right for the other root, then call the root finder again.

That's about it. Since I already have code to find the MAP solution, I initialize weights by solving the regularized logistic regression problem, then I leave it up to the slice sampler to explore the surrounding region. I go in every 10 times through the resampling and pull a sample set of new weights.

To see what this looks like, I generated a one dimensional data set then built a logistic regression model where the first dimension of the X's is always 1, and the second dimension is the data. So the weights that the model learns are essentially like intercept and slope terms.

I set the sampler running for 1000 * 10 iterations, and added every 10th set of weights to a histogram. Here's what it comes up with. The intercept weight is on the x axis, and the slope weight on the y axis:
In addition, I plotted the MAP setting of weights with a red X. You have to look fairly hard to see it, but it's where it should be, in the highest density region of samples.

I haven't debugged this code as carefully, but I'll still post it here. Leave a note if you use it so I can let you know if any bugs come up that I fix. I may also put up some more detailed experiments and visualizations at some point.
from scipy.optimize.optimize import fmin_cg, fmin_bfgs, fmin, check_grad
from scipy.optimize import fsolve, bisect
import numpy as np

def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

class SyntheticClassifierData():

def __init__(self, N, d):
""" Create N instances of d dimensional input vectors and a 1D
class label (-1 or 1). """

means = .25 * np.array([[-1] * (d-1), [1] * (d-1)])

self.X_train = np.zeros((N, d))
self.Y_train = np.zeros(N)
for i in range(N):
if np.random.random() > .5:
y = 1
else:
y = 0
self.X_train[i, 0] = 1
self.X_train[i, 1:] = np.random.random(d-1) + means[y, :]
self.Y_train[i] = 2.0 * y - 1

self.X_test = np.zeros((N, d))
self.Y_test = np.zeros(N)
for i in range(N):
if np.random.randn() > .5:
y = 1
else:
y = 0
self.X_test[i, 0] = 1
self.X_test[i, 1:] = np.random.random(d-1) + means[y, :]
self.Y_test[i] = 2.0 * y - 1

class LogisticRegression():
""" A simple logistic regression model with L2 regularization (zero-mean
Gaussian priors on parameters). """

def __init__(self, x_train=None, y_train=None, x_test=None, y_test=None,
alpha=.1, synthetic=False):

# Set L2 regularization strength
self.alpha = alpha

# Set the data.
self.set_data(x_train, y_train, x_test, y_test)

# Initialization only matters if you don't call train().
self.all_betas = []
self.betas = np.random.randn(self.x_train.shape[1])

def negative_lik(self, betas):
return -1 * self.lik(betas)

def lik(self, betas):
""" Likelihood of the data under the current settings of parameters. """

# Data likelihood
l = 0
for i in range(self.n):
l += log(sigmoid(self.y_train[i] * \
np.dot(betas, self.x_train[i,:])))

# Prior likelihood
for k in range(0, self.x_train.shape[1]):
l -= (self.alpha / 2.0) * self.betas[k]**2

return l

def lik_k(self, beta_k, k):
""" The likelihood only in terms of beta_k. """

new_betas = self.betas.copy()
new_betas[k] = beta_k

return self.lik(new_betas)

def train(self):
""" Define the gradient and hand it off to a scipy gradient-based
optimizer. """

# Define the derivative of the likelihood with respect to beta_k.
# Need to multiply by -1 because we will be minimizing.
dB_k = lambda B, k : (k > -1) * self.alpha * B[k] - np.sum([ \
self.y_train[i] * self.x_train[i, k] * \
sigmoid(-self.y_train[i] *\
np.dot(B, self.x_train[i,:])) \
for i in range(self.n)])

# The full gradient is just an array of componentwise derivatives
dB = lambda B : np.array([dB_k(B, k) \
for k in range(self.x_train.shape[1])])

# Optimize
self.betas = fmin_bfgs(self.negative_lik, self.betas, fprime=dB)

def resample(self):
""" Use slice sampling to pull a new draw for logistic regression
parameters from the posterior distribution on beta. """

failures = 0
for i in range(10):
try:
new_betas = np.zeros(self.betas.shape[0])
order = range(self.betas.shape[0])
order.reverse()
for k in order:
new_betas[k] = self.resample_beta_k(k)
except:
failures += 1
continue

for k in range(self.betas.shape[0]):
self.betas[k] = new_betas[k]

print self.betas,
print self.lik(self.betas)

self.all_betas.append(self.betas.copy())

if failures > 0:
"Warning: %s root-finding failures" % (failures)

def resample_beta_k(self, k):
""" Resample beta_k conditional upon all other settings of beta.
This can be used in the inner loop of a Gibbs sampler to get a
full posterior over betas.

Uses slice sampling (Neal, 2001). """

#print "Resampling %s" % k

# Sample uniformly in (0, f(x0)), but do it in the log domain
lik = lambda b_k : self.lik_k(b_k, k)
x0 = self.betas[k]
g_x0 = lik(x0)
e = np.random.exponential()
z = g_x0 - e

# Find the slice of x where z < g(x0) (or where y < f(x0))
#print "y=%s" % exp(z)
lik_minus_z = lambda b_k : (self.lik_k(b_k, k) - z)

# Find the zeros of lik_minus_k to give the interval defining the slice
r0 = fsolve(lik_minus_z, x0)

# Figure out which direction the other root is in
eps = .001
look_right = False
if lik_minus_z(r0 + eps) > 0:
look_right = True

if look_right:
r1 = bisect(lik_minus_z, r0 + eps, 1000)
else:
r1 = bisect(lik_minus_z, -1000, r0 - eps)

L = min(r0, r1)
R = max(r0, r1)
x = (R - L) * np.random.random() + L

#print "S in (%s, %s) -->" % (L, R),
#print "%s" % x
return x

def set_data(self, x_train, y_train, x_test, y_test):
""" Take data that's already been generated. """

self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.n = y_train.shape[0]

def training_reconstruction(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_train[i,:]))

return p_y1

def test_predictions(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_test[i,:]))

return p_y1

def plot_training_reconstruction(self):
plot(np.arange(self.n), .5 + .5 * self.y_train, 'bo')
plot(np.arange(self.n), self.training_reconstruction(), 'rx')
ylim([-.1, 1.1])

def plot_test_predictions(self):
plot(np.arange(self.n), .5 + .5 * self.y_test, 'yo')
plot(np.arange(self.n), self.test_predictions(), 'rx')
ylim([-.1, 1.1])

if __name__ == "__main__":
from pylab import *

data = SyntheticClassifierData(50, 2)

alphas = [.1]
for j, a in enumerate(alphas):
figure()

# Create a new learner, but use the same data for each run
lr = LogisticRegression(x_train=data.X_train, y_train=data.Y_train,
x_test=data.X_test, y_test=data.Y_test,
alpha=a)

print "Initial likelihood:"
print lr.lik(lr.betas)

# Train the model
lr.train()
map_betas = lr.betas.copy()

subplot(10, 2, 1)
lr.plot_training_reconstruction()
ylabel("Alpha=%s" % a)
if j == 0:
title("Training set reconstructions")
axis('off')

subplot(10, 2, 2)
lr.plot_test_predictions()
if j == 0:
title("Test set predictions")
axis('off')

# Display execution info
print "Final betas:"
print lr.betas
print "Final lik:"
print lr.lik(lr.betas)

for l in range(1, 10000):
lr.resample()

# Plot the results
if l % 1000 == 0:
subplot(10, 2, 2*(l/1000) + 1)
lr.plot_training_reconstruction()
axis('off')
subplot(10, 2, 2*(l/1000) + 2)
lr.plot_test_predictions()
axis('off')

figure()
all_betas = np.array(lr.all_betas)

hexbin(all_betas[:,0], all_betas[:,1], bins='log')
plot([map_betas[0]], [map_betas[1]], 'rx', markersize=10)
show()

fvgsa9he5y