Posted by Danny TarlowBinary Classification 101: Round 2 I wrote last time about how easy it is to build a logistic regression classifier in Python using numpy and scipy.
Maintaining the "Machine Learning 101" theme, here's another dead-simple classifier. Remember, we're still talking about binary classification problems:
binary classification problems -- where you have some examples that are "on" and other examples that are "off." You get as input a training set which has some examples of each class along with a label saying whether each example is "on" or "off". The goal is to learn a model from the training data so that you can predict the label of new examples that you haven't seen before and don't know the label of.
For one example, suppose that you have data describing a bunch of buildings and earthquakes (E.g., year the building was constructed, type of material used, strength of earthquake,etc), and you know whether each building collapsed ("on") or not ("off") in each past earthquake. Using this data, you'd like to make predictions about whether a given building is going to collapse in a hypothetical future earthquake.
Classification with Nearest Neighbors
This time we're going to build a nearest neighbors classifier. The idea is even simpler than logistic regression: when you want to predict the label of a point, look at the most similar points, and report the average label as your prediction. For example, suppose we have four examples in a 2D training set:
Now suppose we choose k=3, which means our predictions are using the three nearest neighbors. In this case, we'd find that [0, 0] (label 0), [0, 5] (label 1), and [10, 0] (label 0) are the closest points. If we want to give a probability as our answer, then a reasonable estimate would be that there's a 2/3 chance of the label being 0, and a 1/3 chance of the label being 1. If we just want to make a prediction of "off" (0) or "on" (1), we'd probably guess "off."
If you look back at the logistic regression post, you'll notice that there's nothing preventing us from running the same set of experiments as last time using a nearest neighbors classifier instead of the logistic regression. It's just training set in, classifier out. Once you have your classifier, you can ask it to make predictions about new points.
The one chance is that instead of having a regularization parameter, we have a k parameter, which tells us how many neighbors to look at making a prediction. Here's what happens when you run it for a few different settings of k:
It's kind of interesting, huh? Setting k to 1 causes some overfitting that looks quite similar to unregularized logistic regression. Choosing k to be larger has a somewhat similar effect to increasing the regularization strength.
And here's the code that implements the classifier and produces the figure. For those of you who thought this was all super obvious, take a look at how easy scipy's KDTree class makes this. I almost feel bad taking any credit at all for writing this code.
Also, if you're George, hopefully you'll be happy to see that I took some of your design suggestions into account when deciding how to structure the interface.
from scipy.spatial import KDTree import numpy as np import networkx as nx from synthetic_classifier_data import * class NearestNeighborBinaryClassifier(): def __init__(self, x_train, y_train): self.set_data(x_train, y_train) def set_data(self, x_train, y_train): """ y_train entries should be either -1 or 1.""" self.x_train = x_train self.y_train = y_train self.n = y_train.shape self.kd_tree = KDTree(self.x_train) def predict(self, X, k=3): predictions = np.zeros(X.shape) for i in range(X.shape): # Let the kd-tree do all the real work d_i, n_i = self.kd_tree.query(X[i, :], k=k) predictions[i] = np.sum(.5 + .5 * self.y_train[n_i]) / float(k) return predictions if __name__ == "__main__": from pylab import * # Create 20 dimensional data set with 25 points data = SyntheticClassifierData(25, 20) nn = NearestNeighborBinaryClassifier(data.X_train, data.Y_train) # Run for a variety of settings of k ks = [1, 3, 5, 10] for j, k in enumerate(ks): # Predict the label of each (training and test) y given its # k nearest neighbors hat_y_train = nn.predict(data.X_train, k=k) hat_y_test = nn.predict(data.X_test, k=k) # Plot the results subplot(len(ks), 2, 2*j + 1) plot(np.arange(data.X_train.shape), .5 + .5 * data.Y_train, 'bo') plot(np.arange(data.X_train.shape), hat_y_train, 'rx') ylim([-.1, 1.1]) ylabel("K=%s" % k) if j == 0: title("Training set reconstructions") subplot(len(ks), 2, 2*j + 2) plot(np.arange(data.X_test.shape), .5 + .5 * data.Y_test, 'yo') plot(np.arange(data.X_test.shape), hat_y_test, 'rx') ylim([-.1, 1.1]) if j == 0: title("Test set predictions") show()