This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
In-Depth: Support Vector Machines¶
Support vector machines (SVMs) are a particularly powerful and flexible class of supervised algorithms for both classification and regression. In this section, we will develop the intuition behind support vector machines and their use in classification problems.
We begin with the standard imports:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
Motivating Support Vector Machines¶
As part of our disussion of Bayesian classification (see In Depth: Naive Bayes Classification), we learned a simple model describing the distribution of each underlying class, and used these generative models to probabilistically determine labels for new points. That was an example of generative classification; here we will consider instead discriminative classification: rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.
As an example of this, consider the simple case of a classification task, in which the two classes of points are well separated:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[2], line 1 ----> 1 from sklearn.datasets.samples_generator import make_blobs 2 X, y = make_blobs(n_samples=50, centers=2, 3 random_state=0, cluster_std=0.60) 4 plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn'); ModuleNotFoundError: No module named 'sklearn.datasets.samples_generator'
A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification. For two dimensional data like that shown here, this is a task we could do by hand. But immediately we see a problem: there is more than one possible dividing line that can perfectly discriminate between the two classes!
We can draw them as follows:
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plt.plot([0.6], [2.1], 'x', color='red', markeredgewidth=2, markersize=10)
for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
plt.plot(xfit, m * xfit + b, '-k')
plt.xlim(-1, 3.5);
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[3], line 2 1 xfit = np.linspace(-1, 3.5) ----> 2 plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn') 3 plt.plot([0.6], [2.1], 'x', color='red', markeredgewidth=2, markersize=10) 5 for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]: NameError: name 'X' is not defined
These are three very different separators which, nevertheless, perfectly discriminate between these samples. Depending on which you choose, a new data point (e.g., the one marked by the "X" in this plot) will be assigned a different label! Evidently our simple intuition of "drawing a line between classes" is not enough, and we need to think a bit deeper.
Support Vector Machines: Maximizing the Margin¶
Support vector machines offer one way to improve on this. The intuition is this: rather than simply drawing a zero-width line between the classes, we can draw around each line a margin of some width, up to the nearest point. Here is an example of how this might look:
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[4], line 2 1 xfit = np.linspace(-1, 3.5) ----> 2 plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn') 4 for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]: 5 yfit = m * xfit + b NameError: name 'X' is not defined
In support vector machines, the line that maximizes this margin is the one we will choose as the optimal model. Support vector machines are an example of such a maximum margin estimator.
Fitting a support vector machine¶
Let's see the result of an actual fit to this data: we will use Scikit-Learn's support vector classifier to train an SVM model on this data. For the time being, we will use a linear kernel and set the C
parameter to a very large number (we'll discuss the meaning of these in more depth momentarily).
from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[5], line 3 1 from sklearn.svm import SVC # "Support vector classifier" 2 model = SVC(kernel='linear', C=1E10) ----> 3 model.fit(X, y) NameError: name 'X' is not defined
To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:
def plot_svc_decision_function(model, ax=None, plot_support=True):
"""Plot the decision function for a 2D SVC"""
if ax is None:
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# create grid to evaluate model
x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
# plot decision boundary and margins
ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
# plot support vectors
if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model);
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[7], line 1 ----> 1 plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn') 2 plot_svc_decision_function(model); NameError: name 'X' is not defined
This is the dividing line that maximizes the margin between the two sets of points. Notice that a few of the training points just touch the margin: they are indicated by the black circles in this figure. These points are the pivotal elements of this fit, and are known as the support vectors, and give the algorithm its name. In Scikit-Learn, the identity of these points are stored in the support_vectors_
attribute of the classifier:
model.support_vectors_
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[8], line 1 ----> 1 model.support_vectors_ AttributeError: 'SVC' object has no attribute 'support_vectors_'
A key to this classifier's success is that for the fit, only the position of the support vectors matter; any points further from the margin which are on the correct side do not modify the fit! Technically, this is because these points do not contribute to the loss function used to fit the model, so their position and number do not matter so long as they do not cross the margin.
We can see this, for example, if we plot the model learned from the first 60 points and first 120 points of this dataset:
def plot_svm(N=10, ax=None):
X, y = make_blobs(n_samples=200, centers=2,
random_state=0, cluster_std=0.60)
X = X[:N]
y = y[:N]
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
ax = ax or plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
ax.set_xlim(-1, 4)
ax.set_ylim(-1, 6)
plot_svc_decision_function(model, ax)
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for axi, N in zip(ax, [60, 120]):
plot_svm(N, axi)
axi.set_title('N = {0}'.format(N))
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[9], line 18 16 fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1) 17 for axi, N in zip(ax, [60, 120]): ---> 18 plot_svm(N, axi) 19 axi.set_title('N = {0}'.format(N)) Cell In[9], line 2, in plot_svm(N, ax) 1 def plot_svm(N=10, ax=None): ----> 2 X, y = make_blobs(n_samples=200, centers=2, 3 random_state=0, cluster_std=0.60) 4 X = X[:N] 5 y = y[:N] NameError: name 'make_blobs' is not defined
In the left panel, we see the model and the support vectors for 60 training points. In the right panel, we have doubled the number of training points, but the model has not changed: the three support vectors from the left panel are still the support vectors from the right panel. This insensitivity to the exact behavior of distant points is one of the strengths of the SVM model.
If you are running this notebook live, you can use IPython's interactive widgets to view this feature of the SVM model interactively:
from ipywidgets import interact, fixed
interact(plot_svm, N=[10, 200], ax=fixed(None));
Beyond linear boundaries: Kernel SVM¶
Where SVM becomes extremely powerful is when it is combined with kernels. We have seen a version of kernels before, in the basis function regressions of In Depth: Linear Regression. There we projected our data into higher-dimensional space defined by polynomials and Gaussian basis functions, and thereby were able to fit for nonlinear relationships with a linear classifier.
In SVM models, we can use a version of the same idea. To motivate the need for kernels, let's look at some data that is not linearly separable:
from sklearn.datasets.samples_generator import make_circles
X, y = make_circles(100, factor=.1, noise=.1)
clf = SVC(kernel='linear').fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf, plot_support=False);
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[11], line 1 ----> 1 from sklearn.datasets.samples_generator import make_circles 2 X, y = make_circles(100, factor=.1, noise=.1) 4 clf = SVC(kernel='linear').fit(X, y) ModuleNotFoundError: No module named 'sklearn.datasets.samples_generator'
It is clear that no linear discrimination will ever be able to separate this data. But we can draw a lesson from the basis function regressions in In Depth: Linear Regression, and think about how we might project the data into a higher dimension such that a linear separator would be sufficient. For example, one simple projection we could use would be to compute a radial basis function centered on the middle clump:
r = np.exp(-(X ** 2).sum(1))
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[12], line 1 ----> 1 r = np.exp(-(X ** 2).sum(1)) NameError: name 'X' is not defined
We can visualize this extra data dimension using a three-dimensional plot—if you are running this notebook live, you will be able to use the sliders to rotate the plot:
from mpl_toolkits import mplot3d
def plot_3D(elev=30, azim=30, X=X, y=y):
ax = plt.subplot(projection='3d')
ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
ax.view_init(elev=elev, azim=azim)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('r')
interact(plot_3D, elev=[-90, 90], azip=(-180, 180),
X=fixed(X), y=fixed(y));
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[13], line 3 1 from mpl_toolkits import mplot3d ----> 3 def plot_3D(elev=30, azim=30, X=X, y=y): 4 ax = plt.subplot(projection='3d') 5 ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn') NameError: name 'X' is not defined
We can see that with this additional dimension, the data becomes trivially linearly separable, by drawing a separating plane at, say, r=0.7.
Here we had to choose and carefully tune our projection: if we had not centered our radial basis function in the right location, we would not have seen such clean, linearly separable results. In general, the need to make such a choice is a problem: we would like to somehow automatically find the best basis functions to use.
One strategy to this end is to compute a basis function centered at every point in the dataset, and let the SVM algorithm sift through the results. This type of basis function transformation is known as a kernel transformation, as it is based on a similarity relationship (or kernel) between each pair of points.
A potential problem with this strategy—projecting $N$ points into $N$ dimensions—is that it might become very computationally intensive as $N$ grows large. However, because of a neat little procedure known as the kernel trick, a fit on kernel-transformed data can be done implicitly—that is, without ever building the full $N$-dimensional representation of the kernel projection! This kernel trick is built into the SVM, and is one of the reasons the method is so powerful.
In Scikit-Learn, we can apply kernelized SVM simply by changing our linear kernel to an RBF (radial basis function) kernel, using the kernel
model hyperparameter:
clf = SVC(kernel='rbf', C=1E6)
clf.fit(X, y)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[14], line 2 1 clf = SVC(kernel='rbf', C=1E6) ----> 2 clf.fit(X, y) NameError: name 'X' is not defined
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=300, lw=1, facecolors='none');
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[15], line 1 ----> 1 plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn') 2 plot_svc_decision_function(clf) 3 plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], 4 s=300, lw=1, facecolors='none'); NameError: name 'X' is not defined
Using this kernelized support vector machine, we learn a suitable nonlinear decision boundary. This kernel transformation strategy is used often in machine learning to turn fast linear methods into fast nonlinear methods, especially for models in which the kernel trick can be used.
Tuning the SVM: Softening Margins¶
Our discussion thus far has centered around very clean datasets, in which a perfect decision boundary exists. But what if your data has some amount of overlap? For example, you may have data like this:
X, y = make_blobs(n_samples=100, centers=2,
random_state=0, cluster_std=1.2)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[16], line 1 ----> 1 X, y = make_blobs(n_samples=100, centers=2, 2 random_state=0, cluster_std=1.2) 3 plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn'); NameError: name 'make_blobs' is not defined
To handle this case, the SVM implementation has a bit of a fudge-factor which "softens" the margin: that is, it allows some of the points to creep into the margin if that allows a better fit. The hardness of the margin is controlled by a tuning parameter, most often known as $C$. For very large $C$, the margin is hard, and points cannot lie in it. For smaller $C$, the margin is softer, and can grow to encompass some points.
The plot shown below gives a visual picture of how a changing $C$ parameter affects the final fit, via the softening of the margin:
X, y = make_blobs(n_samples=100, centers=2,
random_state=0, cluster_std=0.8)
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for axi, C in zip(ax, [10.0, 0.1]):
model = SVC(kernel='linear', C=C).fit(X, y)
axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model, axi)
axi.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, lw=1, facecolors='none');
axi.set_title('C = {0:.1f}'.format(C), size=14)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[17], line 1 ----> 1 X, y = make_blobs(n_samples=100, centers=2, 2 random_state=0, cluster_std=0.8) 4 fig, ax = plt.subplots(1, 2, figsize=(16, 6)) 5 fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1) NameError: name 'make_blobs' is not defined
The optimal value of the $C$ parameter will depend on your dataset, and should be tuned using cross-validation or a similar procedure (refer back to Hyperparameters and Model Validation).
Example: Face Recognition¶
As an example of support vector machines in action, let's take a look at the facial recognition problem. We will use the Labeled Faces in the Wild dataset, which consists of several thousand collated photos of various public figures. A fetcher for the dataset is built into Scikit-Learn:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)
--------------------------------------------------------------------------- OSError Traceback (most recent call last) File /opt/conda/lib/python3.10/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args) 1347 try: -> 1348 h.request(req.get_method(), req.selector, req.data, headers, 1349 encode_chunked=req.has_header('Transfer-encoding')) 1350 except OSError as err: # timeout error File /opt/conda/lib/python3.10/http/client.py:1282, in HTTPConnection.request(self, method, url, body, headers, encode_chunked) 1281 """Send a complete request to the server.""" -> 1282 self._send_request(method, url, body, headers, encode_chunked) File /opt/conda/lib/python3.10/http/client.py:1328, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked) 1327 body = _encode(body, 'body') -> 1328 self.endheaders(body, encode_chunked=encode_chunked) File /opt/conda/lib/python3.10/http/client.py:1277, in HTTPConnection.endheaders(self, message_body, encode_chunked) 1276 raise CannotSendHeader() -> 1277 self._send_output(message_body, encode_chunked=encode_chunked) File /opt/conda/lib/python3.10/http/client.py:1037, in HTTPConnection._send_output(self, message_body, encode_chunked) 1036 del self._buffer[:] -> 1037 self.send(msg) 1039 if message_body is not None: 1040 1041 # create a consistent interface to message_body File /opt/conda/lib/python3.10/http/client.py:975, in HTTPConnection.send(self, data) 974 if self.auto_open: --> 975 self.connect() 976 else: File /opt/conda/lib/python3.10/http/client.py:1447, in HTTPSConnection.connect(self) 1445 "Connect to a host on a given (SSL) port." -> 1447 super().connect() 1449 if self._tunnel_host: File /opt/conda/lib/python3.10/http/client.py:941, in HTTPConnection.connect(self) 940 sys.audit("http.client.connect", self, self.host, self.port) --> 941 self.sock = self._create_connection( 942 (self.host,self.port), self.timeout, self.source_address) 943 # Might fail in OSs that don't implement TCP_NODELAY File /opt/conda/lib/python3.10/socket.py:845, in create_connection(address, timeout, source_address) 844 try: --> 845 raise err 846 finally: 847 # Break explicitly a reference cycle File /opt/conda/lib/python3.10/socket.py:833, in create_connection(address, timeout, source_address) 832 sock.bind(source_address) --> 833 sock.connect(sa) 834 # Break explicitly a reference cycle OSError: [Errno 99] Cannot assign requested address During handling of the above exception, another exception occurred: URLError Traceback (most recent call last) Cell In[18], line 2 1 from sklearn.datasets import fetch_lfw_people ----> 2 faces = fetch_lfw_people(min_faces_per_person=60) 3 print(faces.target_names) 4 print(faces.images.shape) File /opt/conda/lib/python3.10/site-packages/sklearn/datasets/_lfw.py:328, in fetch_lfw_people(data_home, funneled, resize, min_faces_per_person, color, slice_, download_if_missing, return_X_y) 234 def fetch_lfw_people( 235 *, 236 data_home=None, (...) 243 return_X_y=False, 244 ): 245 """Load the Labeled Faces in the Wild (LFW) people dataset \ 246 (classification). 247 (...) 326 .. versionadded:: 0.20 327 """ --> 328 lfw_home, data_folder_path = _check_fetch_lfw( 329 data_home=data_home, funneled=funneled, download_if_missing=download_if_missing 330 ) 331 logger.debug("Loading LFW people faces from %s", lfw_home) 333 # wrap the loader in a memoizing function that will return memmaped data 334 # arrays for optimal memory usage File /opt/conda/lib/python3.10/site-packages/sklearn/datasets/_lfw.py:88, in _check_fetch_lfw(data_home, funneled, download_if_missing) 86 if download_if_missing: 87 logger.info("Downloading LFW metadata: %s", target.url) ---> 88 _fetch_remote(target, dirname=lfw_home) 89 else: 90 raise IOError("%s is missing" % target_filepath) File /opt/conda/lib/python3.10/site-packages/sklearn/datasets/_base.py:1324, in _fetch_remote(remote, dirname) 1302 """Helper function to download a remote dataset into path 1303 1304 Fetch a dataset pointed by remote's url, save into path using remote's (...) 1320 Full path of the created file. 1321 """ 1323 file_path = remote.filename if dirname is None else join(dirname, remote.filename) -> 1324 urlretrieve(remote.url, file_path) 1325 checksum = _sha256(file_path) 1326 if remote.checksum != checksum: File /opt/conda/lib/python3.10/urllib/request.py:241, in urlretrieve(url, filename, reporthook, data) 224 """ 225 Retrieve a URL into a temporary location on disk. 226 (...) 237 data file as well as the resulting HTTPMessage object. 238 """ 239 url_type, path = _splittype(url) --> 241 with contextlib.closing(urlopen(url, data)) as fp: 242 headers = fp.info() 244 # Just return the local path and the "headers" for file:// 245 # URLs. No sense in performing a copy unless requested. File /opt/conda/lib/python3.10/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context) 214 else: 215 opener = _opener --> 216 return opener.open(url, data, timeout) File /opt/conda/lib/python3.10/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout) 523 for processor in self.process_response.get(protocol, []): 524 meth = getattr(processor, meth_name) --> 525 response = meth(req, response) 527 return response File /opt/conda/lib/python3.10/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response) 631 # According to RFC 2616, "2xx" code indicates that the client's 632 # request was successfully received, understood, and accepted. 633 if not (200 <= code < 300): --> 634 response = self.parent.error( 635 'http', request, response, code, msg, hdrs) 637 return response File /opt/conda/lib/python3.10/urllib/request.py:557, in OpenerDirector.error(self, proto, *args) 555 http_err = 0 556 args = (dict, proto, meth_name) + args --> 557 result = self._call_chain(*args) 558 if result: 559 return result File /opt/conda/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args) 494 for handler in handlers: 495 func = getattr(handler, meth_name) --> 496 result = func(*args) 497 if result is not None: 498 return result File /opt/conda/lib/python3.10/urllib/request.py:749, in HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers) 746 fp.read() 747 fp.close() --> 749 return self.parent.open(new, timeout=req.timeout) File /opt/conda/lib/python3.10/urllib/request.py:519, in OpenerDirector.open(self, fullurl, data, timeout) 516 req = meth(req) 518 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method()) --> 519 response = self._open(req, data) 521 # post-process response 522 meth_name = protocol+"_response" File /opt/conda/lib/python3.10/urllib/request.py:536, in OpenerDirector._open(self, req, data) 533 return result 535 protocol = req.type --> 536 result = self._call_chain(self.handle_open, protocol, protocol + 537 '_open', req) 538 if result: 539 return result File /opt/conda/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args) 494 for handler in handlers: 495 func = getattr(handler, meth_name) --> 496 result = func(*args) 497 if result is not None: 498 return result File /opt/conda/lib/python3.10/urllib/request.py:1391, in HTTPSHandler.https_open(self, req) 1390 def https_open(self, req): -> 1391 return self.do_open(http.client.HTTPSConnection, req, 1392 context=self._context, check_hostname=self._check_hostname) File /opt/conda/lib/python3.10/urllib/request.py:1351, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args) 1348 h.request(req.get_method(), req.selector, req.data, headers, 1349 encode_chunked=req.has_header('Transfer-encoding')) 1350 except OSError as err: # timeout error -> 1351 raise URLError(err) 1352 r = h.getresponse() 1353 except: URLError: <urlopen error [Errno 99] Cannot assign requested address>
Let's plot a few of these faces to see what we're working with:
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[19], line 3 1 fig, ax = plt.subplots(3, 5) 2 for i, axi in enumerate(ax.flat): ----> 3 axi.imshow(faces.images[i], cmap='bone') 4 axi.set(xticks=[], yticks=[], 5 xlabel=faces.target_names[faces.target[i]]) NameError: name 'faces' is not defined
Each image contains [62×47] or nearly 3,000 pixels. We could proceed by simply using each pixel value as a feature, but often it is more effective to use some sort of preprocessor to extract more meaningful features; here we will use a principal component analysis (see In Depth: Principal Component Analysis) to extract 150 fundamental components to feed into our support vector machine classifier. We can do this most straightforwardly by packaging the preprocessor and the classifier into a single pipeline:
from sklearn.svm import SVC
from sklearn.decomposition import RandomizedPCA
from sklearn.pipeline import make_pipeline
pca = RandomizedPCA(n_components=150, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) Cell In[20], line 2 1 from sklearn.svm import SVC ----> 2 from sklearn.decomposition import RandomizedPCA 3 from sklearn.pipeline import make_pipeline 5 pca = RandomizedPCA(n_components=150, whiten=True, random_state=42) ImportError: cannot import name 'RandomizedPCA' from 'sklearn.decomposition' (/opt/conda/lib/python3.10/site-packages/sklearn/decomposition/__init__.py)
For the sake of testing our classifier output, we will split the data into a training and testing set:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target,
random_state=42)
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[21], line 1 ----> 1 from sklearn.cross_validation import train_test_split 2 Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target, 3 random_state=42) ModuleNotFoundError: No module named 'sklearn.cross_validation'
Finally, we can use a grid search cross-validation to explore combinations of parameters. Here we will adjust C
(which controls the margin hardness) and gamma
(which controls the size of the radial basis function kernel), and determine the best model:
from sklearn.grid_search import GridSearchCV
param_grid = {'svc__C': [1, 5, 10, 50],
'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)
%time grid.fit(Xtrain, ytrain)
print(grid.best_params_)
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[22], line 1 ----> 1 from sklearn.grid_search import GridSearchCV 2 param_grid = {'svc__C': [1, 5, 10, 50], 3 'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]} 4 grid = GridSearchCV(model, param_grid) ModuleNotFoundError: No module named 'sklearn.grid_search'
The optimal values fall toward the middle of our grid; if they fell at the edges, we would want to expand the grid to make sure we have found the true optimum.
Now with this cross-validated model, we can predict the labels for the test data, which the model has not yet seen:
model = grid.best_estimator_
yfit = model.predict(Xtest)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[23], line 1 ----> 1 model = grid.best_estimator_ 2 yfit = model.predict(Xtest) NameError: name 'grid' is not defined
Let's take a look at a few of the test images along with their predicted values:
fig, ax = plt.subplots(4, 6)
for i, axi in enumerate(ax.flat):
axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
axi.set(xticks=[], yticks=[])
axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
color='black' if yfit[i] == ytest[i] else 'red')
fig.suptitle('Predicted Names; Incorrect Labels in Red', size=14);
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[24], line 3 1 fig, ax = plt.subplots(4, 6) 2 for i, axi in enumerate(ax.flat): ----> 3 axi.imshow(Xtest[i].reshape(62, 47), cmap='bone') 4 axi.set(xticks=[], yticks=[]) 5 axi.set_ylabel(faces.target_names[yfit[i]].split()[-1], 6 color='black' if yfit[i] == ytest[i] else 'red') NameError: name 'Xtest' is not defined
Out of this small sample, our optimal estimator mislabeled only a single face (Bush’s face in the bottom row was mislabeled as Blair). We can get a better sense of our estimator's performance using the classification report, which lists recovery statistics label by label:
from sklearn.metrics import classification_report
print(classification_report(ytest, yfit,
target_names=faces.target_names))
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[25], line 2 1 from sklearn.metrics import classification_report ----> 2 print(classification_report(ytest, yfit, 3 target_names=faces.target_names)) NameError: name 'ytest' is not defined
We might also display the confusion matrix between these classes:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=faces.target_names,
yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[26], line 2 1 from sklearn.metrics import confusion_matrix ----> 2 mat = confusion_matrix(ytest, yfit) 3 sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, 4 xticklabels=faces.target_names, 5 yticklabels=faces.target_names) 6 plt.xlabel('true label') NameError: name 'ytest' is not defined
This helps us get a sense of which labels are likely to be confused by the estimator.
For a real-world facial recognition task, in which the photos do not come pre-cropped into nice grids, the only difference in the facial classification scheme is the feature selection: you would need to use a more sophisticated algorithm to find the faces, and extract features that are independent of the pixellation. For this kind of application, one good option is to make use of OpenCV, which, among other things, includes pre-trained implementations of state-of-the-art feature extraction tools for images in general and faces in particular.
Support Vector Machine Summary¶
We have seen here a brief intuitive introduction to the principals behind support vector machines. These methods are a powerful classification method for a number of reasons:
- Their dependence on relatively few support vectors means that they are very compact models, and take up very little memory.
- Once the model is trained, the prediction phase is very fast.
- Because they are affected only by points near the margin, they work well with high-dimensional data—even data with more dimensions than samples, which is a challenging regime for other algorithms.
- Their integration with kernel methods makes them very versatile, able to adapt to many types of data.
However, SVMs have several disadvantages as well:
- The scaling with the number of samples $N$ is $\mathcal{O}[N^3]$ at worst, or $\mathcal{O}[N^2]$ for efficient implementations. For large numbers of training samples, this computational cost can be prohibitive.
- The results are strongly dependent on a suitable choice for the softening parameter $C$. This must be carefully chosen via cross-validation, which can be expensive as datasets grow in size.
- The results do not have a direct probabilistic interpretation. This can be estimated via an internal cross-validation (see the
probability
parameter ofSVC
), but this extra estimation is costly.
With those traits in mind, I generally only turn to SVMs once other simpler, faster, and less tuning-intensive methods have been shown to be insufficient for my needs. Nevertheless, if you have the CPU cycles to commit to training and cross-validating an SVM on your data, the method can lead to excellent results.