This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
Histograms, Binnings, and Density¶
A simple histogram can be a great first step in understanding a dataset. Earlier, we saw a preview of Matplotlib's histogram function (see Comparisons, Masks, and Boolean Logic), which creates a basic histogram in one line, once the normal boiler-plate imports are done:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
/tmp/ipykernel_1339/1860642311.py:4: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead. plt.style.use('seaborn-white')
plt.hist(data);
The hist()
function has many options to tune both the calculation and the display; here's an example of a more customized histogram:
plt.hist(data, bins=30, normed=True, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[3], line 1 ----> 1 plt.hist(data, bins=30, normed=True, alpha=0.5, 2 histtype='stepfilled', color='steelblue', 3 edgecolor='none'); File /opt/conda/lib/python3.10/site-packages/matplotlib/pyplot.py:2581, in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, data, **kwargs) 2575 @_copy_docstring_and_deprecators(Axes.hist) 2576 def hist( 2577 x, bins=None, range=None, density=False, weights=None, 2578 cumulative=False, bottom=None, histtype='bar', align='mid', 2579 orientation='vertical', rwidth=None, log=False, color=None, 2580 label=None, stacked=False, *, data=None, **kwargs): -> 2581 return gca().hist( 2582 x, bins=bins, range=range, density=density, weights=weights, 2583 cumulative=cumulative, bottom=bottom, histtype=histtype, 2584 align=align, orientation=orientation, rwidth=rwidth, log=log, 2585 color=color, label=label, stacked=stacked, 2586 **({"data": data} if data is not None else {}), **kwargs) File /opt/conda/lib/python3.10/site-packages/matplotlib/__init__.py:1433, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs) 1430 @functools.wraps(func) 1431 def inner(ax, *args, data=None, **kwargs): 1432 if data is None: -> 1433 return func(ax, *map(sanitize_sequence, args), **kwargs) 1435 bound = new_sig.bind(ax, *args, **kwargs) 1436 auto_label = (bound.arguments.get(label_namer) 1437 or bound.kwargs.get(label_namer)) File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_axes.py:6896, in Axes.hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs) 6894 if patch: 6895 p = patch[0] -> 6896 p._internal_update(kwargs) 6897 if lbl is not None: 6898 p.set_label(lbl) File /opt/conda/lib/python3.10/site-packages/matplotlib/artist.py:1186, in Artist._internal_update(self, kwargs) 1179 def _internal_update(self, kwargs): 1180 """ 1181 Update artist properties without prenormalizing them, but generating 1182 errors as if calling `set`. 1183 1184 The lack of prenormalization is to maintain backcompatibility. 1185 """ -> 1186 return self._update_props( 1187 kwargs, "{cls.__name__}.set() got an unexpected keyword argument " 1188 "{prop_name!r}") File /opt/conda/lib/python3.10/site-packages/matplotlib/artist.py:1160, in Artist._update_props(self, props, errfmt) 1158 func = getattr(self, f"set_{k}", None) 1159 if not callable(func): -> 1160 raise AttributeError( 1161 errfmt.format(cls=type(self), prop_name=k)) 1162 ret.append(func(v)) 1163 if ret: AttributeError: Polygon.set() got an unexpected keyword argument 'normed'
The plt.hist
docstring has more information on other customization options available. I find this combination of histtype='stepfilled'
along with some transparency alpha
to be very useful when comparing histograms of several distributions:
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[4], line 7 3 x3 = np.random.normal(3, 2, 1000) 5 kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40) ----> 7 plt.hist(x1, **kwargs) 8 plt.hist(x2, **kwargs) 9 plt.hist(x3, **kwargs); File /opt/conda/lib/python3.10/site-packages/matplotlib/pyplot.py:2581, in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, data, **kwargs) 2575 @_copy_docstring_and_deprecators(Axes.hist) 2576 def hist( 2577 x, bins=None, range=None, density=False, weights=None, 2578 cumulative=False, bottom=None, histtype='bar', align='mid', 2579 orientation='vertical', rwidth=None, log=False, color=None, 2580 label=None, stacked=False, *, data=None, **kwargs): -> 2581 return gca().hist( 2582 x, bins=bins, range=range, density=density, weights=weights, 2583 cumulative=cumulative, bottom=bottom, histtype=histtype, 2584 align=align, orientation=orientation, rwidth=rwidth, log=log, 2585 color=color, label=label, stacked=stacked, 2586 **({"data": data} if data is not None else {}), **kwargs) File /opt/conda/lib/python3.10/site-packages/matplotlib/__init__.py:1433, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs) 1430 @functools.wraps(func) 1431 def inner(ax, *args, data=None, **kwargs): 1432 if data is None: -> 1433 return func(ax, *map(sanitize_sequence, args), **kwargs) 1435 bound = new_sig.bind(ax, *args, **kwargs) 1436 auto_label = (bound.arguments.get(label_namer) 1437 or bound.kwargs.get(label_namer)) File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_axes.py:6896, in Axes.hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs) 6894 if patch: 6895 p = patch[0] -> 6896 p._internal_update(kwargs) 6897 if lbl is not None: 6898 p.set_label(lbl) File /opt/conda/lib/python3.10/site-packages/matplotlib/artist.py:1186, in Artist._internal_update(self, kwargs) 1179 def _internal_update(self, kwargs): 1180 """ 1181 Update artist properties without prenormalizing them, but generating 1182 errors as if calling `set`. 1183 1184 The lack of prenormalization is to maintain backcompatibility. 1185 """ -> 1186 return self._update_props( 1187 kwargs, "{cls.__name__}.set() got an unexpected keyword argument " 1188 "{prop_name!r}") File /opt/conda/lib/python3.10/site-packages/matplotlib/artist.py:1160, in Artist._update_props(self, props, errfmt) 1158 func = getattr(self, f"set_{k}", None) 1159 if not callable(func): -> 1160 raise AttributeError( 1161 errfmt.format(cls=type(self), prop_name=k)) 1162 ret.append(func(v)) 1163 if ret: AttributeError: Polygon.set() got an unexpected keyword argument 'normed'
If you would like to simply compute the histogram (that is, count the number of points in a given bin) and not display it, the np.histogram()
function is available:
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
[ 14 211 510 252 13]
Two-Dimensional Histograms and Binnings¶
Just as we create histograms in one dimension by dividing the number-line into bins, we can also create histograms in two-dimensions by dividing points among two-dimensional bins. We'll take a brief look at several ways to do this here. We'll start by defining some data—an x
and y
array drawn from a multivariate Gaussian distribution:
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.hist2d
: Two-dimensional histogram¶
One straightforward way to plot a two-dimensional histogram is to use Matplotlib's plt.hist2d
function:
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
Just as with plt.hist
, plt.hist2d
has a number of extra options to fine-tune the plot and the binning, which are nicely outlined in the function docstring. Further, just as plt.hist
has a counterpart in np.histogram
, plt.hist2d
has a counterpart in np.histogram2d
, which can be used as follows:
counts, xedges, yedges = np.histogram2d(x, y, bins=30)
For the generalization of this histogram binning in dimensions higher than two, see the np.histogramdd
function.
plt.hexbin
: Hexagonal binnings¶
The two-dimensional histogram creates a tesselation of squares across the axes. Another natural shape for such a tesselation is the regular hexagon. For this purpose, Matplotlib provides the plt.hexbin
routine, which will represents a two-dimensional dataset binned within a grid of hexagons:
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')
plt.hexbin
has a number of interesting options, including the ability to specify weights for each point, and to change the output in each bin to any NumPy aggregate (mean of weights, standard deviation of weights, etc.).
Kernel density estimation¶
Another common method of evaluating densities in multiple dimensions is kernel density estimation (KDE). This will be discussed more fully in In-Depth: Kernel Density Estimation, but for now we'll simply mention that KDE can be thought of as a way to "smear out" the points in space and add up the result to obtain a smooth function. One extremely quick and simple KDE implementation exists in the scipy.stats
package. Here is a quick example of using the KDE on this data:
from scipy.stats import gaussian_kde
# fit an array of size [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)
# evaluate on a regular grid
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
# Plot the result as an image
plt.imshow(Z.reshape(Xgrid.shape),
origin='lower', aspect='auto',
extent=[-3.5, 3.5, -6, 6],
cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")
KDE has a smoothing length that effectively slides the knob between detail and smoothness (one example of the ubiquitous bias–variance trade-off). The literature on choosing an appropriate smoothing length is vast: gaussian_kde
uses a rule-of-thumb to attempt to find a nearly optimal smoothing length for the input data.
Other KDE implementations are available within the SciPy ecosystem, each with its own strengths and weaknesses; see, for example, sklearn.neighbors.KernelDensity
and statsmodels.nonparametric.kernel_density.KDEMultivariate
. For visualizations based on KDE, using Matplotlib tends to be overly verbose. The Seaborn library, discussed in Visualization With Seaborn, provides a much more terse API for creating KDE-based visualizations.