This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.

The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!

< Geographic Data with Basemap | Contents | Further Resources >

Visualization with Seaborn¶

Matplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired. There are several valid complaints about Matplotlib that often come up:

Prior to version 2.0, Matplotlib's defaults are not exactly the best choices. It was based off of MATLAB circa 1999, and this often shows.
Matplotlib's API is relatively low level. Doing sophisticated statistical visualization is possible, but often requires a lot of boilerplate code.
Matplotlib predated Pandas by more than a decade, and thus is not designed for use with Pandas DataFrames. In order to visualize data from a Pandas DataFrame, you must extract each Series and often concatenate them together into the right format. It would be nicer to have a plotting library that can intelligently use the DataFrame labels in a plot.

An answer to these problems is Seaborn. Seaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas DataFrames.

To be fair, the Matplotlib team is addressing this: it has recently added the plt.style tools discussed in Customizing Matplotlib: Configurations and Style Sheets, and is starting to handle Pandas data more seamlessly. The 2.0 release of the library will include a new default stylesheet that will improve on the current status quo. But for all the reasons just discussed, Seaborn remains an extremely useful addon.

Seaborn Versus Matplotlib¶

Here is an example of a simple random-walk plot in Matplotlib, using its classic plot formatting and colors. We start with the typical imports:

In [1]:

  Copied!     
 
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt plt.style.use('classic') %matplotlib inline import numpy as np import pandas as pd

Now we create some random walk data:

In [2]:

  Copied!     
 
# Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0)
# Create some data rng = np.random.RandomState(0) x = np.linspace(0, 10, 500) y = np.cumsum(rng.randn(500, 6), 0)

And do a simple plot:

In [3]:

  Copied!     
 
# Plot the data with Matplotlib defaults
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');
# Plot the data with Matplotlib defaults plt.plot(x, y) plt.legend('ABCDEF', ncol=2, loc='upper left');

Although the result contains all the information we'd like it to convey, it does so in a way that is not all that aesthetically pleasing, and even looks a bit old-fashioned in the context of 21st-century data visualization.

Now let's take a look at how it works with Seaborn. As we will see, Seaborn has many of its own high-level plotting routines, but it can also overwrite Matplotlib's default parameters and in turn get even simple Matplotlib scripts to produce vastly superior output. We can set the style by calling Seaborn's set() method. By convention, Seaborn is imported as sns:

In [4]:

  Copied!     
 
import seaborn as sns
sns.set()
import seaborn as sns sns.set()

Now let's rerun the same two lines as before:

In [5]:

  Copied!     
 
# same plotting code as above!
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');
# same plotting code as above! plt.plot(x, y) plt.legend('ABCDEF', ncol=2, loc='upper left');

Ah, much better!

Exploring Seaborn Plots¶

The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.

Let's take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following could be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood) but the Seaborn API is much more convenient.

Histograms, KDE, and densities¶

Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables. We have seen that this is relatively straightforward in Matplotlib:

In [6]:

  Copied!     
 
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])

for col in 'xy':
    plt.hist(data[col], normed=True, alpha=0.5)
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000) data = pd.DataFrame(data, columns=['x', 'y']) for col in 'xy': plt.hist(data[col], normed=True, alpha=0.5)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 5
      2 data = pd.DataFrame(data, columns=['x', 'y'])
      4 for col in 'xy':
----> 5     plt.hist(data[col], normed=True, alpha=0.5)

File /opt/conda/lib/python3.10/site-packages/matplotlib/pyplot.py:2581, in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, data, **kwargs)
   2575 @_copy_docstring_and_deprecators(Axes.hist)
   2576 def hist(
   2577         x, bins=None, range=None, density=False, weights=None,
   2578         cumulative=False, bottom=None, histtype='bar', align='mid',
   2579         orientation='vertical', rwidth=None, log=False, color=None,
   2580         label=None, stacked=False, *, data=None, **kwargs):
-> 2581     return gca().hist(
   2582         x, bins=bins, range=range, density=density, weights=weights,
   2583         cumulative=cumulative, bottom=bottom, histtype=histtype,
   2584         align=align, orientation=orientation, rwidth=rwidth, log=log,
   2585         color=color, label=label, stacked=stacked,
   2586         **({"data": data} if data is not None else {}), **kwargs)

File /opt/conda/lib/python3.10/site-packages/matplotlib/__init__.py:1433, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1430 @functools.wraps(func)
   1431 def inner(ax, *args, data=None, **kwargs):
   1432     if data is None:
-> 1433         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1435     bound = new_sig.bind(ax, *args, **kwargs)
   1436     auto_label = (bound.arguments.get(label_namer)
   1437                   or bound.kwargs.get(label_namer))

File /opt/conda/lib/python3.10/site-packages/matplotlib/axes/_axes.py:6896, in Axes.hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   6894 if patch:
   6895     p = patch[0]
-> 6896     p._internal_update(kwargs)
   6897     if lbl is not None:
   6898         p.set_label(lbl)

File /opt/conda/lib/python3.10/site-packages/matplotlib/artist.py:1186, in Artist._internal_update(self, kwargs)
   1179 def _internal_update(self, kwargs):
   1180     """
   1181     Update artist properties without prenormalizing them, but generating
   1182     errors as if calling `set`.
   1183 
   1184     The lack of prenormalization is to maintain backcompatibility.
   1185     """
-> 1186     return self._update_props(
   1187         kwargs, "{cls.__name__}.set() got an unexpected keyword argument "
   1188         "{prop_name!r}")

File /opt/conda/lib/python3.10/site-packages/matplotlib/artist.py:1160, in Artist._update_props(self, props, errfmt)
   1158             func = getattr(self, f"set_{k}", None)
   1159             if not callable(func):
-> 1160                 raise AttributeError(
   1161                     errfmt.format(cls=type(self), prop_name=k))
   1162             ret.append(func(v))
   1163 if ret:

AttributeError: Rectangle.set() got an unexpected keyword argument 'normed'

Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with sns.kdeplot:

In [7]:

  Copied!     
 
for col in 'xy':
    sns.kdeplot(data[col], shade=True)
for col in 'xy': sns.kdeplot(data[col], shade=True)

/tmp/ipykernel_1674/4143655892.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data[col], shade=True)
/tmp/ipykernel_1674/4143655892.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data[col], shade=True)

Histograms and KDE can be combined using distplot:

In [8]:

  Copied!     
 
sns.distplot(data['x'])
sns.distplot(data['y']);
sns.distplot(data['x']) sns.distplot(data['y']);

/tmp/ipykernel_1674/923662334.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(data['x'])
/tmp/ipykernel_1674/923662334.py:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(data['y']);

If we pass the full two-dimensional dataset to kdeplot, we will get a two-dimensional visualization of the data:

In [9]:

  Copied!     
 
sns.kdeplot(data);
sns.kdeplot(data);

We can see the joint distribution and the marginal distributions together using sns.jointplot. For this plot, we'll set the style to a white background:

In [10]:

  Copied!     
 
with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='kde');
with sns.axes_style('white'): sns.jointplot("x", "y", data, kind='kde');

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 with sns.axes_style('white'):
----> 2     sns.jointplot("x", "y", data, kind='kde');

TypeError: jointplot() takes from 0 to 1 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given

There are other parameters that can be passed to jointplot—for example, we can use a hexagonally based histogram instead:

In [11]:

  Copied!     
 
with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='hex')
with sns.axes_style('white'): sns.jointplot("x", "y", data, kind='hex')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 with sns.axes_style('white'):
----> 2     sns.jointplot("x", "y", data, kind='hex')

TypeError: jointplot() takes from 0 to 1 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given

Pair plots¶

When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful for exploring correlations between multidimensional data, when you'd like to plot all pairs of values against each other.

We'll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:

In [12]:

  Copied!     
 
iris = sns.load_dataset("iris")
iris.head()
iris = sns.load_dataset("iris") iris.head()

Out[12]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Visualizing the multidimensional relationships among the samples is as easy as calling sns.pairplot:

In [13]:

  Copied!     
 
sns.pairplot(iris, hue='species', size=2.5);
sns.pairplot(iris, hue='species', size=2.5);

/opt/conda/lib/python3.10/site-packages/seaborn/axisgrid.py:2095: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Faceted histograms¶

Sometimes the best way to view data is via histograms of subsets. Seaborn's FacetGrid makes this extremely simple. We'll take a look at some data that shows the amount that restaurant staff receive in tips based on various indicator data:

In [14]:

  Copied!     
 
tips = sns.load_dataset('tips')
tips.head()
tips = sns.load_dataset('tips') tips.head()

Out[14]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [15]:

  Copied!     
 
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']

grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill'] grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True) grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));

Factor plots¶

Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter within bins defined by any other parameter:

In [16]:

  Copied!     
 
with sns.axes_style(style='ticks'):
    g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
    g.set_axis_labels("Day", "Total Bill");
with sns.axes_style(style='ticks'): g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box") g.set_axis_labels("Day", "Total Bill");

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[16], line 2
      1 with sns.axes_style(style='ticks'):
----> 2     g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
      3     g.set_axis_labels("Day", "Total Bill");

AttributeError: module 'seaborn' has no attribute 'factorplot'

Joint distributions¶

Similar to the pairplot we saw earlier, we can use sns.jointplot to show the joint distribution between different datasets, along with the associated marginal distributions:

In [17]:

  Copied!     
 
with sns.axes_style('white'):
    sns.jointplot("total_bill", "tip", data=tips, kind='hex')
with sns.axes_style('white'): sns.jointplot("total_bill", "tip", data=tips, kind='hex')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 2
      1 with sns.axes_style('white'):
----> 2     sns.jointplot("total_bill", "tip", data=tips, kind='hex')

TypeError: jointplot() got multiple values for argument 'data'

The joint plot can even do some automatic kernel density estimation and regression:

In [18]:

  Copied!     
 
sns.jointplot("total_bill", "tip", data=tips, kind='reg');
sns.jointplot("total_bill", "tip", data=tips, kind='reg');

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 sns.jointplot("total_bill", "tip", data=tips, kind='reg');

TypeError: jointplot() got multiple values for argument 'data'

Bar plots¶

Time series can be plotted using sns.factorplot. In the following example, we'll use the Planets data that we first saw in Aggregation and Grouping:

In [19]:

  Copied!     
 
planets = sns.load_dataset('planets')
planets.head()
planets = sns.load_dataset('planets') planets.head()

Out[19]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011
3	Radial Velocity	1	326.030	19.40	110.62	2007
4	Radial Velocity	1	516.220	10.50	119.47	2009

In [20]:

  Copied!     
 
with sns.axes_style('white'):
    g = sns.factorplot("year", data=planets, aspect=2,
                       kind="count", color='steelblue')
    g.set_xticklabels(step=5)
with sns.axes_style('white'): g = sns.factorplot("year", data=planets, aspect=2, kind="count", color='steelblue') g.set_xticklabels(step=5)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[20], line 2
      1 with sns.axes_style('white'):
----> 2     g = sns.factorplot("year", data=planets, aspect=2,
      3                        kind="count", color='steelblue')
      4     g.set_xticklabels(step=5)

AttributeError: module 'seaborn' has no attribute 'factorplot'

We can learn more by looking at the method of discovery of each of these planets:

In [21]:

  Copied!     
 
with sns.axes_style('white'):
    g = sns.factorplot("year", data=planets, aspect=4.0, kind='count',
                       hue='method', order=range(2001, 2015))
    g.set_ylabels('Number of Planets Discovered')
with sns.axes_style('white'): g = sns.factorplot("year", data=planets, aspect=4.0, kind='count', hue='method', order=range(2001, 2015)) g.set_ylabels('Number of Planets Discovered')

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[21], line 2
      1 with sns.axes_style('white'):
----> 2     g = sns.factorplot("year", data=planets, aspect=4.0, kind='count',
      3                        hue='method', order=range(2001, 2015))
      4     g.set_ylabels('Number of Planets Discovered')

AttributeError: module 'seaborn' has no attribute 'factorplot'

For more information on plotting with Seaborn, see the Seaborn documentation, a tutorial, and the Seaborn gallery.

Example: Exploring Marathon Finishing Times¶

Here we'll look at using Seaborn to help visualize and understand finishing results from a marathon. I've scraped the data from sources on the Web, aggregated it and removed any identifying information, and put it on GitHub where it can be downloaded (if you are interested in using Python for web scraping, I would recommend Web Scraping with Python by Ryan Mitchell). We will start by downloading the data from the Web, and loading it into Pandas:

In [22]:

  Copied!     
 
# !curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv
# !curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv

In [23]:

  Copied!     
 
data = pd.read_csv('marathon-data.csv')
data.head()
data = pd.read_csv('marathon-data.csv') data.head()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[23], line 1
----> 1 data = pd.read_csv('marathon-data.csv')
      2 data.head()

File /opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:605, in _read(filepath_or_buffer, kwds)
    602 _validate_names(kwds.get("names", None))
    604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
    607 if chunksize or iterator:
    608     return parser

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
   1439     self.options["has_index_names"] = kwds["has_index_names"]
   1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1735, in TextFileReader._make_engine(self, f, engine)
   1733     if "b" not in mode:
   1734         mode += "b"
-> 1735 self.handles = get_handle(
   1736     f,
   1737     mode,
   1738     encoding=self.options.get("encoding", None),
   1739     compression=self.options.get("compression", None),
   1740     memory_map=self.options.get("memory_map", False),
   1741     is_text=is_text,
   1742     errors=self.options.get("encoding_errors", "strict"),
   1743     storage_options=self.options.get("storage_options", None),
   1744 )
   1745 assert self.handles is not None
   1746 f = self.handles.handle

File /opt/conda/lib/python3.10/site-packages/pandas/io/common.py:856, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    851 elif isinstance(handle, str):
    852     # Check whether the filename is to be opened in binary mode.
    853     # Binary mode does not support 'encoding' and 'newline'.
    854     if ioargs.encoding and "b" not in ioargs.mode:
    855         # Encoding
--> 856         handle = open(
    857             handle,
    858             ioargs.mode,
    859             encoding=ioargs.encoding,
    860             errors=errors,
    861             newline="",
    862         )
    863     else:
    864         # Binary mode
    865         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'marathon-data.csv'

By default, Pandas loaded the time columns as Python strings (type object); we can see this by looking at the dtypes attribute of the DataFrame:

In [24]:

  Copied!     
 
data.dtypes
data.dtypes

Out[24]:

x    float64
y    float64
dtype: object

Let's fix this by providing a converter for the times:

In [25]:

  Copied!     
 
import datetime

def convert_time(s):
    h, m, s = map(int, s.split(':'))
    return datetime.timedelta(hours=h, minutes=m, seconds=s)

data = pd.read_csv('marathon-data.csv',
                   converters={'split':convert_time, 'final':convert_time})
data.head()
import datetime def convert_time(s): h, m, s = map(int, s.split(':')) return datetime.timedelta(hours=h, minutes=m, seconds=s) data = pd.read_csv('marathon-data.csv', converters={'split':convert_time, 'final':convert_time}) data.head()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[25], line 7
      4     h, m, s = map(int, s.split(':'))
      5     return datetime.timedelta(hours=h, minutes=m, seconds=s)
----> 7 data = pd.read_csv('marathon-data.csv',
      8                    converters={'split':convert_time, 'final':convert_time})
      9 data.head()

File /opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:605, in _read(filepath_or_buffer, kwds)
    602 _validate_names(kwds.get("names", None))
    604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
    607 if chunksize or iterator:
    608     return parser

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
   1439     self.options["has_index_names"] = kwds["has_index_names"]
   1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)

File /opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1735, in TextFileReader._make_engine(self, f, engine)
   1733     if "b" not in mode:
   1734         mode += "b"
-> 1735 self.handles = get_handle(
   1736     f,
   1737     mode,
   1738     encoding=self.options.get("encoding", None),
   1739     compression=self.options.get("compression", None),
   1740     memory_map=self.options.get("memory_map", False),
   1741     is_text=is_text,
   1742     errors=self.options.get("encoding_errors", "strict"),
   1743     storage_options=self.options.get("storage_options", None),
   1744 )
   1745 assert self.handles is not None
   1746 f = self.handles.handle

File /opt/conda/lib/python3.10/site-packages/pandas/io/common.py:856, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    851 elif isinstance(handle, str):
    852     # Check whether the filename is to be opened in binary mode.
    853     # Binary mode does not support 'encoding' and 'newline'.
    854     if ioargs.encoding and "b" not in ioargs.mode:
    855         # Encoding
--> 856         handle = open(
    857             handle,
    858             ioargs.mode,
    859             encoding=ioargs.encoding,
    860             errors=errors,
    861             newline="",
    862         )
    863     else:
    864         # Binary mode
    865         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'marathon-data.csv'

In [26]:

  Copied!     
 
data.dtypes
data.dtypes

Out[26]:

x    float64
y    float64
dtype: object

That looks much better. For the purpose of our Seaborn plotting utilities, let's next add columns that give the times in seconds:

In [27]:

  Copied!     
 
data['split_sec'] = data['split'].astype(int) / 1E9
data['final_sec'] = data['final'].astype(int) / 1E9
data.head()
data['split_sec'] = data['split'].astype(int) / 1E9 data['final_sec'] = data['final'].astype(int) / 1E9 data.head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'split'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[27], line 1
----> 1 data['split_sec'] = data['split'].astype(int) / 1E9
      2 data['final_sec'] = data['final'].astype(int) / 1E9
      3 data.head()

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'split'

To get an idea of what the data looks like, we can plot a jointplot over the data:

In [28]:

  Copied!     
 
with sns.axes_style('white'):
    g = sns.jointplot("split_sec", "final_sec", data, kind='hex')
    g.ax_joint.plot(np.linspace(4000, 16000),
                    np.linspace(8000, 32000), ':k')
with sns.axes_style('white'): g = sns.jointplot("split_sec", "final_sec", data, kind='hex') g.ax_joint.plot(np.linspace(4000, 16000), np.linspace(8000, 32000), ':k')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[28], line 2
      1 with sns.axes_style('white'):
----> 2     g = sns.jointplot("split_sec", "final_sec", data, kind='hex')
      3     g.ax_joint.plot(np.linspace(4000, 16000),
      4                     np.linspace(8000, 32000), ':k')

TypeError: jointplot() takes from 0 to 1 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given

The dotted line shows where someone's time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon. If you have run competitively, you'll know that those who do the opposite—run faster during the second half of the race—are said to have "negative-split" the race.

Let's create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:

In [29]:

  Copied!     
 
data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
data.head()
data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec'] data.head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'split_sec'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[29], line 1
----> 1 data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
      2 data.head()

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'split_sec'

Where this split difference is less than zero, the person negative-split the race by that fraction. Let's do a distribution plot of this split fraction:

In [30]:

  Copied!     
 
sns.distplot(data['split_frac'], kde=False);
plt.axvline(0, color="k", linestyle="--");
sns.distplot(data['split_frac'], kde=False); plt.axvline(0, color="k", linestyle="--");

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'split_frac'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[30], line 1
----> 1 sns.distplot(data['split_frac'], kde=False);
      2 plt.axvline(0, color="k", linestyle="--");

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'split_frac'

In [31]:

  Copied!     
 
sum(data.split_frac < 0)
sum(data.split_frac < 0)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[31], line 1
----> 1 sum(data.split_frac < 0)

File /opt/conda/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'split_frac'

Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.

Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a pairgrid, which draws plots of all these correlations:

In [32]:

  Copied!     
 
g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
                 hue='gender', palette='RdBu_r')
g.map(plt.scatter, alpha=0.8)
g.add_legend();
g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'], hue='gender', palette='RdBu_r') g.map(plt.scatter, alpha=0.8) g.add_legend();

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'gender'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[32], line 1
----> 1 g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
      2                  hue='gender', palette='RdBu_r')
      3 g.map(plt.scatter, alpha=0.8)
      4 g.add_legend();

File /opt/conda/lib/python3.10/site-packages/seaborn/axisgrid.py:1321, in PairGrid.__init__(self, data, hue, vars, x_vars, y_vars, hue_order, palette, hue_kws, corner, diag_sharey, height, aspect, layout_pad, despine, dropna)
   1310     self.hue_vals = pd.Series(["_nolegend_"] * len(data),
   1311                               index=data.index)
   1312 else:
   1313     # We need hue_order and hue_names because the former is used to control
   1314     # the order of drawing and the latter is used to control the order of
   (...)
   1319     # to the axes-level functions, while always handling legend creation.
   1320     # See GH2307
-> 1321     hue_names = hue_order = categorical_order(data[hue], hue_order)
   1322     if dropna:
   1323         # Filter NA from the list of unique hue names
   1324         hue_names = list(filter(pd.notnull, hue_names))

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'gender'

It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time. (We see here that Seaborn is no panacea for Matplotlib's ills when it comes to plot styles: in particular, the x-axis labels overlap. Because the output is a simple Matplotlib plot, however, the methods in Customizing Ticks can be used to adjust such things if desired.)

The difference between men and women here is interesting. Let's look at the histogram of split fractions for these two groups:

In [33]:

  Copied!     
 
sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True)
plt.xlabel('split_frac');
sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True) sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True) plt.xlabel('split_frac');

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[33], line 1
----> 1 sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
      2 sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True)
      3 plt.xlabel('split_frac');

File /opt/conda/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'split_frac'

The interesting thing here is that there are many more men than women who are running close to an even split! This almost looks like some kind of bimodal distribution among the men and women. Let's see if we can suss-out what's going on by looking at the distributions as a function of age.

A nice way to compare distributions is to use a violin plot

In [34]:

  Copied!     
 
sns.violinplot("gender", "split_frac", data=data,
               palette=["lightblue", "lightpink"]);
sns.violinplot("gender", "split_frac", data=data, palette=["lightblue", "lightpink"]);

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 sns.violinplot("gender", "split_frac", data=data,
      2                palette=["lightblue", "lightpink"]);

TypeError: violinplot() got multiple values for argument 'data'

This is yet another way to compare the distributions between men and women.

Let's look a little deeper, and compare these violin plots as a function of age. We'll start by creating a new column in the array that specifies the decade of age that each person is in:

In [35]:

  Copied!     
 
data['age_dec'] = data.age.map(lambda age: 10 * (age // 10))
data.head()
data['age_dec'] = data.age.map(lambda age: 10 * (age // 10)) data.head()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[35], line 1
----> 1 data['age_dec'] = data.age.map(lambda age: 10 * (age // 10))
      2 data.head()

File /opt/conda/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'age'

In [36]:

  Copied!     
 
men = (data.gender == 'M')
women = (data.gender == 'W')

with sns.axes_style(style=None):
    sns.violinplot("age_dec", "split_frac", hue="gender", data=data,
                   split=True, inner="quartile",
                   palette=["lightblue", "lightpink"]);
men = (data.gender == 'M') women = (data.gender == 'W') with sns.axes_style(style=None): sns.violinplot("age_dec", "split_frac", hue="gender", data=data, split=True, inner="quartile", palette=["lightblue", "lightpink"]);

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[36], line 1
----> 1 men = (data.gender == 'M')
      2 women = (data.gender == 'W')
      4 with sns.axes_style(style=None):

File /opt/conda/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'gender'

Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s to 50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter).

Also surprisingly, the 80-year-old women seem to outperform everyone in terms of their split time. This is probably due to the fact that we're estimating the distribution from small numbers, as there are only a handful of runners in that range:

In [37]:

  Copied!     
 
(data.age > 80).sum()
(data.age > 80).sum()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[37], line 1
----> 1 (data.age > 80).sum()

File /opt/conda/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'age'

Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We'll use regplot, which will automatically fit a linear regression to the data:

In [38]:

  Copied!     
 
g = sns.lmplot('final_sec', 'split_frac', col='gender', data=data,
               markers=".", scatter_kws=dict(color='c'))
g.map(plt.axhline, y=0.1, color="k", ls=":");
g = sns.lmplot('final_sec', 'split_frac', col='gender', data=data, markers=".", scatter_kws=dict(color='c')) g.map(plt.axhline, y=0.1, color="k", ls=":");

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[38], line 1
----> 1 g = sns.lmplot('final_sec', 'split_frac', col='gender', data=data,
      2                markers=".", scatter_kws=dict(color='c'))
      3 g.map(plt.axhline, y=0.1, color="k", ls=":");

TypeError: lmplot() got multiple values for argument 'data'

Apparently the people with fast splits are the elite runners who are finishing within ~15,000 seconds, or about 4 hours. People slower than that are much less likely to have a fast second split.

< Geographic Data with Basemap | Contents | Further Resources >