This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
Vectorized String Operations¶
One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.
Introducing Pandas String Operations¶
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
array([ 4, 6, 10, 14, 22, 26])
This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
['Peter', 'Paul', 'Mary', 'Guido']
This is perhaps sufficient to work with some data, but it will break if there are any missing values. For example:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[3], line 2 1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO'] ----> 2 [s.capitalize() for s in data] Cell In[3], line 2, in <listcomp>(.0) 1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO'] ----> 2 [s.capitalize() for s in data] AttributeError: 'NoneType' object has no attribute 'capitalize'
Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the str
attribute of Pandas Series and Index objects containing strings. So, for example, suppose we create a Pandas Series with this data:
import pandas as pd
names = pd.Series(data)
names
0 peter 1 Paul 2 None 3 MARY 4 gUIDO dtype: object
We can now call a single method that will capitalize all the entries, while skipping over any missing values:
names.str.capitalize()
0 Peter 1 Paul 2 None 3 Mary 4 Guido dtype: object
Using tab completion on this str
attribute will list all the vectorized string methods available to Pandas.
Tables of Pandas String Methods¶
If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the following series of names:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
Methods similar to Python string methods¶
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str
methods that mirror Python string methods:
len() | lower() | translate() | islower() |
ljust() | upper() | startswith() | isupper() |
rjust() | find() | endswith() | isnumeric() |
center() | rfind() | isalnum() | isdecimal() |
zfill() | index() | isalpha() | split() |
strip() | rindex() | isdigit() | rsplit() |
rstrip() | capitalize() | isspace() | partition() |
lstrip() | swapcase() | istitle() | rpartition() |
Notice that these have various return values. Some, like lower()
, return a series of strings:
monte.str.lower()
0 graham chapman 1 john cleese 2 terry gilliam 3 eric idle 4 terry jones 5 michael palin dtype: object
But some others return numbers:
monte.str.len()
0 14 1 11 2 13 3 9 4 11 5 13 dtype: int64
Or Boolean values:
monte.str.startswith('T')
0 False 1 False 2 True 3 False 4 True 5 False dtype: bool
Still others return lists or other compound values for each element:
monte.str.split()
0 [Graham, Chapman] 1 [John, Cleese] 2 [Terry, Gilliam] 3 [Eric, Idle] 4 [Terry, Jones] 5 [Michael, Palin] dtype: object
We'll see further manipulations of this kind of series-of-lists object as we continue our discussion.
Methods using regular expressions¶
In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in re
module:
Method | Description |
---|---|
match() | Call re.match() on each element, returning a boolean. |
extract() | Call re.match() on each element, returning matched groups as strings. |
findall() | Call re.findall() on each element |
replace() | Replace occurrences of pattern with some other string |
contains() | Call re.search() on each element, returning a boolean |
count() | Count occurrences of pattern |
split() | Equivalent to str.split() , but accepts regexps |
rsplit() | Equivalent to str.rsplit() , but accepts regexps |
With these, you can do a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:
monte.str.extract('([A-Za-z]+)', expand=False)
0 Graham 1 John 2 Terry 3 Eric 4 Terry 5 Michael dtype: object
Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (^
) and end-of-string ($
) regular expression characters:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
0 [Graham Chapman] 1 [] 2 [Terry Gilliam] 3 [] 4 [Terry Jones] 5 [Michael Palin] dtype: object
The ability to concisely apply regular expressions across Series
or Dataframe
entries opens up many possibilities for analysis and cleaning of data.
Miscellaneous methods¶
Finally, there are some miscellaneous methods that enable other convenient operations:
Method | Description |
---|---|
get() | Index each element |
slice() | Slice each element |
slice_replace() | Replace slice in each element with passed value |
cat() | Concatenate strings |
repeat() | Repeat values |
normalize() | Return Unicode form of string |
pad() | Add whitespace to left, right, or both sides of strings |
wrap() | Split long strings into lines with length less than a given width |
join() | Join strings in each element of the Series with passed separator |
get_dummies() | extract dummy variables as a dataframe |
Vectorized item access and slicing¶
The get()
and slice()
operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3)
. Note that this behavior is also available through Python's normal indexing syntax–for example, df.str.slice(0, 3)
is equivalent to df.str[0:3]
:
monte.str[0:3]
0 Gra 1 Joh 2 Ter 3 Eri 4 Ter 5 Mic dtype: object
Indexing via df.str.get(i)
and df.str[i]
is likewise similar.
These get()
and slice()
methods also let you access elements of arrays returned by split()
. For example, to extract the last name of each entry, we can combine split()
and get()
:
monte.str.split().str.get(-1)
0 Chapman 1 Cleese 2 Gilliam 3 Idle 4 Jones 5 Palin dtype: object
Indicator variables¶
Another method that requires a bit of extra explanation is the get_dummies()
method. This is useful when your data has a column containing some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
name | info | |
---|---|---|
0 | Graham Chapman | B|C|D |
1 | John Cleese | B|D |
2 | Terry Gilliam | A|C |
3 | Eric Idle | B|D |
4 | Terry Jones | B|C |
5 | Michael Palin | B|C|D |
The get_dummies()
routine lets you quickly split-out these indicator variables into a DataFrame
:
full_monte['info'].str.get_dummies('|')
A | B | C | D | |
---|---|---|---|---|
0 | 0 | 1 | 1 | 1 |
1 | 0 | 1 | 0 | 1 |
2 | 1 | 0 | 1 | 0 |
3 | 0 | 1 | 0 | 1 |
4 | 0 | 1 | 1 | 0 |
5 | 0 | 1 | 1 | 1 |
With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.
We won't dive further into these methods here, but I encourage you to read through "Working with Text Data" in the Pandas online documentation, or to refer to the resources listed in Further Resources.
Example: Recipe Database¶
These vectorized string operations become most useful in the process of cleaning up messy, real-world data. Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.
The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.
As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:
# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gz
The database is in JSON format, so we will try pd.read_json
to read it:
try:
recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
print("ValueError:", e)
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[18], line 2 1 try: ----> 2 recipes = pd.read_json('recipeitems-latest.json') 3 except ValueError as e: 4 print("ValueError:", e) File /opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs) File /opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 325 if len(args) > num_allow_args: 326 warnings.warn( 327 msg.format(arguments=_format_argument_list(allow_args)), 328 FutureWarning, 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) File /opt/conda/lib/python3.10/site-packages/pandas/io/json/_json.py:733, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options) 730 if convert_axes is None and orient != "table": 731 convert_axes = True --> 733 json_reader = JsonReader( 734 path_or_buf, 735 orient=orient, 736 typ=typ, 737 dtype=dtype, 738 convert_axes=convert_axes, 739 convert_dates=convert_dates, 740 keep_default_dates=keep_default_dates, 741 numpy=numpy, 742 precise_float=precise_float, 743 date_unit=date_unit, 744 encoding=encoding, 745 lines=lines, 746 chunksize=chunksize, 747 compression=compression, 748 nrows=nrows, 749 storage_options=storage_options, 750 encoding_errors=encoding_errors, 751 ) 753 if chunksize: 754 return json_reader File /opt/conda/lib/python3.10/site-packages/pandas/io/json/_json.py:818, in JsonReader.__init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options, encoding_errors) 815 if not self.lines: 816 raise ValueError("nrows can only be passed if lines=True") --> 818 data = self._get_data_from_filepath(filepath_or_buffer) 819 self.data = self._preprocess_data(data) File /opt/conda/lib/python3.10/site-packages/pandas/io/json/_json.py:874, in JsonReader._get_data_from_filepath(self, filepath_or_buffer) 866 filepath_or_buffer = self.handles.handle 867 elif ( 868 isinstance(filepath_or_buffer, str) 869 and filepath_or_buffer.lower().endswith( (...) 872 and not file_exists(filepath_or_buffer) 873 ): --> 874 raise FileNotFoundError(f"File {filepath_or_buffer} does not exist") 876 return filepath_or_buffer FileNotFoundError: File recipeitems-latest.json does not exist
Oops! We get a ValueError
mentioning that there is "trailing data." Searching for the text of this error on the Internet, it seems that it's due to using a file in which each line is itself a valid JSON, but the full file is not. Let's check if this interpretation is true:
with open('recipeitems-latest.json') as f:
line = f.readline()
pd.read_json(line).shape
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[19], line 1 ----> 1 with open('recipeitems-latest.json') as f: 2 line = f.readline() 3 pd.read_json(line).shape File /opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs) 275 if file in {0, 1, 2}: 276 raise ValueError( 277 f"IPython won't let you open fd={file} by default " 278 "as it is likely to crash IPython. If you know what you are doing, " 279 "you can use builtins' open." 280 ) --> 282 return io_open(file, *args, **kwargs) FileNotFoundError: [Errno 2] No such file or directory: 'recipeitems-latest.json'
Yes, apparently each line is a valid JSON, so we'll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with pd.read_json
:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[20], line 2 1 # read the entire file into a Python array ----> 2 with open('recipeitems-latest.json', 'r') as f: 3 # Extract each line 4 data = (line.strip() for line in f) 5 # Reformat so each line is the element of a list File /opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs) 275 if file in {0, 1, 2}: 276 raise ValueError( 277 f"IPython won't let you open fd={file} by default " 278 "as it is likely to crash IPython. If you know what you are doing, " 279 "you can use builtins' open." 280 ) --> 282 return io_open(file, *args, **kwargs) FileNotFoundError: [Errno 2] No such file or directory: 'recipeitems-latest.json'
recipes.shape
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[21], line 1 ----> 1 recipes.shape NameError: name 'recipes' is not defined
We see there are nearly 200,000 recipes, and 17 columns. Let's take a look at one row to see what we have:
recipes.iloc[0]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[22], line 1 ----> 1 recipes.iloc[0] NameError: name 'recipes' is not defined
There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in. Let's start by taking a closer look at the ingredients:
recipes.ingredients.str.len().describe()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[23], line 1 ----> 1 recipes.ingredients.str.len().describe() NameError: name 'recipes' is not defined
The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!
Just out of curiousity, let's see which recipe has the longest ingredient list:
recipes.name[np.argmax(recipes.ingredients.str.len())]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[24], line 1 ----> 1 recipes.name[np.argmax(recipes.ingredients.str.len())] NameError: name 'recipes' is not defined
That certainly looks like an involved recipe.
We can do other aggregate explorations; for example, let's see how many of the recipes are for breakfast food:
recipes.description.str.contains('[Bb]reakfast').sum()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[25], line 1 ----> 1 recipes.description.str.contains('[Bb]reakfast').sum() NameError: name 'recipes' is not defined
Or how many of the recipes list cinnamon as an ingredient:
recipes.ingredients.str.contains('[Cc]innamon').sum()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[26], line 1 ----> 1 recipes.ingredients.str.contains('[Cc]innamon').sum() NameError: name 'recipes' is not defined
We could even look to see whether any recipes misspell the ingredient as "cinamon":
recipes.ingredients.str.contains('[Cc]inamon').sum()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[27], line 1 ----> 1 recipes.ingredients.str.contains('[Cc]inamon').sum() NameError: name 'recipes' is not defined
This is the type of essential data exploration that is possible with Pandas string tools. It is data munging like this that Python really excels at.
A simple recipe recommender¶
Let's go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients. While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row. So we will cheat a bit: we'll start with a list of common ingredients, and simply search to see whether they are in each recipe's ingredient list. For simplicity, let's just stick with herbs and spices for the time being:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
We can then build a Boolean DataFrame
consisting of True and False values, indicating whether this ingredient appears in the list:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
for spice in spice_list))
spice_df.head()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[29], line 2 1 import re ----> 2 spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE)) 3 for spice in spice_list)) 4 spice_df.head() Cell In[29], line 2, in <genexpr>(.0) 1 import re ----> 2 spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE)) 3 for spice in spice_list)) 4 spice_df.head() NameError: name 'recipes' is not defined
Now, as an example, let's say we'd like to find a recipe that uses parsley, paprika, and tarragon. We can compute this very quickly using the query()
method of DataFrame
s, discussed in High-Performance Pandas: eval()
and query()
:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[30], line 1 ----> 1 selection = spice_df.query('parsley & paprika & tarragon') 2 len(selection) NameError: name 'spice_df' is not defined
We find only 10 recipes with this combination; let's use the index returned by this selection to discover the names of the recipes that have this combination:
recipes.name[selection.index]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[31], line 1 ----> 1 recipes.name[selection.index] NameError: name 'recipes' is not defined
Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we'd like to cook for dinner.
Going further with recipes¶
Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods. Of course, building a very robust recipe recommendation system would require a lot more work! Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process. This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.