How to Use Pandas Crosstab: A Practical Introduction
Pandas’ crosstab
function is a powerful tool for data exploration and analysis. It allows you to quickly and easily compute frequency distributions and cross-tabulations of two (or more) factors. This gives you a clear picture of the relationships within your data, often revealing insights you might miss with simple aggregations. This article provides a practical introduction, walking you through various use cases and customization options.
1. The Basics: Frequency Counts
At its core, crosstab
computes a frequency table. Let’s start with a simple example. Imagine we have data on passengers on the Titanic, including their class (Pclass
), whether they survived (Survived
), and their sex (Sex
):
“`python
import pandas as pd
import numpy as np # Import numpy
Create a sample DataFrame (you can load the Titanic dataset from Seaborn if you want)
data = {
‘Pclass’: [1, 3, 2, 1, 3, 2, 1, 3, 3, 2],
‘Survived’: [1, 0, 1, 0, 0, 1, 1, 1, 0, 0],
‘Sex’: [‘female’, ‘male’, ‘female’, ‘male’, ‘male’, ‘female’, ‘female’, ‘male’, ‘female’, ‘male’]
}
df = pd.DataFrame(data)
Basic crosstab: Pclass vs. Survived
cross_tab_basic = pd.crosstab(df[‘Pclass’], df[‘Survived’])
print(cross_tab_basic)
“`
This produces:
Survived 0 1
Pclass
1 1 2
2 1 1
3 2 1
This table shows the count of passengers in each combination of Pclass
and Survived
. For example, there were two passengers in first class (Pclass=1
) who survived (Survived=1
).
2. Normalization (Percentages)
Often, you’re more interested in proportions or percentages than raw counts. crosstab
makes this incredibly easy with the normalize
argument.
“`python
Normalize by rows (across each Pclass)
cross_tab_row_norm = pd.crosstab(df[‘Pclass’], df[‘Survived’], normalize=’index’) # or ‘rows’
print(cross_tab_row_norm)
Normalize by columns (across each Survived status)
cross_tab_col_norm = pd.crosstab(df[‘Pclass’], df[‘Survived’], normalize=’columns’) # or ‘cols’
print(cross_tab_col_norm)
Normalize by total
cross_tab_all_norm = pd.crosstab(df[‘Pclass’], df[‘Survived’], normalize=’all’) # or True
print(cross_tab_all_norm)
“`
normalize='index'
(or'rows'
): Calculates percentages within each row. This shows the proportion of passengers who survived or died given their class.normalize='columns'
(or'cols'
): Calculates percentages within each column. This shows the proportion of passengers in each class given their survival status.normalize='all'
(orTrue
): Calculates percentages based on the grand total of all observations. This shows the overall proportion of the dataset in each combination.
The output for normalize='index'
would be:
Survived 0 1
Pclass
1 0.333333 0.666667
2 0.500000 0.500000
3 0.666667 0.333333
This tells us that 66.67% of first-class passengers in our sample survived.
3. Adding Margins (Totals)
To get row and column totals, use the margins
argument:
python
cross_tab_margins = pd.crosstab(df['Pclass'], df['Survived'], margins=True, margins_name="Total")
print(cross_tab_margins)
This adds a “Total” row and column, showing the overall counts:
Survived 0 1 Total
Pclass
1 1 2 3
2 1 1 2
3 2 1 3
Total 4 4 8
You can also control the name of the margin row/column with margins_name
.
4. Aggregation with values
and aggfunc
crosstab
is not limited to counting. You can use it to aggregate other values using the values
and aggfunc
arguments. This is where it becomes similar to a pivot table.
“`python
Example: Average age of passengers by Pclass and Survived
(Let’s add an ‘Age’ column for this example)
df[‘Age’] = [29, 35, 22, 45, 18, 28, 30, 50, 25, 40] # Hypothetical ages
cross_tab_agg = pd.crosstab(df[‘Pclass’], df[‘Survived’], values=df[‘Age’], aggfunc=’mean’)
print(cross_tab_agg)
“`
This will output the average age of passengers for each Pclass
and Survived
combination.
Survived 0 1
Pclass
1 45.0 29.5
2 40.0 25.0
3 34.0 50.0
You can use any valid aggregation function with aggfunc
, including:
'mean'
(average)'sum'
(total)'min'
(minimum)'max'
(maximum)'std'
(standard deviation)'median'
(median)'count'
(number of non-null values – same as default behavior withoutvalues
)- A custom function (e.g.,
lambda x: x.quantile(0.75)
)
5. Handling Multiple Factors
crosstab
can handle more than two factors. Simply provide lists to the index
and columns
arguments.
“`python
Cross-tabulation of Pclass, Survived, and Sex
cross_tab_multi = pd.crosstab([df[‘Pclass’], df[‘Sex’]], df[‘Survived’])
print(cross_tab_multi)
“`
This produces:
Survived 0 1
Pclass Sex
1 female 0 2
male 1 0
2 female 0 1
male 1 0
3 female 1 1
male 1 0
The index now has a multi-level index (Pclass
and Sex
).
6. Handling Missing Values (NaN)
By default, crosstab
ignores rows with missing values (NaN) in the specified columns. You can control this with the dropna
argument:
“`python
Introduce some missing values in the ‘Sex’ column
df.loc[0, ‘Sex’] = np.nan
Default behavior (dropna=True): Missing values are excluded
cross_tab_nan_default = pd.crosstab(df[‘Sex’], df[‘Survived’])
print(cross_tab_nan_default)
Include missing values (dropna=False): Creates a separate category for NaN
cross_tab_nan_include = pd.crosstab(df[‘Sex’], df[‘Survived’], dropna=False)
print(cross_tab_nan_include)
“`
dropna=False
creates a separate row (or column) for NaN values in the index (or columns). This is important for understanding the completeness of your data.
7. Customizing with rownames
and colnames
You can change the names of the index and column levels using rownames
and colnames
:
python
cross_tab_rename = pd.crosstab(df['Pclass'], df['Survived'],
rownames=['Passenger Class'],
colnames=['Survival Status'])
print(cross_tab_rename)
This makes your output more readable.
8. Key Differences from pivot_table
While crosstab
and pivot_table
share similarities (especially when using values
and aggfunc
), there are important distinctions:
- Input:
crosstab
typically works directly with Series (columns of a DataFrame), whilepivot_table
operates on the entire DataFrame. - Default behavior:
crosstab
defaults to frequency counts, whilepivot_table
defaults to averaging numeric values. - Flexibility:
pivot_table
is generally more flexible for complex reshaping and aggregations, including hierarchical indices and multiple aggregation functions simultaneously.crosstab
is simpler and more focused on cross-tabulations. crosstab
can create dummy variables, by usingpd.get_dummies(series)
and then usepd.crosstab
to create cross tabs.
In summary, pd.crosstab
is a concise and powerful tool for exploring categorical data relationships. It’s especially useful for generating frequency tables, calculating percentages, and performing simple aggregations. For more intricate reshaping and multiple aggregations, pivot_table
might be a better choice, but crosstab
excels in its simplicity and directness for cross-tabulation tasks.