How to Use Pandas Crosstab: A Practical Introduction

How to Use Pandas Crosstab: A Practical Introduction

Pandas’ crosstab function is a powerful tool for data exploration and analysis. It allows you to quickly and easily compute frequency distributions and cross-tabulations of two (or more) factors. This gives you a clear picture of the relationships within your data, often revealing insights you might miss with simple aggregations. This article provides a practical introduction, walking you through various use cases and customization options.

1. The Basics: Frequency Counts

At its core, crosstab computes a frequency table. Let’s start with a simple example. Imagine we have data on passengers on the Titanic, including their class (Pclass), whether they survived (Survived), and their sex (Sex):

“`python
import pandas as pd
import numpy as np # Import numpy

Create a sample DataFrame (you can load the Titanic dataset from Seaborn if you want)

data = {
‘Pclass’: [1, 3, 2, 1, 3, 2, 1, 3, 3, 2],
‘Survived’: [1, 0, 1, 0, 0, 1, 1, 1, 0, 0],
‘Sex’: [‘female’, ‘male’, ‘female’, ‘male’, ‘male’, ‘female’, ‘female’, ‘male’, ‘female’, ‘male’]
}
df = pd.DataFrame(data)

Basic crosstab: Pclass vs. Survived

cross_tab_basic = pd.crosstab(df[‘Pclass’], df[‘Survived’])
print(cross_tab_basic)
“`

This produces:

Survived 0 1
Pclass
1 1 2
2 1 1
3 2 1

This table shows the count of passengers in each combination of Pclass and Survived. For example, there were two passengers in first class (Pclass=1) who survived (Survived=1).

2. Normalization (Percentages)

Often, you’re more interested in proportions or percentages than raw counts. crosstab makes this incredibly easy with the normalize argument.

“`python

Normalize by rows (across each Pclass)

cross_tab_row_norm = pd.crosstab(df[‘Pclass’], df[‘Survived’], normalize=’index’) # or ‘rows’
print(cross_tab_row_norm)

Normalize by columns (across each Survived status)

cross_tab_col_norm = pd.crosstab(df[‘Pclass’], df[‘Survived’], normalize=’columns’) # or ‘cols’
print(cross_tab_col_norm)

Normalize by total

cross_tab_all_norm = pd.crosstab(df[‘Pclass’], df[‘Survived’], normalize=’all’) # or True
print(cross_tab_all_norm)
“`

  • normalize='index' (or 'rows'): Calculates percentages within each row. This shows the proportion of passengers who survived or died given their class.
  • normalize='columns' (or 'cols'): Calculates percentages within each column. This shows the proportion of passengers in each class given their survival status.
  • normalize='all' (or True): Calculates percentages based on the grand total of all observations. This shows the overall proportion of the dataset in each combination.

The output for normalize='index' would be:

Survived 0 1
Pclass
1 0.333333 0.666667
2 0.500000 0.500000
3 0.666667 0.333333

This tells us that 66.67% of first-class passengers in our sample survived.

3. Adding Margins (Totals)

To get row and column totals, use the margins argument:

python
cross_tab_margins = pd.crosstab(df['Pclass'], df['Survived'], margins=True, margins_name="Total")
print(cross_tab_margins)

This adds a “Total” row and column, showing the overall counts:

Survived 0 1 Total
Pclass
1 1 2 3
2 1 1 2
3 2 1 3
Total 4 4 8

You can also control the name of the margin row/column with margins_name.

4. Aggregation with values and aggfunc

crosstab is not limited to counting. You can use it to aggregate other values using the values and aggfunc arguments. This is where it becomes similar to a pivot table.

“`python

Example: Average age of passengers by Pclass and Survived

(Let’s add an ‘Age’ column for this example)

df[‘Age’] = [29, 35, 22, 45, 18, 28, 30, 50, 25, 40] # Hypothetical ages

cross_tab_agg = pd.crosstab(df[‘Pclass’], df[‘Survived’], values=df[‘Age’], aggfunc=’mean’)
print(cross_tab_agg)
“`

This will output the average age of passengers for each Pclass and Survived combination.

Survived 0 1
Pclass
1 45.0 29.5
2 40.0 25.0
3 34.0 50.0

You can use any valid aggregation function with aggfunc, including:

  • 'mean' (average)
  • 'sum' (total)
  • 'min' (minimum)
  • 'max' (maximum)
  • 'std' (standard deviation)
  • 'median' (median)
  • 'count' (number of non-null values – same as default behavior without values)
  • A custom function (e.g., lambda x: x.quantile(0.75))

5. Handling Multiple Factors

crosstab can handle more than two factors. Simply provide lists to the index and columns arguments.

“`python

Cross-tabulation of Pclass, Survived, and Sex

cross_tab_multi = pd.crosstab([df[‘Pclass’], df[‘Sex’]], df[‘Survived’])
print(cross_tab_multi)
“`

This produces:

Survived 0 1
Pclass Sex
1 female 0 2
male 1 0
2 female 0 1
male 1 0
3 female 1 1
male 1 0

The index now has a multi-level index (Pclass and Sex).

6. Handling Missing Values (NaN)

By default, crosstab ignores rows with missing values (NaN) in the specified columns. You can control this with the dropna argument:

“`python

Introduce some missing values in the ‘Sex’ column

df.loc[0, ‘Sex’] = np.nan

Default behavior (dropna=True): Missing values are excluded

cross_tab_nan_default = pd.crosstab(df[‘Sex’], df[‘Survived’])
print(cross_tab_nan_default)

Include missing values (dropna=False): Creates a separate category for NaN

cross_tab_nan_include = pd.crosstab(df[‘Sex’], df[‘Survived’], dropna=False)
print(cross_tab_nan_include)
“`

dropna=False creates a separate row (or column) for NaN values in the index (or columns). This is important for understanding the completeness of your data.

7. Customizing with rownames and colnames

You can change the names of the index and column levels using rownames and colnames:

python
cross_tab_rename = pd.crosstab(df['Pclass'], df['Survived'],
rownames=['Passenger Class'],
colnames=['Survival Status'])
print(cross_tab_rename)

This makes your output more readable.

8. Key Differences from pivot_table

While crosstab and pivot_table share similarities (especially when using values and aggfunc), there are important distinctions:

  • Input: crosstab typically works directly with Series (columns of a DataFrame), while pivot_table operates on the entire DataFrame.
  • Default behavior: crosstab defaults to frequency counts, while pivot_table defaults to averaging numeric values.
  • Flexibility: pivot_table is generally more flexible for complex reshaping and aggregations, including hierarchical indices and multiple aggregation functions simultaneously. crosstab is simpler and more focused on cross-tabulations.
  • crosstab can create dummy variables, by using pd.get_dummies(series) and then use pd.crosstab to create cross tabs.

In summary, pd.crosstab is a concise and powerful tool for exploring categorical data relationships. It’s especially useful for generating frequency tables, calculating percentages, and performing simple aggregations. For more intricate reshaping and multiple aggregations, pivot_table might be a better choice, but crosstab excels in its simplicity and directness for cross-tabulation tasks.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top