Okay, here is a detailed beginner’s guide to identifying and handling duplicate data in Pandas, aiming for the requested length and detail.
Pandas: Identifying and Handling Duplicate Data (A Beginner’s Guide)
In the world of data analysis and data science, the quality of your data is paramount. Raw data collected from various sources is often messy, incomplete, and contains inconsistencies. One of the most common issues encountered during the data cleaning process is the presence of duplicate records. Duplicate data can skew analyses, lead to incorrect conclusions, inflate counts, waste storage space, and potentially cause errors in downstream processes like machine learning model training or reporting.
Fortunately, the Pandas library, the cornerstone of data manipulation in Python, provides powerful and flexible tools specifically designed to identify and handle duplicate data effectively. This guide is aimed at beginners who are starting their journey with Pandas and need a comprehensive understanding of how to tackle duplicate entries in their datasets (represented as Pandas DataFrames or Series).
We will cover:
- What Constitutes Duplicate Data? – Understanding different types of duplicates.
- Why Duplicate Data is Problematic – The impact on analysis and results.
- Setting Up Your Environment – Importing Pandas and creating sample data.
- Identifying Duplicates with
.duplicated()
– The fundamental method for detection.- Basic Usage on Series and DataFrames.
- Understanding the
keep
parameter ('first'
,'last'
,False
). - Filtering to view duplicate rows.
- Removing Duplicates with
.drop_duplicates()
– The primary method for removal.- Basic Usage.
- Using the
subset
parameter for targeted duplication checks. - Using the
keep
parameter to control which duplicate to retain. - Using the
inplace
parameter for direct modification.
- Handling Nuances and Edge Cases
- Case Sensitivity.
- Leading/Trailing Whitespace.
- Considering Data Types.
- How Missing Values (NaN) are Handled.
- Counting Duplicates – Getting summaries of duplicate occurrences.
- Practical Workflow and Best Practices – A step-by-step approach.
- Putting It All Together: A More Complex Example – Applying the concepts.
- Conclusion – Key takeaways and next steps.
Let’s embark on this journey to master duplicate data handling in Pandas!
1. What Constitutes Duplicate Data?
At its core, a duplicate record is an entry in your dataset that is identical to another entry. However, the definition of “identical” can vary depending on the context:
- Full Row Duplicates: An entire row has the exact same values across all columns as another row. This is the simplest form of duplication.
- Partial Duplicates (Based on Key Columns): Sometimes, duplication is defined based on a subset of columns. For example, in a customer database, you might consider two rows duplicates if they have the same
CustomerID
or the same combination ofFirstName
,LastName
, andDateOfBirth
, even if other columns likeLastPurchaseDate
differ. These “key” columns define the uniqueness of a record. - Near Duplicates: These are trickier. Records might be almost identical due to typos, variations in formatting (e.g., “St.” vs. “Street”), case differences (“apple” vs. “Apple”), or leading/trailing whitespace (” value ” vs. “value”). While Pandas’ basic duplication functions primarily handle exact matches, recognizing near duplicates often requires additional data cleaning steps before checking for exact duplicates.
This guide will focus primarily on handling exact duplicates, both full-row and partial (subset-based), as these are directly addressed by Pandas’ core functions. We’ll also touch upon preprocessing steps for handling common causes of near-duplicates like case and whitespace.
2. Why Duplicate Data is Problematic
Ignoring duplicate data can lead to significant issues:
- Skewed Statistical Analysis: Measures like mean, median, and counts will be distorted. If 10% of your sales data is duplicated, your total sales figures will be artificially inflated.
- Incorrect Reporting: Business reports based on duplicated data will present a false picture of reality, potentially leading to poor decision-making.
- Biased Machine Learning Models: If duplicates are prevalent in training data, models might overweight certain patterns, leading to poor generalization on new, unseen data.
- Wasted Resources: Storing and processing duplicate data consumes unnecessary storage space and computational power.
- Operational Issues: Sending duplicate emails to customers, double-billing, or maintaining conflicting records for the same entity can damage customer relationships and operational efficiency.
Therefore, identifying and appropriately handling duplicates is a crucial step in any data preparation pipeline.
3. Setting Up Your Environment
Before we dive into the methods, let’s ensure we have Pandas installed and import it. We’ll also create some sample DataFrames to illustrate the concepts.
If you don’t have Pandas installed, you can install it using pip:
bash
pip install pandas
Now, let’s start our Python script or Jupyter Notebook session by importing Pandas:
“`python
import pandas as pd
import numpy as np # Often useful, especially for NaNs
print(f”Pandas version: {pd.version}”)
print(f”NumPy version: {np.version}”)
“`
Next, let’s create a few simple DataFrames that we will use throughout this guide.
DataFrame 1: Simple Duplicates (df_simple
)
This DataFrame contains obvious, full-row duplicates.
“`python
data_simple = {‘col_a’: [‘A’, ‘B’, ‘C’, ‘A’, ‘B’, ‘D’],
‘col_b’: [1, 2, 3, 1, 2, 4]}
df_simple = pd.DataFrame(data_simple)
print(“DataFrame: df_simple”)
print(df_simple)
print(“-” * 30)
“`
Output:
“`
DataFrame: df_simple
col_a col_b
0 A 1
1 B 2
2 C 3
3 A 1 # Duplicate of row 0
4 B 2 # Duplicate of row 1
5 D 4
“`
DataFrame 2: Duplicates based on Subset (df_subset
)
Here, duplicates exist if we only consider col_x
, but the rows are not identical overall.
“`python
data_subset = {‘col_x’: [‘X’, ‘Y’, ‘X’, ‘Z’, ‘Y’, ‘X’],
‘col_y’: [10, 20, 30, 40, 50, 60],
‘col_z’: [True, False, True, False, True, False]}
df_subset = pd.DataFrame(data_subset)
print(“DataFrame: df_subset”)
print(df_subset)
print(“-” * 30)
“`
Output:
“`
DataFrame: df_subset
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True # col_x is ‘X’ (like row 0), but col_y differs
3 Z 40 False
4 Y 50 True # col_x is ‘Y’ (like row 1), but col_y, col_z differ
5 X 60 False # col_x is ‘X’ (like row 0, 2), but others differ
“`
DataFrame 3: Duplicates with Variations (df_variations
)
This includes case sensitivity, whitespace issues, and missing values (NaN).
“`python
data_variations = {‘ID’: [1, 2, 3, 1, 4, 2, 5, 3],
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘alice’, ‘ David ‘, ‘Bob’, ‘Eve’, ‘charlie ‘],
‘Value’: [100, 200, np.nan, 100, 400, 200, 500, np.nan]}
df_variations = pd.DataFrame(data_variations)
print(“DataFrame: df_variations”)
print(df_variations)
print(“-” * 30)
“`
Output:
“`
DataFrame: df_variations
ID Name Value
0 1 Alice 100.0
1 2 Bob 200.0
2 3 Charlie NaN
3 1 alice 100.0 # Different case in Name
4 4 David 400.0 # Whitespace in Name
5 2 Bob 200.0 # Full duplicate of row 1
6 5 Eve 500.0
7 3 charlie NaN # Whitespace and case difference
“`
Now we have our sample data ready to explore Pandas’ duplication tools.
4. Identifying Duplicates with .duplicated()
The primary method for identifying duplicate rows without removing them is the .duplicated()
method. It can be called on both Pandas Series (single columns) and DataFrames (multiple columns).
How it Works:
.duplicated()
returns a boolean Series (a Series of True
/False
values) with the same index as the original Series or DataFrame.
* True
indicates that the row (or value in a Series) is a duplicate of a previous row/value.
* False
indicates that the row/value is unique so far or is the first occurrence of a value that might appear again later.
Basic Usage on a Series
Let’s apply it to a single column from df_simple
:
“`python
Check for duplicates in ‘col_a’ of df_simple
duplicates_in_col_a = df_simple[‘col_a’].duplicated()
print(“Duplicates in df_simple[‘col_a’]:”)
print(duplicates_in_col_a)
“`
Output:
Duplicates in df_simple['col_a']:
0 False # First 'A'
1 False # First 'B'
2 False # First 'C'
3 True # Second 'A', marked as duplicate
4 True # Second 'B', marked as duplicate
5 False # First 'D'
Name: col_a, dtype: bool
As you can see, the first occurrences of ‘A’ (index 0) and ‘B’ (index 1) are marked False
, while their subsequent occurrences (index 3 and 4) are marked True
.
Basic Usage on a DataFrame
When called on a DataFrame without any arguments, .duplicated()
checks for duplicates based on all columns. A row is marked True
only if all its values match all the values in a preceding row.
“`python
Check for full-row duplicates in df_simple
full_row_duplicates = df_simple.duplicated()
print(“\nFull row duplicates in df_simple:”)
print(full_row_duplicates)
“`
Output:
Full row duplicates in df_simple:
0 False # ('A', 1) - First occurrence
1 False # ('B', 2) - First occurrence
2 False # ('C', 3) - First occurrence
3 True # ('A', 1) - Duplicate of row 0
4 True # ('B', 2) - Duplicate of row 1
5 False # ('D', 4) - First occurrence
dtype: bool
This confirms that rows at index 3 and 4 are exact duplicates of rows 0 and 1, respectively.
Now let’s try it on df_subset
, where no full rows are identical:
“`python
Check for full-row duplicates in df_subset
full_row_duplicates_subset = df_subset.duplicated()
print(“\nFull row duplicates in df_subset:”)
print(full_row_duplicates_subset)
“`
Output:
Full row duplicates in df_subset:
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
As expected, since no row is completely identical to a previous one, all values are False
.
The keep
Parameter
The behavior of which occurrence is marked as True
(the duplicate) is controlled by the keep
parameter. It has three possible values:
keep='first'
(Default): Marks all occurrences except the first one as duplicates (True
).keep='last'
: Marks all occurrences except the last one as duplicates (True
).keep=False
: Marks all occurrences that are part of a duplicate set asTrue
. This is extremely useful for identifying all rows involved in a duplication.
Let’s see how keep
affects the output on df_simple
:
“`python
Using keep=’first’ (default)
print(“\ndf_simple.duplicated(keep=’first’):”)
print(df_simple.duplicated(keep=’first’)) # Same as df_simple.duplicated()
Using keep=’last’
print(“\ndf_simple.duplicated(keep=’last’):”)
print(df_simple.duplicated(keep=’last’))
Using keep=False
print(“\ndf_simple.duplicated(keep=False):”)
print(df_simple.duplicated(keep=False))
“`
Output:
“`
df_simple.duplicated(keep=’first’):
0 False
1 False
2 False
3 True # Second (‘A’, 1) is duplicate
4 True # Second (‘B’, 2) is duplicate
5 False
dtype: bool
df_simple.duplicated(keep=’last’):
0 True # First (‘A’, 1) is duplicate (because last one is kept)
1 True # First (‘B’, 2) is duplicate (because last one is kept)
2 False
3 False # Last (‘A’, 1) is kept
4 False # Last (‘B’, 2) is kept
5 False
dtype: bool
df_simple.duplicated(keep=False):
0 True # Part of (‘A’, 1) duplicate set
1 True # Part of (‘B’, 2) duplicate set
2 False
3 True # Part of (‘A’, 1) duplicate set
4 True # Part of (‘B’, 2) duplicate set
5 False
dtype: bool
“`
Notice the difference:
* keep='first'
marks rows 3 and 4 as duplicates.
* keep='last'
marks rows 0 and 1 as duplicates.
* keep=False
marks rows 0, 1, 3, and 4 as duplicates, showing us all rows that have a twin somewhere in the DataFrame.
Filtering to View Duplicate Rows
The boolean Series returned by .duplicated()
is incredibly useful for filtering the original DataFrame to see the actual duplicate rows. This is done using boolean indexing.
“`python
Show rows marked as duplicates (keeping the first occurrence)
duplicate_rows_keep_first = df_simple[df_simple.duplicated(keep=’first’)]
print(“\nDuplicate rows in df_simple (keeping first):”)
print(duplicate_rows_keep_first)
Show rows marked as duplicates (keeping the last occurrence)
duplicate_rows_keep_last = df_simple[df_simple.duplicated(keep=’last’)]
print(“\nDuplicate rows in df_simple (keeping last):”)
print(duplicate_rows_keep_last)
Show ALL rows that are part of any duplicate set
all_involved_duplicates = df_simple[df_simple.duplicated(keep=False)]
print(“\nAll rows involved in duplicates in df_simple:”)
print(all_involved_duplicates)
“`
Output:
“`
Duplicate rows in df_simple (keeping first):
col_a col_b
3 A 1
4 B 2
Duplicate rows in df_simple (keeping last):
col_a col_b
0 A 1
1 B 2
All rows involved in duplicates in df_simple:
col_a col_b
0 A 1
1 B 2
3 A 1
4 B 2
“`
This ability to easily view the duplicate records (especially using keep=False
) is vital for understanding the nature of the duplication before deciding how to handle it.
Using .duplicated()
with a subset
Just like we discussed partial duplicates, .duplicated()
can check for duplication based on a specific column or a list of columns using the subset
parameter.
Let’s revisit df_subset
and check for duplicates based only on col_x
:
“`python
Check for duplicates based ONLY on ‘col_x’ in df_subset
duplicates_in_col_x_subset = df_subset.duplicated(subset=[‘col_x’]) # Note: subset expects a list
print(“\nDuplicates in df_subset based on ‘col_x’ (keep=’first’):”)
print(duplicates_in_col_x_subset)
View the rows marked as duplicates based on ‘col_x’
print(“\nRows in df_subset where ‘col_x’ is duplicated (keep=’first’):”)
print(df_subset[duplicates_in_col_x_subset])
View ALL rows involved in ‘col_x’ duplication
print(“\nAll rows in df_subset involved in ‘col_x’ duplication (keep=False):”)
print(df_subset[df_subset.duplicated(subset=[‘col_x’], keep=False)])
“`
Output:
“`
Duplicates in df_subset based on ‘col_x’ (keep=’first’):
0 False # First ‘X’
1 False # First ‘Y’
2 True # Second ‘X’
3 False # First ‘Z’
4 True # Second ‘Y’
5 True # Third ‘X’
dtype: bool
Rows in df_subset where ‘col_x’ is duplicated (keep=’first’):
col_x col_y col_z
2 X 30 True
4 Y 50 True
5 X 60 False
All rows in df_subset involved in ‘col_x’ duplication (keep=False):
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True
4 Y 50 True
5 X 60 False
``
.duplicated()
Now,identifies rows 2, 4, and 5 as duplicates because their values in
col_x('X', 'Y', and 'X' respectively) have already appeared in earlier rows (index 0 for 'X', index 1 for 'Y'). Using
keep=Falseshows all rows (0, 1, 2, 4, 5) that share a
col_x` value with another row.
You can specify multiple columns in the subset
list:
“`python
Check for duplicates based on BOTH ‘col_x’ and ‘col_z’
duplicates_xz = df_subset.duplicated(subset=[‘col_x’, ‘col_z’])
print(“\nDuplicates in df_subset based on ‘col_x’ AND ‘col_z’ (keep=’first’):”)
print(duplicates_xz)
View these duplicates
print(“\nRows in df_subset where (‘col_x’, ‘col_z’) pair is duplicated:”)
print(df_subset[duplicates_xz])
“`
Output:
“`
Duplicates in df_subset based on ‘col_x’ AND ‘col_z’ (keep=’first’):
0 False # First (‘X’, True)
1 False # First (‘Y’, False)
2 True # Second (‘X’, True)
3 False # First (‘Z’, False)
4 False # First (‘Y’, True)
5 False # First (‘X’, False)
dtype: bool
Rows in df_subset where (‘col_x’, ‘col_z’) pair is duplicated:
col_x col_y col_z
2 X 30 True
``
‘X’
Here, only row 2 is marked as a duplicate because the combination (,
True`) occurred earlier in row 0.
.duplicated()
is your inspection tool. It lets you see if and where duplicates exist according to your definition (all columns or a subset) without changing your data.
5. Removing Duplicates with .drop_duplicates()
Once you have identified duplicates using .duplicated()
and decided on a strategy for handling them, the .drop_duplicates()
method is used to remove them. It returns a new DataFrame (by default) with the duplicate rows removed.
How it Works:
.drop_duplicates()
scans the DataFrame (or a subset of columns) and removes rows that are duplicates of rows that are kept. Which rows are kept and which are removed depends on the subset
and keep
parameters.
Basic Usage
By default, .drop_duplicates()
works like .duplicated()
: it considers all columns and keeps the first occurrence (keep='first'
).
“`python
Drop full-row duplicates from df_simple, keeping the first occurrence
df_simple_deduplicated = df_simple.drop_duplicates() # keep=’first’ is default
print(“\ndf_simple after dropping full-row duplicates (keeping first):”)
print(df_simple_deduplicated)
Note that the original df_simple remains unchanged
print(“\nOriginal df_simple (unchanged):”)
print(df_simple)
“`
Output:
“`
df_simple after dropping full-row duplicates (keeping first):
col_a col_b
0 A 1
1 B 2
2 C 3
5 D 4
Original df_simple (unchanged):
col_a col_b
0 A 1
1 B 2
2 C 3
3 A 1
4 B 2
5 D 4
``
df_simple_deduplicated`.
Rows 3 and 4, which were exact duplicates of rows 0 and 1, have been removed. The method returned a new DataFrame
Using the subset
Parameter
Similar to .duplicated()
, the subset
parameter in .drop_duplicates()
allows you to define duplication based on specific columns. Rows will be dropped if their values in the subset
columns match a row that is kept.
Let’s use df_subset
again. If we want to keep only the first row for each unique value in col_x
:
“`python
Drop duplicates based on ‘col_x’, keeping the first occurrence
df_subset_dedup_col_x = df_subset.drop_duplicates(subset=[‘col_x’], keep=’first’)
print(“\ndf_subset after dropping duplicates based on ‘col_x’ (keeping first):”)
print(df_subset_dedup_col_x)
“`
Output:
df_subset after dropping duplicates based on 'col_x' (keeping first):
col_x col_y col_z
0 X 10 True # First 'X' kept
1 Y 20 False # First 'Y' kept
3 Z 40 False # First 'Z' kept
Rows 2, 4, and 5 were dropped because their col_x
values (‘X’, ‘Y’, ‘X’) had already been seen in rows 0 and 1, which were kept.
Using the keep
Parameter
The keep
parameter works identically to how it does in .duplicated()
, but here it determines which row to keep when duplicates are found based on the subset
(or all columns).
keep='first'
(Default): Keep the first occurrence, drop subsequent duplicates.keep='last'
: Keep the last occurrence, drop preceding duplicates.keep=False
: Drop all occurrences that are part of a duplicate set. This is useful if you want to remove any record that ever had a duplicate counterpart based on your criteria.
Let’s illustrate with df_simple
:
“`python
Keep the ‘last’ occurrence of full-row duplicates
df_simple_keep_last = df_simple.drop_duplicates(keep=’last’)
print(“\ndf_simple dropping duplicates, keeping last:”)
print(df_simple_keep_last)
Keep ‘False’ – drop ALL rows that were ever duplicated
df_simple_keep_false = df_simple.drop_duplicates(keep=False)
print(“\ndf_simple dropping duplicates, keeping none (keep=False):”)
print(df_simple_keep_false)
“`
Output:
“`
df_simple dropping duplicates, keeping last:
col_a col_b
2 C 3
3 A 1 # Last (‘A’, 1) kept
4 B 2 # Last (‘B’, 2) kept
5 D 4
df_simple dropping duplicates, keeping none (keep=False):
col_a col_b
2 C 3
5 D 4
“`
Now apply keep
with subset
on df_subset
based on col_x
:
“`python
Keep the ‘last’ row for each unique ‘col_x’ value
df_subset_keep_last_col_x = df_subset.drop_duplicates(subset=[‘col_x’], keep=’last’)
print(“\ndf_subset dropping based on ‘col_x’, keeping last:”)
print(df_subset_keep_last_col_x)
Keep ‘False’ – drop ALL rows where ‘col_x’ value appeared more than once
df_subset_keep_false_col_x = df_subset.drop_duplicates(subset=[‘col_x’], keep=False)
print(“\ndf_subset dropping based on ‘col_x’, keeping none (keep=False):”)
print(df_subset_keep_false_col_x)
“`
Output:
“`
df_subset dropping based on ‘col_x’, keeping last:
col_x col_y col_z
3 Z 40 False # Only ‘Z’ appears once
4 Y 50 True # Last ‘Y’ kept
5 X 60 False # Last ‘X’ kept
df_subset dropping based on ‘col_x’, keeping none (keep=False):
col_x col_y col_z
3 Z 40 False # Only ‘Z’ occurred exactly once
``
keep` depends entirely on your specific requirements. Do you trust the first entry more? The most recent (last) entry? Or do you want to discard any ambiguous entries entirely?
The choice of
Using the inplace
Parameter
By default, .drop_duplicates()
(like most Pandas manipulation methods) returns a new DataFrame, leaving the original DataFrame untouched. If you want to modify the original DataFrame directly, you can use the inplace=True
argument.
“`python
print(“\nOriginal df_subset before inplace drop:”)
print(df_subset)
Create a copy to modify inplace, preserving the original df_subset for later examples
df_subset_copy = df_subset.copy()
Drop duplicates based on ‘col_x’, keeping first, modifying df_subset_copy directly
return_value = df_subset_copy.drop_duplicates(subset=[‘col_x’], keep=’first’, inplace=True)
print(“\nReturn value of inplace operation:”, return_value) # Note: inplace returns None
print(“\ndf_subset_copy after inplace drop:”)
print(df_subset_copy)
Verify the original df_subset is unchanged
print(“\nOriginal df_subset (still unchanged):”)
print(df_subset)
“`
Output:
“`
Original df_subset before inplace drop:
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True
3 Z 40 False
4 Y 50 True
5 X 60 False
Return value of inplace operation: None
df_subset_copy after inplace drop:
col_x col_y col_z
0 X 10 True
1 Y 20 False
3 Z 40 False
Original df_subset (still unchanged):
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True
3 Z 40 False
4 Y 50 True
5 X 60 False
“`
Caution with inplace=True
: While inplace=True
can seem convenient as it avoids creating a new variable, it’s generally recommended for beginners (and often for experienced users too) to avoid it.
* It makes code harder to debug, as the state of the DataFrame changes silently.
* It breaks the flow of method chaining (assigning sequential operations).
* It doesn’t necessarily offer significant performance benefits in many cases.
It’s usually safer and clearer to assign the result back to the original variable or a new variable:
“`python
Safer alternative to inplace=True
df_subset = df_subset.drop_duplicates(subset=[‘col_x’], keep=’first’)
Now df_subset holds the deduplicated result
“`
6. Handling Nuances and Edge Cases
Real-world data often requires preprocessing before applying .duplicated()
or .drop_duplicates()
to catch duplicates that aren’t immediately obvious due to formatting inconsistencies.
Let’s use df_variations
to explore these.
“`python
print(“\nDataFrame: df_variations (Recall)”)
print(df_variations)
Check initial duplicates based on ‘ID’
print(“\nDuplicates based on ‘ID’ initially:”)
print(df_variations.duplicated(subset=[‘ID’], keep=False))
Check initial duplicates based on ‘Name’
print(“\nDuplicates based on ‘Name’ initially:”)
print(df_variations.duplicated(subset=[‘Name’], keep=False))
Check initial full-row duplicates
print(“\nFull-row duplicates initially:”)
print(df_variations.duplicated(keep=False))
“`
Output:
“`
DataFrame: df_variations (Recall)
ID Name Value
0 1 Alice 100.0
1 2 Bob 200.0
2 3 Charlie NaN
3 1 alice 100.0
4 4 David 400.0
5 2 Bob 200.0
6 5 Eve 500.0
7 3 charlie NaN
Duplicates based on ‘ID’ initially:
0 True # ID 1
1 True # ID 2
2 True # ID 3
3 True # ID 1
4 False # ID 4
5 True # ID 2
6 False # ID 5
7 True # ID 3
dtype: bool
Duplicates based on ‘Name’ initially:
0 False # Alice
1 True # Bob
2 False # Charlie
3 False # alice (different case)
4 False # David (whitespace)
5 True # Bob
6 False # Eve
7 False # charlie (whitespace + case)
dtype: bool
Full-row duplicates initially:
0 False
1 True # Row 1 (‘Bob’, 200.0)
2 False # NaN != NaN by default in this context for full row check unless ALL columns match incl NaN position
3 False
4 False
5 True # Row 5 (‘Bob’, 200.0) – duplicate of row 1
6 False
7 False
dtype: bool
``
ID
Initially:
*shows duplicates for 1, 2, and 3.
Name` only shows ‘Bob’ as duplicated, because ‘Alice’/’alice’, ‘Charlie’/’charlie ‘, and ‘ David ‘ are treated as distinct strings.
*
* Only rows 1 and 5 are identified as full-row duplicates.
Handling Case Sensitivity
Problem: ‘Alice’ and ‘alice’ are treated as different names.
Solution: Convert the relevant column(s) to a consistent case (e.g., lowercase) before checking for duplicates.
“`python
Create a temporary column or modify the column for checking
df_variations[‘Name_lower’] = df_variations[‘Name’].str.lower()
print(“\ndf_variations with ‘Name_lower’:”)
print(df_variations)
Check duplicates based on ‘ID’ and ‘Name_lower’
name_lower_duplicates = df_variations.duplicated(subset=[‘ID’, ‘Name_lower’], keep=False)
print(“\nDuplicates based on ‘ID’ and ‘Name_lower’:”)
print(name_lower_duplicates)
print(“\nRows involved in (‘ID’, ‘Name_lower’) duplication:”)
print(df_variations[name_lower_duplicates])
Don’t forget to potentially drop the temporary column if not needed
df_variations = df_variations.drop(columns=[‘Name_lower’])
“`
Output:
“`
df_variations with ‘Name_lower’:
ID Name Value Name_lower
0 1 Alice 100.0 alice
1 2 Bob 200.0 bob
2 3 Charlie NaN charlie
3 1 alice 100.0 alice # Now matches row 0 based on Name_lower
4 4 David 400.0 david
5 2 Bob 200.0 bob # Matches row 1
6 5 Eve 500.0 eve
7 3 charlie NaN charlie # Name matches row 2, but whitespace still an issue
Duplicates based on ‘ID’ and ‘Name_lower’:
0 True # (‘1’, ‘alice’)
1 True # (‘2’, ‘bob’)
2 False # (‘3’, ‘charlie’) – Note: whitespace still matters!
3 True # (‘1’, ‘alice’)
4 False # (‘4’, ‘ david ‘)
5 True # (‘2’, ‘bob’)
6 False # (‘5’, ‘eve’)
7 False # (‘3’, ‘charlie ‘) – Whitespace difference
dtype: bool
Rows involved in (‘ID’, ‘Name_lower’) duplication:
ID Name Value Name_lower
0 1 Alice 100.0 alice
1 2 Bob 200.0 bob
3 1 alice 100.0 alice
5 2 Bob 200.0 bob
``
Name_lower
By converting 'Name' to lowercase (), we now correctly identify rows 0 and 3 as duplicates based on the combination of
ID` and the lowercased name. Rows 1 and 5 are also correctly identified. However, rows 2 and 7 are still not matched due to whitespace.
Handling Leading/Trailing Whitespace
Problem: ‘ David ‘ and ‘charlie ‘ have extra spaces.
Solution: Use the .str.strip()
method to remove leading and trailing whitespace from string columns. This should typically be done along with case conversion.
“`python
Apply both strip and lower to the ‘Name’ column
We can do this inplace on the original column or create a new one
df_variations[‘Name_clean’] = df_variations[‘Name’].str.strip().str.lower()
print(“\ndf_variations with ‘Name_clean’:”)
print(df_variations[[‘ID’, ‘Name’, ‘Name_clean’, ‘Value’]]) # Show relevant cols
Check duplicates based on ‘ID’ and ‘Name_clean’
clean_duplicates = df_variations.duplicated(subset=[‘ID’, ‘Name_clean’], keep=False)
print(“\nDuplicates based on ‘ID’ and ‘Name_clean’:”)
print(clean_duplicates)
print(“\nRows involved in (‘ID’, ‘Name_clean’) duplication:”)
Display original columns for clarity, but filter based on the check
print(df_variations[clean_duplicates][[‘ID’, ‘Name’, ‘Value’]])
Now let’s drop duplicates based on the cleaned name and ID, keeping the first
df_variations_deduped = df_variations.drop_duplicates(subset=[‘ID’, ‘Name_clean’], keep=’first’)
print(“\ndf_variations after dropping duplicates based on (‘ID’, ‘Name_clean’), keeping first:”)
print(df_variations_deduped[[‘ID’, ‘Name’, ‘Value’]]) # Show original columns
Clean up temporary columns if desired
df_variations = df_variations.drop(columns=[‘Name_lower’, ‘Name_clean’])
df_variations_deduped = df_variations_deduped.drop(columns=[‘Name_lower’, ‘Name_clean’])
“`
Output:
“`
df_variations with ‘Name_clean’:
ID Name Name_clean Value
0 1 Alice alice 100.0
1 2 Bob bob 200.0
2 3 Charlie charlie NaN
3 1 alice alice 100.0
4 4 David david 400.0 # Note: Name_clean is now ‘david’
5 2 Bob bob 200.0
6 5 Eve eve 500.0
7 3 charlie charlie NaN # Note: Name_clean is now ‘charlie’
Duplicates based on ‘ID’ and ‘Name_clean’:
0 True # (‘1’, ‘alice’)
1 True # (‘2’, ‘bob’)
2 True # (‘3’, ‘charlie’)
3 True # (‘1’, ‘alice’)
4 False # (‘4’, ‘david’)
5 True # (‘2’, ‘bob’)
6 False # (‘5’, ‘eve’)
7 True # (‘3’, ‘charlie’)
dtype: bool
Rows involved in (‘ID’, ‘Name_clean’) duplication:
ID Name Value
0 1 Alice 100.0
1 2 Bob 200.0
2 3 Charlie NaN
3 1 alice 100.0
5 2 Bob 200.0
7 3 charlie NaN
df_variations after dropping duplicates based on (‘ID’, ‘Name_clean’), keeping first:
ID Name Value
0 1 Alice 100.0 # Kept (‘1’, ‘alice’)
1 2 Bob 200.0 # Kept (‘2’, ‘bob’)
2 3 Charlie NaN # Kept (‘3’, ‘charlie’)
4 4 David 400.0 # Kept (‘4’, ‘david’)
6 5 Eve 500.0 # Kept (‘5’, ‘eve’)
``
.str.strip()
Success! By applying bothand
.str.lower()to the 'Name' column (creating
Name_clean), we correctly identified all logical duplicates based on
IDand the cleaned name: (0, 3), (1, 5), and (2, 7). The subsequent
drop_duplicates` call correctly kept only the first occurrence for each pair.
Considering Data Types
Sometimes, data might look similar but have different underlying types (e.g., the number 10
vs. the string '10'
). Pandas’ duplication checks are type-sensitive.
“`python
df_types = pd.DataFrame({‘A’: [1, ‘1’, 2, 2], ‘B’: [‘x’, ‘x’, ‘y’, ‘y’]})
print(“\nDataFrame with mixed types:”)
print(df_types)
print(df_types.dtypes)
print(“\nDuplicates based on ‘A’:”)
print(df_types.duplicated(subset=[‘A’], keep=False)) # 1 and ‘1’ are different
print(“\nDuplicates based on ‘B’:”)
print(df_types.duplicated(subset=[‘B’], keep=False)) # ‘x’ and ‘x’, ‘y’ and ‘y’
print(“\nFull row duplicates:”)
print(df_types.duplicated(keep=False)) # None, because of column ‘A’ types
“`
Output:
“`
DataFrame with mixed types:
A B
0 1 x
1 1 x
2 2 y
3 2 y
A object # Because it contains both int and str
B object
dtype: object
Duplicates based on ‘A’:
0 False # int 1
1 False # str ‘1’
2 True # int 2
3 True # int 2
dtype: bool
Duplicates based on ‘B’:
0 True # ‘x’
1 True # ‘x’
2 True # ‘y’
3 True # ‘y’
dtype: bool
Full row duplicates:
0 False
1 False
2 False
3 False
dtype: bool
``
1
As you see, the integerand the string
‘1’in column
A` are not considered duplicates. If your intention is to treat them as the same, you need to convert the column to a consistent type before checking for duplicates.
“`python
Convert column ‘A’ to string type (or numeric, depending on context)
df_types[‘A_str’] = df_types[‘A’].astype(str)
print(“\nDataFrame with ‘A’ converted to string:”)
print(df_types)
print(“\nDuplicates based on ‘A_str’:”)
print(df_types.duplicated(subset=[‘A_str’], keep=False)) # Now ‘1’ and ‘1’ match
Check full duplicates using the consistent type column ‘A_str’ and ‘B’
print(“\nFull duplicates using ‘A_str’ and ‘B’:”)
print(df_types.duplicated(subset=[‘A_str’, ‘B’], keep=False))
“`
Output:
“`
DataFrame with ‘A’ converted to string:
A B A_str
0 1 x 1
1 1 x 1
2 2 y 2
3 2 y 2
Duplicates based on ‘A_str’:
0 True # ‘1’
1 True # ‘1’
2 True # ‘2’
3 True # ‘2’
dtype: bool
Full duplicates using ‘A_str’ and ‘B’:
0 True # (‘1’, ‘x’)
1 True # (‘1’, ‘x’)
2 True # (‘2’, ‘y’)
3 True # (‘2’, ‘y’)
dtype: bool
``
A
After converting columnto string type (
A_str), the duplication checks behave as expected if we intended
1and
‘1’to be treated identically. Always inspect your data types using
df.info()or
df.dtypes` and perform necessary conversions.
How Missing Values (NaN) are Handled
This is a subtle but important point. How do .duplicated()
and .drop_duplicates()
treat np.nan
(Not a Number) or None
values?
By default, in Pandas:
* When comparing values within a column for duplication checks, NaN
is considered equal to itself. So, two rows with NaN
in the same checked column(s) (and matching values in other checked columns) can be considered duplicates.
* However, Python’s standard comparison np.nan == np.nan
returns False
. Pandas handles this specially within its duplication logic.
Let’s look at df_variations
again, focusing on rows 2 and 7 after cleaning the ‘Name’ column.
“`python
Recall df_variations after cleaning ‘Name’
print(“\ndf_variations with ‘Name_clean’ (Rows 2 and 7):”)
print(df_variations.loc[[2, 7], [‘ID’, ‘Name_clean’, ‘Value’]])
Check duplication based on ‘ID’ and ‘Name_clean’ again
print(“\nDuplicates based on (‘ID’, ‘Name_clean’) for rows 2, 7:”)
print(df_variations.duplicated(subset=[‘ID’, ‘Name_clean’], keep=False).loc[[2, 7]])
Check duplication based on ALL columns (‘ID’, ‘Name’, ‘Value’, ‘Name_lower’, ‘Name_clean’)
Note: We need to consider all current columns to see if NaN matching works
print(“\nFull row duplicates considering NaNs (keep=False):”)
print(df_variations.duplicated(keep=False))
Let’s create a specific NaN example
df_nan = pd.DataFrame({‘A’: [1, 1, 2, 2, 3], ‘B’: [np.nan, np.nan, ‘x’, ‘x’, np.nan]})
print(“\nDataFrame with NaNs:”)
print(df_nan)
print(“\nDuplicates in df_nan based on ‘A’:”)
print(df_nan.duplicated(subset=[‘A’], keep=False))
print(“\nDuplicates in df_nan based on ‘B’:”)
print(df_nan.duplicated(subset=[‘B’], keep=False)) # NaNs match each other, ‘x’ matches ‘x’
print(“\nFull row duplicates in df_nan:”)
print(df_nan.duplicated(keep=False)) # (1, NaN) matches (1, NaN); (2, ‘x’) matches (2, ‘x’)
“`
Output:
“`
df_variations with ‘Name_clean’ (Rows 2 and 7):
ID Name_clean Value
2 3 charlie NaN
7 3 charlie NaN
Duplicates based on (‘ID’, ‘Name_clean’) for rows 2, 7:
2 True
7 True
dtype: bool
Full row duplicates considering NaNs (keep=False):
0 True # ID=1, Name_clean=alice, Value=100.0
1 True # ID=2, Name_clean=bob, Value=200.0
2 True # ID=3, Name_clean=charlie, Value=NaN <– Matches row 7
3 True # ID=1, Name_clean=alice, Value=100.0
4 False
5 True # ID=2, Name_clean=bob, Value=200.0
6 False
7 True # ID=3, Name_clean=charlie, Value=NaN <– Matches row 2
dtype: bool
DataFrame with NaNs:
A B
0 1 NaN
1 1 NaN
2 2 x
3 2 x
4 3 NaN
Duplicates in df_nan based on ‘A’:
0 True
1 True
2 True
3 True
4 False
dtype: bool
Duplicates in df_nan based on ‘B’:
0 True # NaN matches NaN
1 True # NaN matches NaN
2 True # ‘x’ matches ‘x’
3 True # ‘x’ matches ‘x’
4 True # NaN matches NaN (at index 0, 1)
dtype: bool
Full row duplicates in df_nan:
0 True # (1, NaN)
1 True # (1, NaN)
2 True # (2, ‘x’)
3 True # (2, ‘x’)
4 False # (3, NaN) is unique
dtype: bool
``
.duplicated()
Key takeaway: When usingor
.drop_duplicates(),
NaNvalues **in the columns being checked** are considered identical to other
NaNvalues *within that check*. This means rows
(1, NaN)and
(1, NaN)are duplicates, and rows
(3, ‘charlie’, NaN)and
(3, ‘charlie’, NaN)in our cleaned
df_variations` are also duplicates.
7. Counting Duplicates
Often, you don’t just want to identify or remove duplicates, but also count how many there are.
Counting Total Duplicate Rows
The boolean Series returned by .duplicated()
(with keep='first'
or keep='last'
) directly tells you which rows would be dropped. Since True
evaluates to 1 and False
to 0 in numerical contexts, you can simply use .sum()
on the result.
“`python
Count how many rows are duplicates (excluding the first occurrence)
num_duplicates_simple = df_simple.duplicated(keep=’first’).sum()
print(f”\nNumber of duplicate rows in df_simple (excluding first): {num_duplicates_simple}”)
Count how many rows would be dropped if keeping the last
num_duplicates_simple_keep_last = df_simple.duplicated(keep=’last’).sum()
print(f”Number of duplicate rows in df_simple (excluding last): {num_duplicates_simple_keep_last}”)
Count total rows involved in ANY duplication
num_involved_in_duplicates = df_simple.duplicated(keep=False).sum()
print(f”Total number of rows involved in duplication in df_simple: {num_involved_in_duplicates}”)
“`
Output:
Number of duplicate rows in df_simple (excluding first): 2
Number of duplicate rows in df_simple (excluding last): 2
Total number of rows involved in duplication in df_simple: 4
Counting Duplicates per Group/Value
Sometimes you want to know which values are duplicated and how many times they appear. Pandas’ .value_counts()
method on a Series or .groupby().size()
on a DataFrame are useful here.
“`python
Using value_counts() on a Series to see frequency
print(“\nValue counts for df_simple[‘col_a’]:”)
print(df_simple[‘col_a’].value_counts())
Filter value_counts to show only duplicated values
col_a_counts = df_simple[‘col_a’].value_counts()
print(“\nDuplicated values in df_simple[‘col_a’] (occur > 1 time):”)
print(col_a_counts[col_a_counts > 1])
Using groupby() and size() for combinations of columns
print(“\nCounts for combinations of (‘col_a’, ‘col_b’) in df_simple:”)
.size() includes NaN groups, .count() excludes them per column
group_counts = df_simple.groupby([‘col_a’, ‘col_b’]).size()
print(group_counts)
print(“\nDuplicated combinations in df_simple (occur > 1 time):”)
print(group_counts[group_counts > 1])
“`
Output:
“`
Value counts for df_simple[‘col_a’]:
A 2
B 2
C 1
D 1
Name: col_a, dtype: int64
Duplicated values in df_simple[‘col_a’] (occur > 1 time):
A 2
B 2
Name: col_a, dtype: int64
Counts for combinations of (‘col_a’, ‘col_b’) in df_simple:
col_a col_b
A 1 2
B 2 2
C 3 1
D 4 1
dtype: int64
Duplicated combinations in df_simple (occur > 1 time):
col_a col_b
A 1 2
B 2 2
dtype: int64
“`
These methods help quantify the extent and nature of duplication based on specific columns or combinations.
8. Practical Workflow and Best Practices
Handling duplicates is rarely a single command. It typically involves exploration, decision-making, and verification. Here’s a recommended workflow:
-
Understand Your Data & Uniqueness Constraints:
- Load your data into a Pandas DataFrame.
- Use
.info()
to check data types and non-null counts. - Use
.head()
,.sample()
, and.describe()
to get a feel for the values. - Crucially, determine which column(s) should uniquely identify a record. Is it an ID column? A combination of name, date, and location? This defines your
subset
.
-
Initial Check for Full Duplicates:
- Run
df.duplicated().sum()
to quickly see if any obvious full-row duplicates exist.
- Run
-
Preprocessing (If Necessary):
- Identify columns prone to formatting issues (strings, dates, potentially numbers read as strings).
- Apply
.str.strip()
,.str.lower()
(or.upper()
), type conversions (.astype()
), date parsing (pd.to_datetime
) etc., to the relevant columns to standardize them. Create temporary cleaned columns if you want to preserve the originals initially.
-
Identify Duplicates Based on Key Columns (
subset
):- Use
df.duplicated(subset=key_columns, keep=False)
to get a boolean Series marking all rows involved in duplication based on your chosen keys. - Filter the DataFrame using this boolean Series:
df[df.duplicated(subset=key_columns, keep=False)]
. - Inspect these rows carefully. Why are they duplicates? Is it expected? Do they contain conflicting information in non-key columns? Sorting these rows (
.sort_values(by=key_columns)
) often helps visual inspection.
- Use
-
Decide on a Strategy (
keep
):- Based on your inspection, decide which version of the duplicate records to keep.
keep='first'
: Keep the one that appeared earliest in the DataFrame. Often the default choice if there’s no better reason.keep='last'
: Keep the most recent entry. Useful if newer data is presumed more accurate or relevant (e.g., based on a timestamp column you sorted by).keep=False
: Discard all records that have duplicates. Use if any ambiguity makes the record untrustworthy.- More complex logic: Sometimes you might need to group by the key columns and then apply custom logic (e.g., keep the row with the fewest missing values, or the highest value in a specific column) before dropping duplicates. This goes beyond basic
.drop_duplicates()
.
- Based on your inspection, decide which version of the duplicate records to keep.
-
Apply
.drop_duplicates()
:- Use
df.drop_duplicates(subset=key_columns, keep=chosen_keep_strategy)
to remove the duplicates. - Assign the result to a new DataFrame or overwrite the old one (carefully):
python
df_deduplicated = df.drop_duplicates(subset=key_columns, keep=chosen_keep_strategy)
# OR (use with caution)
# df = df.drop_duplicates(subset=key_columns, keep=chosen_keep_strategy)
- Use
-
Verify the Result:
- Check the shape of the new DataFrame (
df_deduplicated.shape
) to see how many rows were removed. - Run the duplication check again on the resulting DataFrame to confirm duplicates are gone:
df_deduplicated.duplicated(subset=key_columns).sum()
. This should return 0. - Optionally, check value counts or unique counts on key columns again.
- Check the shape of the new DataFrame (
Best Practices:
- Don’t delete blindly: Always inspect duplicates before dropping, especially when using a
subset
. Usedf[df.duplicated(subset=..., keep=False)]
extensively. - Standardize first: Apply cleaning (case, whitespace, types) before checking for duplicates.
- Document your choices: Record which columns you used as keys (
subset
) and why you chose a specifickeep
strategy. This is crucial for reproducibility. - Consider the source: Why did the duplicates occur? Can the data entry or collection process be improved?
- Prefer returning new DataFrames: Avoid
inplace=True
unless you have a specific reason and understand the implications. - Backup original data: Keep a copy of the raw data before performing cleaning operations.
9. Putting It All Together: A More Complex Example
Let’s simulate a slightly more realistic scenario. Imagine a dataset of user registrations where duplicates might arise from multiple sign-up attempts or data entry errors.
“`python
Sample User Registration Data
data_users = {
‘Timestamp’: pd.to_datetime([‘2023-01-10 09:00’, ‘2023-01-10 09:05’, ‘2023-01-11 10:00’,
‘2023-01-11 10:05’, ‘2023-01-12 11:30’, ‘2023-01-12 11:31’,
‘2023-01-13 14:00’, ‘2023-01-13 14:05’, ‘2023-01-14 15:00’]),
‘UserID_Input’: [‘ user1 ‘, ‘User1’, ‘ user2’, ‘user3 ‘, ‘User2’, ‘user2 ‘, ‘user4’, ‘user4’, ‘User5’],
‘Email’: [‘[email protected]’, ‘[email protected]’, ‘ [email protected] ‘, ‘[email protected]’,
‘[email protected]’, ‘[email protected]’, ‘[email protected]’, ‘[email protected]’, ‘[email protected]’],
‘Plan’: [‘Free’, ‘Premium’, ‘Premium’, ‘Free’, ‘Premium’, ‘Premium’, ‘Pro’, ‘Pro’, ‘Free’],
‘Country’: [‘USA’, ‘USA’, ‘Canada’, ‘UK’, ‘ CA ‘, ‘Canada’, ‘USA’, ‘usa’, ‘UK’]
}
df_users = pd.DataFrame(data_users)
print(“— Original User Data —“)
print(df_users)
print(f”\nShape: {df_users.shape}”)
print(“\nInfo:”)
df_users.info()
— Step 1: Understand Data & Uniqueness —
Potential Keys: UserID_Input, Email. Maybe combination?
Timestamp might indicate first/last entry.
Country has formatting issues.
— Step 2: Initial Check (Full Duplicates) —
print(f”\nInitial full row duplicates: {df_users.duplicated().sum()}”) # Likely 0 due to Timestamp
— Step 3: Preprocessing —
print(“\n— Preprocessing —“)
df_users[‘UserID’] = df_users[‘UserID_Input’].str.strip().str.lower()
df_users[‘Email_Clean’] = df_users[‘Email’].str.strip().str.lower()
df_users[‘Country_Clean’] = df_users[‘Country’].str.strip().str.upper()
Display cleaned columns alongside originals for clarity
print(“\nData after cleaning UserID, Email, Country:”)
print(df_users[[‘Timestamp’, ‘UserID_Input’, ‘UserID’, ‘Email’, ‘Email_Clean’, ‘Country’, ‘Country_Clean’, ‘Plan’]])
— Step 4: Identify Duplicates Based on Keys —
Let’s assume Email should be unique.
key_columns = [‘Email_Clean’]
print(f”\n— Identifying Duplicates Based On: {key_columns} —“)
duplicates_bool = df_users.duplicated(subset=key_columns, keep=False)
print(“\nBoolean Series (keep=False):”)
print(duplicates_bool)
print(“\nRows involved in Email duplication:”)
Sort by email and timestamp to see related rows together
print(df_users[duplicates_bool].sort_values(by=[‘Email_Clean’, ‘Timestamp’]))
— Step 5: Decide on Strategy —
Looking at the duplicates for ‘[email protected]’ (rows 0, 1) and ‘[email protected]’ (rows 2, 4, 5) and ‘[email protected]’ (rows 6, 7):
– For ‘[email protected]’, row 1 has Plan=’Premium’ vs row 0 ‘Free’. Maybe keep the latest one?
– For ‘[email protected]’, row 2 (‘user2’) and 5 (‘user2’) seem consistent except for Timestamp and minor Country variation. Row 4 (‘User2’) has a different UserID input but same email. Keeping the latest timestamp seems reasonable.
– For ‘[email protected]’, rows 6 and 7 are almost identical, except timestamp and Country case. Keep latest.
Strategy: Keep the ‘last’ entry based on Timestamp for each Email.
We should sort by Timestamp before dropping duplicates if we want ‘last’ to reliably mean the latest registration time.
— Step 6 & 7: Apply drop_duplicates & Verify —
print(“\n— Applying drop_duplicates (keep=’last’) after sorting by Timestamp —“)
Sort by Timestamp (ascending) so ‘last’ keeps the latest entry for each email
df_users_sorted = df_users.sort_values(by=’Timestamp’)
df_users_deduped = df_users_sorted.drop_duplicates(subset=key_columns, keep=’last’)
print(“\nDataFrame after deduplication:”)
print(df_users_deduped[[‘Timestamp’, ‘UserID’, ‘Email_Clean’, ‘Plan’, ‘Country_Clean’]])
print(f”\nShape after deduplication: {df_users_deduped.shape}”)
Verification
print(f”\nRemaining duplicates based on {key_columns}: {df_users_deduped.duplicated(subset=key_columns).sum()}”) # Should be 0
Check UserID duplicates in the cleaned data (optional)
print(f”\nDuplicates based on UserID in final data: {df_users_deduped.duplicated(subset=[‘UserID’]).sum()}”)
Note: We might still have UserID duplicates if different emails were used.
Final clean-up (optional: drop intermediate columns)
df_final = df_users_deduped.drop(columns=[‘UserID_Input’, ‘Email’, ‘Country’])
print(“\n— Final Cleaned Data —“)
print(df_final)
“`
Explanation of the Example:
- Load & Inspect: We loaded the data and used
.info()
to see types (noting object types for strings). - Preprocessing: We identified
UserID_Input
,Email
, andCountry
as needing cleaning. We applied.str.strip()
and case conversion (.lower()
or.upper()
) to create standardizedUserID
,Email_Clean
, andCountry_Clean
columns. - Identify: We decided
Email_Clean
should be the unique key. We useddf.duplicated(subset=['Email_Clean'], keep=False)
and filtered the DataFrame to inspect all rows associated with duplicate emails. Sorting byEmail_Clean
andTimestamp
helped group them visually. - Strategize: We observed different plans or user IDs associated with the same email. We decided that keeping the latest registration (based on
Timestamp
) for each email was a reasonable strategy (keep='last'
). To ensure ‘last’ refers to the latest time, we first sorted the DataFrame byTimestamp
. - Apply & Verify: We applied
drop_duplicates(subset=['Email_Clean'], keep='last')
to the sorted DataFrame. We checked the shape before and after, and verified that the number of duplicates based onEmail_Clean
in the resulting DataFrame was zero. We also performed an optional check for remainingUserID
duplicates (which might be acceptable if a user can register multiple emails). Finally, we dropped the original messy columns.
This example demonstrates the iterative nature: clean -> identify -> inspect -> strategize -> apply -> verify.
10. Conclusion
Duplicate data is a common challenge in data analysis, but Pandas provides straightforward and powerful tools to manage it. We’ve explored the core functions:
.duplicated(subset=None, keep='first')
: Identifies duplicate rows based on all columns or a specifiedsubset
, returning a boolean Series. Thekeep
parameter ('first'
,'last'
,False
) controls which occurrences are marked asTrue
. Essential for inspection..drop_duplicates(subset=None, keep='first', inplace=False)
: Removes duplicate rows, returning a new DataFrame (unlessinplace=True
). Uses the samesubset
andkeep
logic to determine which rows to remove and which to retain.
We also highlighted the critical importance of preprocessing steps like handling case sensitivity (.str.lower()
, .str.upper()
), removing whitespace (.str.strip()
), and ensuring consistent data types (.astype()
) before applying duplication checks. Understanding how missing values (NaN
) are treated (they match each other) is also key.
By following a structured workflow – understanding uniqueness, cleaning data, inspecting duplicates thoroughly, choosing a sensible keep
strategy, applying drop_duplicates
, and verifying the results – beginners can confidently tackle duplicate data issues and significantly improve the quality and reliability of their analyses.
Mastering duplicate handling is a fundamental skill in data cleaning, paving the way for more accurate insights and robust data-driven applications. Keep practicing with different datasets and scenarios!