Pandas: Identifying Duplicate Data (Beginner’s Guide)

Okay, here is a detailed beginner’s guide to identifying and handling duplicate data in Pandas, aiming for the requested length and detail.


Pandas: Identifying and Handling Duplicate Data (A Beginner’s Guide)

In the world of data analysis and data science, the quality of your data is paramount. Raw data collected from various sources is often messy, incomplete, and contains inconsistencies. One of the most common issues encountered during the data cleaning process is the presence of duplicate records. Duplicate data can skew analyses, lead to incorrect conclusions, inflate counts, waste storage space, and potentially cause errors in downstream processes like machine learning model training or reporting.

Fortunately, the Pandas library, the cornerstone of data manipulation in Python, provides powerful and flexible tools specifically designed to identify and handle duplicate data effectively. This guide is aimed at beginners who are starting their journey with Pandas and need a comprehensive understanding of how to tackle duplicate entries in their datasets (represented as Pandas DataFrames or Series).

We will cover:

  1. What Constitutes Duplicate Data? – Understanding different types of duplicates.
  2. Why Duplicate Data is Problematic – The impact on analysis and results.
  3. Setting Up Your Environment – Importing Pandas and creating sample data.
  4. Identifying Duplicates with .duplicated() – The fundamental method for detection.
    • Basic Usage on Series and DataFrames.
    • Understanding the keep parameter ('first', 'last', False).
    • Filtering to view duplicate rows.
  5. Removing Duplicates with .drop_duplicates() – The primary method for removal.
    • Basic Usage.
    • Using the subset parameter for targeted duplication checks.
    • Using the keep parameter to control which duplicate to retain.
    • Using the inplace parameter for direct modification.
  6. Handling Nuances and Edge Cases
    • Case Sensitivity.
    • Leading/Trailing Whitespace.
    • Considering Data Types.
    • How Missing Values (NaN) are Handled.
  7. Counting Duplicates – Getting summaries of duplicate occurrences.
  8. Practical Workflow and Best Practices – A step-by-step approach.
  9. Putting It All Together: A More Complex Example – Applying the concepts.
  10. Conclusion – Key takeaways and next steps.

Let’s embark on this journey to master duplicate data handling in Pandas!

1. What Constitutes Duplicate Data?

At its core, a duplicate record is an entry in your dataset that is identical to another entry. However, the definition of “identical” can vary depending on the context:

  • Full Row Duplicates: An entire row has the exact same values across all columns as another row. This is the simplest form of duplication.
  • Partial Duplicates (Based on Key Columns): Sometimes, duplication is defined based on a subset of columns. For example, in a customer database, you might consider two rows duplicates if they have the same CustomerID or the same combination of FirstName, LastName, and DateOfBirth, even if other columns like LastPurchaseDate differ. These “key” columns define the uniqueness of a record.
  • Near Duplicates: These are trickier. Records might be almost identical due to typos, variations in formatting (e.g., “St.” vs. “Street”), case differences (“apple” vs. “Apple”), or leading/trailing whitespace (” value ” vs. “value”). While Pandas’ basic duplication functions primarily handle exact matches, recognizing near duplicates often requires additional data cleaning steps before checking for exact duplicates.

This guide will focus primarily on handling exact duplicates, both full-row and partial (subset-based), as these are directly addressed by Pandas’ core functions. We’ll also touch upon preprocessing steps for handling common causes of near-duplicates like case and whitespace.

2. Why Duplicate Data is Problematic

Ignoring duplicate data can lead to significant issues:

  • Skewed Statistical Analysis: Measures like mean, median, and counts will be distorted. If 10% of your sales data is duplicated, your total sales figures will be artificially inflated.
  • Incorrect Reporting: Business reports based on duplicated data will present a false picture of reality, potentially leading to poor decision-making.
  • Biased Machine Learning Models: If duplicates are prevalent in training data, models might overweight certain patterns, leading to poor generalization on new, unseen data.
  • Wasted Resources: Storing and processing duplicate data consumes unnecessary storage space and computational power.
  • Operational Issues: Sending duplicate emails to customers, double-billing, or maintaining conflicting records for the same entity can damage customer relationships and operational efficiency.

Therefore, identifying and appropriately handling duplicates is a crucial step in any data preparation pipeline.

3. Setting Up Your Environment

Before we dive into the methods, let’s ensure we have Pandas installed and import it. We’ll also create some sample DataFrames to illustrate the concepts.

If you don’t have Pandas installed, you can install it using pip:

bash
pip install pandas

Now, let’s start our Python script or Jupyter Notebook session by importing Pandas:

“`python
import pandas as pd
import numpy as np # Often useful, especially for NaNs

print(f”Pandas version: {pd.version}”)
print(f”NumPy version: {np.version}”)
“`

Next, let’s create a few simple DataFrames that we will use throughout this guide.

DataFrame 1: Simple Duplicates (df_simple)
This DataFrame contains obvious, full-row duplicates.

“`python
data_simple = {‘col_a’: [‘A’, ‘B’, ‘C’, ‘A’, ‘B’, ‘D’],
‘col_b’: [1, 2, 3, 1, 2, 4]}
df_simple = pd.DataFrame(data_simple)

print(“DataFrame: df_simple”)
print(df_simple)
print(“-” * 30)
“`

Output:
“`
DataFrame: df_simple
col_a col_b
0 A 1
1 B 2
2 C 3
3 A 1 # Duplicate of row 0
4 B 2 # Duplicate of row 1
5 D 4


“`

DataFrame 2: Duplicates based on Subset (df_subset)
Here, duplicates exist if we only consider col_x, but the rows are not identical overall.

“`python
data_subset = {‘col_x’: [‘X’, ‘Y’, ‘X’, ‘Z’, ‘Y’, ‘X’],
‘col_y’: [10, 20, 30, 40, 50, 60],
‘col_z’: [True, False, True, False, True, False]}
df_subset = pd.DataFrame(data_subset)

print(“DataFrame: df_subset”)
print(df_subset)
print(“-” * 30)
“`

Output:
“`
DataFrame: df_subset
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True # col_x is ‘X’ (like row 0), but col_y differs
3 Z 40 False
4 Y 50 True # col_x is ‘Y’ (like row 1), but col_y, col_z differ
5 X 60 False # col_x is ‘X’ (like row 0, 2), but others differ


“`

DataFrame 3: Duplicates with Variations (df_variations)
This includes case sensitivity, whitespace issues, and missing values (NaN).

“`python
data_variations = {‘ID’: [1, 2, 3, 1, 4, 2, 5, 3],
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘alice’, ‘ David ‘, ‘Bob’, ‘Eve’, ‘charlie ‘],
‘Value’: [100, 200, np.nan, 100, 400, 200, 500, np.nan]}
df_variations = pd.DataFrame(data_variations)

print(“DataFrame: df_variations”)
print(df_variations)
print(“-” * 30)
“`

Output:
“`
DataFrame: df_variations
ID Name Value
0 1 Alice 100.0
1 2 Bob 200.0
2 3 Charlie NaN
3 1 alice 100.0 # Different case in Name
4 4 David 400.0 # Whitespace in Name
5 2 Bob 200.0 # Full duplicate of row 1
6 5 Eve 500.0
7 3 charlie NaN # Whitespace and case difference


“`

Now we have our sample data ready to explore Pandas’ duplication tools.

4. Identifying Duplicates with .duplicated()

The primary method for identifying duplicate rows without removing them is the .duplicated() method. It can be called on both Pandas Series (single columns) and DataFrames (multiple columns).

How it Works:
.duplicated() returns a boolean Series (a Series of True/False values) with the same index as the original Series or DataFrame.
* True indicates that the row (or value in a Series) is a duplicate of a previous row/value.
* False indicates that the row/value is unique so far or is the first occurrence of a value that might appear again later.

Basic Usage on a Series

Let’s apply it to a single column from df_simple:

“`python

Check for duplicates in ‘col_a’ of df_simple

duplicates_in_col_a = df_simple[‘col_a’].duplicated()

print(“Duplicates in df_simple[‘col_a’]:”)
print(duplicates_in_col_a)
“`

Output:
Duplicates in df_simple['col_a']:
0 False # First 'A'
1 False # First 'B'
2 False # First 'C'
3 True # Second 'A', marked as duplicate
4 True # Second 'B', marked as duplicate
5 False # First 'D'
Name: col_a, dtype: bool

As you can see, the first occurrences of ‘A’ (index 0) and ‘B’ (index 1) are marked False, while their subsequent occurrences (index 3 and 4) are marked True.

Basic Usage on a DataFrame

When called on a DataFrame without any arguments, .duplicated() checks for duplicates based on all columns. A row is marked True only if all its values match all the values in a preceding row.

“`python

Check for full-row duplicates in df_simple

full_row_duplicates = df_simple.duplicated()

print(“\nFull row duplicates in df_simple:”)
print(full_row_duplicates)
“`

Output:
Full row duplicates in df_simple:
0 False # ('A', 1) - First occurrence
1 False # ('B', 2) - First occurrence
2 False # ('C', 3) - First occurrence
3 True # ('A', 1) - Duplicate of row 0
4 True # ('B', 2) - Duplicate of row 1
5 False # ('D', 4) - First occurrence
dtype: bool

This confirms that rows at index 3 and 4 are exact duplicates of rows 0 and 1, respectively.

Now let’s try it on df_subset, where no full rows are identical:

“`python

Check for full-row duplicates in df_subset

full_row_duplicates_subset = df_subset.duplicated()

print(“\nFull row duplicates in df_subset:”)
print(full_row_duplicates_subset)
“`

Output:
Full row duplicates in df_subset:
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool

As expected, since no row is completely identical to a previous one, all values are False.

The keep Parameter

The behavior of which occurrence is marked as True (the duplicate) is controlled by the keep parameter. It has three possible values:

  • keep='first' (Default): Marks all occurrences except the first one as duplicates (True).
  • keep='last': Marks all occurrences except the last one as duplicates (True).
  • keep=False: Marks all occurrences that are part of a duplicate set as True. This is extremely useful for identifying all rows involved in a duplication.

Let’s see how keep affects the output on df_simple:

“`python

Using keep=’first’ (default)

print(“\ndf_simple.duplicated(keep=’first’):”)
print(df_simple.duplicated(keep=’first’)) # Same as df_simple.duplicated()

Using keep=’last’

print(“\ndf_simple.duplicated(keep=’last’):”)
print(df_simple.duplicated(keep=’last’))

Using keep=False

print(“\ndf_simple.duplicated(keep=False):”)
print(df_simple.duplicated(keep=False))
“`

Output:
“`
df_simple.duplicated(keep=’first’):
0 False
1 False
2 False
3 True # Second (‘A’, 1) is duplicate
4 True # Second (‘B’, 2) is duplicate
5 False
dtype: bool

df_simple.duplicated(keep=’last’):
0 True # First (‘A’, 1) is duplicate (because last one is kept)
1 True # First (‘B’, 2) is duplicate (because last one is kept)
2 False
3 False # Last (‘A’, 1) is kept
4 False # Last (‘B’, 2) is kept
5 False
dtype: bool

df_simple.duplicated(keep=False):
0 True # Part of (‘A’, 1) duplicate set
1 True # Part of (‘B’, 2) duplicate set
2 False
3 True # Part of (‘A’, 1) duplicate set
4 True # Part of (‘B’, 2) duplicate set
5 False
dtype: bool
“`

Notice the difference:
* keep='first' marks rows 3 and 4 as duplicates.
* keep='last' marks rows 0 and 1 as duplicates.
* keep=False marks rows 0, 1, 3, and 4 as duplicates, showing us all rows that have a twin somewhere in the DataFrame.

Filtering to View Duplicate Rows

The boolean Series returned by .duplicated() is incredibly useful for filtering the original DataFrame to see the actual duplicate rows. This is done using boolean indexing.

“`python

Show rows marked as duplicates (keeping the first occurrence)

duplicate_rows_keep_first = df_simple[df_simple.duplicated(keep=’first’)]
print(“\nDuplicate rows in df_simple (keeping first):”)
print(duplicate_rows_keep_first)

Show rows marked as duplicates (keeping the last occurrence)

duplicate_rows_keep_last = df_simple[df_simple.duplicated(keep=’last’)]
print(“\nDuplicate rows in df_simple (keeping last):”)
print(duplicate_rows_keep_last)

Show ALL rows that are part of any duplicate set

all_involved_duplicates = df_simple[df_simple.duplicated(keep=False)]
print(“\nAll rows involved in duplicates in df_simple:”)
print(all_involved_duplicates)
“`

Output:
“`
Duplicate rows in df_simple (keeping first):
col_a col_b
3 A 1
4 B 2

Duplicate rows in df_simple (keeping last):
col_a col_b
0 A 1
1 B 2

All rows involved in duplicates in df_simple:
col_a col_b
0 A 1
1 B 2
3 A 1
4 B 2
“`

This ability to easily view the duplicate records (especially using keep=False) is vital for understanding the nature of the duplication before deciding how to handle it.

Using .duplicated() with a subset

Just like we discussed partial duplicates, .duplicated() can check for duplication based on a specific column or a list of columns using the subset parameter.

Let’s revisit df_subset and check for duplicates based only on col_x:

“`python

Check for duplicates based ONLY on ‘col_x’ in df_subset

duplicates_in_col_x_subset = df_subset.duplicated(subset=[‘col_x’]) # Note: subset expects a list

print(“\nDuplicates in df_subset based on ‘col_x’ (keep=’first’):”)
print(duplicates_in_col_x_subset)

View the rows marked as duplicates based on ‘col_x’

print(“\nRows in df_subset where ‘col_x’ is duplicated (keep=’first’):”)
print(df_subset[duplicates_in_col_x_subset])

View ALL rows involved in ‘col_x’ duplication

print(“\nAll rows in df_subset involved in ‘col_x’ duplication (keep=False):”)
print(df_subset[df_subset.duplicated(subset=[‘col_x’], keep=False)])
“`

Output:
“`
Duplicates in df_subset based on ‘col_x’ (keep=’first’):
0 False # First ‘X’
1 False # First ‘Y’
2 True # Second ‘X’
3 False # First ‘Z’
4 True # Second ‘Y’
5 True # Third ‘X’
dtype: bool

Rows in df_subset where ‘col_x’ is duplicated (keep=’first’):
col_x col_y col_z
2 X 30 True
4 Y 50 True
5 X 60 False

All rows in df_subset involved in ‘col_x’ duplication (keep=False):
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True
4 Y 50 True
5 X 60 False
``
Now,
.duplicated()identifies rows 2, 4, and 5 as duplicates because their values incol_x('X', 'Y', and 'X' respectively) have already appeared in earlier rows (index 0 for 'X', index 1 for 'Y'). Usingkeep=Falseshows all rows (0, 1, 2, 4, 5) that share acol_x` value with another row.

You can specify multiple columns in the subset list:

“`python

Check for duplicates based on BOTH ‘col_x’ and ‘col_z’

duplicates_xz = df_subset.duplicated(subset=[‘col_x’, ‘col_z’])

print(“\nDuplicates in df_subset based on ‘col_x’ AND ‘col_z’ (keep=’first’):”)
print(duplicates_xz)

View these duplicates

print(“\nRows in df_subset where (‘col_x’, ‘col_z’) pair is duplicated:”)
print(df_subset[duplicates_xz])
“`

Output:
“`
Duplicates in df_subset based on ‘col_x’ AND ‘col_z’ (keep=’first’):
0 False # First (‘X’, True)
1 False # First (‘Y’, False)
2 True # Second (‘X’, True)
3 False # First (‘Z’, False)
4 False # First (‘Y’, True)
5 False # First (‘X’, False)
dtype: bool

Rows in df_subset where (‘col_x’, ‘col_z’) pair is duplicated:
col_x col_y col_z
2 X 30 True
``
Here, only row 2 is marked as a duplicate because the combination (
‘X’,True`) occurred earlier in row 0.

.duplicated() is your inspection tool. It lets you see if and where duplicates exist according to your definition (all columns or a subset) without changing your data.

5. Removing Duplicates with .drop_duplicates()

Once you have identified duplicates using .duplicated() and decided on a strategy for handling them, the .drop_duplicates() method is used to remove them. It returns a new DataFrame (by default) with the duplicate rows removed.

How it Works:
.drop_duplicates() scans the DataFrame (or a subset of columns) and removes rows that are duplicates of rows that are kept. Which rows are kept and which are removed depends on the subset and keep parameters.

Basic Usage

By default, .drop_duplicates() works like .duplicated(): it considers all columns and keeps the first occurrence (keep='first').

“`python

Drop full-row duplicates from df_simple, keeping the first occurrence

df_simple_deduplicated = df_simple.drop_duplicates() # keep=’first’ is default

print(“\ndf_simple after dropping full-row duplicates (keeping first):”)
print(df_simple_deduplicated)

Note that the original df_simple remains unchanged

print(“\nOriginal df_simple (unchanged):”)
print(df_simple)
“`

Output:
“`
df_simple after dropping full-row duplicates (keeping first):
col_a col_b
0 A 1
1 B 2
2 C 3
5 D 4

Original df_simple (unchanged):
col_a col_b
0 A 1
1 B 2
2 C 3
3 A 1
4 B 2
5 D 4
``
Rows 3 and 4, which were exact duplicates of rows 0 and 1, have been removed. The method returned a new DataFrame
df_simple_deduplicated`.

Using the subset Parameter

Similar to .duplicated(), the subset parameter in .drop_duplicates() allows you to define duplication based on specific columns. Rows will be dropped if their values in the subset columns match a row that is kept.

Let’s use df_subset again. If we want to keep only the first row for each unique value in col_x:

“`python

Drop duplicates based on ‘col_x’, keeping the first occurrence

df_subset_dedup_col_x = df_subset.drop_duplicates(subset=[‘col_x’], keep=’first’)

print(“\ndf_subset after dropping duplicates based on ‘col_x’ (keeping first):”)
print(df_subset_dedup_col_x)
“`

Output:
df_subset after dropping duplicates based on 'col_x' (keeping first):
col_x col_y col_z
0 X 10 True # First 'X' kept
1 Y 20 False # First 'Y' kept
3 Z 40 False # First 'Z' kept

Rows 2, 4, and 5 were dropped because their col_x values (‘X’, ‘Y’, ‘X’) had already been seen in rows 0 and 1, which were kept.

Using the keep Parameter

The keep parameter works identically to how it does in .duplicated(), but here it determines which row to keep when duplicates are found based on the subset (or all columns).

  • keep='first' (Default): Keep the first occurrence, drop subsequent duplicates.
  • keep='last': Keep the last occurrence, drop preceding duplicates.
  • keep=False: Drop all occurrences that are part of a duplicate set. This is useful if you want to remove any record that ever had a duplicate counterpart based on your criteria.

Let’s illustrate with df_simple:

“`python

Keep the ‘last’ occurrence of full-row duplicates

df_simple_keep_last = df_simple.drop_duplicates(keep=’last’)
print(“\ndf_simple dropping duplicates, keeping last:”)
print(df_simple_keep_last)

Keep ‘False’ – drop ALL rows that were ever duplicated

df_simple_keep_false = df_simple.drop_duplicates(keep=False)
print(“\ndf_simple dropping duplicates, keeping none (keep=False):”)
print(df_simple_keep_false)
“`

Output:
“`
df_simple dropping duplicates, keeping last:
col_a col_b
2 C 3
3 A 1 # Last (‘A’, 1) kept
4 B 2 # Last (‘B’, 2) kept
5 D 4

df_simple dropping duplicates, keeping none (keep=False):
col_a col_b
2 C 3
5 D 4
“`

Now apply keep with subset on df_subset based on col_x:

“`python

Keep the ‘last’ row for each unique ‘col_x’ value

df_subset_keep_last_col_x = df_subset.drop_duplicates(subset=[‘col_x’], keep=’last’)
print(“\ndf_subset dropping based on ‘col_x’, keeping last:”)
print(df_subset_keep_last_col_x)

Keep ‘False’ – drop ALL rows where ‘col_x’ value appeared more than once

df_subset_keep_false_col_x = df_subset.drop_duplicates(subset=[‘col_x’], keep=False)
print(“\ndf_subset dropping based on ‘col_x’, keeping none (keep=False):”)
print(df_subset_keep_false_col_x)
“`

Output:
“`
df_subset dropping based on ‘col_x’, keeping last:
col_x col_y col_z
3 Z 40 False # Only ‘Z’ appears once
4 Y 50 True # Last ‘Y’ kept
5 X 60 False # Last ‘X’ kept

df_subset dropping based on ‘col_x’, keeping none (keep=False):
col_x col_y col_z
3 Z 40 False # Only ‘Z’ occurred exactly once
``
The choice of
keep` depends entirely on your specific requirements. Do you trust the first entry more? The most recent (last) entry? Or do you want to discard any ambiguous entries entirely?

Using the inplace Parameter

By default, .drop_duplicates() (like most Pandas manipulation methods) returns a new DataFrame, leaving the original DataFrame untouched. If you want to modify the original DataFrame directly, you can use the inplace=True argument.

“`python
print(“\nOriginal df_subset before inplace drop:”)
print(df_subset)

Create a copy to modify inplace, preserving the original df_subset for later examples

df_subset_copy = df_subset.copy()

Drop duplicates based on ‘col_x’, keeping first, modifying df_subset_copy directly

return_value = df_subset_copy.drop_duplicates(subset=[‘col_x’], keep=’first’, inplace=True)

print(“\nReturn value of inplace operation:”, return_value) # Note: inplace returns None
print(“\ndf_subset_copy after inplace drop:”)
print(df_subset_copy)

Verify the original df_subset is unchanged

print(“\nOriginal df_subset (still unchanged):”)
print(df_subset)
“`

Output:
“`
Original df_subset before inplace drop:
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True
3 Z 40 False
4 Y 50 True
5 X 60 False

Return value of inplace operation: None

df_subset_copy after inplace drop:
col_x col_y col_z
0 X 10 True
1 Y 20 False
3 Z 40 False

Original df_subset (still unchanged):
col_x col_y col_z
0 X 10 True
1 Y 20 False
2 X 30 True
3 Z 40 False
4 Y 50 True
5 X 60 False
“`

Caution with inplace=True: While inplace=True can seem convenient as it avoids creating a new variable, it’s generally recommended for beginners (and often for experienced users too) to avoid it.
* It makes code harder to debug, as the state of the DataFrame changes silently.
* It breaks the flow of method chaining (assigning sequential operations).
* It doesn’t necessarily offer significant performance benefits in many cases.

It’s usually safer and clearer to assign the result back to the original variable or a new variable:

“`python

Safer alternative to inplace=True

df_subset = df_subset.drop_duplicates(subset=[‘col_x’], keep=’first’)

Now df_subset holds the deduplicated result

“`

6. Handling Nuances and Edge Cases

Real-world data often requires preprocessing before applying .duplicated() or .drop_duplicates() to catch duplicates that aren’t immediately obvious due to formatting inconsistencies.

Let’s use df_variations to explore these.

“`python
print(“\nDataFrame: df_variations (Recall)”)
print(df_variations)

Check initial duplicates based on ‘ID’

print(“\nDuplicates based on ‘ID’ initially:”)
print(df_variations.duplicated(subset=[‘ID’], keep=False))

Check initial duplicates based on ‘Name’

print(“\nDuplicates based on ‘Name’ initially:”)
print(df_variations.duplicated(subset=[‘Name’], keep=False))

Check initial full-row duplicates

print(“\nFull-row duplicates initially:”)
print(df_variations.duplicated(keep=False))
“`

Output:
“`
DataFrame: df_variations (Recall)
ID Name Value
0 1 Alice 100.0
1 2 Bob 200.0
2 3 Charlie NaN
3 1 alice 100.0
4 4 David 400.0
5 2 Bob 200.0
6 5 Eve 500.0
7 3 charlie NaN

Duplicates based on ‘ID’ initially:
0 True # ID 1
1 True # ID 2
2 True # ID 3
3 True # ID 1
4 False # ID 4
5 True # ID 2
6 False # ID 5
7 True # ID 3
dtype: bool

Duplicates based on ‘Name’ initially:
0 False # Alice
1 True # Bob
2 False # Charlie
3 False # alice (different case)
4 False # David (whitespace)
5 True # Bob
6 False # Eve
7 False # charlie (whitespace + case)
dtype: bool

Full-row duplicates initially:
0 False
1 True # Row 1 (‘Bob’, 200.0)
2 False # NaN != NaN by default in this context for full row check unless ALL columns match incl NaN position
3 False
4 False
5 True # Row 5 (‘Bob’, 200.0) – duplicate of row 1
6 False
7 False
dtype: bool
``
Initially:
*
IDshows duplicates for 1, 2, and 3.
*
Name` only shows ‘Bob’ as duplicated, because ‘Alice’/’alice’, ‘Charlie’/’charlie ‘, and ‘ David ‘ are treated as distinct strings.
* Only rows 1 and 5 are identified as full-row duplicates.

Handling Case Sensitivity

Problem: ‘Alice’ and ‘alice’ are treated as different names.
Solution: Convert the relevant column(s) to a consistent case (e.g., lowercase) before checking for duplicates.

“`python

Create a temporary column or modify the column for checking

df_variations[‘Name_lower’] = df_variations[‘Name’].str.lower()

print(“\ndf_variations with ‘Name_lower’:”)
print(df_variations)

Check duplicates based on ‘ID’ and ‘Name_lower’

name_lower_duplicates = df_variations.duplicated(subset=[‘ID’, ‘Name_lower’], keep=False)
print(“\nDuplicates based on ‘ID’ and ‘Name_lower’:”)
print(name_lower_duplicates)

print(“\nRows involved in (‘ID’, ‘Name_lower’) duplication:”)
print(df_variations[name_lower_duplicates])

Don’t forget to potentially drop the temporary column if not needed

df_variations = df_variations.drop(columns=[‘Name_lower’])

“`

Output:
“`
df_variations with ‘Name_lower’:
ID Name Value Name_lower
0 1 Alice 100.0 alice
1 2 Bob 200.0 bob
2 3 Charlie NaN charlie
3 1 alice 100.0 alice # Now matches row 0 based on Name_lower
4 4 David 400.0 david
5 2 Bob 200.0 bob # Matches row 1
6 5 Eve 500.0 eve
7 3 charlie NaN charlie # Name matches row 2, but whitespace still an issue

Duplicates based on ‘ID’ and ‘Name_lower’:
0 True # (‘1’, ‘alice’)
1 True # (‘2’, ‘bob’)
2 False # (‘3’, ‘charlie’) – Note: whitespace still matters!
3 True # (‘1’, ‘alice’)
4 False # (‘4’, ‘ david ‘)
5 True # (‘2’, ‘bob’)
6 False # (‘5’, ‘eve’)
7 False # (‘3’, ‘charlie ‘) – Whitespace difference
dtype: bool

Rows involved in (‘ID’, ‘Name_lower’) duplication:
ID Name Value Name_lower
0 1 Alice 100.0 alice
1 2 Bob 200.0 bob
3 1 alice 100.0 alice
5 2 Bob 200.0 bob
``
By converting 'Name' to lowercase (
Name_lower), we now correctly identify rows 0 and 3 as duplicates based on the combination ofID` and the lowercased name. Rows 1 and 5 are also correctly identified. However, rows 2 and 7 are still not matched due to whitespace.

Handling Leading/Trailing Whitespace

Problem: ‘ David ‘ and ‘charlie ‘ have extra spaces.
Solution: Use the .str.strip() method to remove leading and trailing whitespace from string columns. This should typically be done along with case conversion.

“`python

Apply both strip and lower to the ‘Name’ column

We can do this inplace on the original column or create a new one

df_variations[‘Name_clean’] = df_variations[‘Name’].str.strip().str.lower()

print(“\ndf_variations with ‘Name_clean’:”)
print(df_variations[[‘ID’, ‘Name’, ‘Name_clean’, ‘Value’]]) # Show relevant cols

Check duplicates based on ‘ID’ and ‘Name_clean’

clean_duplicates = df_variations.duplicated(subset=[‘ID’, ‘Name_clean’], keep=False)
print(“\nDuplicates based on ‘ID’ and ‘Name_clean’:”)
print(clean_duplicates)

print(“\nRows involved in (‘ID’, ‘Name_clean’) duplication:”)

Display original columns for clarity, but filter based on the check

print(df_variations[clean_duplicates][[‘ID’, ‘Name’, ‘Value’]])

Now let’s drop duplicates based on the cleaned name and ID, keeping the first

df_variations_deduped = df_variations.drop_duplicates(subset=[‘ID’, ‘Name_clean’], keep=’first’)

print(“\ndf_variations after dropping duplicates based on (‘ID’, ‘Name_clean’), keeping first:”)
print(df_variations_deduped[[‘ID’, ‘Name’, ‘Value’]]) # Show original columns

Clean up temporary columns if desired

df_variations = df_variations.drop(columns=[‘Name_lower’, ‘Name_clean’])

df_variations_deduped = df_variations_deduped.drop(columns=[‘Name_lower’, ‘Name_clean’])

“`

Output:
“`
df_variations with ‘Name_clean’:
ID Name Name_clean Value
0 1 Alice alice 100.0
1 2 Bob bob 200.0
2 3 Charlie charlie NaN
3 1 alice alice 100.0
4 4 David david 400.0 # Note: Name_clean is now ‘david’
5 2 Bob bob 200.0
6 5 Eve eve 500.0
7 3 charlie charlie NaN # Note: Name_clean is now ‘charlie’

Duplicates based on ‘ID’ and ‘Name_clean’:
0 True # (‘1’, ‘alice’)
1 True # (‘2’, ‘bob’)
2 True # (‘3’, ‘charlie’)
3 True # (‘1’, ‘alice’)
4 False # (‘4’, ‘david’)
5 True # (‘2’, ‘bob’)
6 False # (‘5’, ‘eve’)
7 True # (‘3’, ‘charlie’)
dtype: bool

Rows involved in (‘ID’, ‘Name_clean’) duplication:
ID Name Value
0 1 Alice 100.0
1 2 Bob 200.0
2 3 Charlie NaN
3 1 alice 100.0
5 2 Bob 200.0
7 3 charlie NaN

df_variations after dropping duplicates based on (‘ID’, ‘Name_clean’), keeping first:
ID Name Value
0 1 Alice 100.0 # Kept (‘1’, ‘alice’)
1 2 Bob 200.0 # Kept (‘2’, ‘bob’)
2 3 Charlie NaN # Kept (‘3’, ‘charlie’)
4 4 David 400.0 # Kept (‘4’, ‘david’)
6 5 Eve 500.0 # Kept (‘5’, ‘eve’)
``
Success! By applying both
.str.strip()and.str.lower()to the 'Name' column (creatingName_clean), we correctly identified all logical duplicates based onIDand the cleaned name: (0, 3), (1, 5), and (2, 7). The subsequentdrop_duplicates` call correctly kept only the first occurrence for each pair.

Considering Data Types

Sometimes, data might look similar but have different underlying types (e.g., the number 10 vs. the string '10'). Pandas’ duplication checks are type-sensitive.

“`python
df_types = pd.DataFrame({‘A’: [1, ‘1’, 2, 2], ‘B’: [‘x’, ‘x’, ‘y’, ‘y’]})
print(“\nDataFrame with mixed types:”)
print(df_types)
print(df_types.dtypes)

print(“\nDuplicates based on ‘A’:”)
print(df_types.duplicated(subset=[‘A’], keep=False)) # 1 and ‘1’ are different

print(“\nDuplicates based on ‘B’:”)
print(df_types.duplicated(subset=[‘B’], keep=False)) # ‘x’ and ‘x’, ‘y’ and ‘y’

print(“\nFull row duplicates:”)
print(df_types.duplicated(keep=False)) # None, because of column ‘A’ types
“`

Output:
“`
DataFrame with mixed types:
A B
0 1 x
1 1 x
2 2 y
3 2 y
A object # Because it contains both int and str
B object
dtype: object

Duplicates based on ‘A’:
0 False # int 1
1 False # str ‘1’
2 True # int 2
3 True # int 2
dtype: bool

Duplicates based on ‘B’:
0 True # ‘x’
1 True # ‘x’
2 True # ‘y’
3 True # ‘y’
dtype: bool

Full row duplicates:
0 False
1 False
2 False
3 False
dtype: bool
``
As you see, the integer
1and the string‘1’in columnA` are not considered duplicates. If your intention is to treat them as the same, you need to convert the column to a consistent type before checking for duplicates.

“`python

Convert column ‘A’ to string type (or numeric, depending on context)

df_types[‘A_str’] = df_types[‘A’].astype(str)

print(“\nDataFrame with ‘A’ converted to string:”)
print(df_types)

print(“\nDuplicates based on ‘A_str’:”)
print(df_types.duplicated(subset=[‘A_str’], keep=False)) # Now ‘1’ and ‘1’ match

Check full duplicates using the consistent type column ‘A_str’ and ‘B’

print(“\nFull duplicates using ‘A_str’ and ‘B’:”)
print(df_types.duplicated(subset=[‘A_str’, ‘B’], keep=False))
“`

Output:
“`
DataFrame with ‘A’ converted to string:
A B A_str
0 1 x 1
1 1 x 1
2 2 y 2
3 2 y 2

Duplicates based on ‘A_str’:
0 True # ‘1’
1 True # ‘1’
2 True # ‘2’
3 True # ‘2’
dtype: bool

Full duplicates using ‘A_str’ and ‘B’:
0 True # (‘1’, ‘x’)
1 True # (‘1’, ‘x’)
2 True # (‘2’, ‘y’)
3 True # (‘2’, ‘y’)
dtype: bool
``
After converting column
Ato string type (A_str), the duplication checks behave as expected if we intended1and‘1’to be treated identically. Always inspect your data types usingdf.info()ordf.dtypes` and perform necessary conversions.

How Missing Values (NaN) are Handled

This is a subtle but important point. How do .duplicated() and .drop_duplicates() treat np.nan (Not a Number) or None values?

By default, in Pandas:
* When comparing values within a column for duplication checks, NaN is considered equal to itself. So, two rows with NaN in the same checked column(s) (and matching values in other checked columns) can be considered duplicates.
* However, Python’s standard comparison np.nan == np.nan returns False. Pandas handles this specially within its duplication logic.

Let’s look at df_variations again, focusing on rows 2 and 7 after cleaning the ‘Name’ column.

“`python

Recall df_variations after cleaning ‘Name’

print(“\ndf_variations with ‘Name_clean’ (Rows 2 and 7):”)
print(df_variations.loc[[2, 7], [‘ID’, ‘Name_clean’, ‘Value’]])

Check duplication based on ‘ID’ and ‘Name_clean’ again

print(“\nDuplicates based on (‘ID’, ‘Name_clean’) for rows 2, 7:”)
print(df_variations.duplicated(subset=[‘ID’, ‘Name_clean’], keep=False).loc[[2, 7]])

Check duplication based on ALL columns (‘ID’, ‘Name’, ‘Value’, ‘Name_lower’, ‘Name_clean’)

Note: We need to consider all current columns to see if NaN matching works

print(“\nFull row duplicates considering NaNs (keep=False):”)
print(df_variations.duplicated(keep=False))

Let’s create a specific NaN example

df_nan = pd.DataFrame({‘A’: [1, 1, 2, 2, 3], ‘B’: [np.nan, np.nan, ‘x’, ‘x’, np.nan]})
print(“\nDataFrame with NaNs:”)
print(df_nan)

print(“\nDuplicates in df_nan based on ‘A’:”)
print(df_nan.duplicated(subset=[‘A’], keep=False))

print(“\nDuplicates in df_nan based on ‘B’:”)
print(df_nan.duplicated(subset=[‘B’], keep=False)) # NaNs match each other, ‘x’ matches ‘x’

print(“\nFull row duplicates in df_nan:”)
print(df_nan.duplicated(keep=False)) # (1, NaN) matches (1, NaN); (2, ‘x’) matches (2, ‘x’)
“`

Output:
“`
df_variations with ‘Name_clean’ (Rows 2 and 7):
ID Name_clean Value
2 3 charlie NaN
7 3 charlie NaN

Duplicates based on (‘ID’, ‘Name_clean’) for rows 2, 7:
2 True
7 True
dtype: bool

Full row duplicates considering NaNs (keep=False):
0 True # ID=1, Name_clean=alice, Value=100.0
1 True # ID=2, Name_clean=bob, Value=200.0
2 True # ID=3, Name_clean=charlie, Value=NaN <– Matches row 7
3 True # ID=1, Name_clean=alice, Value=100.0
4 False
5 True # ID=2, Name_clean=bob, Value=200.0
6 False
7 True # ID=3, Name_clean=charlie, Value=NaN <– Matches row 2
dtype: bool

DataFrame with NaNs:
A B
0 1 NaN
1 1 NaN
2 2 x
3 2 x
4 3 NaN

Duplicates in df_nan based on ‘A’:
0 True
1 True
2 True
3 True
4 False
dtype: bool

Duplicates in df_nan based on ‘B’:
0 True # NaN matches NaN
1 True # NaN matches NaN
2 True # ‘x’ matches ‘x’
3 True # ‘x’ matches ‘x’
4 True # NaN matches NaN (at index 0, 1)
dtype: bool

Full row duplicates in df_nan:
0 True # (1, NaN)
1 True # (1, NaN)
2 True # (2, ‘x’)
3 True # (2, ‘x’)
4 False # (3, NaN) is unique
dtype: bool
``
Key takeaway: When using
.duplicated()or.drop_duplicates(),NaNvalues **in the columns being checked** are considered identical to otherNaNvalues *within that check*. This means rows(1, NaN)and(1, NaN)are duplicates, and rows(3, ‘charlie’, NaN)and(3, ‘charlie’, NaN)in our cleaneddf_variations` are also duplicates.

7. Counting Duplicates

Often, you don’t just want to identify or remove duplicates, but also count how many there are.

Counting Total Duplicate Rows

The boolean Series returned by .duplicated() (with keep='first' or keep='last') directly tells you which rows would be dropped. Since True evaluates to 1 and False to 0 in numerical contexts, you can simply use .sum() on the result.

“`python

Count how many rows are duplicates (excluding the first occurrence)

num_duplicates_simple = df_simple.duplicated(keep=’first’).sum()
print(f”\nNumber of duplicate rows in df_simple (excluding first): {num_duplicates_simple}”)

Count how many rows would be dropped if keeping the last

num_duplicates_simple_keep_last = df_simple.duplicated(keep=’last’).sum()
print(f”Number of duplicate rows in df_simple (excluding last): {num_duplicates_simple_keep_last}”)

Count total rows involved in ANY duplication

num_involved_in_duplicates = df_simple.duplicated(keep=False).sum()
print(f”Total number of rows involved in duplication in df_simple: {num_involved_in_duplicates}”)
“`

Output:
Number of duplicate rows in df_simple (excluding first): 2
Number of duplicate rows in df_simple (excluding last): 2
Total number of rows involved in duplication in df_simple: 4

Counting Duplicates per Group/Value

Sometimes you want to know which values are duplicated and how many times they appear. Pandas’ .value_counts() method on a Series or .groupby().size() on a DataFrame are useful here.

“`python

Using value_counts() on a Series to see frequency

print(“\nValue counts for df_simple[‘col_a’]:”)
print(df_simple[‘col_a’].value_counts())

Filter value_counts to show only duplicated values

col_a_counts = df_simple[‘col_a’].value_counts()
print(“\nDuplicated values in df_simple[‘col_a’] (occur > 1 time):”)
print(col_a_counts[col_a_counts > 1])

Using groupby() and size() for combinations of columns

print(“\nCounts for combinations of (‘col_a’, ‘col_b’) in df_simple:”)

.size() includes NaN groups, .count() excludes them per column

group_counts = df_simple.groupby([‘col_a’, ‘col_b’]).size()
print(group_counts)

print(“\nDuplicated combinations in df_simple (occur > 1 time):”)
print(group_counts[group_counts > 1])
“`

Output:
“`
Value counts for df_simple[‘col_a’]:
A 2
B 2
C 1
D 1
Name: col_a, dtype: int64

Duplicated values in df_simple[‘col_a’] (occur > 1 time):
A 2
B 2
Name: col_a, dtype: int64

Counts for combinations of (‘col_a’, ‘col_b’) in df_simple:
col_a col_b
A 1 2
B 2 2
C 3 1
D 4 1
dtype: int64

Duplicated combinations in df_simple (occur > 1 time):
col_a col_b
A 1 2
B 2 2
dtype: int64
“`
These methods help quantify the extent and nature of duplication based on specific columns or combinations.

8. Practical Workflow and Best Practices

Handling duplicates is rarely a single command. It typically involves exploration, decision-making, and verification. Here’s a recommended workflow:

  1. Understand Your Data & Uniqueness Constraints:

    • Load your data into a Pandas DataFrame.
    • Use .info() to check data types and non-null counts.
    • Use .head(), .sample(), and .describe() to get a feel for the values.
    • Crucially, determine which column(s) should uniquely identify a record. Is it an ID column? A combination of name, date, and location? This defines your subset.
  2. Initial Check for Full Duplicates:

    • Run df.duplicated().sum() to quickly see if any obvious full-row duplicates exist.
  3. Preprocessing (If Necessary):

    • Identify columns prone to formatting issues (strings, dates, potentially numbers read as strings).
    • Apply .str.strip(), .str.lower() (or .upper()), type conversions (.astype()), date parsing (pd.to_datetime) etc., to the relevant columns to standardize them. Create temporary cleaned columns if you want to preserve the originals initially.
  4. Identify Duplicates Based on Key Columns (subset):

    • Use df.duplicated(subset=key_columns, keep=False) to get a boolean Series marking all rows involved in duplication based on your chosen keys.
    • Filter the DataFrame using this boolean Series: df[df.duplicated(subset=key_columns, keep=False)].
    • Inspect these rows carefully. Why are they duplicates? Is it expected? Do they contain conflicting information in non-key columns? Sorting these rows (.sort_values(by=key_columns)) often helps visual inspection.
  5. Decide on a Strategy (keep):

    • Based on your inspection, decide which version of the duplicate records to keep.
      • keep='first': Keep the one that appeared earliest in the DataFrame. Often the default choice if there’s no better reason.
      • keep='last': Keep the most recent entry. Useful if newer data is presumed more accurate or relevant (e.g., based on a timestamp column you sorted by).
      • keep=False: Discard all records that have duplicates. Use if any ambiguity makes the record untrustworthy.
      • More complex logic: Sometimes you might need to group by the key columns and then apply custom logic (e.g., keep the row with the fewest missing values, or the highest value in a specific column) before dropping duplicates. This goes beyond basic .drop_duplicates().
  6. Apply .drop_duplicates():

    • Use df.drop_duplicates(subset=key_columns, keep=chosen_keep_strategy) to remove the duplicates.
    • Assign the result to a new DataFrame or overwrite the old one (carefully):
      python
      df_deduplicated = df.drop_duplicates(subset=key_columns, keep=chosen_keep_strategy)
      # OR (use with caution)
      # df = df.drop_duplicates(subset=key_columns, keep=chosen_keep_strategy)
  7. Verify the Result:

    • Check the shape of the new DataFrame (df_deduplicated.shape) to see how many rows were removed.
    • Run the duplication check again on the resulting DataFrame to confirm duplicates are gone: df_deduplicated.duplicated(subset=key_columns).sum(). This should return 0.
    • Optionally, check value counts or unique counts on key columns again.

Best Practices:

  • Don’t delete blindly: Always inspect duplicates before dropping, especially when using a subset. Use df[df.duplicated(subset=..., keep=False)] extensively.
  • Standardize first: Apply cleaning (case, whitespace, types) before checking for duplicates.
  • Document your choices: Record which columns you used as keys (subset) and why you chose a specific keep strategy. This is crucial for reproducibility.
  • Consider the source: Why did the duplicates occur? Can the data entry or collection process be improved?
  • Prefer returning new DataFrames: Avoid inplace=True unless you have a specific reason and understand the implications.
  • Backup original data: Keep a copy of the raw data before performing cleaning operations.

9. Putting It All Together: A More Complex Example

Let’s simulate a slightly more realistic scenario. Imagine a dataset of user registrations where duplicates might arise from multiple sign-up attempts or data entry errors.

“`python

Sample User Registration Data

data_users = {
‘Timestamp’: pd.to_datetime([‘2023-01-10 09:00’, ‘2023-01-10 09:05’, ‘2023-01-11 10:00’,
‘2023-01-11 10:05’, ‘2023-01-12 11:30’, ‘2023-01-12 11:31’,
‘2023-01-13 14:00’, ‘2023-01-13 14:05’, ‘2023-01-14 15:00’]),
‘UserID_Input’: [‘ user1 ‘, ‘User1’, ‘ user2’, ‘user3 ‘, ‘User2’, ‘user2 ‘, ‘user4’, ‘user4’, ‘User5’],
‘Email’: [‘[email protected]’, ‘[email protected]’, ‘ [email protected] ‘, ‘[email protected]’,
[email protected]’, ‘[email protected]’, ‘[email protected]’, ‘[email protected]’, ‘[email protected]’],
‘Plan’: [‘Free’, ‘Premium’, ‘Premium’, ‘Free’, ‘Premium’, ‘Premium’, ‘Pro’, ‘Pro’, ‘Free’],
‘Country’: [‘USA’, ‘USA’, ‘Canada’, ‘UK’, ‘ CA ‘, ‘Canada’, ‘USA’, ‘usa’, ‘UK’]
}
df_users = pd.DataFrame(data_users)

print(“— Original User Data —“)
print(df_users)
print(f”\nShape: {df_users.shape}”)
print(“\nInfo:”)
df_users.info()

— Step 1: Understand Data & Uniqueness —

Potential Keys: UserID_Input, Email. Maybe combination?

Timestamp might indicate first/last entry.

Country has formatting issues.

— Step 2: Initial Check (Full Duplicates) —

print(f”\nInitial full row duplicates: {df_users.duplicated().sum()}”) # Likely 0 due to Timestamp

— Step 3: Preprocessing —

print(“\n— Preprocessing —“)
df_users[‘UserID’] = df_users[‘UserID_Input’].str.strip().str.lower()
df_users[‘Email_Clean’] = df_users[‘Email’].str.strip().str.lower()
df_users[‘Country_Clean’] = df_users[‘Country’].str.strip().str.upper()

Display cleaned columns alongside originals for clarity

print(“\nData after cleaning UserID, Email, Country:”)
print(df_users[[‘Timestamp’, ‘UserID_Input’, ‘UserID’, ‘Email’, ‘Email_Clean’, ‘Country’, ‘Country_Clean’, ‘Plan’]])

— Step 4: Identify Duplicates Based on Keys —

Let’s assume Email should be unique.

key_columns = [‘Email_Clean’]
print(f”\n— Identifying Duplicates Based On: {key_columns} —“)

duplicates_bool = df_users.duplicated(subset=key_columns, keep=False)
print(“\nBoolean Series (keep=False):”)
print(duplicates_bool)

print(“\nRows involved in Email duplication:”)

Sort by email and timestamp to see related rows together

print(df_users[duplicates_bool].sort_values(by=[‘Email_Clean’, ‘Timestamp’]))

— Step 5: Decide on Strategy —

Looking at the duplicates for ‘[email protected]’ (rows 0, 1) and ‘[email protected]’ (rows 2, 4, 5) and ‘[email protected]’ (rows 6, 7):

– For ‘[email protected]’, row 1 has Plan=’Premium’ vs row 0 ‘Free’. Maybe keep the latest one?

– For ‘[email protected]’, row 2 (‘user2’) and 5 (‘user2’) seem consistent except for Timestamp and minor Country variation. Row 4 (‘User2’) has a different UserID input but same email. Keeping the latest timestamp seems reasonable.

– For ‘[email protected]’, rows 6 and 7 are almost identical, except timestamp and Country case. Keep latest.

Strategy: Keep the ‘last’ entry based on Timestamp for each Email.

We should sort by Timestamp before dropping duplicates if we want ‘last’ to reliably mean the latest registration time.

— Step 6 & 7: Apply drop_duplicates & Verify —

print(“\n— Applying drop_duplicates (keep=’last’) after sorting by Timestamp —“)

Sort by Timestamp (ascending) so ‘last’ keeps the latest entry for each email

df_users_sorted = df_users.sort_values(by=’Timestamp’)

df_users_deduped = df_users_sorted.drop_duplicates(subset=key_columns, keep=’last’)

print(“\nDataFrame after deduplication:”)
print(df_users_deduped[[‘Timestamp’, ‘UserID’, ‘Email_Clean’, ‘Plan’, ‘Country_Clean’]])
print(f”\nShape after deduplication: {df_users_deduped.shape}”)

Verification

print(f”\nRemaining duplicates based on {key_columns}: {df_users_deduped.duplicated(subset=key_columns).sum()}”) # Should be 0

Check UserID duplicates in the cleaned data (optional)

print(f”\nDuplicates based on UserID in final data: {df_users_deduped.duplicated(subset=[‘UserID’]).sum()}”)

Note: We might still have UserID duplicates if different emails were used.

Final clean-up (optional: drop intermediate columns)

df_final = df_users_deduped.drop(columns=[‘UserID_Input’, ‘Email’, ‘Country’])
print(“\n— Final Cleaned Data —“)
print(df_final)
“`

Explanation of the Example:

  1. Load & Inspect: We loaded the data and used .info() to see types (noting object types for strings).
  2. Preprocessing: We identified UserID_Input, Email, and Country as needing cleaning. We applied .str.strip() and case conversion (.lower() or .upper()) to create standardized UserID, Email_Clean, and Country_Clean columns.
  3. Identify: We decided Email_Clean should be the unique key. We used df.duplicated(subset=['Email_Clean'], keep=False) and filtered the DataFrame to inspect all rows associated with duplicate emails. Sorting by Email_Clean and Timestamp helped group them visually.
  4. Strategize: We observed different plans or user IDs associated with the same email. We decided that keeping the latest registration (based on Timestamp) for each email was a reasonable strategy (keep='last'). To ensure ‘last’ refers to the latest time, we first sorted the DataFrame by Timestamp.
  5. Apply & Verify: We applied drop_duplicates(subset=['Email_Clean'], keep='last') to the sorted DataFrame. We checked the shape before and after, and verified that the number of duplicates based on Email_Clean in the resulting DataFrame was zero. We also performed an optional check for remaining UserID duplicates (which might be acceptable if a user can register multiple emails). Finally, we dropped the original messy columns.

This example demonstrates the iterative nature: clean -> identify -> inspect -> strategize -> apply -> verify.

10. Conclusion

Duplicate data is a common challenge in data analysis, but Pandas provides straightforward and powerful tools to manage it. We’ve explored the core functions:

  • .duplicated(subset=None, keep='first'): Identifies duplicate rows based on all columns or a specified subset, returning a boolean Series. The keep parameter ('first', 'last', False) controls which occurrences are marked as True. Essential for inspection.
  • .drop_duplicates(subset=None, keep='first', inplace=False): Removes duplicate rows, returning a new DataFrame (unless inplace=True). Uses the same subset and keep logic to determine which rows to remove and which to retain.

We also highlighted the critical importance of preprocessing steps like handling case sensitivity (.str.lower(), .str.upper()), removing whitespace (.str.strip()), and ensuring consistent data types (.astype()) before applying duplication checks. Understanding how missing values (NaN) are treated (they match each other) is also key.

By following a structured workflow – understanding uniqueness, cleaning data, inspecting duplicates thoroughly, choosing a sensible keep strategy, applying drop_duplicates, and verifying the results – beginners can confidently tackle duplicate data issues and significantly improve the quality and reliability of their analyses.

Mastering duplicate handling is a fundamental skill in data cleaning, paving the way for more accurate insights and robust data-driven applications. Keep practicing with different datasets and scenarios!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top