Mastering Nested JSON in Pandas: A Deep Dive into json_normalize
In the modern data landscape, JSON (JavaScript Object Notation) has become the lingua franca for data exchange, especially for web APIs, configuration files, and NoSQL databases. Its human-readable text format and flexible hierarchical structure make it incredibly versatile. However, this very flexibility, particularly its ability to represent nested structures (dictionaries within dictionaries, lists within dictionaries, etc.), can pose a challenge when trying to load and analyze this data in a tabular format, like a Pandas DataFrame.
Pandas, the cornerstone library for data manipulation and analysis in Python, excels with two-dimensional, labeled data structures. Directly loading deeply nested JSON into a DataFrame often results in columns containing complex objects (lists or dictionaries) rather than the desired scalar values. This hinders straightforward analysis, filtering, and aggregation.
Fortunately, Pandas provides a powerful and elegant solution specifically designed for this problem: pandas.json_normalize
. This function is purpose-built to “flatten” semi-structured JSON data into a flat table (a DataFrame), making it readily accessible for standard Pandas operations.
This article provides a comprehensive guide to understanding and utilizing pd.json_normalize
. We will cover:
- The Challenge of Nested JSON: Understanding why nested structures are problematic for DataFrames.
- Introduction to
pd.json_normalize
: Basic usage and core concepts. - Key Parameters: A detailed exploration of
record_path
,meta
,sep
,errors
, andmax_level
. - Handling Different Nesting Patterns: Examples covering various common JSON structures.
- Practical Use Cases: Applying
json_normalize
to real-world scenarios like API responses. - Advanced Techniques and Considerations: Error handling, performance, preprocessing, and dealing with multiple record paths.
- Putting It All Together: A comprehensive, step-by-step example.
By the end of this article, you will have a thorough understanding of how to effectively tame nested JSON data and transform it into analysis-ready Pandas DataFrames using json_normalize
.
1. The Challenge of Nested JSON
Before diving into the solution, let’s clearly define the problem. Consider a typical JSON response you might get from an API representing information about users and their posts:
json
[
{
"userId": 1,
"userInfo": {
"name": "Alice",
"email": "[email protected]",
"address": {
"street": "123 Maple St",
"city": "Wonderland",
"zipcode": "12345"
}
},
"posts": [
{
"postId": 101,
"title": "Adventures in Data",
"tags": ["python", "pandas", "json"]
},
{
"postId": 102,
"title": "Flattening the World",
"tags": ["pandas", "json_normalize"]
}
]
},
{
"userId": 2,
"userInfo": {
"name": "Bob",
"email": "[email protected]",
"address": {
"street": "456 Oak Ave",
"city": "Codeville",
"zipcode": "67890"
}
},
"posts": [
{
"postId": 201,
"title": "API Integrations",
"tags": ["python", "requests", "api"]
}
]
}
]
This JSON structure contains multiple levels of nesting:
- The main structure is a list
[...]
of user objects. - Each user object is a dictionary
{...}
. - The
userInfo
key maps to another dictionary. - Within
userInfo
, theaddress
key maps to yet another dictionary. - The
posts
key maps to a list of dictionaries, where each dictionary represents a post. - Within each post dictionary, the
tags
key maps to a list of strings.
If we try to load this directly into a Pandas DataFrame:
“`python
import pandas as pd
import json
data = [
{
“userId”: 1,
“userInfo”: {
“name”: “Alice”,
“email”: “[email protected]”,
“address”: {
“street”: “123 Maple St”,
“city”: “Wonderland”,
“zipcode”: “12345”
}
},
“posts”: [
{
“postId”: 101,
“title”: “Adventures in Data”,
“tags”: [“python”, “pandas”, “json”]
},
{
“postId”: 102,
“title”: “Flattening the World”,
“tags”: [“pandas”, “json_normalize”]
}
]
},
{
“userId”: 2,
“userInfo”: {
“name”: “Bob”,
“email”: “[email protected]”,
“address”: {
“street”: “456 Oak Ave”,
“city”: “Codeville”,
“zipcode”: “67890”
}
},
“posts”: [
{
“postId”: 201,
“title”: “API Integrations”,
“tags”: [“python”, “requests”, “api”]
}
]
}
]
Load directly into DataFrame
df_naive = pd.DataFrame(data)
print(df_naive)
“`
The output would look something like this:
userId userInfo posts
0 1 {'name': 'Alice', 'email': '[email protected]... [{'postId': 101, 'title': 'Adventures in Data...
1 2 {'name': 'Bob', 'email': '[email protected]', '... [{'postId': 201, 'title': 'API Integrations',...
As you can see, the userInfo
and posts
columns contain Python dictionaries and lists, respectively. While Pandas can store these objects, it’s not ideal for analysis. How would you easily filter users based on their city? How would you count the total number of posts across all users? How would you analyze post tags? These tasks become cumbersome. You’d need to apply custom functions or iterate through these object columns, which defeats the purpose of using a highly optimized library like Pandas.
This is precisely the problem json_normalize
solves: it unpacks these nested structures into separate columns, creating a flat, tabular representation suitable for analysis.
2. Introduction to pd.json_normalize
pandas.json_normalize
takes a dictionary or a list of dictionaries as input and normalizes the semi-structured JSON data into a flat table.
Basic Syntax:
python
pd.json_normalize(
data,
record_path=None,
meta=None,
meta_prefix=None, # Less commonly used, adds prefix to meta columns
record_prefix=None, # Less commonly used, adds prefix to record columns
errors='raise',
sep='.',
max_level=None
)
Let’s start with the simplest cases.
Case 1: List of Flat Dictionaries
If your data is already a list of flat dictionaries (no nesting), json_normalize
behaves similarly to the standard pd.DataFrame
constructor.
“`python
flat_data = [
{‘colA’: 1, ‘colB’: ‘apple’},
{‘colA’: 2, ‘colB’: ‘banana’},
{‘colA’: 3, ‘colB’: ‘cherry’}
]
df_flat = pd.json_normalize(flat_data)
print(df_flat)
“`
Output:
colA colB
0 1 apple
1 2 banana
2 3 cherry
Case 2: List of Dictionaries with One Level of Nesting
Now, let’s introduce a single level of nesting, similar to the userInfo
part of our earlier example.
“`python
nested_data_simple = [
{
‘id’: 1,
‘details’: { ‘name’: ‘Alice’, ‘age’: 30 }
},
{
‘id’: 2,
‘details’: { ‘name’: ‘Bob’, ‘age’: 25 }
}
]
df_nested_simple = pd.json_normalize(nested_data_simple)
print(df_nested_simple)
“`
Output:
id details.name details.age
0 1 Alice 30
1 2 Bob 25
Notice how json_normalize
automatically handled the nested details
dictionary. It flattened the structure by creating new column names using the parent key (details
) followed by the nested key (name
or age
), separated by a dot (.
). This .
separator is the default, but it can be changed using the sep
parameter.
This default behavior works well for dictionaries nested within dictionaries. However, it doesn’t automatically unpack lists within dictionaries, like the posts
field in our original example. This is where the record_path
and meta
parameters become crucial.
3. Key Parameters of json_normalize
Let’s delve into the most important parameters that give json_normalize
its power and flexibility.
3.1 record_path
: Navigating to the List of Records
This is arguably the most critical parameter for handling common nested JSON structures, especially API responses where a list of results is embedded within a larger JSON object.
- Purpose: Specifies the path (key or sequence of keys) within each dictionary in the input
data
that leads to the list of records you want to unpack. Each element in this target list will become a row in the resulting DataFrame. - Type: Can be a string (for a single key) or a list of strings (for nested keys).
Let’s revisit our initial complex example, focusing on unpacking the posts
list. Suppose we want a DataFrame where each post is a row.
“`python
data = [ # Same data as before
{
“userId”: 1, “userInfo”: { … },
“posts”: [ { “postId”: 101, … }, { “postId”: 102, … } ]
},
{
“userId”: 2, “userInfo”: { … },
“posts”: [ { “postId”: 201, … } ]
}
]
Normalize based on the ‘posts’ list
df_posts = pd.json_normalize(data, record_path=’posts’)
print(df_posts)
“`
Output:
postId title tags
0 101 Adventures in Data [python, pandas, json]
1 102 Flattening the World [pandas, json_normalize]
2 201 API Integrations [python, requests, api]
Observe what happened:
1. json_normalize
looked inside each dictionary in the top-level data
list.
2. It accessed the key specified by record_path='posts'
.
3. It found a list of dictionaries at that path.
4. It iterated through each dictionary within the posts
lists (across all users) and created a new row in the DataFrame for each post dictionary.
5. The columns of the resulting DataFrame (postId
, title
, tags
) are the keys from the dictionaries found inside the posts
lists.
Nested record_path
: If the list you want to flatten is deeper within the structure, provide a list of keys.
Imagine the JSON was structured like this:
json
data_deep_records = [
{
"requestId": "req-001",
"results": {
"status": "success",
"items": [
{"itemId": "A", "value": 10},
{"itemId": "B", "value": 20}
]
}
},
{
"requestId": "req-002",
"results": {
"status": "success",
"items": [
{"itemId": "C", "value": 30}
]
}
}
]
To get a DataFrame where each item
is a row, the record_path
needs to navigate through results
and then items
:
python
df_deep_items = pd.json_normalize(data_deep_records, record_path=['results', 'items'])
print(df_deep_items)
Output:
itemId value
0 A 10
1 B 20
2 C 30
3.2 meta
: Including Metadata from Parent Levels
When using record_path
, the default output only includes columns from the dictionaries within the specified list. Often, however, you also need information from the parent levels associated with those records. For instance, in our user/post example, when we flattened the posts
, we lost the userId
information. Which user made which post?
The meta
parameter solves this.
- Purpose: Specifies keys from the parent levels (outside the
record_path
) whose values should be included in the final DataFrame. These values will be broadcast (repeated) across the rows generated from the correspondingrecord_path
list. - Type: A list of strings or lists of strings (for nested metadata keys).
Let’s combine record_path
and meta
to get a DataFrame of posts that also includes the userId
and the user’s name
and email
.
“`python
data = [ # Same data as before
{
“userId”: 1,
“userInfo”: { “name”: “Alice”, “email”: “[email protected]”, “address”: { … } },
“posts”: [ { “postId”: 101, … }, { “postId”: 102, … } ]
},
{
“userId”: 2,
“userInfo”: { “name”: “Bob”, “email”: “[email protected]”, “address”: { … } },
“posts”: [ { “postId”: 201, … } ]
}
]
Normalize posts, including userId and user details from parent levels
df_posts_with_meta = pd.json_normalize(
data,
record_path=’posts’,
meta=[
‘userId’,
[‘userInfo’, ‘name’], # Nested meta key
[‘userInfo’, ’email’] # Nested meta key
]
)
print(df_posts_with_meta)
“`
Output:
postId title tags userId userInfo.name userInfo.email
0 101 Adventures in Data [python, pandas, json] 1 Alice [email protected]
1 102 Flattening the World [pandas, json_normalize] 1 Alice [email protected]
2 201 API Integrations [python, requests, api] 2 Bob [email protected]
Now, this DataFrame is much more useful!
1. Each row still represents a single post (due to record_path='posts'
).
2. We have the columns from the post dictionaries (postId
, title
, tags
).
3. We also have columns specified by meta
:
* userId
: The top-level user ID.
* userInfo.name
: The user’s name, accessed via the nested path ['userInfo', 'name']
.
* userInfo.email
: The user’s email, accessed via ['userInfo', 'email']
.
4. Notice how the userId
, userInfo.name
, and userInfo.email
values are repeated for posts belonging to the same user (e.g., userId
1 for posts 101 and 102). This is the broadcasting effect of meta
.
5. Also note the column naming for nested meta keys: userInfo.name
and userInfo.email
, using the default .
separator.
Important: The paths provided in meta
are relative to the top level of each dictionary in the input data
, not relative to the record_path
.
3.3 sep
: Customizing the Column Name Separator
By default, json_normalize
uses a dot (.
) to separate keys when flattening nested dictionaries (e.g., userInfo.name
, userInfo.address.city
).
- Purpose: Allows you to specify a different separator for constructing flattened column names.
- Type: String.
Sometimes, dots in column names can be inconvenient (e.g., if you plan to use query methods that interpret dots, or if you prefer snake_case). You can change the separator using sep
.
“`python
Using the previous example, but with ‘_’ as separator
df_posts_with_meta_sep = pd.json_normalize(
data,
record_path=’posts’,
meta=[
‘userId’,
[‘userInfo’, ‘name’],
[‘userInfo’, ’email’],
[‘userInfo’, ‘address’, ‘city’] # Adding city for demonstration
],
sep=’_’ # Use underscore as separator
)
print(df_posts_with_meta_sep)
“`
Output:
postId title tags userId userInfo_name userInfo_email userInfo_address_city
0 101 Adventures in Data [python, pandas, json] 1 Alice [email protected] Wonderland
1 102 Flattening the World [pandas, json_normalize] 1 Alice [email protected] Wonderland
2 201 API Integrations [python, requests, api] 2 Bob [email protected] Codeville
Now the columns are named userInfo_name
, userInfo_email
, and userInfo_address_city
, which might be preferable in some contexts.
3.4 errors
: Handling Missing Keys
Real-world JSON data is often messy. Structures might be inconsistent; expected keys might be missing in some records. The errors
parameter controls how json_normalize
behaves when it encounters missing keys specified in record_path
or meta
.
- Purpose: Define the strategy for handling missing keys during normalization.
-
Type: String, either
'raise'
(default) or'ignore'
. -
errors='raise'
(Default): If a key specified inrecord_path
ormeta
is missing in any of the input dictionaries, aKeyError
(or sometimesTypeError
if the path expects a dict/list but finds something else) is raised, and the normalization process stops. This is useful for ensuring data consistency and catching structural problems early. -
errors='ignore'
: If a key is missing,json_normalize
will skip that key for that specific record and proceed. The corresponding value in the resulting DataFrame will typically beNaN
(orNone
). This is useful when you expect some optional fields and want the normalization to complete even with inconsistencies.
Let’s modify our data slightly to introduce missing keys:
“`python
data_missing = [
{ # Alice has user info and posts
“userId”: 1,
“userInfo”: { “name”: “Alice”, “email”: “[email protected]” },
“posts”: [ { “postId”: 101, “title”: “Post A” } ]
},
{ # Bob is missing the ‘userInfo’ key entirely
“userId”: 2,
# “userInfo”: { … }, # Missing!
“posts”: [ { “postId”: 201, “title”: “Post B” } ]
},
{ # Charlie has userInfo but is missing the ‘posts’ key
“userId”: 3,
“userInfo”: { “name”: “Charlie”, “email”: “[email protected]” },
# “posts”: [ … ] # Missing!
},
{ # Dave has posts, but one post is missing ‘title’ (won’t affect meta/record path errors)
“userId”: 4,
“userInfo”: { “name”: “Dave”, “email”: “[email protected]” },
“posts”: [ { “postId”: 401 } ] # Missing title
}
]
Attempt 1: Default behavior (errors=’raise’)
try:
df_missing_raise = pd.json_normalize(
data_missing,
record_path=’posts’,
meta=[‘userId’, [‘userInfo’, ‘name’]]
)
print(df_missing_raise)
except KeyError as e:
print(f”\nCaught KeyError: {e}”) # Expecting error due to missing ‘userInfo’ for Bob
try:
df_missing_raise_posts = pd.json_normalize(
data_missing,
record_path=’posts’, # Will fail for Charlie who is missing ‘posts’
meta=[‘userId’]
)
print(df_missing_raise_posts)
except TypeError as e: # Often TypeError if path expects list but gets None
print(f”\nCaught TypeError related to missing ‘posts’: {e}”)
Attempt 2: Ignore errors (errors=’ignore’)
df_missing_ignore = pd.json_normalize(
data_missing,
record_path=’posts’,
meta=[‘userId’, [‘userInfo’, ‘name’]], # Path [‘userInfo’, ‘name’] is missing for Bob
errors=’ignore’ # Set errors to ‘ignore’
)
print(“\nDataFrame with errors=’ignore’:”)
print(df_missing_ignore)
“`
Output:
“`
Caught KeyError: ‘userInfo’
Caught TypeError related to missing ‘posts’: ‘NoneType’ object is not iterable
DataFrame with errors=’ignore’:
postId title userId userInfo.name
0 101 Post A 1 Alice
1 201 Post B 2 NaN # Missing userInfo.name handled gracefully
2 401 NaN 4 Dave # Missing ‘title’ within record results in NaN
“`
Explanation:
- With
errors='raise'
, the first attempt fails because user Bob (userId
2) is missing theuserInfo
key, which is needed to extract['userInfo', 'name']
. The second attempt fails because Charlie (userId
3) is missing theposts
key specified inrecord_path
. - With
errors='ignore'
, the normalization proceeds:- For user Bob (
userId
2), theuserInfo.name
column getsNaN
because the path['userInfo', 'name']
couldn’t be resolved. - User Charlie (
userId
3) is skipped entirely in the output because therecord_path
(‘posts’) didn’t exist for him. Ifrecord_path
is missing anderrors='ignore'
, that record simply doesn’t contribute any rows. - For user Dave (
userId
4), the post is missing thetitle
key. This is handled within the record flattening itself (not by theerrors
parameter applied tometa
orrecord_path
lookups) – the resultingtitle
column getsNaN
for that row.
- For user Bob (
Choose errors='ignore'
carefully, as it can mask underlying data quality issues. It’s often best to start with errors='raise'
during development to understand the structure and inconsistencies, then switch to 'ignore'
if missing optional fields are expected and acceptable.
3.5 max_level
: Limiting the Depth of Normalization
Sometimes, your JSON is deeply nested, but you only want to flatten it up to a certain level, leaving deeper structures intact within the DataFrame columns.
- Purpose: Control the maximum depth of nesting to flatten.
- Type: Integer.
Consider this deeply nested example:
“`python
data_deep_levels = [
{
‘level1’: {
‘id’: ‘L1_A’,
‘level2’: {
‘id’: ‘L2_A’,
‘level3’: {
‘value’: 100,
‘config’: {‘enabled’: True, ‘mode’: ‘auto’}
}
}
}
}
]
Flatten completely (default behavior, equivalent to a large max_level)
df_deep_full = pd.json_normalize(data_deep_levels)
print(“Full Flattening:”)
print(df_deep_full)
Flatten only up to level 1 (max_level=0 flattens the top level only)
Note: max_level=0 means only the first level keys are used.
max_level=1 means flatten one level down from the top.
df_deep_level1 = pd.json_normalize(data_deep_levels, max_level=1)
print(“\nFlattening with max_level=1:”)
print(df_deep_level1)
Flatten only up to level 2
df_deep_level2 = pd.json_normalize(data_deep_levels, max_level=2)
print(“\nFlattening with max_level=2:”)
print(df_deep_level2)
“`
Output:
“`
Full Flattening:
level1.id level1.level2.id level1.level2.level3.value level1.level2.level3.config.enabled level1.level2.level3.config.mode
0 L1_A L2_A 100 True auto
Flattening with max_level=1:
level1.id level1.level2
0 L1_A {‘id’: ‘L2_A’, ‘level3’: {‘value’: 100, ‘conf…
Flattening with max_level=2:
level1.id level1.level2.id level1.level2.level3
0 L1_A L2_A {‘value’: 100, ‘config’: {‘enabled’: True, ‘m…
“`
- With no
max_level
(or a sufficiently high value), everything is flattened down to scalar values, resulting in long column names likelevel1.level2.level3.config.enabled
. - With
max_level=1
, only the keys directly underlevel1
are potentially expanded. Sincelevel1.level2
is still a dictionary, it remains as an object in the columnlevel1.level2
. - With
max_level=2
, flattening proceeds one step further.level1.level2.id
becomes a column, butlevel1.level2.level3
remains a dictionary object because it’s at the third level of nesting relative to the top.
max_level
is useful when you want a semi-flattened structure, perhaps for performance reasons or because further flattening isn’t necessary for your immediate analysis goals.
4. Handling Different Nesting Patterns with Examples
Let’s solidify understanding with more targeted examples covering common JSON structures.
Example 1: Simple Dictionary (Not a List)
What if your input data
is a single JSON object, not a list?
“`python
single_record = {
“event_id”: “evt_123”,
“timestamp”: “2023-10-27T10:00:00Z”,
“payload”: {
“type”: “user_login”,
“user”: {“id”: “u_abc”, “role”: “admin”},
“ip_address”: “192.168.1.100”
},
“metadata”: {“source”: “web”, “region”: “us-east-1”}
}
Normalizing a single dictionary
df_single = pd.json_normalize(single_record)
print(df_single)
“`
Output:
event_id timestamp payload.type payload.user.id payload.user.role payload.ip_address metadata.source metadata.region
0 evt_123 2023-10-27T10:00:00Z user_login u_abc admin 192.168.1.100 web us-east-1
json_normalize
handles a single dictionary by creating a DataFrame with one row and flattening all nested dictionaries within it.
Example 2: List of Records Nested Under a Single Key
This is a very common API response pattern, where metadata surrounds a list of results.
“`python
api_response = {
“query”: “pandas”,
“page”: 1,
“total_results”: 150,
“results”: [
{ “id”: 1, “title”: “Pandas Intro”, “author”: {“name”: “Alice”, “org”: “DataCo”} },
{ “id”: 2, “title”: “Advanced Pandas”, “author”: {“name”: “Bob”, “org”: “Analytics Inc”} },
{ “id”: 3, “title”: “Pandas Visualization”, “author”: {“name”: “Charlie”, “org”: “DataCo”} }
]
}
Flatten the ‘results’ list, bringing ‘query’ and ‘page’ as metadata
df_api = pd.json_normalize(
api_response,
record_path=’results’,
meta=[‘query’, ‘page’],
sep=’_’ # Use underscore separator
)
print(df_api)
“`
Output:
id title author_name author_org query page
0 1 Pandas Intro Alice DataCo pandas 1
1 2 Advanced Pandas Bob Analytics Inc pandas 1
2 3 Pandas Visualization Charlie DataCo pandas 1
Here, record_path='results'
targets the list we want to turn into rows. meta=['query', 'page']
brings in top-level information relevant to all results on this page. Nested dictionary author
within the records is automatically flattened to author_name
and author_org
using the _
separator.
Example 3: JSON with Lists that Aren’t the Main Records
Sometimes you have lists within your data that you don’t necessarily want to explode into separate rows immediately, like the tags
in our initial example.
“`python
data_with_lists = [
{
“productId”: “P100”,
“name”: “Laptop”,
“specs”: {“cpu”: “i7”, “ram_gb”: 16},
“stores”: [“Store A”, “Store B”],
“ratings”: [
{“user”: “User1”, “score”: 5},
{“user”: “User2”, “score”: 4}
]
},
{
“productId”: “P200”,
“name”: “Keyboard”,
“specs”: {“cpu”: None, “ram_gb”: None}, # Assuming specs irrelevant for keyboard
“stores”: [“Store A”, “Store C”],
“ratings”: [
{“user”: “User1”, “score”: 4},
{“user”: “User3”, “score”: 5}
]
}
]
Simple normalization – lists remain as objects
df_lists_basic = pd.json_normalize(data_with_lists, sep=’_’)
print(“Basic Normalization (Lists as Objects):”)
print(df_lists_basic)
print(“\nData type of ‘stores’ column:”, df_lists_basic[‘stores’].dtype)
print(“Data type of ‘ratings’ column:”, df_lists_basic[‘ratings’].dtype)
If we wanted each rating as a row (less common for this structure)
df_ratings = pd.json_normalize(
data_with_lists,
record_path=’ratings’,
meta=[‘productId’, ‘name’, [‘specs’, ‘cpu’], ‘stores’],
sep=’_’
)
print(“\nRatings as Rows:”)
print(df_ratings)
“`
Output:
“`
Basic Normalization (Lists as Objects):
productId name stores ratings specs_cpu specs_ram_gb
0 P100 Laptop [Store A, Store B] [{‘user’: ‘User1’, ‘score’: 5}, {‘user’: ‘User… i7 16.0
1 P200 Keyboard [Store A, ‘Store C’] [{‘user’: ‘User1’, ‘score’: 4}, {‘user’: ‘User… None NaN
Data type of ‘stores’ column: object
Data type of ‘ratings’ column: object
Ratings as Rows:
user score productId name specs_cpu stores
0 User1 5 P100 Laptop i7 [Store A, Store B]
1 User2 4 P100 Laptop i7 [Store A, Store B]
2 User1 4 P200 Keyboard None [Store A, Store C]
3 User3 5 P200 Keyboard None [Store A, Store C]
“`
- The first normalization (
df_lists_basic
) flattens thespecs
dictionary but leaves thestores
(list of strings) andratings
(list of dictionaries) columns as Python objects. This might be perfectly acceptable if you plan to process these columns later (e.g., using.explode()
or.apply()
). - The second normalization (
df_ratings
) demonstrates usingrecord_path='ratings'
to make each individual rating a separate row, bringing product information along viameta
. Notice how thestores
list is repeated for each rating of the same product. Choose the approach based on your analytical goal.
5. Practical Use Cases
The primary driver for json_normalize
is often dealing with data from external sources.
Use Case 1: Processing API Responses
Web APIs frequently return data in nested JSON format. Imagine querying a weather API:
“`python
weather_api_response = {
“query_location”: “London, UK”,
“request_time”: “2023-10-27T12:00:00Z”,
“forecast”: [
{
“date”: “2023-10-28”,
“day”: { “maxtemp_c”: 15.0, “mintemp_c”: 8.0, “condition”: {“text”: “Partly cloudy”, “code”: 1003} },
“astro”: { “sunrise”: “07:30”, “sunset”: “18:00” }
},
{
“date”: “2023-10-29”,
“day”: { “maxtemp_c”: 14.0, “mintemp_c”: 7.5, “condition”: {“text”: “Sunny”, “code”: 1000} },
“astro”: { “sunrise”: “07:32”, “sunset”: “17:58” }
}
]
}
Normalize the forecast data
df_weather = pd.json_normalize(
weather_api_response,
record_path=’forecast’,
meta=[‘query_location’, ‘request_time’],
sep=’_’
)
print(df_weather)
“`
Output:
date day_maxtemp_c day_mintemp_c day_condition_text day_condition_code astro_sunrise astro_sunset query_location request_time
0 2023-10-28 15.0 8.0 Partly cloudy 1003 07:30 18:00 London, UK 2023-10-27T12:00:00Z
1 2023-10-29 14.0 7.5 Sunny 1000 07:32 17:58 London, UK 2023-10-27T12:00:00Z
This instantly transforms the nested forecast into a clean table, ready for time series analysis or comparison, including the location and request time as context.
Use Case 2: Reading Nested Configuration Files
JSON is often used for application configuration. While usually read directly by applications, sometimes you might want to analyze or compare configurations across different environments stored in JSON files.
“`json
// config_prod.json
{
“environment”: “production”,
“database”: {
“type”: “postgres”,
“host”: “prod-db.example.com”,
“port”: 5432,
“credentials”: {“user”: “prod_user”} // Secret handled separately!
},
“features”: {
“new_dashboard”: true,
“email_alerts”: {
“enabled”: true,
“level”: “critical”
}
}
}
// config_staging.json
{
“environment”: “staging”,
“database”: {
“type”: “postgres”,
“host”: “staging-db.example.com”,
“port”: 5432,
“credentials”: {“user”: “staging_user”}
},
“features”: {
“new_dashboard”: true,
“email_alerts”: {
“enabled”: false,
“level”: “debug” // Different level
}
}
}
“`
“`python
import json
Assume files are loaded into dictionaries
with open(‘config_prod.json’) as f:
config_prod = json.load(f)
with open(‘config_staging.json’) as f:
config_staging = json.load(f)
configs = [config_prod, config_staging]
Normalize the list of configurations
df_configs = pd.json_normalize(configs, sep=’_’)
print(df_configs)
“`
Output:
environment database_type database_host database_port database_credentials_user features_new_dashboard features_email_alerts_enabled features_email_alerts_level
0 production postgres prod-db.example.com 5432 prod_user True True critical
1 staging postgres staging-db.example.com 5432 staging_user True False debug
This makes it easy to compare settings side-by-side, for instance, checking differences in feature flags or database hosts.
Use Case 3: Data from NoSQL Databases (e.g., MongoDB)
Documents retrieved from NoSQL databases like MongoDB are often complex, nested JSON-like structures. json_normalize
is invaluable for preparing this data for analysis in Pandas.
“`python
Simulated data structure from a MongoDB collection
mongo_docs = [
{
“_id”: “ObjectId(‘653bb…’)”,
“orderId”: “ORD1001”,
“customer”: {“id”: “CUST01”, “name”: “Alice”},
“items”: [
{“sku”: “SKU001”, “qty”: 2, “price”: 10.50},
{“sku”: “SKU002”, “qty”: 1, “price”: 25.00}
],
“timestamp”: “ISODate(‘2023-10-27T…’)”,
“shipping”: {“method”: “standard”, “address”: {“city”: “Wonderland”}}
},
{
“_id”: “ObjectId(‘653bc…’)”,
“orderId”: “ORD1002”,
“customer”: {“id”: “CUST02”, “name”: “Bob”},
“items”: [
{“sku”: “SKU003”, “qty”: 5, “price”: 5.00}
],
“timestamp”: “ISODate(‘2023-10-27T…’)”,
“shipping”: {“method”: “express”, “address”: {“city”: “Codeville”}}
}
]
Normalize the items list, bringing order and customer info
df_order_items = pd.json_normalize(
mongo_docs,
record_path=’items’,
meta=[
‘orderId’,
[‘customer’, ‘id’],
[‘customer’, ‘name’],
‘timestamp’,
[‘shipping’, ‘method’],
[‘shipping’, ‘address’, ‘city’]
],
meta_prefix=’order_’, # Example: Add prefix to meta columns
sep=’_’
)
print(df_order_items)
“`
Output:
sku qty price order_orderId order_customer_id order_customer_name order_timestamp order_shipping_method order_shipping_address_city
0 SKU001 2 10.5 ORD1001 CUST01 Alice ISODate('2023-10-27T...') standard Wonderland
1 SKU002 1 25.0 ORD1001 CUST01 Alice ISODate('2023-10-27T...') standard Wonderland
2 SKU003 5 5.0 ORD1002 CUST02 Bob ISODate('2023-10-27T...') express Codeville
This flattens the order items into individual rows, associating each item with its corresponding order details, customer information, and shipping details. The meta_prefix
parameter was used here to demonstrate adding a prefix (order_
) to all columns derived from the meta
parameter, which can help organize columns if many metadata fields are included.
6. Advanced Techniques and Considerations
While json_normalize
covers many scenarios, sometimes you need additional steps or awareness of certain limitations.
Preprocessing JSON:
-
JSON as Strings: Occasionally, nested JSON might be stored as a string within a JSON field. You’ll need to parse this inner string first using
json.loads
.“`python
data_str = [{‘id’: 1, ‘payload’: ‘{“value”: 10, “status”: “ok”}’}]Preprocess the ‘payload’ column
for record in data_str:
if isinstance(record.get(‘payload’), str):
record[‘payload’] = json.loads(record[‘payload’])df_preprocessed = pd.json_normalize(data_str)
print(df_preprocessed)Output:
id payload.value payload.status
0 1 10 ok
``
json_normalize
* **Inconsistent Types:** If a field sometimes contains a dictionary and sometimes a scalar (or is missing),might raise errors (if
errors=’raise’) or produce columns with mixed types or NaNs (if
errors=’ignore’). Preprocessing might involve standardizing these fields (e.g., ensuring a field is always a dictionary, even if empty
{}`).
Handling Multiple, Independent Lists:
json_normalize
with record_path
is designed to flatten one primary list structure per input record into rows. What if a single JSON object contains multiple lists you want to flatten independently and then relate?
Example: A document with both comments
and revisions
.
json
doc = {
"docId": "D1",
"content": "Some text.",
"comments": [ {"user": "Alice", "text": "Good point"}, {"user": "Bob", "text": "Needs citation"} ],
"revisions": [ {"revId": 1, "ts": "..."}, {"revId": 2, "ts": "..."} ]
}
You cannot flatten both comments
and revisions
into rows simultaneously in a single json_normalize
call while keeping them correctly associated only with docId
“D1”. json_normalize
would try to create combinations if you somehow forced it.
The typical approach is:
1. Normalize each list separately, bringing in the common identifier (docId
) using meta
.
2. Optionally, merge or join these resulting DataFrames if needed, though often analyzing them separately is sufficient.
“`python
Normalize comments
df_comments = pd.json_normalize(doc, record_path=’comments’, meta=[‘docId’])
Normalize revisions
df_revisions = pd.json_normalize(doc, record_path=’revisions’, meta=[‘docId’])
print(“Comments:”)
print(df_comments)
print(“\nRevisions:”)
print(df_revisions)
“`
Output:
“`
Comments:
user text docId
0 Alice Good point D1
1 Bob Needs citation D1
Revisions:
revId ts docId
0 1 … D1
1 2 … D1
“`
Memory Usage and Performance:
Flattening extremely large or deeply nested JSON objects can consume significant memory, as you are potentially creating many new columns and rows.
max_level
: Usemax_level
if you don’t need full flattening.- Selective
meta
: Only include necessary metadata fields in themeta
parameter. - Chunking: If processing a very large JSON file (e.g., newline-delimited JSON), consider reading and normalizing it in chunks.
- Alternative Libraries: For extremely large files or performance-critical ETL pipelines, libraries like
dask
(which can parallelize Pandas operations, including potentially custom JSON flattening) or specialized data processing engines might be necessary. However, for most common use cases involving API responses or moderately sized files,json_normalize
is efficient enough.
Alternatives to json_normalize
:
While powerful, json_normalize
isn’t the only way:
- Manual Iteration and DataFrame Construction: You could write custom Python loops to traverse the JSON, extract the data you need, collect it into lists or dictionaries, and then create a DataFrame. This offers maximum flexibility but is significantly more verbose, error-prone, and usually less performant than
json_normalize
. .apply(pd.Series)
: For columns that contain dictionaries after an initialpd.DataFrame
load, you can sometimes usedf['col_with_dict'].apply(pd.Series)
to expand that dictionary into new columns. This needs to be done column-by-column and can be less efficient thanjson_normalize
handling the whole structure at once. It also doesn’t handle therecord_path
andmeta
concepts directly.
In most scenarios involving flattening nested JSON, pd.json_normalize
provides the best balance of power, conciseness, and performance within the Pandas ecosystem.
7. Putting It All Together: A Comprehensive Example
Let’s tackle a more complex JSON structure that combines several features we’ve discussed. Imagine an API response for a complex order system.
json
order_data = [
{
"order_id": "ORD-2023-A001",
"customer_details": {
"id": "CUST_01",
"contact": {"name": "Alice", "email": "[email protected]"},
"address": {
"street": "1 Tech Park",
"city": "Dataville",
"country": "DV"
}
},
"items": [
{
"sku": "SKU001",
"description": "Widget Type A",
"quantity": 2,
"pricing": {"unit_price": 10.0, "currency": "USD"}
},
{
"sku": "SKU002",
"description": "Gadget Type B",
"quantity": 1,
"pricing": {"unit_price": 25.5, "currency": "USD"}
}
],
"shipments": [
{
"shipment_id": "SHP_A1",
"carrier": "FastShip",
"tracking_url": "http://track.fastship.co/SHP_A1",
"status": "delivered"
}
],
"order_notes": [{"ts": "t1", "note": "Urgent order"}],
"metadata": {"source": "web", "processed_by": "worker-01"}
},
{
"order_id": "ORD-2023-B002",
"customer_details": {
"id": "CUST_02",
"contact": {"name": "Bob", "email": "[email protected]"},
"address": {
"street": "2 Analysis Ave",
"city": "Analytic City",
"country": "AC"
}
},
"items": [
{
"sku": "SKU003",
"description": "Component X",
"quantity": 5,
"pricing": {"unit_price": 5.2, "currency": "EUR"}
}
],
"shipments": [], # Empty list
"order_notes": None, # Missing notes
"metadata": {"source": "api", "processed_by": "worker-02"}
}
]
Goal: Create a DataFrame where each item in an order is a row. Include the order_id
, customer name, customer city, shipment count (as an example of handling another list), and metadata source.
Step 1: Identify record_path
We want each item to be a row, so the path points to the list of items: record_path='items'
.
Step 2: Identify meta
We need information from outside the items
list:
* order_id
(top-level)
* Customer name: ['customer_details', 'contact', 'name']
* Customer city: ['customer_details', 'address', 'city']
* Shipment count: This isn’t directly available as a key. We’ll need to calculate this after normalization or during preprocessing if absolutely needed in the meta
. For simplicity here, let’s bring the shipments
list itself into the metadata and process it later. Path: shipments
.
* Metadata source: ['metadata', 'source']
Step 3: Choose sep
, errors
Let’s use _
as the separator (sep='_'
). Since order_notes
might be missing or shipments
might be empty, we should use errors='ignore'
to prevent crashes if paths don’t resolve (though in this specific meta
selection, only missing metadata
or customer_details
would cause issues if raise
was used).
Step 4: Execute json_normalize
“`python
df_order_items_complex = pd.json_normalize(
order_data,
record_path=’items’,
meta=[
‘order_id’,
[‘customer_details’, ‘contact’, ‘name’],
[‘customer_details’, ‘address’, ‘city’],
‘shipments’, # Bring the whole list for now
[‘metadata’, ‘source’]
],
sep=’_’,
errors=’ignore’ # Handle potential missing paths like order_notes (though not used in meta here)
)
print(“Initial Normalized DataFrame:”)
print(df_order_items_complex)
“`
Output:
Initial Normalized DataFrame:
sku description quantity pricing_unit_price pricing_currency order_id customer_details_contact_name customer_details_address_city shipments metadata_source
0 SKU001 Widget Type A 2 10.0 USD ORD-2023-A001 Alice Dataville [{'shipment_id': 'SHP_A1', 'carrier': 'FastSh... web
1 SKU002 Gadget Type B 1 25.5 USD ORD-2023-A001 Alice Dataville [{'shipment_id': 'SHP_A1', 'carrier': 'FastSh... web
2 SKU003 Component X 5 5.2 EUR ORD-2023-B002 Bob Analytic City [] api
Step 5: Post-processing (Optional but often needed)
The DataFrame is flattened, but the shipments
column still contains lists (or NaNs if a record was missing the ‘shipments’ key entirely). We can now easily process this using standard Pandas:
“`python
Calculate shipment count from the ‘shipments’ column
df_order_items_complex[‘shipment_count’] = df_order_items_complex[‘shipments’].apply(lambda x: len(x) if isinstance(x, list) else 0)
Drop the original list column if no longer needed
df_order_items_complex = df_order_items_complex.drop(columns=[‘shipments’])
Display the final DataFrame
print(“\nFinal Processed DataFrame:”)
print(df_order_items_complex)
“`
Output:
Final Processed DataFrame:
sku description quantity pricing_unit_price pricing_currency order_id customer_details_contact_name customer_details_address_city metadata_source shipment_count
0 SKU001 Widget Type A 2 10.0 USD ORD-2023-A001 Alice Dataville web 1
1 SKU002 Gadget Type B 1 25.5 USD ORD-2023-A001 Alice Dataville web 1
2 SKU003 Component X 5 5.2 EUR ORD-2023-B002 Bob Analytic City api 0
Now we have a clean, flat DataFrame where each row is an order item, linked to relevant order and customer details, and includes a calculated shipment_count
. This complex transformation was made significantly easier by json_normalize
.
8. Conclusion
Working with nested JSON data is a common task in data analysis and preparation, particularly when dealing with APIs and semi-structured data sources. While nested structures are flexible for data representation, they don’t map directly to the tabular format expected by Pandas DataFrames.
pandas.json_normalize
emerges as a purpose-built, powerful, and efficient tool to bridge this gap. By understanding its core parameters – record_path
for targeting lists to explode into rows, meta
for including contextual data from parent levels, sep
for controlling column naming, errors
for managing inconsistencies, and max_level
for limiting flattening depth – you can effectively transform complex, hierarchical JSON into analysis-ready, flat DataFrames.
We’ve explored its application from simple nested dictionaries to complex API-like responses and NoSQL document structures. We’ve also discussed practical considerations like preprocessing, handling multiple lists, performance, and alternatives.
Mastering json_normalize
significantly streamlines the process of ingesting and preparing JSON data, allowing you to spend less time wrestling with data structures and more time deriving insights using the rich analytical capabilities of the Pandas library. It’s an essential function for any data professional working with Python and diverse data sources. The next time you encounter a challenging nested JSON, remember pd.json_normalize
– your key to flattening the complexities.