Pandas: Working with Nested JSON using json_normalize


Mastering Nested JSON in Pandas: A Deep Dive into json_normalize

In the modern data landscape, JSON (JavaScript Object Notation) has become the lingua franca for data exchange, especially for web APIs, configuration files, and NoSQL databases. Its human-readable text format and flexible hierarchical structure make it incredibly versatile. However, this very flexibility, particularly its ability to represent nested structures (dictionaries within dictionaries, lists within dictionaries, etc.), can pose a challenge when trying to load and analyze this data in a tabular format, like a Pandas DataFrame.

Pandas, the cornerstone library for data manipulation and analysis in Python, excels with two-dimensional, labeled data structures. Directly loading deeply nested JSON into a DataFrame often results in columns containing complex objects (lists or dictionaries) rather than the desired scalar values. This hinders straightforward analysis, filtering, and aggregation.

Fortunately, Pandas provides a powerful and elegant solution specifically designed for this problem: pandas.json_normalize. This function is purpose-built to “flatten” semi-structured JSON data into a flat table (a DataFrame), making it readily accessible for standard Pandas operations.

This article provides a comprehensive guide to understanding and utilizing pd.json_normalize. We will cover:

  1. The Challenge of Nested JSON: Understanding why nested structures are problematic for DataFrames.
  2. Introduction to pd.json_normalize: Basic usage and core concepts.
  3. Key Parameters: A detailed exploration of record_path, meta, sep, errors, and max_level.
  4. Handling Different Nesting Patterns: Examples covering various common JSON structures.
  5. Practical Use Cases: Applying json_normalize to real-world scenarios like API responses.
  6. Advanced Techniques and Considerations: Error handling, performance, preprocessing, and dealing with multiple record paths.
  7. Putting It All Together: A comprehensive, step-by-step example.

By the end of this article, you will have a thorough understanding of how to effectively tame nested JSON data and transform it into analysis-ready Pandas DataFrames using json_normalize.

1. The Challenge of Nested JSON

Before diving into the solution, let’s clearly define the problem. Consider a typical JSON response you might get from an API representing information about users and their posts:

json
[
{
"userId": 1,
"userInfo": {
"name": "Alice",
"email": "[email protected]",
"address": {
"street": "123 Maple St",
"city": "Wonderland",
"zipcode": "12345"
}
},
"posts": [
{
"postId": 101,
"title": "Adventures in Data",
"tags": ["python", "pandas", "json"]
},
{
"postId": 102,
"title": "Flattening the World",
"tags": ["pandas", "json_normalize"]
}
]
},
{
"userId": 2,
"userInfo": {
"name": "Bob",
"email": "[email protected]",
"address": {
"street": "456 Oak Ave",
"city": "Codeville",
"zipcode": "67890"
}
},
"posts": [
{
"postId": 201,
"title": "API Integrations",
"tags": ["python", "requests", "api"]
}
]
}
]

This JSON structure contains multiple levels of nesting:

  • The main structure is a list [...] of user objects.
  • Each user object is a dictionary {...}.
  • The userInfo key maps to another dictionary.
  • Within userInfo, the address key maps to yet another dictionary.
  • The posts key maps to a list of dictionaries, where each dictionary represents a post.
  • Within each post dictionary, the tags key maps to a list of strings.

If we try to load this directly into a Pandas DataFrame:

“`python
import pandas as pd
import json

data = [
{
“userId”: 1,
“userInfo”: {
“name”: “Alice”,
“email”: “[email protected]”,
“address”: {
“street”: “123 Maple St”,
“city”: “Wonderland”,
“zipcode”: “12345”
}
},
“posts”: [
{
“postId”: 101,
“title”: “Adventures in Data”,
“tags”: [“python”, “pandas”, “json”]
},
{
“postId”: 102,
“title”: “Flattening the World”,
“tags”: [“pandas”, “json_normalize”]
}
]
},
{
“userId”: 2,
“userInfo”: {
“name”: “Bob”,
“email”: “[email protected]”,
“address”: {
“street”: “456 Oak Ave”,
“city”: “Codeville”,
“zipcode”: “67890”
}
},
“posts”: [
{
“postId”: 201,
“title”: “API Integrations”,
“tags”: [“python”, “requests”, “api”]
}
]
}
]

Load directly into DataFrame

df_naive = pd.DataFrame(data)
print(df_naive)
“`

The output would look something like this:

userId userInfo posts
0 1 {'name': 'Alice', 'email': '[email protected]... [{'postId': 101, 'title': 'Adventures in Data...
1 2 {'name': 'Bob', 'email': '[email protected]', '... [{'postId': 201, 'title': 'API Integrations',...

As you can see, the userInfo and posts columns contain Python dictionaries and lists, respectively. While Pandas can store these objects, it’s not ideal for analysis. How would you easily filter users based on their city? How would you count the total number of posts across all users? How would you analyze post tags? These tasks become cumbersome. You’d need to apply custom functions or iterate through these object columns, which defeats the purpose of using a highly optimized library like Pandas.

This is precisely the problem json_normalize solves: it unpacks these nested structures into separate columns, creating a flat, tabular representation suitable for analysis.

2. Introduction to pd.json_normalize

pandas.json_normalize takes a dictionary or a list of dictionaries as input and normalizes the semi-structured JSON data into a flat table.

Basic Syntax:

python
pd.json_normalize(
data,
record_path=None,
meta=None,
meta_prefix=None, # Less commonly used, adds prefix to meta columns
record_prefix=None, # Less commonly used, adds prefix to record columns
errors='raise',
sep='.',
max_level=None
)

Let’s start with the simplest cases.

Case 1: List of Flat Dictionaries

If your data is already a list of flat dictionaries (no nesting), json_normalize behaves similarly to the standard pd.DataFrame constructor.

“`python
flat_data = [
{‘colA’: 1, ‘colB’: ‘apple’},
{‘colA’: 2, ‘colB’: ‘banana’},
{‘colA’: 3, ‘colB’: ‘cherry’}
]

df_flat = pd.json_normalize(flat_data)
print(df_flat)
“`

Output:

colA colB
0 1 apple
1 2 banana
2 3 cherry

Case 2: List of Dictionaries with One Level of Nesting

Now, let’s introduce a single level of nesting, similar to the userInfo part of our earlier example.

“`python
nested_data_simple = [
{
‘id’: 1,
‘details’: { ‘name’: ‘Alice’, ‘age’: 30 }
},
{
‘id’: 2,
‘details’: { ‘name’: ‘Bob’, ‘age’: 25 }
}
]

df_nested_simple = pd.json_normalize(nested_data_simple)
print(df_nested_simple)
“`

Output:

id details.name details.age
0 1 Alice 30
1 2 Bob 25

Notice how json_normalize automatically handled the nested details dictionary. It flattened the structure by creating new column names using the parent key (details) followed by the nested key (name or age), separated by a dot (.). This . separator is the default, but it can be changed using the sep parameter.

This default behavior works well for dictionaries nested within dictionaries. However, it doesn’t automatically unpack lists within dictionaries, like the posts field in our original example. This is where the record_path and meta parameters become crucial.

3. Key Parameters of json_normalize

Let’s delve into the most important parameters that give json_normalize its power and flexibility.

3.1 record_path: Navigating to the List of Records

This is arguably the most critical parameter for handling common nested JSON structures, especially API responses where a list of results is embedded within a larger JSON object.

  • Purpose: Specifies the path (key or sequence of keys) within each dictionary in the input data that leads to the list of records you want to unpack. Each element in this target list will become a row in the resulting DataFrame.
  • Type: Can be a string (for a single key) or a list of strings (for nested keys).

Let’s revisit our initial complex example, focusing on unpacking the posts list. Suppose we want a DataFrame where each post is a row.

“`python
data = [ # Same data as before
{
“userId”: 1, “userInfo”: { … },
“posts”: [ { “postId”: 101, … }, { “postId”: 102, … } ]
},
{
“userId”: 2, “userInfo”: { … },
“posts”: [ { “postId”: 201, … } ]
}
]

Normalize based on the ‘posts’ list

df_posts = pd.json_normalize(data, record_path=’posts’)
print(df_posts)
“`

Output:

postId title tags
0 101 Adventures in Data [python, pandas, json]
1 102 Flattening the World [pandas, json_normalize]
2 201 API Integrations [python, requests, api]

Observe what happened:
1. json_normalize looked inside each dictionary in the top-level data list.
2. It accessed the key specified by record_path='posts'.
3. It found a list of dictionaries at that path.
4. It iterated through each dictionary within the posts lists (across all users) and created a new row in the DataFrame for each post dictionary.
5. The columns of the resulting DataFrame (postId, title, tags) are the keys from the dictionaries found inside the posts lists.

Nested record_path: If the list you want to flatten is deeper within the structure, provide a list of keys.

Imagine the JSON was structured like this:

json
data_deep_records = [
{
"requestId": "req-001",
"results": {
"status": "success",
"items": [
{"itemId": "A", "value": 10},
{"itemId": "B", "value": 20}
]
}
},
{
"requestId": "req-002",
"results": {
"status": "success",
"items": [
{"itemId": "C", "value": 30}
]
}
}
]

To get a DataFrame where each item is a row, the record_path needs to navigate through results and then items:

python
df_deep_items = pd.json_normalize(data_deep_records, record_path=['results', 'items'])
print(df_deep_items)

Output:

itemId value
0 A 10
1 B 20
2 C 30

3.2 meta: Including Metadata from Parent Levels

When using record_path, the default output only includes columns from the dictionaries within the specified list. Often, however, you also need information from the parent levels associated with those records. For instance, in our user/post example, when we flattened the posts, we lost the userId information. Which user made which post?

The meta parameter solves this.

  • Purpose: Specifies keys from the parent levels (outside the record_path) whose values should be included in the final DataFrame. These values will be broadcast (repeated) across the rows generated from the corresponding record_path list.
  • Type: A list of strings or lists of strings (for nested metadata keys).

Let’s combine record_path and meta to get a DataFrame of posts that also includes the userId and the user’s name and email.

“`python
data = [ # Same data as before
{
“userId”: 1,
“userInfo”: { “name”: “Alice”, “email”: “[email protected]”, “address”: { … } },
“posts”: [ { “postId”: 101, … }, { “postId”: 102, … } ]
},
{
“userId”: 2,
“userInfo”: { “name”: “Bob”, “email”: “[email protected]”, “address”: { … } },
“posts”: [ { “postId”: 201, … } ]
}
]

Normalize posts, including userId and user details from parent levels

df_posts_with_meta = pd.json_normalize(
data,
record_path=’posts’,
meta=[
‘userId’,
[‘userInfo’, ‘name’], # Nested meta key
[‘userInfo’, ’email’] # Nested meta key
]
)
print(df_posts_with_meta)
“`

Output:

postId title tags userId userInfo.name userInfo.email
0 101 Adventures in Data [python, pandas, json] 1 Alice [email protected]
1 102 Flattening the World [pandas, json_normalize] 1 Alice [email protected]
2 201 API Integrations [python, requests, api] 2 Bob [email protected]

Now, this DataFrame is much more useful!
1. Each row still represents a single post (due to record_path='posts').
2. We have the columns from the post dictionaries (postId, title, tags).
3. We also have columns specified by meta:
* userId: The top-level user ID.
* userInfo.name: The user’s name, accessed via the nested path ['userInfo', 'name'].
* userInfo.email: The user’s email, accessed via ['userInfo', 'email'].
4. Notice how the userId, userInfo.name, and userInfo.email values are repeated for posts belonging to the same user (e.g., userId 1 for posts 101 and 102). This is the broadcasting effect of meta.
5. Also note the column naming for nested meta keys: userInfo.name and userInfo.email, using the default . separator.

Important: The paths provided in meta are relative to the top level of each dictionary in the input data, not relative to the record_path.

3.3 sep: Customizing the Column Name Separator

By default, json_normalize uses a dot (.) to separate keys when flattening nested dictionaries (e.g., userInfo.name, userInfo.address.city).

  • Purpose: Allows you to specify a different separator for constructing flattened column names.
  • Type: String.

Sometimes, dots in column names can be inconvenient (e.g., if you plan to use query methods that interpret dots, or if you prefer snake_case). You can change the separator using sep.

“`python

Using the previous example, but with ‘_’ as separator

df_posts_with_meta_sep = pd.json_normalize(
data,
record_path=’posts’,
meta=[
‘userId’,
[‘userInfo’, ‘name’],
[‘userInfo’, ’email’],
[‘userInfo’, ‘address’, ‘city’] # Adding city for demonstration
],
sep=’_’ # Use underscore as separator
)
print(df_posts_with_meta_sep)
“`

Output:

postId title tags userId userInfo_name userInfo_email userInfo_address_city
0 101 Adventures in Data [python, pandas, json] 1 Alice [email protected] Wonderland
1 102 Flattening the World [pandas, json_normalize] 1 Alice [email protected] Wonderland
2 201 API Integrations [python, requests, api] 2 Bob [email protected] Codeville

Now the columns are named userInfo_name, userInfo_email, and userInfo_address_city, which might be preferable in some contexts.

3.4 errors: Handling Missing Keys

Real-world JSON data is often messy. Structures might be inconsistent; expected keys might be missing in some records. The errors parameter controls how json_normalize behaves when it encounters missing keys specified in record_path or meta.

  • Purpose: Define the strategy for handling missing keys during normalization.
  • Type: String, either 'raise' (default) or 'ignore'.

  • errors='raise' (Default): If a key specified in record_path or meta is missing in any of the input dictionaries, a KeyError (or sometimes TypeError if the path expects a dict/list but finds something else) is raised, and the normalization process stops. This is useful for ensuring data consistency and catching structural problems early.

  • errors='ignore': If a key is missing, json_normalize will skip that key for that specific record and proceed. The corresponding value in the resulting DataFrame will typically be NaN (or None). This is useful when you expect some optional fields and want the normalization to complete even with inconsistencies.

Let’s modify our data slightly to introduce missing keys:

“`python
data_missing = [
{ # Alice has user info and posts
“userId”: 1,
“userInfo”: { “name”: “Alice”, “email”: “[email protected]” },
“posts”: [ { “postId”: 101, “title”: “Post A” } ]
},
{ # Bob is missing the ‘userInfo’ key entirely
“userId”: 2,
# “userInfo”: { … }, # Missing!
“posts”: [ { “postId”: 201, “title”: “Post B” } ]
},
{ # Charlie has userInfo but is missing the ‘posts’ key
“userId”: 3,
“userInfo”: { “name”: “Charlie”, “email”: “[email protected]” },
# “posts”: [ … ] # Missing!
},
{ # Dave has posts, but one post is missing ‘title’ (won’t affect meta/record path errors)
“userId”: 4,
“userInfo”: { “name”: “Dave”, “email”: “[email protected]” },
“posts”: [ { “postId”: 401 } ] # Missing title
}
]

Attempt 1: Default behavior (errors=’raise’)

try:
df_missing_raise = pd.json_normalize(
data_missing,
record_path=’posts’,
meta=[‘userId’, [‘userInfo’, ‘name’]]
)
print(df_missing_raise)
except KeyError as e:
print(f”\nCaught KeyError: {e}”) # Expecting error due to missing ‘userInfo’ for Bob

try:
df_missing_raise_posts = pd.json_normalize(
data_missing,
record_path=’posts’, # Will fail for Charlie who is missing ‘posts’
meta=[‘userId’]
)
print(df_missing_raise_posts)
except TypeError as e: # Often TypeError if path expects list but gets None
print(f”\nCaught TypeError related to missing ‘posts’: {e}”)

Attempt 2: Ignore errors (errors=’ignore’)

df_missing_ignore = pd.json_normalize(
data_missing,
record_path=’posts’,
meta=[‘userId’, [‘userInfo’, ‘name’]], # Path [‘userInfo’, ‘name’] is missing for Bob
errors=’ignore’ # Set errors to ‘ignore’
)
print(“\nDataFrame with errors=’ignore’:”)
print(df_missing_ignore)
“`

Output:

“`
Caught KeyError: ‘userInfo’

Caught TypeError related to missing ‘posts’: ‘NoneType’ object is not iterable

DataFrame with errors=’ignore’:
postId title userId userInfo.name
0 101 Post A 1 Alice
1 201 Post B 2 NaN # Missing userInfo.name handled gracefully
2 401 NaN 4 Dave # Missing ‘title’ within record results in NaN
“`

Explanation:

  • With errors='raise', the first attempt fails because user Bob (userId 2) is missing the userInfo key, which is needed to extract ['userInfo', 'name']. The second attempt fails because Charlie (userId 3) is missing the posts key specified in record_path.
  • With errors='ignore', the normalization proceeds:
    • For user Bob (userId 2), the userInfo.name column gets NaN because the path ['userInfo', 'name'] couldn’t be resolved.
    • User Charlie (userId 3) is skipped entirely in the output because the record_path (‘posts’) didn’t exist for him. If record_path is missing and errors='ignore', that record simply doesn’t contribute any rows.
    • For user Dave (userId 4), the post is missing the title key. This is handled within the record flattening itself (not by the errors parameter applied to meta or record_path lookups) – the resulting title column gets NaN for that row.

Choose errors='ignore' carefully, as it can mask underlying data quality issues. It’s often best to start with errors='raise' during development to understand the structure and inconsistencies, then switch to 'ignore' if missing optional fields are expected and acceptable.

3.5 max_level: Limiting the Depth of Normalization

Sometimes, your JSON is deeply nested, but you only want to flatten it up to a certain level, leaving deeper structures intact within the DataFrame columns.

  • Purpose: Control the maximum depth of nesting to flatten.
  • Type: Integer.

Consider this deeply nested example:

“`python
data_deep_levels = [
{
‘level1’: {
‘id’: ‘L1_A’,
‘level2’: {
‘id’: ‘L2_A’,
‘level3’: {
‘value’: 100,
‘config’: {‘enabled’: True, ‘mode’: ‘auto’}
}
}
}
}
]

Flatten completely (default behavior, equivalent to a large max_level)

df_deep_full = pd.json_normalize(data_deep_levels)
print(“Full Flattening:”)
print(df_deep_full)

Flatten only up to level 1 (max_level=0 flattens the top level only)

Note: max_level=0 means only the first level keys are used.

max_level=1 means flatten one level down from the top.

df_deep_level1 = pd.json_normalize(data_deep_levels, max_level=1)
print(“\nFlattening with max_level=1:”)
print(df_deep_level1)

Flatten only up to level 2

df_deep_level2 = pd.json_normalize(data_deep_levels, max_level=2)
print(“\nFlattening with max_level=2:”)
print(df_deep_level2)
“`

Output:

“`
Full Flattening:
level1.id level1.level2.id level1.level2.level3.value level1.level2.level3.config.enabled level1.level2.level3.config.mode
0 L1_A L2_A 100 True auto

Flattening with max_level=1:
level1.id level1.level2
0 L1_A {‘id’: ‘L2_A’, ‘level3’: {‘value’: 100, ‘conf…

Flattening with max_level=2:
level1.id level1.level2.id level1.level2.level3
0 L1_A L2_A {‘value’: 100, ‘config’: {‘enabled’: True, ‘m…
“`

  • With no max_level (or a sufficiently high value), everything is flattened down to scalar values, resulting in long column names like level1.level2.level3.config.enabled.
  • With max_level=1, only the keys directly under level1 are potentially expanded. Since level1.level2 is still a dictionary, it remains as an object in the column level1.level2.
  • With max_level=2, flattening proceeds one step further. level1.level2.id becomes a column, but level1.level2.level3 remains a dictionary object because it’s at the third level of nesting relative to the top.

max_level is useful when you want a semi-flattened structure, perhaps for performance reasons or because further flattening isn’t necessary for your immediate analysis goals.

4. Handling Different Nesting Patterns with Examples

Let’s solidify understanding with more targeted examples covering common JSON structures.

Example 1: Simple Dictionary (Not a List)

What if your input data is a single JSON object, not a list?

“`python
single_record = {
“event_id”: “evt_123”,
“timestamp”: “2023-10-27T10:00:00Z”,
“payload”: {
“type”: “user_login”,
“user”: {“id”: “u_abc”, “role”: “admin”},
“ip_address”: “192.168.1.100”
},
“metadata”: {“source”: “web”, “region”: “us-east-1”}
}

Normalizing a single dictionary

df_single = pd.json_normalize(single_record)
print(df_single)
“`

Output:

event_id timestamp payload.type payload.user.id payload.user.role payload.ip_address metadata.source metadata.region
0 evt_123 2023-10-27T10:00:00Z user_login u_abc admin 192.168.1.100 web us-east-1

json_normalize handles a single dictionary by creating a DataFrame with one row and flattening all nested dictionaries within it.

Example 2: List of Records Nested Under a Single Key

This is a very common API response pattern, where metadata surrounds a list of results.

“`python
api_response = {
“query”: “pandas”,
“page”: 1,
“total_results”: 150,
“results”: [
{ “id”: 1, “title”: “Pandas Intro”, “author”: {“name”: “Alice”, “org”: “DataCo”} },
{ “id”: 2, “title”: “Advanced Pandas”, “author”: {“name”: “Bob”, “org”: “Analytics Inc”} },
{ “id”: 3, “title”: “Pandas Visualization”, “author”: {“name”: “Charlie”, “org”: “DataCo”} }
]
}

Flatten the ‘results’ list, bringing ‘query’ and ‘page’ as metadata

df_api = pd.json_normalize(
api_response,
record_path=’results’,
meta=[‘query’, ‘page’],
sep=’_’ # Use underscore separator
)
print(df_api)
“`

Output:

id title author_name author_org query page
0 1 Pandas Intro Alice DataCo pandas 1
1 2 Advanced Pandas Bob Analytics Inc pandas 1
2 3 Pandas Visualization Charlie DataCo pandas 1

Here, record_path='results' targets the list we want to turn into rows. meta=['query', 'page'] brings in top-level information relevant to all results on this page. Nested dictionary author within the records is automatically flattened to author_name and author_org using the _ separator.

Example 3: JSON with Lists that Aren’t the Main Records

Sometimes you have lists within your data that you don’t necessarily want to explode into separate rows immediately, like the tags in our initial example.

“`python
data_with_lists = [
{
“productId”: “P100”,
“name”: “Laptop”,
“specs”: {“cpu”: “i7”, “ram_gb”: 16},
“stores”: [“Store A”, “Store B”],
“ratings”: [
{“user”: “User1”, “score”: 5},
{“user”: “User2”, “score”: 4}
]
},
{
“productId”: “P200”,
“name”: “Keyboard”,
“specs”: {“cpu”: None, “ram_gb”: None}, # Assuming specs irrelevant for keyboard
“stores”: [“Store A”, “Store C”],
“ratings”: [
{“user”: “User1”, “score”: 4},
{“user”: “User3”, “score”: 5}
]
}
]

Simple normalization – lists remain as objects

df_lists_basic = pd.json_normalize(data_with_lists, sep=’_’)
print(“Basic Normalization (Lists as Objects):”)
print(df_lists_basic)
print(“\nData type of ‘stores’ column:”, df_lists_basic[‘stores’].dtype)
print(“Data type of ‘ratings’ column:”, df_lists_basic[‘ratings’].dtype)

If we wanted each rating as a row (less common for this structure)

df_ratings = pd.json_normalize(
data_with_lists,
record_path=’ratings’,
meta=[‘productId’, ‘name’, [‘specs’, ‘cpu’], ‘stores’],
sep=’_’
)
print(“\nRatings as Rows:”)
print(df_ratings)
“`

Output:

“`
Basic Normalization (Lists as Objects):
productId name stores ratings specs_cpu specs_ram_gb
0 P100 Laptop [Store A, Store B] [{‘user’: ‘User1’, ‘score’: 5}, {‘user’: ‘User… i7 16.0
1 P200 Keyboard [Store A, ‘Store C’] [{‘user’: ‘User1’, ‘score’: 4}, {‘user’: ‘User… None NaN

Data type of ‘stores’ column: object
Data type of ‘ratings’ column: object

Ratings as Rows:
user score productId name specs_cpu stores
0 User1 5 P100 Laptop i7 [Store A, Store B]
1 User2 4 P100 Laptop i7 [Store A, Store B]
2 User1 4 P200 Keyboard None [Store A, Store C]
3 User3 5 P200 Keyboard None [Store A, Store C]
“`

  • The first normalization (df_lists_basic) flattens the specs dictionary but leaves the stores (list of strings) and ratings (list of dictionaries) columns as Python objects. This might be perfectly acceptable if you plan to process these columns later (e.g., using .explode() or .apply()).
  • The second normalization (df_ratings) demonstrates using record_path='ratings' to make each individual rating a separate row, bringing product information along via meta. Notice how the stores list is repeated for each rating of the same product. Choose the approach based on your analytical goal.

5. Practical Use Cases

The primary driver for json_normalize is often dealing with data from external sources.

Use Case 1: Processing API Responses

Web APIs frequently return data in nested JSON format. Imagine querying a weather API:

“`python
weather_api_response = {
“query_location”: “London, UK”,
“request_time”: “2023-10-27T12:00:00Z”,
“forecast”: [
{
“date”: “2023-10-28”,
“day”: { “maxtemp_c”: 15.0, “mintemp_c”: 8.0, “condition”: {“text”: “Partly cloudy”, “code”: 1003} },
“astro”: { “sunrise”: “07:30”, “sunset”: “18:00” }
},
{
“date”: “2023-10-29”,
“day”: { “maxtemp_c”: 14.0, “mintemp_c”: 7.5, “condition”: {“text”: “Sunny”, “code”: 1000} },
“astro”: { “sunrise”: “07:32”, “sunset”: “17:58” }
}
]
}

Normalize the forecast data

df_weather = pd.json_normalize(
weather_api_response,
record_path=’forecast’,
meta=[‘query_location’, ‘request_time’],
sep=’_’
)
print(df_weather)
“`

Output:

date day_maxtemp_c day_mintemp_c day_condition_text day_condition_code astro_sunrise astro_sunset query_location request_time
0 2023-10-28 15.0 8.0 Partly cloudy 1003 07:30 18:00 London, UK 2023-10-27T12:00:00Z
1 2023-10-29 14.0 7.5 Sunny 1000 07:32 17:58 London, UK 2023-10-27T12:00:00Z

This instantly transforms the nested forecast into a clean table, ready for time series analysis or comparison, including the location and request time as context.

Use Case 2: Reading Nested Configuration Files

JSON is often used for application configuration. While usually read directly by applications, sometimes you might want to analyze or compare configurations across different environments stored in JSON files.

“`json
// config_prod.json
{
“environment”: “production”,
“database”: {
“type”: “postgres”,
“host”: “prod-db.example.com”,
“port”: 5432,
“credentials”: {“user”: “prod_user”} // Secret handled separately!
},
“features”: {
“new_dashboard”: true,
“email_alerts”: {
“enabled”: true,
“level”: “critical”
}
}
}

// config_staging.json
{
“environment”: “staging”,
“database”: {
“type”: “postgres”,
“host”: “staging-db.example.com”,
“port”: 5432,
“credentials”: {“user”: “staging_user”}
},
“features”: {
“new_dashboard”: true,
“email_alerts”: {
“enabled”: false,
“level”: “debug” // Different level
}
}
}
“`

“`python
import json

Assume files are loaded into dictionaries

with open(‘config_prod.json’) as f:
config_prod = json.load(f)
with open(‘config_staging.json’) as f:
config_staging = json.load(f)

configs = [config_prod, config_staging]

Normalize the list of configurations

df_configs = pd.json_normalize(configs, sep=’_’)
print(df_configs)
“`

Output:

environment database_type database_host database_port database_credentials_user features_new_dashboard features_email_alerts_enabled features_email_alerts_level
0 production postgres prod-db.example.com 5432 prod_user True True critical
1 staging postgres staging-db.example.com 5432 staging_user True False debug

This makes it easy to compare settings side-by-side, for instance, checking differences in feature flags or database hosts.

Use Case 3: Data from NoSQL Databases (e.g., MongoDB)

Documents retrieved from NoSQL databases like MongoDB are often complex, nested JSON-like structures. json_normalize is invaluable for preparing this data for analysis in Pandas.

“`python

Simulated data structure from a MongoDB collection

mongo_docs = [
{
“_id”: “ObjectId(‘653bb…’)”,
“orderId”: “ORD1001”,
“customer”: {“id”: “CUST01”, “name”: “Alice”},
“items”: [
{“sku”: “SKU001”, “qty”: 2, “price”: 10.50},
{“sku”: “SKU002”, “qty”: 1, “price”: 25.00}
],
“timestamp”: “ISODate(‘2023-10-27T…’)”,
“shipping”: {“method”: “standard”, “address”: {“city”: “Wonderland”}}
},
{
“_id”: “ObjectId(‘653bc…’)”,
“orderId”: “ORD1002”,
“customer”: {“id”: “CUST02”, “name”: “Bob”},
“items”: [
{“sku”: “SKU003”, “qty”: 5, “price”: 5.00}
],
“timestamp”: “ISODate(‘2023-10-27T…’)”,
“shipping”: {“method”: “express”, “address”: {“city”: “Codeville”}}
}
]

Normalize the items list, bringing order and customer info

df_order_items = pd.json_normalize(
mongo_docs,
record_path=’items’,
meta=[
‘orderId’,
[‘customer’, ‘id’],
[‘customer’, ‘name’],
‘timestamp’,
[‘shipping’, ‘method’],
[‘shipping’, ‘address’, ‘city’]
],
meta_prefix=’order_’, # Example: Add prefix to meta columns
sep=’_’
)
print(df_order_items)
“`

Output:

sku qty price order_orderId order_customer_id order_customer_name order_timestamp order_shipping_method order_shipping_address_city
0 SKU001 2 10.5 ORD1001 CUST01 Alice ISODate('2023-10-27T...') standard Wonderland
1 SKU002 1 25.0 ORD1001 CUST01 Alice ISODate('2023-10-27T...') standard Wonderland
2 SKU003 5 5.0 ORD1002 CUST02 Bob ISODate('2023-10-27T...') express Codeville

This flattens the order items into individual rows, associating each item with its corresponding order details, customer information, and shipping details. The meta_prefix parameter was used here to demonstrate adding a prefix (order_) to all columns derived from the meta parameter, which can help organize columns if many metadata fields are included.

6. Advanced Techniques and Considerations

While json_normalize covers many scenarios, sometimes you need additional steps or awareness of certain limitations.

Preprocessing JSON:

  • JSON as Strings: Occasionally, nested JSON might be stored as a string within a JSON field. You’ll need to parse this inner string first using json.loads.

    “`python
    data_str = [{‘id’: 1, ‘payload’: ‘{“value”: 10, “status”: “ok”}’}]

    Preprocess the ‘payload’ column

    for record in data_str:
    if isinstance(record.get(‘payload’), str):
    record[‘payload’] = json.loads(record[‘payload’])

    df_preprocessed = pd.json_normalize(data_str)
    print(df_preprocessed)

    Output:

    id payload.value payload.status

    0 1 10 ok

    ``
    * **Inconsistent Types:** If a field sometimes contains a dictionary and sometimes a scalar (or is missing),
    json_normalizemight raise errors (iferrors=’raise’) or produce columns with mixed types or NaNs (iferrors=’ignore’). Preprocessing might involve standardizing these fields (e.g., ensuring a field is always a dictionary, even if empty{}`).

Handling Multiple, Independent Lists:

json_normalize with record_path is designed to flatten one primary list structure per input record into rows. What if a single JSON object contains multiple lists you want to flatten independently and then relate?

Example: A document with both comments and revisions.

json
doc = {
"docId": "D1",
"content": "Some text.",
"comments": [ {"user": "Alice", "text": "Good point"}, {"user": "Bob", "text": "Needs citation"} ],
"revisions": [ {"revId": 1, "ts": "..."}, {"revId": 2, "ts": "..."} ]
}

You cannot flatten both comments and revisions into rows simultaneously in a single json_normalize call while keeping them correctly associated only with docId “D1”. json_normalize would try to create combinations if you somehow forced it.

The typical approach is:
1. Normalize each list separately, bringing in the common identifier (docId) using meta.
2. Optionally, merge or join these resulting DataFrames if needed, though often analyzing them separately is sufficient.

“`python

Normalize comments

df_comments = pd.json_normalize(doc, record_path=’comments’, meta=[‘docId’])

Normalize revisions

df_revisions = pd.json_normalize(doc, record_path=’revisions’, meta=[‘docId’])

print(“Comments:”)
print(df_comments)
print(“\nRevisions:”)
print(df_revisions)
“`

Output:

“`
Comments:
user text docId
0 Alice Good point D1
1 Bob Needs citation D1

Revisions:
revId ts docId
0 1 … D1
1 2 … D1
“`

Memory Usage and Performance:

Flattening extremely large or deeply nested JSON objects can consume significant memory, as you are potentially creating many new columns and rows.

  • max_level: Use max_level if you don’t need full flattening.
  • Selective meta: Only include necessary metadata fields in the meta parameter.
  • Chunking: If processing a very large JSON file (e.g., newline-delimited JSON), consider reading and normalizing it in chunks.
  • Alternative Libraries: For extremely large files or performance-critical ETL pipelines, libraries like dask (which can parallelize Pandas operations, including potentially custom JSON flattening) or specialized data processing engines might be necessary. However, for most common use cases involving API responses or moderately sized files, json_normalize is efficient enough.

Alternatives to json_normalize:

While powerful, json_normalize isn’t the only way:

  • Manual Iteration and DataFrame Construction: You could write custom Python loops to traverse the JSON, extract the data you need, collect it into lists or dictionaries, and then create a DataFrame. This offers maximum flexibility but is significantly more verbose, error-prone, and usually less performant than json_normalize.
  • .apply(pd.Series): For columns that contain dictionaries after an initial pd.DataFrame load, you can sometimes use df['col_with_dict'].apply(pd.Series) to expand that dictionary into new columns. This needs to be done column-by-column and can be less efficient than json_normalize handling the whole structure at once. It also doesn’t handle the record_path and meta concepts directly.

In most scenarios involving flattening nested JSON, pd.json_normalize provides the best balance of power, conciseness, and performance within the Pandas ecosystem.

7. Putting It All Together: A Comprehensive Example

Let’s tackle a more complex JSON structure that combines several features we’ve discussed. Imagine an API response for a complex order system.

json
order_data = [
{
"order_id": "ORD-2023-A001",
"customer_details": {
"id": "CUST_01",
"contact": {"name": "Alice", "email": "[email protected]"},
"address": {
"street": "1 Tech Park",
"city": "Dataville",
"country": "DV"
}
},
"items": [
{
"sku": "SKU001",
"description": "Widget Type A",
"quantity": 2,
"pricing": {"unit_price": 10.0, "currency": "USD"}
},
{
"sku": "SKU002",
"description": "Gadget Type B",
"quantity": 1,
"pricing": {"unit_price": 25.5, "currency": "USD"}
}
],
"shipments": [
{
"shipment_id": "SHP_A1",
"carrier": "FastShip",
"tracking_url": "http://track.fastship.co/SHP_A1",
"status": "delivered"
}
],
"order_notes": [{"ts": "t1", "note": "Urgent order"}],
"metadata": {"source": "web", "processed_by": "worker-01"}
},
{
"order_id": "ORD-2023-B002",
"customer_details": {
"id": "CUST_02",
"contact": {"name": "Bob", "email": "[email protected]"},
"address": {
"street": "2 Analysis Ave",
"city": "Analytic City",
"country": "AC"
}
},
"items": [
{
"sku": "SKU003",
"description": "Component X",
"quantity": 5,
"pricing": {"unit_price": 5.2, "currency": "EUR"}
}
],
"shipments": [], # Empty list
"order_notes": None, # Missing notes
"metadata": {"source": "api", "processed_by": "worker-02"}
}
]

Goal: Create a DataFrame where each item in an order is a row. Include the order_id, customer name, customer city, shipment count (as an example of handling another list), and metadata source.

Step 1: Identify record_path
We want each item to be a row, so the path points to the list of items: record_path='items'.

Step 2: Identify meta
We need information from outside the items list:
* order_id (top-level)
* Customer name: ['customer_details', 'contact', 'name']
* Customer city: ['customer_details', 'address', 'city']
* Shipment count: This isn’t directly available as a key. We’ll need to calculate this after normalization or during preprocessing if absolutely needed in the meta. For simplicity here, let’s bring the shipments list itself into the metadata and process it later. Path: shipments.
* Metadata source: ['metadata', 'source']

Step 3: Choose sep, errors
Let’s use _ as the separator (sep='_'). Since order_notes might be missing or shipments might be empty, we should use errors='ignore' to prevent crashes if paths don’t resolve (though in this specific meta selection, only missing metadata or customer_details would cause issues if raise was used).

Step 4: Execute json_normalize

“`python
df_order_items_complex = pd.json_normalize(
order_data,
record_path=’items’,
meta=[
‘order_id’,
[‘customer_details’, ‘contact’, ‘name’],
[‘customer_details’, ‘address’, ‘city’],
‘shipments’, # Bring the whole list for now
[‘metadata’, ‘source’]
],
sep=’_’,
errors=’ignore’ # Handle potential missing paths like order_notes (though not used in meta here)
)

print(“Initial Normalized DataFrame:”)
print(df_order_items_complex)
“`

Output:

Initial Normalized DataFrame:
sku description quantity pricing_unit_price pricing_currency order_id customer_details_contact_name customer_details_address_city shipments metadata_source
0 SKU001 Widget Type A 2 10.0 USD ORD-2023-A001 Alice Dataville [{'shipment_id': 'SHP_A1', 'carrier': 'FastSh... web
1 SKU002 Gadget Type B 1 25.5 USD ORD-2023-A001 Alice Dataville [{'shipment_id': 'SHP_A1', 'carrier': 'FastSh... web
2 SKU003 Component X 5 5.2 EUR ORD-2023-B002 Bob Analytic City [] api

Step 5: Post-processing (Optional but often needed)

The DataFrame is flattened, but the shipments column still contains lists (or NaNs if a record was missing the ‘shipments’ key entirely). We can now easily process this using standard Pandas:

“`python

Calculate shipment count from the ‘shipments’ column

df_order_items_complex[‘shipment_count’] = df_order_items_complex[‘shipments’].apply(lambda x: len(x) if isinstance(x, list) else 0)

Drop the original list column if no longer needed

df_order_items_complex = df_order_items_complex.drop(columns=[‘shipments’])

Display the final DataFrame

print(“\nFinal Processed DataFrame:”)
print(df_order_items_complex)
“`

Output:

Final Processed DataFrame:
sku description quantity pricing_unit_price pricing_currency order_id customer_details_contact_name customer_details_address_city metadata_source shipment_count
0 SKU001 Widget Type A 2 10.0 USD ORD-2023-A001 Alice Dataville web 1
1 SKU002 Gadget Type B 1 25.5 USD ORD-2023-A001 Alice Dataville web 1
2 SKU003 Component X 5 5.2 EUR ORD-2023-B002 Bob Analytic City api 0

Now we have a clean, flat DataFrame where each row is an order item, linked to relevant order and customer details, and includes a calculated shipment_count. This complex transformation was made significantly easier by json_normalize.

8. Conclusion

Working with nested JSON data is a common task in data analysis and preparation, particularly when dealing with APIs and semi-structured data sources. While nested structures are flexible for data representation, they don’t map directly to the tabular format expected by Pandas DataFrames.

pandas.json_normalize emerges as a purpose-built, powerful, and efficient tool to bridge this gap. By understanding its core parameters – record_path for targeting lists to explode into rows, meta for including contextual data from parent levels, sep for controlling column naming, errors for managing inconsistencies, and max_level for limiting flattening depth – you can effectively transform complex, hierarchical JSON into analysis-ready, flat DataFrames.

We’ve explored its application from simple nested dictionaries to complex API-like responses and NoSQL document structures. We’ve also discussed practical considerations like preprocessing, handling multiple lists, performance, and alternatives.

Mastering json_normalize significantly streamlines the process of ingesting and preparing JSON data, allowing you to spend less time wrestling with data structures and more time deriving insights using the rich analytical capabilities of the Pandas library. It’s an essential function for any data professional working with Python and diverse data sources. The next time you encounter a challenging nested JSON, remember pd.json_normalize – your key to flattening the complexities.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top