29 August 2023

Splitting Nested JSON Into Individual Rows in Redshift

Mastery of the subtleties of JSON data transformation across various data warehouse environments has evolved into an indispensable skill within data manipulation. While each contemporary platform boasts its advantages, consider Snowflake, which integrates native JSON functions like PARSE_JSON and TO_JSON, capable of significantly streamlining the transformation process. Amazon Redshift remains a favored option for those paying a premium on cost efficiency. Nonetheless, working with JSON manipulation on Redshift can sometimes be intricate and demanding when viewed in a broader context. Amazon Redshift might occasionally exude a less modern aura than its progressive counterparts like Snowflake and BigQuery, particularly concerning JSON transformations.

In the context of my recent endeavors in coding within the Redshift environment, a particular challenge came to the forefront, leading me to seek comprehensive resources. I’ve compiled this blog post as a way to present a solution for others who might be facing similar challenges.

The Predicament

A typical scenario involves dealing with nested JSON within a table, representing a one-to-many relationship. I’ve prepared executable SQL below that creates a sample “problem” table to provide context.

with nested_purchases as (
select
    45367 as customer_id,
    json_parse('[' ||
    '{"id": "3748", "date": "2023-08-25", "total_amt": 55.24}, ' ||
    '{"id": "4563", "date": "2023-08-25", "total_amt": 35.14}, ' ||
    '{"id": "5644", "date": "2023-08-29", "total_amt": 59.30}]'
    ) as purchases

union

select
    23563 as customer_id,
    json_parse('[' ||
    '{"id": "2298", "date": "2023-07-05", "total_amt": 101.54}, ' ||
    '{"id": "3331", "date": "2023-07-14", "total_amt": 71.11}, ' ||
    '{"id": "4420", "date": "2023-07-29", "total_amt": 4.25}]'
    ) as purchases

union

select
    27390 as customer_id,
    json_parse('[' ||
    '{"id": "9824", "date": "2023-08-03", "total_amt": 44.24}, ' ||
    '{"id": "3938", "date": "2023-08-22", "total_amt": 733.13}, ' ||
    '{"id": "4383", "date": "2023-08-30", "total_amt": 293.92}]'
    ) as purchases
)

select * from nested_purchases

This code generates a table structured as shown:

customer_id purchases
23563 [{"id":"2298","date":"2023-07-05","total_amt":101.54},{"id":"3331","date":"2023-07-14","total_amt":71.11},{"id":"4420","date":"2023-07-29","total_amt":4.25}]
27390 [{"id":"9824","date":"2023-08-03","total_amt":44.24},{"id":"3938","date":"2023-08-22","total_amt":733.13},{"id":"4383","date":"2023-08-30","total_amt":293.92}]
45367 [{"id":"3748","date":"2023-08-25","total_amt":55.24},{"id":"4563","date":"2023-08-25","total_amt":35.14},{"id":"5644","date":"2023-08-29","total_amt":59.30}]

The Objective

Imagine wanting to break down each customer purchase into its distinct rows while extracting relevant attributes such as purchase amount and date. The goal is to transform the data into a format resembling the following:

customer_id purchase_id purchase_date purchase_amount
23563 2298 2023-07-05 101.54
23563 3331 2023-07-14 71.11
23563 4420 2023-07-29 4.25
27390 9824 2023-08-03 44.24
27390 3938 2023-08-22 733.13
27390 4383 2023-08-30 293.92
45367 3748 2023-08-25 55.24
45367 4563 2023-08-25 35.14
45367 5644 2023-08-29 59.3

Unveiling the Solution

Navigating through various Redshift DDLs like UNPIVOT and PIVOT, along with exploring an array of JSON-specific functions, led me to discover the power of the at index operation! Despite the need for more comprehensive examples on platforms like Stack Overflow or in tutorials, unraveling the mechanism of at index required delving into AWS documentation. However, persistence paid off as I crafted a query that accomplishes the transformation we seek:

with nested_purchases as (
  select
    45367 as customer_id,
    json_parse('[' ||
       '{"id": "3748", "date": "2023-08-25", "total_amt": 55.24}, ' ||
       '{"id": "4563", "date": "2023-08-25", "total_amt": 35.14}, ' ||
       '{"id": "5644", "date": "2023-08-29", "total_amt": 59.30}]'
    ) as purchases

  union

  select
    23563 as customer_id,
    json_parse('[' ||
       '{"id": "2298", "date": "2023-07-05", "total_amt": 101.54}, ' ||
       '{"id": "3331", "date": "2023-07-14", "total_amt": 71.11}, ' ||
       '{"id": "4420", "date": "2023-07-29", "total_amt": 4.25}]'
    ) as purchases

  union

  select
    27390 as customer_id,
    json_parse('[' ||
       '{"id": "9824", "date": "2023-08-03", "total_amt": 44.24}, ' ||
       '{"id": "3938", "date": "2023-08-22", "total_amt": 733.13}, ' ||
       '{"id": "4383", "date": "2023-08-30", "total_amt": 293.92}]'
    ) as purchases
)

select
    a.customer_id,
    json_extract_path_text(json_serialize(purchase_details), 'id') as purchase_id,
    json_extract_path_text(json_serialize(purchase_details), 'date') as purchase_date,
    json_extract_path_text(json_serialize(purchase_details), 'total_amt') as purchase_amount
from nested_purchases as a, a.purchases as purchase_details at index

And there you have it—a purchase table meticulously derived from nested JSON data! I hope this blog post will spare you the time and effort I invested in uncovering this solution. Happy coding!