Mastery of the subtleties of JSON data transformation across various data warehouse environments has evolved into an indispensable skill within data manipulation. While each contemporary platform boasts its advantages, consider Snowflake, which integrates native JSON functions like PARSE_JSON and TO_JSON, capable of significantly streamlining the transformation process. Amazon Redshift remains a favored option for those paying a premium on cost efficiency. Nonetheless, working with JSON manipulation on Redshift can sometimes be intricate and demanding when viewed in a broader context. Amazon Redshift might occasionally exude a less modern aura than its progressive counterparts like Snowflake and BigQuery, particularly concerning JSON transformations.
In the context of my recent endeavors in coding within the Redshift environment, a particular challenge came to the forefront, leading me to seek comprehensive resources. I’ve compiled this blog post as a way to present a solution for others who might be facing similar challenges.
The Predicament
A typical scenario involves dealing with nested JSON within a table, representing a one-to-many relationship. I’ve prepared executable SQL below that creates a sample “problem” table to provide context.
with nested_purchases as (
select
45367 as customer_id,
json_parse('[' ||
'{"id": "3748", "date": "2023-08-25", "total_amt": 55.24}, ' ||
'{"id": "4563", "date": "2023-08-25", "total_amt": 35.14}, ' ||
'{"id": "5644", "date": "2023-08-29", "total_amt": 59.30}]'
) as purchases
union
select
23563 as customer_id,
json_parse('[' ||
'{"id": "2298", "date": "2023-07-05", "total_amt": 101.54}, ' ||
'{"id": "3331", "date": "2023-07-14", "total_amt": 71.11}, ' ||
'{"id": "4420", "date": "2023-07-29", "total_amt": 4.25}]'
) as purchases
union
select
27390 as customer_id,
json_parse('[' ||
'{"id": "9824", "date": "2023-08-03", "total_amt": 44.24}, ' ||
'{"id": "3938", "date": "2023-08-22", "total_amt": 733.13}, ' ||
'{"id": "4383", "date": "2023-08-30", "total_amt": 293.92}]'
) as purchases
)
select * from nested_purchases
This code generates a table structured as shown:
customer_id | purchases |
---|---|
23563 | [{"id":"2298","date":"2023-07-05","total_amt":101.54},{"id":"3331","date":"2023-07-14","total_amt":71.11},{"id":"4420","date":"2023-07-29","total_amt":4.25}] |
27390 | [{"id":"9824","date":"2023-08-03","total_amt":44.24},{"id":"3938","date":"2023-08-22","total_amt":733.13},{"id":"4383","date":"2023-08-30","total_amt":293.92}] |
45367 | [{"id":"3748","date":"2023-08-25","total_amt":55.24},{"id":"4563","date":"2023-08-25","total_amt":35.14},{"id":"5644","date":"2023-08-29","total_amt":59.30}] |
The Objective
Imagine wanting to break down each customer purchase into its distinct rows while extracting relevant attributes such as purchase amount and date. The goal is to transform the data into a format resembling the following:
customer_id | purchase_id | purchase_date | purchase_amount |
---|---|---|---|
23563 | 2298 | 2023-07-05 | 101.54 |
23563 | 3331 | 2023-07-14 | 71.11 |
23563 | 4420 | 2023-07-29 | 4.25 |
27390 | 9824 | 2023-08-03 | 44.24 |
27390 | 3938 | 2023-08-22 | 733.13 |
27390 | 4383 | 2023-08-30 | 293.92 |
45367 | 3748 | 2023-08-25 | 55.24 |
45367 | 4563 | 2023-08-25 | 35.14 |
45367 | 5644 | 2023-08-29 | 59.3 |
Unveiling the Solution
Navigating through various Redshift DDLs like UNPIVOT
and PIVOT
, along with exploring an array of JSON-specific functions, led me to discover the power of the at index
operation! Despite the need for more comprehensive examples on platforms like Stack Overflow or in tutorials, unraveling the mechanism of at index
required delving into AWS documentation. However, persistence paid off as I crafted a query that accomplishes the transformation we seek:
with nested_purchases as (
select
45367 as customer_id,
json_parse('[' ||
'{"id": "3748", "date": "2023-08-25", "total_amt": 55.24}, ' ||
'{"id": "4563", "date": "2023-08-25", "total_amt": 35.14}, ' ||
'{"id": "5644", "date": "2023-08-29", "total_amt": 59.30}]'
) as purchases
union
select
23563 as customer_id,
json_parse('[' ||
'{"id": "2298", "date": "2023-07-05", "total_amt": 101.54}, ' ||
'{"id": "3331", "date": "2023-07-14", "total_amt": 71.11}, ' ||
'{"id": "4420", "date": "2023-07-29", "total_amt": 4.25}]'
) as purchases
union
select
27390 as customer_id,
json_parse('[' ||
'{"id": "9824", "date": "2023-08-03", "total_amt": 44.24}, ' ||
'{"id": "3938", "date": "2023-08-22", "total_amt": 733.13}, ' ||
'{"id": "4383", "date": "2023-08-30", "total_amt": 293.92}]'
) as purchases
)
select
a.customer_id,
json_extract_path_text(json_serialize(purchase_details), 'id') as purchase_id,
json_extract_path_text(json_serialize(purchase_details), 'date') as purchase_date,
json_extract_path_text(json_serialize(purchase_details), 'total_amt') as purchase_amount
from nested_purchases as a, a.purchases as purchase_details at index
And there you have it—a purchase table meticulously derived from nested JSON data! I hope this blog post will spare you the time and effort I invested in uncovering this solution. Happy coding!