Unable to explode the nested list within a json using pd.json_normalize

102 Views Asked by At

I tried different ways to get this done, but unable to do so. Please help here. I am trying to explode a nested json list using python pd.json_normalize. After exploding the output is coming in two rows causing duplicate PKeys. My data is below and i am trying to get a single record with "id" as pkey and values in the nested list

{
"data": [
    {
        "id": "123456",
        "created_by": "[email protected]",
        "created_at": "2023-12-11",
        "modified_by": "[email protected]",
        "modified_at": "2023-12-11",
        "rg": {
            "id": "xyz234",
            "parent_id": "cpqrst",
            "account_id": "ntb1we"
        },
        "values": [
            {
                "value1": 100,
                "value2": 101,
                "value10": {
                    "valueid": "btqns"
                }
            },
            {
                "value1": 200,
                "value2": 201,
                "value10": {
                    "valueid": "apqns"
                }
            }
        ],
        "page_info": {
            "page": 1,
            "per_page": 100,
            "total_pages": 286,
            "total_count": 28580
        }
    }
]

}

my code as below to normalize and explode

response = requests.get(getURL, params={"page": page}, headers=headers)
    if response.status_code == 200:
        json_data = response.json()
        target_schema  = StructType([StructField(.................. StructField('version', LongType(), True)])

        normalized_data = pd.json_normalize(json_data,["data"],"page_info"]).explode("values").reset_index(drop=True)
        
        JSONdf = json.dumps(json.loads(normalized_data.to_json(orient="records")))
        results.append(JSONdf)

###########################modified code that worked################################
       data_df = json_normalize(json_data['data'],sep='_')
        data_df.columns = [f"{col}_data" for col in data_df.columns]

        df = 
        json_normalize(json_data['data'],'values',sep='_')
        df.columns = [f"{col}_tran" for col in df.columns]
        
        result_df = pd.concat([data_df,df],axis=1)
         ###############################################

This provides data in the format as below with id value 123456 repeating twice for values

Current output with id 123456 repeating in 2 rows for values.value1 and values.value2

current_output

desired output is to have a single row with id 123456 for values column having the entire list

desired_output

Any help greatly appreciated in getting the desired output. I tried all possible ways but could not get the desired output.

2

There are 2 best solutions below

1
Mahboob Nur On

You can achieve this by adjusting the column names inside the pd.json_normalize function according to the actual structure of your JSON data.

import pandas as pd

# Sample corrected JSON data
json_data = {
    "data": [
        {
            "id": "123456",
            "created_by": "[email protected]",
            "created_at": "2023-12-11",
            "modified_by": "[email protected]",
            "modified_at": "2023-12-11",
            "rg": {
                "id": "xyz234",
                "parent_id": "cpqrst",
                "account_id": "ntb1we",
                "values": [
                    {
                        "value1": 100,
                        "value2": 101,
                        "value10": {
                            "valueid": "btqns"
                        }
                    },
                    {
                        "value1": 200,
                        "value2": 201,
                        "value10": {
                            "valueid": "apqns"
                        }
                    }
                ]
            },
            "page_info": {
                "page": 1,
                "per_page": 100,
                "total_pages": 286,
                "total_count": 28580
            }
        }
    ]
}

# Flatten the JSON and explode the "values" column
df = pd.json_normalize(json_data['data'], 'rg.values', ['id', 'created_by', 'created_at', 'modified_by', 'modified_at', 'rg.id', 'rg.parent_id', 'rg.account_id'])

print(df)
0
jezrael On

IIUC use:

out = pd.json_normalize(json_data["data"])

print(out)
       id     created_by  created_at    modified_by modified_at   rg.id  \
0  123456  [email protected]  2023-12-11  [email protected]  2023-12-11  xyz234   

  rg.parent_id rg.account_id  \
0       cpqrst        ntb1we   

                                           rg.values  page_info.page  \
0  [{'value1': 100, 'value2': 101, 'value10': {'v...               1   

   page_info.per_page  page_info.total_pages  page_info.total_count  
0                 100                    286                  28580  

If need separator _:

out = pd.json_normalize(json_data["data"], sep='_')

print(out)
       id     created_by  created_at    modified_by modified_at   rg_id  \
0  123456  [email protected]  2023-12-11  [email protected]  2023-12-11  xyz234   

  rg_parent_id rg_account_id  \
0       cpqrst        ntb1we   

                                           rg_values  page_info_page  \
0  [{'value1': 100, 'value2': 101, 'value10': {'v...               1   

   page_info_per_page  page_info_total_pages  page_info_total_count  
0                 100                    286                  28580