Unpacking dictionary data to create a multiindex panda dataframe

33 Views Asked by At

I have a dataset including nested dictionaries that I would like to unpack and form a multiindex dataframe.

The dataframe should have columns for Years with sub columns for the Result Types in each year (yield and quality in this case).

Typical dataset:

datalist= [
    {
        'trial': 'efr1',
        'location': 'aberdeen',
        '2010': {'yield': '100', 'quality': '97'},
        '2011': {'yield': '90', 'quality': '87'},
        '2012': {'yield': '88', 'quality': '90'}
    
    },
    {
        'trial': 'efr2',
        'location': 'bristol',
        '2010': {'yield': '88', 'quality': '90'},
        '2011': {'yield': '75', 'quality': '82'},
        '2012': {'yield': '77', 'quality': '80'}
    
    },
    {
        'trial': 'axy1',
        'location': 'newcastle',
        '2010': {'yield': '91', 'quality': '95'},
        '2011': {'yield': '93', 'quality': '93'},
        '2012': {'yield': '75', 'quality': '97'}
    }   
]

Using Dataframe.from_dict() produces a table with embedded dictionaries elements which I cannot figure out how to unpack and split to subcolumns.

On the other hand using json_normalize() produces a flat table with compound headings, and I cannot figure out how to to convert that into a multiindex frame with the structure I need...

Results with Dataframe.from_dict() and json_normalize()

1

There are 1 best solutions below

1
mozway On BEST ANSWER

If I understand correctly, you could use pandas.json_normalize, then convert the year.type columns to a MultiIndex (with str.split):

out = (
 pd.json_normalize(datalist)
   .set_index(['trial', 'location'])
   .pipe(lambda d: d.set_axis(d.columns.str.split('.', expand=True), axis=1))
)

Variant:

out =  pd.json_normalize(datalist).set_index(['trial', 'location'])
out.columns = out.columns.str.split('.', expand=True)

Output:

                 2010          2011          2012        
                yield quality yield quality yield quality
trial location                                           
efr1  aberdeen    100      97    90      87    88      90
efr2  bristol      88      90    75      82    77      80
axy1  newcastle    91      95    93      93    75      97

Note that setting the first two columns as index is optional, but if you don't the MultiIndex will be:

  trial   location  2010          2011          2012        
    NaN        NaN yield quality yield quality yield quality
0  efr1   aberdeen   100      97    90      87    88      90
1  efr2    bristol    88      90    75      82    77      80
2  axy1  newcastle    91      95    93      93    75      97

Intermediate after pd.json_normalize(datalist):

  trial   location 2010.yield 2010.quality 2011.yield 2011.quality 2012.yield 2012.quality
0  efr1   aberdeen        100           97         90           87         88           90
1  efr2    bristol         88           90         75           82         77           80
2  axy1  newcastle         91           95         93           93         75           97