Explode Pandas dataframe with mismatched number of elements

39 Views Asked by At

I work on a dataset like this one and I want to explode it with the content of each list.

index    col1      col2      col3
    1   [A,B]     [,xx]     [1,2]
    2   [A,C]   [zz,xx]     [3,4]
    3    [D,]   [zz,yy]     [2,2]

However, the missing values in [,xx] and in [D,] generate errors and I am looking for a way to fill the empty position in these lists.

My idea is to fill the blank in the lists to get something like [NaN,xx] and [D,NaN], without specifying the index in the list because the blank is not always at the same position.

How can I achieve this please ?

2

There are 2 best solutions below

0
mozway On

[,xx] (or [,'xx']) could not be a valid python object.

[,'xx']
# SyntaxError: invalid syntax

Only ['xx'] would be, but in this case there is no way to differentiate something that would be converted to ['xx', NaN] or [NaN, 'xx'].

Thus, assuming that you have strings representations of list-like objects ('[,xx]'), you could strip the [/] and split on , before exploding:

cols = ['col1', 'col2', 'col3']

out = (df[cols]
       .applymap(lambda x: x.strip('[]').split(','))  # convert to lists
       .explode(cols)                                 # explode
       .join(df.drop(columns=cols))[df.columns]       # add other columns
      )

NB. in most recent versions of pandas, replace applymap by map.

Output:

   index col1 col2 col3
0      1    A         1
0      1    B   xx    2
1      2    A   zz    3
1      2    C   xx    4
2      3    D   zz    2
2      3        yy    2

Used input:

df = pd.DataFrame({'index': [1, 2, 3],
                   'col1': ['[A,B]', '[A,C]', '[D,]'],
                   'col2': ['[,xx]', '[zz,xx]', '[zz,yy]'],
                   'col3': ['[1,2]', '[3,4]', '[2,2]']})
0
udaykumar gajavalli On

If the data as below and if you want to replace your null values with some special character you can do below.

np.nan is nothing but the null in python.

import pandas as pd
import numpy as np

data = [
    [['A','B'], ['A','C'], ['D',np.nan]],
    [[np.nan,'xx'], ['zz','xx'], ['zz','yy']],
    [[1,2], [3,4],[2,2]]
]
df = pd.DataFrame(data)
special_char = 'NaN'

def replace_null(x):
  if isinstance(x, list):
    return [replace_null(item) for item in x]
  else:
    return x if not pd.isna(x) else special_char

df = df.applymap(replace_null)
print(df)