Change the type of a column multi-index level?

96 Views Asked by At

I am reading a .csv file that looks like this:

,1,1,2,2,3,3...
,'A','B','A','B',...
0,1,2,3,4,...
1,2,3,4,5,...

I read this with df = pd.read_csv(fname,header=[0,1],index_col=0)

Now, deplorably, pandas sets the dtype of the first level of the column multiindex to string, even though there are only integers in the first row of the file. Unfortunately, there is no way to tell pandas what dtype to use for each row (level) of the column index. (Something like header={0:'int64',1:'string'}).

Now, it seems that there should be a simple and easy way to convert one level of the multiindex columns to int, but I searched for a long time and could not come up with anything. Right now, I am re-generating the index from scratch, but that seems overkill.

Another solution that could possibly work would be to convert the multiindex to a DataFrame, change the dtypes, then set the index from the DataFramce. That also seems like an overcomplicated process.

Suggestions?

2

There are 2 best solutions below

1
Nick On

You could use set_levels:

csvstr = '''
,1,1,2,2,3,3
,'A','B','A','B','A','B'
0,1,2,3,4,5,6
1,2,3,4,5,6,7
'''

df = pd.read_csv(StringIO(csvstr),header=[0,1],index_col=0)

df.columns.dtypes
# level_0    object
# level_1    object
# dtype: object

df.columns
# MultiIndex([('1', ''A''),
#             ('1', ''B''),
#             ('2', ''A''),
#             ('2', ''B''),
#             ('3', ''A''),
#             ('3', ''B'')],
#            )

df.columns = df.columns.set_levels(df.columns.levels[0].astype('int64'), level=0)

df.columns.dtypes
# level_0     int64
# level_1    object
# dtype: object

df.columns
# MultiIndex([(1, ''A''),
#             (1, ''B''),
#             (2, ''A''),
#             (2, ''B''),
#             (3, ''A''),
#             (3, ''B'')],
#            )
3
Timeless On

it seems that there should be a simple and easy way to convert one level of the multiindex..

Unfortunately, not yet (see pandas/issues/54523).

A workaround would be to convert the header to_frame and map it with literal_eval :

from ast import literal_eval

conv_mux = df.columns.to_frame().map(literal_eval).convert_dtypes()

df.columns = pd.MultiIndex.from_frame(conv_mux) # ---------^ is optional

Output :

>>> df.columns

MultiIndex([(1, 'A'),
            (1, 'B'),
            (2, 'A'),
            (2, 'B'),
            (3, 'A')],
           names=[0, 1])

>>> df.columns.dtypes.to_dict()

{0: Int64Dtype(), 1: string[python]}

Used input :

op = """,1,1,2,2,3
,'A','B','A','B','A'
0,1,2,3,4
1,2,3,4,5"""

from io import StringIO

df = pd.read_csv(StringIO(op) ,header=[0,1], index_col=0)