How to drop a subset of MultiIndex "columns" from a Pandas series or dataframe?

38 Views Asked by At

Iterating over a pandas DataFrame using iterrows() produces a series of index Series pairs (tuples).

for timestamp, row in df.iterrows():

I am aware that iterrows() is slow. Ignoring that issue for now -

Some of the returned rows will contain None or NaN values. I want to remove these. (Not from the DataFrame but from a copy of each row returned by iterrows().)

I also want to remove a subset of "columns". Columns are named with a 2-level MultiIndex.

Here's an idea of what the DataFrame looks like:

                    AACT                  ABILF                ...
                    open high  low close  open high  low close ...
timestamp                                                                                                                   ...
2022-01-04 00:00:00  NaN  NaN  NaN   NaN   NaN  NaN  NaN   NaN ...
2022-01-04 00:01:00  NaN  NaN  NaN   NaN   NaN  NaN  NaN   NaN ...
2022-01-04 00:02:00  NaN  NaN  NaN   NaN   NaN  NaN  NaN   NaN ...
2022-01-04 00:03:00  NaN  NaN  NaN   NaN   NaN  NaN  NaN   NaN ...
2022-01-04 00:04:00  NaN  NaN  NaN   NaN   NaN  NaN  NaN   NaN ...

All the values happen to be NaN here, however in general that will not be the case.

Because I do not know how to approach this problem, here is some pseudocode:

for timestamp, row in df.iterrows():

    row.drop([('AACT', 'open'), ('AACT', 'high'), ('AACT', 'low')])
    row.drop([('ABILF', 'open'), ('ABILF', 'high'), ('ABILF', 'low')])
    row.dropna()

    # remaining data is `('AACT', 'close')` and `('ABILF', 'close')`
    # iff values in this `Series` are non-NaN
2

There are 2 best solutions below

0
Adrien Riaux On

There is the level=... argument in the pandas.DataFrame.drop() method, see the documentation here.

See the following reproductible example:

import pandas as pd
import numpy as np

arrays = [
    ["bar", "bar", "foo", "foo"],
    ["one", "two", "three", "four"],
]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])


df = pd.DataFrame(np.random.randn(3, 4), index=["A", "B", "C"], columns=index)
df

>>> 
first   bar foo
second  one two three   four
A   0.951324    1.439571    0.381357    -0.442386
B   1.664391    0.615796    -0.879859   -0.916176
C   -0.944747   -0.158825   0.615032    0.545176

df.drop('two', axis=1, level=1)

>>>
first   bar foo
second  one three   four
A   0.951324    0.381357    -0.442386
B   1.664391    -0.879859   -0.916176
C   -0.944747   0.615032    0.545176
0
FreelanceConsultant On

The solution is provided below. The drop() function which exists for both DataFrame and Series objects takes an argument level which specifies which level of a MultiIndex should be inspected.

Series also has the dropna() function which drops None/NaN.

for timestamp, row in df.iterrows():

    series = row.dropna()
    series = series.drop(labels=['open', 'high', 'low'], level=1)