Is there any efficient way to replace loc[[bla]] in pandas?

85 Views Asked by At

I have a dataframe in pandas (version 1.5.3) and I want to select the records by an index and go through them in a loop. Before I was using df_info = df.loc[[idx]], whose return is a dataframe with the selected rows. However, this process runs MANY times and I noticed that this line specifically is taking a lot of time. Using cProfile, I saw that most of the time is related to 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine'. How to do this in a more efficient way?

An example of what the code looks like:

import pandas as pd
import cProfile
from tqdm import tqdm


def iterate_through_df():
    indexes = df.index.unique()
    for idx in tqdm(indexes):            
        df_info = df.loc[[idx]]
        #The code continues...

df = pd.read_csv('random_data.csv', index_col='id')
cProfile.run('iterate_through_df()', sort='cumulative')

I made the csv for testing with this code:

import pandas as pd
import numpy as np

size = 100000
num_columns = 100

data = {}
for i in range(1, num_columns + 1):
    key = f'name{i}:'
    data[key] = np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], size=size)
random_indexes = np.random.randint(1, 100, size=size)
df = pd.DataFrame(data, index=random_indexes)
df.to_csv('random_data.csv', index_label='id')

most of the time spent is with this line (which I wanted to optimize):

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
99    1.046    0.011    1.046    0.011 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}

I tried df_info = df_link.loc[idx]. The execution time was indeed shorter but the problem is that sometimes the return is a pandas Series object (when there is just one record for that index), sometimes it is a dataframe (when there is more than one record), and I need it to always be a dataframe.

0

There are 0 best solutions below