I have a large dataframe of 140,000 rows. In one of the columns is a value, which I need to compare with two thresholds, and then assign an index. Then, I also find the closest corresponding value in another index, but that index is specific to the prior task.
As an example, here's a similar task: I have a timeseries, ts, with temperature "temp" and humidity "h".
- Based on two temperature thresholds, select a "class" of 0, 1 or 2.
- There exist then three humidity arrays, one for each class. Find the index of the closest value to the humidity value in the specific row of ts. Note; in the example, I didn't pass the humidity arrays (h_ref) to the function, they are available globally.
I can do this via:
def temp_and_h(ts):
temp_i = ts['temp']
h_i = ts['h']
if temp_i > 2:
class=0
elif temp_i < -2:
class=1
else:
class=2
# find the index of closest value in the correct (class-specific) array in "h_ref"
diffs = [abs(x - h_i) for x in h_ref[class]]
h_idx = diffs.index(min(diffs))
return (class,h_idx)
Which is simply called via:
ts[['temp_idx', 'h_idx']] = ts.apply(stability_and_TI, axis=1)
However, on my machine this takes around ~670 seconds. I need to speed this up as much as possible. What other approaches can I try? I tried using np.vectorize, but the memory usage is too large (>100GB).
You can vectorize using
pd.cutandgroupby.apply:Example with dummy data:
Reproducible input: