I am trying to filter a timeseries glue dynamic frame with millions of rows having data:
id val ts
a 1.3 2022-05-03T14:18:00.000Z
a 9.2 2022-05-03T12:18:00.000Z
c 8.2 2022-05-03T13:48:00.000Z
I have another pandas dataframe with thousands of rows:
id start_ts end_ts
a 2022-05-03T14:00:00.000Z 2022-05-03T14:18:00.000Z
a 2022-05-03T11:38:00.000Z 2022-05-03T12:18:00.000Z
c 2022-05-03T13:15:00.000Z 2022-05-03T13:48:00.000Z
I want to filter all the rows in the time series dynamic frame having condition they have the same id and the ts is between start_ts and end_ts.
My current approach is too slow to solve the problem:
I am first iterating over the pandas_df and storing multiple filtered glue dynamic frames into an array
dfs=[]
for index, row in pandas_df.iterrows():
df = Filter.apply(ts_dynamicframe, f=lambda x: ((row['start_ts'] <= x['ts'] <= row['end_ts']) and x['id'] == index))
dfs.append(df)
and then unioning all the dynamicframes together.
df = dfs[0]
dfs.pop(0)
for _df in dfs:
df = df.union(_df)
the materialization takes too long and never finishes..
print("Count: ", df.count())
What could be more efficient approaches to solving this problem with spark/glue?
Use a range join
Data
Solution
outcome