I am receiving the events continuously from the event hub and performing the aggregation group by on 2 second window ,however the records which has 2 second window are falling into different microbatches ,how to handle this scenario in spark streaming
Input :
Following output I needed.
Here is my code
group_by_attributes = ["id"] + (
[window("timestamp", "3 seconds","2 seconds")] if agg_by_window else []
)
return (
df.groupBy(*group_by_attributes)
.agg(
agg_max("col1").alias("col1"),
agg_max("col2").alias("col2")
)
)
in my case aggregate_by_window is true
Here how to keep the previous batch data in memory and later how can i use the previous batch data in current batch or do i need to remove foreachbatch?

