Does apache beam processing time, avoid late data?

34 Views Asked by At

If we are using beam streaming, with for example kafka source. If we use the default processing timestamp, is it possible to have late data?

For example a pipeline like that:

    pipeline
     | "ReadFromKafka" >> ReadFromKafka()
     | "WINDOW" >> beam.WindowInto(FixedWindows(5))
     | "DESERIALIZE REQUESTS" >> beam.Map(deserialize).with_output_types(Something)
     | "YIELD MULTIPLE ITEMS" >> beam.ParDo(RandomLatency())
     | "GROUP" >> beam.GroupByKey()
     | "BIG PROCESSING" >> beam.ParDo(BigProcessing())
 )

I understand that windows are not use until groupByKey, but for a given request in kafka, I will have a parDo that yield multiple items with the same key (I even split the pipeline and join with CoGroupByKey)

From my understanding, because I am using processing timestamp, my pipeline can't have late data, in my GroupByKey and CoGroupByKey is it correct ?

Because watermark will always be behind of the data that enter the pipeline? Even if my groupby happen in late stage of the pipelines?

0

There are 0 best solutions below