How does combination of Starting Position and Checkpoints work in Kinesis Data Streams?

99 Views Asked by At

I am using Glue Streaming Job with Kinesis Data Stream. I want my glue job to always read from last unprocessed record (in case the job goes down and is restarted).

Relevant code -

df = glueContext.create_data_frame.from_option(
   connection_type="kinesis",
   connection_options={
       ... # other properties
       "startingPosition": "TRIM_HORIZON"
   }
)
glueContext.forEachBatch(
    frame=df,
    batch_function=batch_fn,
    options={
        "checkpointLocation": "s3://....",
        ...
    }
)

My understanding is TRIM_HORIZON will make the glue job read data from the very beginning.

AWS Glue documentation says -

AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read.

Which makes me think the streaming glue job is keeping record of last read data.

My question is what does the above combination do? Will it start reading from the beginning but skip already processed data?

1

There are 1 best solutions below

2
SerhiiH On

TRIM_HORIZON there to indicate that you want to start reading from the oldest available record in the Kinesis stream. This means that your AWS Glue job will process all records in the stream, starting from the beginning.

However "checkpointLocation": "s3://....", is used to store checkpoints for the streaming ETL job and will be used to remember the lats read record from Kinesis for particular ETL job.

Whether you stop and restart the same Glue Job in the future or keep it running, with TRIM_HORIZON option, Kinesis data source will be using that checkpoints to start read data from your last checkpoint.

Note: if you start another application (new ETL job), you will read all data from the beginning again!

There are other options available too: see startingPosition at Kinesis connection option reference