I have to stream data from Postgres DB to the AWS S3 bucket with Kafka. I have to stream about 60 tables.
I have three S3 Buckets in my Data Lake:
- RawBucket - raw data. Tables and columns are equal as in Postgres. This data is deleted after eg. 10 days (retention policy)
- ArchiveBucket - data from RawBucet (no transformed data and no retention policy)
- CuratedBucket - Data from RawBucket in Parquet file format. 80% of data keeps structure as it is in RawBucket, and 20% of data is transformed
Is it able to read data from Kafka with AWS Glue and:
- move All data to ArchiveBucket
- move 80% of data into CuratedBucket in its source structure in parquet format
- move 20% of data into CuratedBucket in new transformed structure in parquet format
- remove processed data from Kafka topics
- run AWS Glue Job once on a daily basis and incrementally consume Kafka
- every single record in Kafka will have the property "uploadedAt" that tells what is the date when data is uploaded to Kafka. According to the date, record will be saved into a specific subfolder (hive style partitioning, eg: data from table "order" will be saved in the curated bucket in folder "order" and partitioned by date. So final targe is S3:/curatedBucket/order/year=2024/month=1/day=15/"
- would like to go without RawBucket if possible
Not sure about quality of this solution, is it possible and good solution? If not, then would some be so kind to write some advice (as much as possible details) about design and how to achieve it?
What about the solution to stream Data from Kafka to RawBucket and process data incrementally using AWS Glue using job bookmarks?