I want to scale on cloud a one off pipeline I have locally.
- The script takes data from a large (30TB), static S3 bucket made up of PDFs
- I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
- I save the output to a file.
I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.
I've been trying to replicate this on GCP - which I am still discovering.
- Using Cloud functions doesn't work well because of its max timeout
- A full Cloud composer architecture seems a bit of an overkill for a very straightforward pipeline which doesn't require Airflow.
- I'd like to avoid coding this in Apache Beam format for Dataflow.
What is the best way to run such a python data processing pipeline with a container on GCP ?
Thanks to the useful comments in the original post, I explored other alternatives on GCP.
Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.