Running large pipelines on GCP

232 Views Asked by Matthieu At 11 August 2022 at 10:21

I want to scale on cloud a one off pipeline I have locally.

The script takes data from a large (30TB), static S3 bucket made up of PDFs
I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
I save the output to a file.

I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.

I've been trying to replicate this on GCP - which I am still discovering.

Using Cloud functions doesn't work well because of its max timeout
A full Cloud composer architecture seems a bit of an overkill for a very straightforward pipeline which doesn't require Airflow.
I'd like to avoid coding this in Apache Beam format for Dataflow.

What is the best way to run such a python data processing pipeline with a container on GCP ?

There are 1 best solutions below

Matthieu On 16 August 2022 at 19:43 BEST ANSWER

Thanks to the useful comments in the original post, I explored other alternatives on GCP.

Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.