I have an Airflow / Spark Architecture for ETL purpose.
The Airflow orchestrate PySpark jobs into my Spark Connect cluster.
In order to send data between Airflow tasks, I'm using a Docker Volume associated to the tmp folder.
The problem is, when I try to delete the .parquet folder to reclaim space with a PySpark job using shutil.rmtree, I do not have the authorization to do so. Even when I use Docker directly.
Is there a better way to share my datasets between my tasks. Or is there a way to delete the .parquet inside my tmp folder?
Thank you !