I am trying to figure out the upper limit (practical estimate is fine) of characters/bytes passed on as args to the LivyOperator in Airflow.
The operator eventually uses args(: Sequence[str | int | float]) as part of the spark_params in the POST request: self.get_hook().post_batch(**self.spark_params).
As there are a lot of steps involved until the Spark cluster receives the original args as arguments, I find it difficult to determine a practical limit when working in Airflow (e.g. pass on de-serialized JSON as part of args).
Assumptions on available sys memory: server running Airflow 1-3 GB, server running Livy 5-10 GB
My thoughts on limits involved so far:
- inside python: actual value I want to know the limit of -> single str -> python str limit -> dependent on py installation, but in this case practically limited by memory
- server hosting Livy: POST request (see What is the size limit of an HTTP POST request?) -> best guess: kB range
- Spark (invoked by Livy obviously) specific limits -> ?
- server specific cmd line arg limits -> best guess: low GB range
Seems like the POST request is the potential bottleneck, but maybe I'm missing something entirely... Any practical advice highly appreciated.