Me and my team are creating a new big data engine, reading big amounts of data from Kafka (and other sources), performing some operations on the data and finally writing to S3 compatible storage (e.g. MinIO) in Iceberg format (all this in Apache Spark). From there data can be queried e.g. using Trino by external services.
My collegue came up with an idea to also WRITE all data trough Trino (Spark -> Trino API -> Storage), because then if we had to switch e.g. from Iceberg to Hudi (e.g. functionality or licensing related problems), we could do that easily (all you have to do is change Trino connector I guess).
I'm a bit concerned/skeptical because it seems to me that Trino is advertised more as "reading" solution (for quick data extraction queries) and not exactly for transforming/writing data. Two major concernes are:
- Not a lot of resources/examples online with this kind of architecture, feels like everyone is joining Spark + Iceberg directly (feels like this solutions are almost designed to DIRECTLY work with eachother), which might significantly extend our development time (which is very limited).
- Performance, which is a very important factor in big data solutions. It is hard to predict how much our performance will suffer after putting Trino on the "write side".
Do you guys see any real benefits of putting Trino on the "write side" of the solution?
Would you go for something like this in your own architecture or stick with standard, direct approach between Spark and Iceberg?
As you already know both
trinoandsparkare query engines.One can only write to
trinousingspark-jdbcfrom spark job. JDBC connections (as for many other databases) has limit on number of allowed connections using JDBC. Writing a heavy dataframe using jdbc may need lot of optimization and tuning on both spark job and in trino engine side well.Whereas using spark provided connectors or using spark engine would not in most cases hit these limit upto the limit of the underlying storage/databses. Using spark engine to write also gives the developer, flexibility to play around with partitions, executor cores & instances to get the desired performance.
From standardization point of view, one might argue that using
trinoas SQL interface to interact all the underlying databases is beneficial as the code need not change when underlying storage/database is changed but same can be said for spark-sql and/or dataframe or dataset spark API.Changing from existing storage to Hudi or Cassandra (another database example) only needs set of database specific properties to be specified either as
optionsor assparkConf.