I have multiples applications that do writes to Cassandra.
Each unit app has backpressure mecanism configured, like throughputMBPerSec=10
Problems arise when multiples applications are running at the same time, because the backpressure set and successfully tested individually becomes wrong
In a client side backpressure scenario, how to implement a mecanism that choose a good backpressure value, regarding the whole pressure state of the cluster, and without loosing too much performance ?
How this kind of problems are solved in large companies ?
There are two approaches commonly used by large companies to mitigate write backpressure on Cassandra. I've suggested that application teams use both of these. And I'll suggest a third, as well:
Send each write as its own thread with a "listenable future." Once a certain number of them have been initiated (say 50...this will change per application), block to ensure they have all completed. Once complete, kick off another batch of threads. The main "tuneable" here, is to raise or lower the active thread count.
Send each write as a message to an Event Processor/Broker like Apache Pulsar or Apache Kafka. Build a consumer that processes the messages. The main tuneable here is adjusting the size of the consumer's receiver queue. I think the default for Pulsar is 1000 messages.
Build a Cassandra cluster for each application. Unfortunately, Cassandra doesn't do a great job of handling vastly different access patterns. At Target, we finally had enough of the applications with heavy write traffic creating a bottleneck for everyone else and built a cluster for every single new application.
Smaller applications only needed 3 to 6 nodes, while others required more than 200 nodes. When you look at disparities like that in required resources, it really doesn't make sense to co-locate those applications. If you're deploying in the cloud (public or private) this is MUCH easier to do, as opposed to deploying on bare metal.
Edits
So yes, this is assuming that the thread control is happening in each individual ETL job. This might not be easy to do in your case.
I'm not exactly sure on this one, but I think that adjusting the queue size limits how many messages the consumer can pull off of the topic at any given time. So if that queue is smaller, it has to make more trips to the broker, effectively slowing down how fast the messages can be written into Casasndra.
You're not wrong on this one. It's a tradeoff for sure, but the idea is that some of the problem is solved by the message brokers effectively acting as a hand brake on write throughput. Of course, the overflow is handled by either the messaging servers or the consumers.