Reduce cassandra tombstones

897 Views Asked by At

I have a table to store messages which are failed to process and I am retrying to process messages every 5 minutes through scheduler.

When message gets processed successfully, respective row from table is deleted, so that same message should not get processed again.

To fetch rows from table query is SELECT * FROM <table_name> , due to which we are facing tombstone issues if large number of rows gets deleted. Table have timestamp as partition key and message_name(TEXT) as clustering key, TTL of 7 days and gc_grace_second of 2 days

As per my requirement, I need to delete records otherwise duplicate record will get processed. Is there any solution to avoid tombstone issues?

2

There are 2 best solutions below

2
Erick Ramirez On

Unfortunately, there isn't a quick fix to your problem.

The challenge for you is that you're using Cassandra as a queue and it isn't a good idea because you run exactly into that tombstone hell. I'm sure you've seen this blog post by now that talks queues and queue-like datasets being an anti-pattern for Cassandra.

It is possible to avoid generating lots of tombstones if you model your data differently in buckets with each bucket mapping to a table. When you're done processing all the items in the bucket, TRUNCATE the table. This idea came from Ryan Svihla in his blog post Understanding Deletes where he goes through the idea of "partitioning tables". Cheers!

3
Aaron On

So I see two problems here.

  1. Cassandra is being used as a queuing mechanism, which is an established anti-pattern.
  2. All partitions are being queried with SELECT * FROM <table_name>, because there isn't a WHERE clause.

So with Cassandra, some data models and use cases will generate tombstones. At that point, there's not a whole lot to be done, except to design the data model so as to not query them.

So my thought here, would be to partition the table differently.

CREATE TABLE messages (
    day TEXT,
    message_time TIMESTAMP,
    message_text TEXT,
    PRIMARY KEY ((day),message_time))
WITH CLUSTERING ORDER BY (message_time DESC);

With this model, you can query all messages for a particular day. You can also run a range query on day and message_time. Ex:

SELECT * FROM messages
WHERE day='20210827'
AND message_time > '2021-08-27 04:00';

This will build a result set of all messages since 2021-08-27 04:00. Any tombstones generated outside of the requested time range (in this case, before 04:00) will not be queried.

Note that (based on the delete pattern) you could still have tombstones within the given time range. But the idea here, is that the WHERE clause limits the "blast radius," so querying a smaller number of tombstones shouldn't be a problem.