Table Structure : testkeyspace.productInfo (productID, productName, timestamp, price) - Partition key column in the above table is “productId”. - table -> testkeyspace.productInfo consists of 10k records
Test Case
- Test consists of two Java applications
- Application 1 -> Writes data to Cassandra table (testkeyspace.productInfo)
- Application 2 -> Reads from Cassandra table (testkeyspace.productInfo) continuously with 5 seconds interval
- Partition Key (productID) is unique for all records. Hence, Number of partitions = Number of records in table
How data is read from Cassandra table? (Read Pattern)
- Initially, "SELECT *" is issued to Cassandra table with the help of Cassandra JDBC
- Latest Timestamp (lastNotedTimestamp) of the fetched records is noted from the timestamp column of the table
- The subsequent "SELECT *" are issued with a WHERE condition of timestamp > lastNotedTimestamp
- Step 2 and 3 are repeated until application kill.
Issue
- While Step 4 is executed in parallel with the write application, few random records are missing in the ResultSet returned from the JDBC call.
- While Step 4 is executed after write application has completed its task, records are not missed and the fetching is successful.
Note: Above mentioned issue is applicable even when using a numeric-based column in the WHERE clause of the select query.
It's tested and observed in the single node cluster.
Can someone explain this behaviour and let me know why this is happening?
Thanks in advance.
Sure. The problem is with the queries:
No mention of a
WHEREclause, so I'll assume that there isn't one.So this one does have a
WHEREclause. However, I don't see the partition key inside of it.When you query Cassandra without specifying a partition key, there are no guarantees that the query can be served by a single node. This means (for both queries) that one node is being selected as a coordinator, to query the other nodes exhaustively. Once that node has assembled the result set (in RAM) it then returns it to the application.
Non-partition key queries cause Cassandra nodes to work very hard. Now, add into the mix that the cluster is also being asked it to handle the throughput of the write application, and I'm not surprised that you're seeing issues.
Essentially, this is happening because the current read/write patterns are too resource-intensive for your cluster to handle.
You can:
SELECT *(without aWHEREclause) or theALLOW FILTERINGdirective.Edit 20230518
Yes, because that query is still causing stress on all of the nodes. The partition key needs to be specified with an equality operator. Sure, it's getting a partition key. But all of the partition keys greater than
latestPartitionColumnObservedPositionwill still be spread across multiple nodes, so nothing is improving.Given the table definition provided above, this table can support one performant query:
That's it. If there's any other query running on that table, it's going to be hitting multiple nodes and causing them to work hard.
If the concern is about the
timestamp, then you might try building a new table with a "bucketing" approach, like this:This will store products updated for a particular day in a single partition, meaning that they can be queried like:
Depending on how many products are updated in a particular day, that "time bucket" may need to be a different unit of measure. But that's up to the business requirements. This also allows filtering on timestamp, as long as it's within a partition.
Note that
producttimestampis specified as a clustering key, sorting in descending order. This is because most time-based application queries tend to care about the most-recent data. Sorting in descending order allows queries to quickly pull records from the "top" of the partition, which will be faster.productidis the last clustering key, but that's really just there as a tie-breaker to ensure uniqueness.You should check out DataStax Academy, especially DS220 which is our free class on Cassandra data modeling. It really does a good job of explaining how everything works under-the-hood, and gives you data modeling and query building strategies to leverage that.