I have a question on Cassandra as a newbie. I was wondering if it's possible somehow to make a query with a contains statement (i found out that an in statement might not be the best for lot's of data) that accepts a list of things.
Something like this:
SELECT * FROM table WHERE column CONTAINS('valueA', 'valueB');
Searching online i read that it shouldn't be possible but if i work with too many values I would have to write 60 CONTAINS and the query would be too heavy i guess.
So i was wondering if somehow that is possible in Cassandra or there is a workaround. Thanks!
So while you did mention the
INstatement, that is basically what you're asking it to do here. And that will certainly work. If I have a table namedtimeswith a single PRIMARY KEY ofid, I can do this with theINoperator:The
CONTAINSoperator is a little different. I'm pretty sure that it only accepts one parameter. ButCONTAINSonly works on collection columns; columns of type List, Set, or Map. So if I have a tableplaylistswith a columntags LIST<TEXT>and an index on thetagscolumn, then this works:As for the part about "not the best for lots of data," you're right. Neither approach really fits with Casandra. The first is what we call a "multi-key query," and the second is a "secondary index query." Cassandra works well in large row scenarios because it can figure out which nodes have the data by running a hash on the partition key.
When Cassandra can't pinpoint the nodes which contain the data (as is the case with a both multi-key and secondary index queries). It picks a node to be something called a "coordinator node." That node is then responsible for contacting (in the case of the secondary index query) all of the other nodes in the cluster, assembling the result set, and returning it to the calling application. This introduces network time into the equation.
If you're running a 6 node cluster, that may not be too bad. However, a 60 node cluster will not be as forgiving. Also, if the query pulls too many results back, it may use up all of the available heap memory on the coordinator node, causing it to crash.
That's why running queries based on the PRIMARY KEY is the recommended way to go. Because results can be returned after contacting only 1 or 2 nodes, and not all nodes need to be polled. So stick with your PRIMARY KEY. If that doesn't work for your query, then create a new table with a different PRIMARY KEY as that will be A) faster and B) easier on the cluster.