Missing column that was just inserted in cassandra column family

491 Views Asked by At

We are constancly getting problem on our test cluster.

  1. Cassandra configuration:

    • cassandra version: 2.2.12
    • nodes count: 6, seed-nodess 3, none-seed-nodes 3
    • replication factor 1 (of course for prod we will use 3)
  2. Table configuration where we get problem:

    CREATE TABLE "STATISTICS" (
        key timeuuid,
        column1 blob,
        column2 blob,
        column3 blob,
        column4 blob,
        value blob,
        PRIMARY KEY (key, column1, column2, column3, column4)
    ) WITH COMPACT STORAGE
        AND CLUSTERING ORDER BY (column1 ASC, column2 ASC, column3 ASC, column4 ASC)
        AND caching = {
            'keys':'ALL', 'rows_per_partition':'100'
        }
        AND compaction = {
            'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
        };
    
  3. Our java code details

    • java 8
    • cassandra driver: astyanax
    • app-nodes count: 4

So, whats happening:

Under high load our application do many inserts in cassandra tables from all nodes. During this we have one workflow when we do next with one row in STATISTICS table:

  1. do insert 3 columns from app-node-1
  2. do insert 1 column from app-node-2
  3. do insert 1 column from app-node-3
  4. do read all columns from row on app-node-4

at last step(4) when we read all columns we are sure that insert of all columns is done (it is guaranteed by other checks that we have)

The problem is that some times(2-5 times on 100'000) it happens that at stpp 4 when we read all columns, we get 4 columns instead of 5, i.e. we are missing column that was inserted at step 2 or 3.

We even start doing reads of this columns every 100ms in loop and we dont get expected result. During this time we also check columns using cqlsh - same result, i.e. 4 instead of 5.

BUT, if we add in this row any new column, then we immediately get expected result, i.e. we are getting then 6 columns - 5 columns from workflow and 1 dummy. So after inserting dummy column we get missing column that was inserted at step 2 or 3.

Moreover when we get the timestamp of missing (and then apperared column), - its very closed to time when this column was actually added from our app-node.

Basically insertions from app-node-2 & app-node-3 are done nearlly at the same time, so finally these two columns allways have nearly same timestamp, even if we do insert of dummy column in 1 minute after first read of all columns at step 4.

With replication factor 3 we cannot reproduce this problem.

So open questions are:

  1. May be this is expected behavior of Cassandra when replication factor is 1 ?
  2. If its not expected, then what could be potential reason?

UPDATE 1:

next code is used to insert column:

UUID uuid = <some uuid>;
short shortV = <some short>;
int intVal = <some int>;
String strVal = <some string>;
ColumnFamily<UUID, Composite> statisticsCF = ColumnFamily.newColumnFamily(
        "STATISTICS", 
        UUIDSerializer.get(), 
        CompositeSerializer.get()
);
MutationBatch mb = keyspace.prepareMutationBatch();
ColumnListMutation<Composite> clm = mb.withRow(statisticsCF, uuid);
clm.putColumn(new Composite(shortV, intVal, strVal, null), true);
mb.execute();

UPDATE 2:

Proceed testing/investigatnig.

When we caught this situation again, we immediately stop(killed) our java apps. And then can constantly see in cqlsh that particular row does not contain inserted column.

To appear it, first we tried nodetool flash on every cassandra node:

pssh -h cnodes.txt /path-to-cassandra/bin/nodetool flush

result - the same, column did not appear.

Then we just restarted the cassandra cluster and column appeared

UPDATE 3:

Tried to disable cassandra cache, by setting row_cache_size_in_mb property to 0 (before it was 2Gb)

row_cache_size_in_mb: 0

After it, the problem gone.

SO probably the probmlem may be in OHCProvider which is used as default cache provider.

0

There are 0 best solutions below