What are the performance implications of sparsely populated Frozen User Defined Type?

206 Views Asked by At


We have a frozen UDT with ~2000 fields as one of the columns in a table. We use this table to implement append-only writes so that the data is auditable and not overwritten.

We are seeing degradation in write performance when only 1 (out of 2000) field in the UDT is populated.

Trying to understand the performance implication of using sparsely populated frozen UDTs. How are UDTs serialized/deserialized internally? Any documentation of this will be highly appreciated.

We tried to gather some metrics from cass session, but couldn't get much information.

edit: Using the C++ cassandra driver withPrepared Statements for writes

Cassandra version: 3.11.6

Data Model:

CREATE TYPE udt_xyx {
field1 bigint,
field2 ..
..
..
field2000
}

CREATE TABLE table_xyz(
    key_1 text,
    txn_id int,
    fields frozen<udt_xyx>,
    PRIMARY KEY ((key_1), txn_id)
) 

Workflow:

  1. Request comes in from the caller to write n fields(out of 2000) for a given key_1.
  2. We assign a unique txn_id (transaction_id) to the request.
  3. Then we create a UDT object which has 2000 fields but only populate n of those fields and persist it in the table.
  4. The new request that comes in for the same key_1 with different (or same) fields will be assigned a new txn_id and written to the table as a new record.

That way we are not updating any currently written UDT, but always creating a new record in the table (associated with new txn_id).

When the UDT is sparsely populated, we are experiencing write performance degradation.

EDIT: After doing some analysis we narrowed down the slowness to this: https://github.com/datastax/cpp-driver/blob/master/src/data_type.hpp#L352-L380

Basically every time we bind a udt the "check" method runs and compares the string names for every field in the UDT.

Since we have ~2000 fields and we do over 100,000 binds we're doing about 100 Million string comparisons

1

There are 1 best solutions below

3
Madhavan On

What performance are you measuring here? Comparing performance to inserting data using non-UDT columns into a table versus inserting data using both non-UDT columns and UDT-type columns?

a column whose type is a frozen collection (set, map, or list) or UDT can only have its value replaced as a whole. In other words, we can't add, update, or delete individual elements from the collection as we can in non-frozen collection types. So, the frozen keyword can be useful, for example, when we want to protect collections against single-value updates.

For example, in case of the below snippet,

CREATE TYPE IF NOT EXISTS race (
race_title text,
race_date date
);

CREATE TABLE IF NOT EXISTS race_data (
id INT PRIMARY KEY,
races frozen<list<race>>
...
);

the UDT nested in the list is frozen, so the entire list will be read when querying the table.

Since you did not provide "how" you're updating the frozen collection, it is hard to triage why there is a performannce concern here.

References for exploration:

Essentially, you will not be able to do an append-only operation with a frozen type as you will always have to perform read-before-write operation for any upserts.