We have a frozen UDT with ~2000 fields as one of the columns in a table.
We use this table to implement append-only writes so that the data is auditable and not overwritten.
We are seeing degradation in write performance when only 1 (out of 2000) field in the UDT is populated.
Trying to understand the performance implication of using sparsely populated frozen UDTs. How are UDTs serialized/deserialized internally? Any documentation of this will be highly appreciated.
We tried to gather some metrics from cass session, but couldn't get much information.
edit: Using the C++ cassandra driver withPrepared Statements for writes
Cassandra version: 3.11.6
Data Model:
CREATE TYPE udt_xyx {
field1 bigint,
field2 ..
..
..
field2000
}
CREATE TABLE table_xyz(
key_1 text,
txn_id int,
fields frozen<udt_xyx>,
PRIMARY KEY ((key_1), txn_id)
)
Workflow:
- Request comes in from the caller to write
nfields(out of2000) for a givenkey_1. - We assign a unique
txn_id(transaction_id) to the request. - Then we create a
UDTobject which has2000fields but only populatenof those fields and persist it in the table. - The new request that comes in for the same
key_1with different (or same) fields will be assigned a newtxn_idand written to the table as a new record.
That way we are not updating any currently written UDT, but always creating a new record in the table (associated with new txn_id).
When the UDT is sparsely populated, we are experiencing write performance degradation.
EDIT: After doing some analysis we narrowed down the slowness to this: https://github.com/datastax/cpp-driver/blob/master/src/data_type.hpp#L352-L380
Basically every time we bind a udt the "check" method runs and compares the string names for every field in the UDT.
Since we have ~2000 fields and we do over 100,000 binds we're doing about 100 Million string comparisons
What performance are you measuring here? Comparing performance to inserting data using non-UDT columns into a table versus inserting data using both non-UDT columns and UDT-type columns?
a column whose type is a frozen collection (set, map, or list) or UDT can only have its value replaced as a whole. In other words, we can't add, update, or delete individual elements from the collection as we can in non-frozen collection types. So, the frozen keyword can be useful, for example, when we want to protect collections against single-value updates.
For example, in case of the below snippet,
the UDT nested in the list is frozen, so the entire list will be read when querying the table.
Since you did not provide "how" you're updating the frozen collection, it is hard to triage why there is a performannce concern here.
References for exploration:
Essentially, you will not be able to do an append-only operation with a frozen type as you will always have to perform read-before-write operation for any upserts.