Perfect data model for REST API

55 Views Asked by At

I created a REST API whose schema looks like this :

type Todo struct {
    ID          gocql.UUID    `json:"id"`
    User_ID     string        `json:"user_id"`
    Title       string        `json:"title"`
    Description string        `json:"description"`
    Status      string        `json:"status"`
    Created     time.Time     `json:"created"`
    Updated     time.Time     `json:"updated"`
}

So as I'm using CQL and majorly all the function queries are filtering on ID but in one query I have to filter using where clause on "Status" and Order By "Created" in DESC like below :

SELECT id, user_id, title, description, status, created, updated FROM todos WHERE status = ? ORDER BY created

Currently I'm using below model :

CREATE TABLE todos (
    id UUID,
    user_id TEXT,
    title TEXT,
    description TEXT,
    status TEXT,
    created TIMESTAMP,
    updated TIMESTAMP,
    PRIMARY KEY((status, created), id)
);

but using this I'm getting filtering errors. What should be the perfect table creation so I can perform the desired operations without any issue

Expecting some good responses to get my issues fixed

1

There are 1 best solutions below

2
Andrew On

The problem requires two tables to solve, because there are two usage patterns which are not compatible - the bulk of the queries are handled by the ID partition key, so the first table would be partitioned by ID - but the status query can not use that table design.

For the table schema you posted, the filtering errors are because the partition key is a composite key of the 2 values, status and created. As such from a partition perspective, they are hashed to a single value. By providing only 1 part of the key, it would be forced to scan all partitions.

Your initial reaction would then be to change the primary key to be:

PRIMARY KEY((status), created, id)

The double parens are not required, but I left them in to make it obvious what the change is. The likelihood though is that status is a very low cardinality value, which would result in a very high number of records within a single partition - which is unadvisable.

You may be tempted into a secondary index - but this suffers the same problem (as does a materialized view). The cardinality problem needs to be solved.

I would look to use time buckets in some form, such as:

  PRIMARY KEY((status, time_bucket), created, id)

Where time_bucket is a timestamp field which is a truncated version of the created field that your application provides, such as week, day, or even hour. You should know which time bucket you are searching within, or a number of buckets that you would search for in parallel, and construct the final ordered answer within the application code. The truncation level to use should be based on how much data per status you expect within a period of time.

This increases the cardinality of the partition key and distributes it across the nodes within the cluster.

If there were no updates, TWCS would then be a feasible option but the existence of the updated field in the struct, indicates that this is not the case.

Final result:

CREATE TABLE todos_by_id (
    id UUID,
    user_id TEXT,
    title TEXT,
    description TEXT,
    status TEXT,
    created TIMESTAMP,
    time_bucket TIMESTAMP,
    updated TIMESTAMP,
    PRIMARY KEY(id)
);

CREATE TABLE todos_by_status_time (
    id UUID,
    user_id TEXT,
    title TEXT,
    description TEXT,
    status TEXT,
    created TIMESTAMP,
    time_bucket TIMESTAMP,
    updated TIMESTAMP,
    PRIMARY KEY((status, time_bucket), created, id)
);