How is columnar storage in the context of a NoSQL database like Cassandra different from that in Redshift. If Cassandra is also a columnar storage then why isn't it used for OLAP applications like Redshift?
Columnar storage: Cassandra vs Redshift
4.3k Views Asked by p0712 At
2
There are 2 best solutions below
0
Jun Yin
On
I encountered the same question today, and found that this resource on AWS: https://aws.amazon.com/nosql/columnar/
Related Questions in CASSANDRA
- how to create a chess board with Queen in the central position and all its moves in assembler code
- Passing arguments to ENTRYPOINT causes the container to start and run indefinitely
- Apache Cassandra Node Driver Connection
- Simulate Cassandra DB timeout
- How to update Cassandra Lucene index with a new column? rebuild or update index?
- Cassandra JDBC connection string for logstash
- Cassandra OversizedMessageException
- dsbulk unload is failing after ran couple of hours with OOM issue
- Cassandra: "Model keyspace not set" and "Connection name doesn't exist in the registry" Errors
- Unable to cqlsh to a cassandra docker container remotely
- Forward pagination with object mapper in java asyn
- Allow filter in cassandra query
- How to fix bytes unrepaired in cassandra
- Can't install Cassandra using RPM packages for RHEL 9
- Why can't get a connection to Cassandra running on Docker from a Spring Boot instace using spring-boot-starter-data-cassandra on first boot?
Related Questions in AMAZON-REDSHIFT
- Redshift/Postgres between function produces seemingly unrelated error when ">" or "<" both work but not together
- extract nested fields from dynamodb json format in redshift/ Unmarshall DynamoDB JSON to regular JSON
- Redshift Datashare and Python Flask backend with SQALAlchemy
- Qlik IntervalMatch to SQL
- Unable to connect to publicly accessible redshift cluster
- Loading around 50gb of parquet data to Redshift taking indefinite time to load
- Not equal vs IN in AWS Redshift
- Copy Command Redshift putting quotes around super column values
- Amazon RSQL concat of two tables with 2 shared columns
- SQL query to extract incremental data from a table in SQL Server
- Create table in Redshift through db_query() in Python
- latest version of redshift with crazy compile times
- redshift spectrum type conversion from String to Varchar
- Redshift 1:1 left join on right table with duplicates
- Replacing empty and null strings in Redshift with default strings when querying?
Related Questions in COLUMN-ORIENTED
- Questionable vectorization with column-by-column addressing order (C)
- SQL data versioning in DuckDB
- Columnar/Column-oriented database vs wide-column/column family database
- Adding mysql to Apache-Druid as metadata store using docker-compose
- is a column family placed one next to the other on disk in HBase? another words, is HBase Column-oriented?
- Wide column vs column family vs columnar vs column oriented DB definition
- Cassandra follows which partitioning technique?
- How Column Oriented Database stores data in disk?
- What's the Meaning of "the primary key is the data" in Columnar DB
- Columnar storage: Cassandra vs Redshift
- Sequence order in the column oriented formats chapter of book Hadoop the definitive guide?
- Comparing two arrays in ClickHouse rows
- Truncate and Insert in ClickHouse Database
- Counting columns that are not PK - Cassandra
- How does the disk seek is faster in column oriented database
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
The storage engines of Cassandra and Redshift are very different, and are created for different cases. Cassandra's storage not really "columnar" in wide known meaning of this type of databases, like Redshift, Vertica etc, it is much more closer to key-value family in NoSQL world. The SQL syntax used in Cassandra is not any ANSI SQL, and it has very limited set of queries that can be ran there. Cassandra's engine built for fast writing and reading of records, based on key, while Redshift's engine is built for fast aggregations (MPP), and has wide support for analytical queries, and stores,encodes and compresses data on column level.
It can be easily understood with following example:
Suppose we have a table with user id and many metrics (for example weight, height, blood pressure etc...). I we will run aggregate the query in Redshift, like average weight, it will do the following (in best scenario):
Master will send query to nodes.
Only the data for this specific column will be fetched from storage.
The query will be executed in parallel on all nodes.
Final result will be fetched to master.
Running same query in Cassandra, will result in scan of all "rows", and each "row" can have several versions, and only the latest should be used in aggregation. If you familiar with any key-value store (Redis, Riak, DynamoDB etc..) it is less effective than scanning all keys there.
Cassandra many times used for analytical workflows with Spark, acting as a storage layer, while Spark acting as actual query engine, and basically shouldn't be used for analytical queries by its own. With each version released more and more aggregation capabilities are added, but it is very far from being real analytical database.