How does Dremel or its implementation (say Drill) handle large columnar data layout in memory?

141 Views Asked by At

I am going through the white paper of Google Dremel. I came to know it converts complex data into columnar data layout.

At what location is this data stored?

As Drill has no central metadata repository, I assume it must be in-memory.

Therefore how does Drill handle this data when I have billions of rows?

1

There are 1 best solutions below

3
catpaws On BEST ANSWER

To get complete, consistent query results from billions of rows, you'll use a distributed file system connected to multiple Drillbits, simulate a distributed file system by copying files to each node, or use an NFS volume, such as Amazon Elastic File System. Drill performs performant querying of big data using a number of techniques, including these:

  • Relies on the cluster nodes to handle failures (doesn't spend time on failure-related tasks).
  • Uses an in-memory data model that's hierarchical and columnar (doesn't access the disk for columns that are not involved in an analytic query, processing the columnar data without row materialization).
  • Uses columnar storage optimizations and execution (keeps memory footprint low).
  • Uses vectorization to work on arrays of values from different records rather than single values from one record at a time.

For more information, see http://drill.apache.org/docs/performance/.