Is each column in Google Big Table indexed by index key?

202 Views Asked by At

From the google cloud documentation, I see:

Each table has only one index, the row key

However, I see someone mentioned (I can not find the source anymore) big table has index for every column. The column is indexed by index key. So, big table can locate a column very fast by the index key. Index Key = Row Key + ColumnFamily:ColumnQualifier + timestamp

Is above statement true?

2

There are 2 best solutions below

0
Sathi Aiswarya On

No, Bigtable does not have individual indexes for every column. It has only one primary index known as row key.

As mentioned in this document.

Each row is indexed by a single row key, and columns that are related to one another are typically grouped into a column family. Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.

0
Bora On

That is correct. It is a composite index in the form:

Index Key = Row Key + ColumnFamily + ColumnQualifier + timestamp

However as with any composite index, the ordering matters. Since the row key is the leftmost item in the index, you'd still need the row key (or row key prefix or a range of row keys) to avoid a full table scan.

If you're coming from systems like DynamoDB or Cassandra, one way to think about this could be to treat the row key like the partition key and column qualifier as dual-purpose for Cassandra columns or DynamoDB attributes and sort keys in both systems.You can extend this pattern to even create local secondary indexes in Bigtable since within a row, multiple column mutations will be done as an atomic transaction.

Assume a table where you keep Formula 1 race results. This is not the best example since data would be very small but imagine it were a much bigger dataset.

  • Rowkey: The circuit e.g. Monte Carlo, Le Mans
  • Column Family: a meaningful name to group columns by e.g. driver
  • Column: we'll get creative with these e.g. assume I want fast access by placement in each race e.g. 01 for 1st place, 02 for 2nd place driver then I also want quick access by the team e.g. Ferrari, McLaren, Red Bull
  • Timestamp: let's ignore timestamp for brevity

So my overall key could look like

 rowkey      +  family  + column
"MonteCarlo" + "driver" + "01" -> "Max Verstappen"
"MonteCarlo" + "driver" + "02" -> "Charles Leclerc" 

which allows me to quickly get e.g. top 5 drivers in ranked order since 01, 02... will be sorted. So we effectively used the columns as sort keys.

As I am writing this row, I can also insert some additional columns in the same mutation in a single transaction such as

"MonteCarlo" + "team" + "RedBull" -> ["Max Verstappen", "Sergio Pérez"]
"MonteCarlo" + "team" + "Ferrari" -> ["Charles Leclerc"]

Now within the MonteCarlo "partition" I can quickly pinpoint the data by team name instead of scanning through each driver to see which team they belong to and filter that way. This is essentially the way a local secondary index do on other systems.

Long story short, Bigtable columns are quite flexible. You can use them as column names as in a typical database but also to store data since you can define different set of columns at write-time for each row. And these would all be part of one large composite index but that index always starts with the rowkey.