Today we are using Hive as our data warehouse, mainly used for batch/bulk data processing - hive analytics queries/joins etc - ETL pipeline
Recently we are facing a problem where we are trying to expose our hive based ETL pipeline as a service. The problem is related to the fixed table schema nature of hive. We have a situation where the table schema is not fixed, it could change ex: new columns could be added (at any position in the schema not necessarily at the end), deleted, and renamed.
In Hive, once the partitions are created, I guess they can not be changed i.e. we can not add new column in the older partition and populate just that column with data. We have to re-create the partition with new schema and populate data in all columns. However new partitions can have new schema and would contain data for new column (not sure if new column can be inserted at any position in the schema?). Trying to read value of new column from older partition (un-modified) would return NULL.
I want to know if I can use HBase in this scenario and will it solve my above problems? 1. insert new columns at any position in the schema, delete column, rename column 2. backfill data in new column i.e. for older data (in older partitions) populate data only in new column without re-creating partition/re-populating data in other columns.
I understand that Hbase is schema-less (schema-free) i.e. each record/row can have different number of columns. Not sure if HBase has a concept of partitions?
You will be able to populate data only in new column without re-creating partition/re-populating data in other columns
but
Unfortunately, HBase does not support partitions (talking in Hive terms) you can see this discussion. That means if partition date will not be a part of row key, each query will do a full table scan
I hope it is helpful.