What's the meaning of field bitVectors in column statistics(Hive)?

52 Views Asked by At

column statistics test I got information from the doc about Hive column statistics. For most of columns have filed bitVectors, what's the mean of it, and the rule to calculate it?

I create some tables and use analyze statement to calculate statistics. I show the column statistics, find most of columns have this filed(bitVectors), the value is null/HL, i don't know what it mean.

1

There are 1 best solutions below

3
leftjoin On

This feature was introduced in Hive 3.0.0 by HIVE-16997 - Extend object store to store and use bit vectors and still is not fully documented. Bit vectors in statistics metadata can be used for calculation of the number of distinct values (NDV) using sketch algorithms (FM-sketch, HLL).

Also corresponding parameter was added to allow using stats bit vectors for NDV calculation:

hive.stats.fetch.bitvector

Default Value: false
Added In: Hive 3.0.0 with HIVE-16997

Whether Hive fetches bitvector when computing number of distinct values (ndv). Keep it set to false if you want to use the old schema without bitvectors.

See here: hive.stats.fetch.bitvector

Also you can google and find some info about sketch algorithms: FM and HLL. For example this Flajolet–Martin algorithm