How to effectively hash all the columns of a dataframe

252 Views Asked by At

I play with mostly very large datasets. For our datacompare tool, I need to hash all the columns ( not every value in the column, but the column as a whole ). I was wondering if there was a more faster way to do the hashing particularly for wide tables. For example it takes 10 minutes to hash 19M rows by 206 columns. This is what I have :

for c in compare_cols:
  try:
     h = hashlib.sha256(pd.util.hash_pandas_object(df[c].values())).hexdigest()
  except Exception as e:
    <exception stuff>

Without the pd.util inside the sha256, i get goofy values for the hash which I can't use later.

1

There are 1 best solutions below

12
Andrej Kesely On

IIUC, you can call hashlib.sha256() on pd.Series directly:

Example:

import hashlib

df = pd.DataFrame(list(map(str, range(20_000_000))), columns=['col1'])

print(hashlib.sha256(df['col1'].values).hexdigest())

Prints:

d0b9915c26ca6713c963e0de22554354298a5870e49e6b07488e2e39ea35e252

According to my benchmark, this takes ~70ms on my machine