I play with mostly very large datasets. For our datacompare tool, I need to hash all the columns ( not every value in the column, but the column as a whole ). I was wondering if there was a more faster way to do the hashing particularly for wide tables. For example it takes 10 minutes to hash 19M rows by 206 columns. This is what I have :
for c in compare_cols:
try:
h = hashlib.sha256(pd.util.hash_pandas_object(df[c].values())).hexdigest()
except Exception as e:
<exception stuff>
Without the pd.util inside the sha256, i get goofy values for the hash which I can't use later.
IIUC, you can call
hashlib.sha256()onpd.Seriesdirectly:Example:
Prints:
According to my benchmark, this takes ~70ms on my machine