I have a class MyClass where each instance stores pixels' x- and y-coordinates, represented as two 1D numpy arrays (of the same length). Two instances are considered equal if their coordinate arrays are identical (including nan).
I tried two methods of hashing: one by casting both arrays to tuples and hashing those, and the other by calling the tobytes() method for each array:
class MyClass:
# ... init, doA(), doB(), etc. ...
def __eq__(self, other):
if not type(self) == type(other):
return False
if not np.array_equal(self._x, other._x, equal_nan=True):
return False
if not np.array_equal(self._y, other._y, equal_nan=True):
return False
return True
def hash1(self):
return hash((tuple(self._x), tuple(self._y)))
def hash2(self):
return hash((self._x.tobytes(), self._y.tobytes()))
Calling hash1 on the same instance yields different hashes, and calling hash2 outputs the same thing every time. Why do these behave so differently?
A NumPy array doesn't store its elements as Python objects (unless you're using dtype=object). It stores raw hardware numeric values. That means when you call
tuple, the array has to create Python objects for all the elements. For example, if your array has dtype float64, the array has to generate instances ofnumpy.float64.The array doesn't save these wrapper objects. Every time you call
tuple, the array generates new wrapper objects. Two instances ofnumpy.float64with NaN values aren't guaranteed to hash the same, so if your array contains NaNs, hashing the tuples isn't guaranteed to produce consistent results.