Python Hashing of "tupled" numpy Array

61 Views Asked by At

I have a class MyClass where each instance stores pixels' x- and y-coordinates, represented as two 1D numpy arrays (of the same length). Two instances are considered equal if their coordinate arrays are identical (including nan).
I tried two methods of hashing: one by casting both arrays to tuples and hashing those, and the other by calling the tobytes() method for each array:

class MyClass:
  # ... init, doA(), doB(), etc. ...
  def __eq__(self, other):
    if not type(self) == type(other):
      return False
    if not np.array_equal(self._x, other._x, equal_nan=True):
      return False
    if not np.array_equal(self._y, other._y, equal_nan=True):
      return False
    return True

  def hash1(self):
    return hash((tuple(self._x), tuple(self._y)))

  def hash2(self):
    return hash((self._x.tobytes(), self._y.tobytes()))

Calling hash1 on the same instance yields different hashes, and calling hash2 outputs the same thing every time. Why do these behave so differently?

1

There are 1 best solutions below

0
user2357112 On BEST ANSWER

A NumPy array doesn't store its elements as Python objects (unless you're using dtype=object). It stores raw hardware numeric values. That means when you call tuple, the array has to create Python objects for all the elements. For example, if your array has dtype float64, the array has to generate instances of numpy.float64.

The array doesn't save these wrapper objects. Every time you call tuple, the array generates new wrapper objects. Two instances of numpy.float64 with NaN values aren't guaranteed to hash the same, so if your array contains NaNs, hashing the tuples isn't guaranteed to produce consistent results.