Linear probing hash set - efficient way to know if value is already inserted, just not in its hashed index

71 Views Asked by At

I am implementing a hash set in C and am using linear probing to deal with collisions. My question is, what happens when you have something like this:

  1. a set with capacity of 32

  2. value a has hash 14, inserted at idx 14

  3. value b has hash 14, colliding with a, so it is inserted at idx 15

  4. value a is deleted, leaving idx 14 empty

  5. value c, which is equal to value b, has hash 14 and can be inserted at idx 14 since it is empty from the deletion of a

As you can see, value c should not be inserted into the set because it is equal to value b, but the slot in the table is empty, so my program thinks it can be inserted because of the lack of searching for b after the hash index.

My first thought is to linearly search the entire table to ensure that there are no duplicate values before inserting, but that removes the O(1) insertion, deletion, and lookup time for a set.

My second thought is to have a structure like this at each of the slots in the table:

typedef struct {
    bool has_value;
    bool has_collision
    void *value;
} set_entry;

But, not only does it require more memory, the state of has_collision would not really be able to tell the truth, since values can be deleted from the set.

Is there a more efficient way to do this?

1

There are 1 best solutions below

0
chqrlie On

To allow elements to be deleted from the hash set, you must be able to distinguish between empty slots that have never been used and freed slots that have been made available when their value has been deleted.

To probe if a key is present in the hash table, you must iterate over used and freed slots, only stopping on empty slots.

When you insert a key into the hash set, first test if it is present, and if not, use the first available (empty or freed) slot from its hash value.

When deleting a key from the hash set, set its entry to empty if the next entry is empty and to freed otherwise. If you can set the entry as empty, you can also set all previous consecutive freed entries as empty.

You do not need extra boolean fields to keep track of these states, if the value member is a void *, you can use 2 unused pointers to represent empty and freed slots, for example the address of the hash table and that of the second element, or NULL and another invalid pointer if the value pointer is required to be a valid pointer.