How are unordered_set and unordered_map implemented in the C++ STL?

515 Views Asked by At

I want to know, how are unordered_set and unordered_map implemented in the C++ STL?

To be more precise:

  1. Are individual elements a part of the node which together are connected as a list or an array of consecutive elements?

  2. Are all buckets allocated side-by-side in memory? If not, what is the logic towards iterating over multiple buckets?

  3. How are collisions handled in unordered_set and unordered_map? Or, do they not and let the user define the load factor, and based on that they create a completely new set of buckets?

2

There are 2 best solutions below

2
Jerry Coffin On BEST ANSWER

Based on its interface, the expectation for unordered_(set|map) was something at least reasonably close to a hash table with direct chaining.

That is, you'd start with an array of pointers. Each of those pointers is to a bucket, which is typically going to be a linked list of nodes.

We can deduce most of this from the complexity requirements. In particular, the fact that it degenerates from O(1) to O(N) as the table gets overfilled indicates that under normal circumstances (load factor < 1) the complexity is based on hashing the key to find the right bucket. To support that, indexing into the array of buckets has to be an O(1) operation.

As the table gets overfilled, complexity degenerates to O(N). This reflects the complexity of traversing the linked list of nodes that hashed to the same bucket.

Pointer invalidation rules are revealing as well. For example, once we've inserted an item into the table, a pointer to that item remains valid (and continues to point to that node) even if we do things like inserting or deleting other nodes. As such, we can rearrange the table as a whole by having a different arrangement of pointers to the nodes, but the node itself has to "stay put". This means we can't (for example) implement a bucket as something like a std:vector<T>--if we did so, when that vector got reallocated, existing pointers would be invalidated.

You might be able to deviate from this a little bit if you tried really hard, but doing so and continuing to meet all the requirements becomes quite difficult (as I originally wrote this, I speculated on a couple of minor variations maybe being possible, but after comments and more thought, I doubt either really meets the requirements).

0
Tony Delroy On

Are individual elements a part of the node which together are connected as a list or an array of consecutive elements?

(see also https://stackoverflow.com/a/31113618/410767 for an explanation of how the C++ Standard effectively mandates separate chaining)

The individual elements (i.e. key,value pairs for unordered_map, values for unordered_set) are indeed packed into nodes (in GCC's implementation for example, adding next pointers/iterators and a record of the hash of the key, though the latter is a recalculation-speed vs memory usage tradeoff and not all implementations need do it; GCC uses a singly-linked list for all elements in the container, so there is no prev pointer/iterator, but other implementations might have prev; without prev, buckets need to point to the node before the first for their bucket contents, so they can rewire the list if erasing the last element from their bucket).

Are all buckets allocated side-by-side in memory? If not, what is the logic towards iterating over multiple buckets?

By definition hash table based containers always have a contiguous, random-access array of buckets. But, iteration need not happen "over multiple buckets" - in GCCs case, as mentioned above, there's a singly linked list off to the side that can be iterated directly, independently of any buckets. A major advantage of this is that if your container has large capacity but small size, you can still iterate in O(size()) rather than O(bucket_count()) - I'm not sure the latter would be Standard compliant (?).

How are collisions handled in unordered_set and unordered_map?

Using separate chaining (i.e. by inserting extra nodes into the linked list).

Or, do they not and let the user define the load factor, and based on that they create a completely new set of buckets?

The load factor is only tangentially related to this. In the general case where number of distinct key values is much larger than the number of elements being stored, and you have no particular insight into which keys you'll receive, even with a cryptographically strong hash function you need on-the-order-of N^2 buckets to have even a 50/50 chance of placing K keys into the table without them colliding. It's just not practical to grow the hash table to avoid collisions. In C++, the load factor is capped by a user-settable (default 1.0) max_load_factor, so the implementation grows the table when the load factor would otherwise be exceeded. But the load factor only has a probabilistic relationship with collisions.

In case you're curious: if you have K keys into B buckets, the probability of not having a collision is the product of the probabilities of being able to place each successive key into an at-that-time empty bucket, which depends on the proportion of buckets still empty: B/B * (B-1)/B * (B-2)/B * (B-3)/B * ... * (B-K-1)/B. As a few examples, the chance of not having collisions is >= 50% for 2 keys in 2 buckets, 3 keys in 6 buckets, 4 in 10, 5 in 17, 6 in 24, 7 in 33, 8 in 43, 9 in 55, 10 in 69, 11 in 83, 12 in 100, 13 in 117, 14 in 136, 15 in 157, 16 in 179, 17 in 202, 18 in 227, 19 in 253, 20 in 281, 21 in 310, 22 in 341, 23 in 373, 24 in 407, 25 in 442, 26 in 478, 27 in 516, 28 in 555, 29 in 596, 30 in 638, 31 in 682, 32 in 727, 33 in 773, 34 in 821, 35 in 870, 36 in 921, 37 in 974, 38 in 1027, 39 in 1082, 40 in 1139, 41 in 1197, 42 in 1257, 43 in 1317, 44 in 1380, 45 in 1444, 46 in 1509, 47 in 1576, 48 in 1644, 49 in 1713, 50 in 1784, 51 in 1857, 52 in 1931, 53 in 2006, 54 in 2083, 55 in 2161, 56 in 2241, 57 in 2322, 58 in 2405, 59 in 2489, 60 in 2574, 61 in 2661, 62 in 2749, 63 in 2839, 64 in 2930, 65 in 3023, 66 in 3117, 67 in 3213, 68 in 3310, 69 in 3408, 70 in 3508, 71 in 3609, 72 in 3712, 73 in 3816, 74 in 3922, 75 in 4029, 76 in 4137, 77 in 4247, 78 in 4359, 79 in 4472, 80 in 4586, 81 in 4702, 82 in 4819, 83 in 4938, 84 in 5058, 85 in 5179, 86 in 5302, 87 in 5427, 88 in 5552, 89 in 5680, 90 in 5808, 91 in 5939, 92 in 6070, 93 in 6203, 94 in 6338, 95 in 6474, 96 in 6611, 97 in 6750, 98 in 6890, 99 in 7032, 100 in 7175....

Put another way, for 100 keys we'd need to have 70.75 empty buckets for every 1 in-use bucket just to have a >=50% chance of not having collisions, which would waste a lot of memory and ensure the in-use buckets were spread out across CPU cache lines, making bucket access (and general CPU operation) much slower. Trying to avoid collisions, or rehashing every time they occur, is simply not a net-desirable thing for the general hash-table usage case. That said, perfect hashing (where the keys hash to distinct buckets) is occasionally both practical and desirable for small sets of keys known at compile time, or when there are a very small number of possible keys.