Sharding, or say partitioning, is a technique widely used in distributed systems which logically splits data into partitions. Each node is assigned a set of partitions and hence the read/write throughput could be increased with parallelization.
Consistent hashing is a technique widely used in load balancing and routing service. To find the node responsible for serving a given key, the traditional way is to compute the hash of the key and then modulo the hash value with the number of nodes N. However, if the number of nodes changes, the keys have to rehashed and to be transferred between nodes. To minimize the amount of data being transferred after rehashing, the consistent hashing is used to replace the modulo N operations so that only a fraction of data needs to be transferred.
In order to do sharding, we need two mappings:
key_to_shard: the key range is segmented into a sequence of sub-ranges and each sub-range is assigned to a shard. The assignment of key ranges to shards definitely does not involve a modulo N operation.
shard_to_node: for a given shard, it's assigned to a node.
For static sharding, i.e. the number of shards never changes, key_to_shard is trivial. For dynamic sharding, there're shard splitting which splits a shard into two shards with adjacent key ranges, and shard coalescing which merges two shards with adjacent key ranges into a single shard. Neither the static sharding nor the dynamic sharding would incur a modulo N operation.
In many literatures or textbooks, there says that sharding is generally combined with consistent hashing.
So, if there's no modulo N operation, where shall the consistent hashing be used?
Is my understanding wrong?
In practice, usually there are many more shards than nodes. Technically N of shard will never change, as it would mean rehashing for every key.
For example, let's have 1000 shards and 10 nodes. With this approach we still can balance hot ranges, by assigning them to different nodes. In addition, we can add and remove nodes by reassigning shards.