Why need to run nodetool repair -pr on each node of each DC? It is not needed when repair is run without -pr. Why it is different? As I understand difference is only in a number of token ranges - with -pr only "primary" ranges and without -pr also ranges belonging to other nodes that replicated on this node. How it is affecting repair propagation to other DCs? All DCs have the same token space(token ring) and if we do repair on all nodes of one DC then entire token space(token ring) will be covered.
What I'm expecting is that
nodetool repair -prenough to run on a single datacenter of a cluster. Apache documentation has no requirement to runnodetool repair -pron each node of each datacenter https://cassandra.apache.org/doc/3.11/cassandra/operating/repair.html "The -pr flag will only repair the "primary" ranges on a node, so you can repair your entire cluster by running nodetool repair -pr on each node in a single datacenter."According to the following articles when
nodetool repairis run without -pr, then it is need to be done only on one datacenter in cluster. But on each node of each datacenter when run with -pr.https://www.datastax.com/blog/repair-cassandra "This is very important, so I’m going to say it again, if you are using “nodetool repair -pr” you must run it on EVERY node in EVERY data center, no skipping allowed...."
"If you have multiple data centers, by default when running repair all nodes in all data centers will sync with each other on the range being repaired. So for an RF of {DC1:3, DC2:3} for a given token range there will be 6 nodes all comparing data with each other and streaming any differences back and forth. If you have 4 data centers {DC1:3, DC2:3, DC3:3, DC4:3} you will have 12 nodes all comparing with each other and streaming data to each other at the same time for each token range [2]. This makes using “-pr” even more important, as if you don’t use it you repair a given token range 3+3+3+3+=12 times for the 4 DC case if you ran without using “-pr” on every node in the cluster."
and
https://www.datastax.com/blog/repair-cassandra "Note: If you use this option, you must run nodetool repair -pr on every node in the cluster to repair all data. Otherwise, some ranges of data will not be repaired..."
"Consider carefully before using nodetool repair across datacenters, instead of within a local datacenter. When you run repair locally on a node using -local or --in-local-dc, the command runs only on nodes within the same datacenter as the node that runs it. Otherwise, the command runs cluster-wide repair processes on all nodes that contain replicas, even those in different datacenters. For example, if you start nodetool repair over two datacenters, DC1 and DC2, each with a replication factor of 3, repairmust build Merkle tables for 6 nodes..."
Even more documnetation inconsistency in the following: "The nodetool repair tool does not support the use of -local with the -pr option unless the datacenter's nodes have all the data for all ranges." That is assumed that -pr is also running cluster wide as such without -pr.
Current behavior, when -pr is specified, is to treat a multi-DC set up as a single ring. Because TokenMetadata.getPredecessor(Token) doesn't take into account a DC for a token and just searches for a predecessor across all tokens from all DCs.
So let's say we have this token range from 0 to 100 for simplicity.
You would expect "nodetool repair -pr" on DC1 node1 to be the same as
nodetool repair -st 0 -et 33, but it is actually-st 0 -et 25.And repair -pr on node 2 would be the same as
-st 33 -et 50Node 3 would be
-st 66 -et 90So we skipped, 25-33, 50-66, and 90-0
-pr isn't really primary range, it is partial range.