If 1 OSD crashes, does rook-ceph eventually tries to replicate missing data to the still workings OSDs or does it wait for all OSD to be healthy again ? Let's say yes so that I can explain how I calculated :
I started with 1,71 TB provisionned for kubernetes PVCs and 3 nodes of 745 GB each (total 2,23 TB). Rook has a replication factor of 2 (RF=2).
For the replication to work, I need 2 times 1,71 TB (3,42 TB), so I added 2 nodes 745 GB each (total 3,72 TB) Let's say I use all of the 1,71 TB provisonned.
If I lose an OSD, my K8S cluster still runs because data is replicated, but when missing data is replicated itself on still working OSD, other OSD may crash because, assuming OSDs are always equally distributed (which I know is not true in the long run) :
- I have 290 GB unused space on my cluster (3,72 total - 3,42 PVC provisionning)
- Which is 58 GB per OSD (290 / 5)
- Crashed OSD has 687 GB (745 disk total - 58 GB unused)
- Ceph tries to replicate 172 GB missing data on each OSD left (687/4)
- Which is way too much because we only have 58 GB left which should lead to OSD failures cascading
If I had 6 nodes instead of 5, I could loose 1 OSD indefinitely tho :
- New pool is 4,5 TB (6x745)
- I have 1+ TB free space on the cluster (4,5 total - 3,42 PVC provisionning)
- Which is 166+ GB per OSD (~1 TB / 6)
- Crashed OSD has 579+ GB data max. (745 - 166)
- Ceph tries to replicate less than 100 GB missing data on each OSD left (579 / 6)
- Which is less than free space on each OSD (166+ GB) so replication works again with only 5 nodes left but if another OSD crashes I'm doomed.
Is the initial assumption correct? If so, does the maths sound right to you ?
First: if you value your data, don't use replication with size 2! You will eventually have issues leading to data loss.
Regarding your calculation: Ceph doesn't distribute every MB of data evenly across all nodes, there will be differences between your OSDs. Because of that the OSD with the most data will be your bottleneck regarding free space and the capacity to rebalance after a failure. Ceph also doesn't handle full or near full clusters very well, your calculation is very close to a full cluster, that will lead to new issues. Try avoiding a cluster with more than 85 or 90 % used capacity, plan ahead and use more disks to both avoid a full cluster and also have a higher failure resistency. The more OSDs you have the less impact a single disk failure will have on the rest of the cluster.
And regarding recovery: ceph usually tries to recovery automatically but it depends on your actual crushmap and the rulesets your pools are configured with. For example, if you have a crush tree consisting of 3 racks and your pool is configured with size 3 (so 3 replicas in total) spread across your 3 racks (failure-domain = rack), then a whole rack fails. In this example ceph won't be able to recover the third replica until the rack is online again. The data is still available to clients and all, but your cluster is in a degraded state. But this configuration has to be done manually so it probably won't apply to you, I just wanted to point out how that works. The default usually is a pool with size 3 with host as failure-domain.