How can I configuring StatefulSets for Zone-Affinity Pod scheduling with LRS disks on AKS?

305 Views Asked by At

I inherited an AKS cluster running in Switzerland north. This region doesn't provide ZRS-managed disk, only LRS. Switching to ReadWriteMany (Azure File) is not an option.

I have one system node pool in all (three) availability zones. Also, I have a custom storage class that allows for dynamic block storage provisioning. Next, I have a stateful set defining a persistent volume claim template.

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: true
  name: my-block-sc
parameters:
  cachingmode: ReadOnly
  diskEncryptionSetID: ...
  diskEncryptionType: EncryptionAtRestWithCustomerKey
  networkAccessPolicy: DenyAll
  skuName: StandardSSD_LRS
provisioned: disk.csi.azure.com
reclaimPolicy: Retain 
volumeBindingMode: WaitForFirstCustomer

Now, from time to time, pods get stuck in a pending state. This is because the default scheduler tries to create a pod on a node, not in the same zone as the PV (LRS disk).

Of course, I could configure a node affinity and bind all pods to a single zone. But then I can't profit from HA and pods being spread across zones.

So, how can I configure a stateful set so that, after a crash or restart of a pod, the pod gets scheduled again in the same zone?

Is there some dynamic way of providing a node affinity to a pod template spec?

1

There are 1 best solutions below

0
Xli On

I am experiencing a similar issue and this post helped me. I hope it can help you. Link here

Essentially, you want to make sure the PVC is defined correctly in the ClaimRef of your PV. Then, you want to check that the PVC is defined correctly in your pod StatefulSet or whatever you use to deploy your pods. You can refer the binding section of this for more info on ClaimRef. Persistent Volumes

This should ensure that the pod will deploy to the same availability zone on restarts. If after all that you are still having issues, it could be that the node simply does not have any more room to deploy your pod so it is stuck pending. If this is the case, then you might want to consider implementing priority and a preemption policy that can evict lower priority pods to make room. Other solution would be to vertically scale your node so it can accommodate more pods. Reference for priority and preemption.

I hope this helps!