Scaling Kafka Pods in AKS Cluster with Self-hosted Kafka not Taking Effect

249 Views Asked by At

I'm facing an issue while attempting to scale Kafka pods in an AKS cluster with self-hosted Kafka. When I try to scale the Kafka pods to the desired number using a custom resource definition, the changes do not take effect, whether it's scaling up or down. However, if I manually increase the StatefulSet, it increases the pods, but I encounter the following error in sequential pods (e.g., kafka-cluster-sc-kafka-1):

Error

STRIMZI_BROKER_ID=1
Preparing truststore for replication listener
Adding /opt/kafka/cluster-ca-certs/ca.crt to truststore /tmp/kafka/cluster.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for replication listener is complete
Looking for the right CA
No CA found. Thus exiting.

However, it's important to note that the pod kafka-cluster-sc-kafka-0 always works as expected.

Kafka Custom Resource Definition:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  creationTimestamp: '2022-02-23T18:48:29Z'
  generation: 33
  labels:
    k8slens-edit-resource-version: v1beta2
    name: kafka-cluster-sc
  namespace: kafka
  resourceVersion: '439262033'
  uid: 8470c254-f4b9-406b-a4e2-a56162cd78db
  selfLink: /apis/kafka.strimzi.io/v1beta2/namespaces/kafka/kafkas/kafka-cluster-sc
status:
  conditions:
    - lastTransitionTime: '2023-09-27T19:00:15.174Z'
      message: >-
        Failure executing: POST at:
        https://10.0.0.1/apis/policy/v1beta1/namespaces/kafka/poddisruptionbudgets.
        Message: the server could not find the requested resource. Received
        status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[],
        group=null, kind=null, name=null, retryAfterSeconds=null, uid=null,
        additionalProperties={}), kind=Status, message=the server could not find
        the requested resource, metadata=ListMeta(_continue=null,
        remainingItemCount=null, resourceVersion=null, selfLink=null,
        additionalProperties={}), reason=NotFound, status=Failure,
        additionalProperties={}).
      reason: KubernetesClientException
      status: 'True'
      type: NotReady
  observedGeneration: 33
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    authorization:
      superUsers:
        - CN=temfluence-user
        - CN=temfluence-user-tls
        - CN=temfluence-mm-user-tls-sc
      type: simple
    config:
      auto.create.topics.enable: 'false'
      default.replication.factor: 2
      inter.broker.protocol.version: '3.1'
      min.insync.replicas: 2
      offsets.topic.replication.factor: 2
      ssl.cipher.suites: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
      ssl.enabled.protocols: TLSv1.2
      ssl.protocol: TLSv1.2
      transaction.state.log.min.isr: 1
      transaction.state.log.replication.factor: 1
    listeners:
      - authentication:
          type: scram-sha-512
        name: tls
        port: 9093
        tls: true
        type: internal
      - authentication:
          type: tls
        configuration:
          bootstrap:
            annotations:
              kubernetes.io/ingress.class: nginx
            host: kafka-bootstrap.temfluence.internal
          brokers:
            - annotations:
                kubernetes.io/ingress.class: nginx
              broker: 0
              host: kafka-broker-0.temfluence.internal
            - annotations:
                kubernetes.io/ingress.class: nginx
              broker: 1
              host: kafka-broker-1.temfluence.internal
            - annotations:
                kubernetes.io/ingress.class: nginx
              broker: 2
              host: kafka-broker-2.temfluence.internal
        name: external
        port: 9094
        tls: true
        type: ingress
    livenessProbe:
      initialDelaySeconds: 15
      timeoutSeconds: 5
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: kafka-metrics-config.yml
          name: kafka-metrics
    readinessProbe:
      initialDelaySeconds: 15
      timeoutSeconds: 5
    replicas: 2
    resources:
      limits:
        memory: 1500Mi
      requests:
        cpu: 100m
        memory: 600Mi
    storage:
      type: jbod
      volumes:
        - deleteClaim: false
          id: 0
          size: 100Gi
          type: persistent-claim
        - deleteClaim: false
          id: 1
          size: 100Gi
          type: persistent-claim
    template:
      pod:
        metadata:
          annotations: {}
      podDisruptionBudget: {}
    version: 3.1.0
  kafkaExporter:
    groupRegex: .*
    template:
      pod:
        metadata:
          annotations: {}
    topicRegex: .*
  zookeeper:
    livenessProbe:
      initialDelaySeconds: 15
      timeoutSeconds: 5
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: zookeeper-metrics-config.yml
          name: kafka-metrics
    readinessProbe:
      initialDelaySeconds: 15
      timeoutSeconds: 5
    replicas: 3
    resources:
      limits:
        memory: 1800Mi
      requests:
        cpu: 300m
        memory: 700Mi
    storage:
      deleteClaim: false
      size: 100Gi
      type: persistent-claim

I've researched this issue but haven't found a solution. While some sources mention using the entity-operator for Kafka pod scaling, I'm unsure of the correct procedure. Thank you for your assistance!

1

There are 1 best solutions below

2
Jakub On

You should not touch the StatefulSet resources created by Strimzi (and not just them, but most of the other resources it creates as well). If you want to scale the Kafka cluster, you should edit the Kafka custom resource and change the number of replicas in .spec.kafka.replicas and let the operator create/update the required resources.

I'm not sure from the information you provided which of the errors you have there are related to you scaling the StatefulSets manually. There might be some other issues as well possibly. But scaling the Kafka cluster through the Kafka CR would be the start.