I'm facing an issue while attempting to scale Kafka pods in an AKS cluster with self-hosted Kafka. When I try to scale the Kafka pods to the desired number using a custom resource definition, the changes do not take effect, whether it's scaling up or down. However, if I manually increase the StatefulSet, it increases the pods, but I encounter the following error in sequential pods (e.g., kafka-cluster-sc-kafka-1):
Error
STRIMZI_BROKER_ID=1
Preparing truststore for replication listener
Adding /opt/kafka/cluster-ca-certs/ca.crt to truststore /tmp/kafka/cluster.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for replication listener is complete
Looking for the right CA
No CA found. Thus exiting.
However, it's important to note that the pod kafka-cluster-sc-kafka-0 always works as expected.
Kafka Custom Resource Definition:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
creationTimestamp: '2022-02-23T18:48:29Z'
generation: 33
labels:
k8slens-edit-resource-version: v1beta2
name: kafka-cluster-sc
namespace: kafka
resourceVersion: '439262033'
uid: 8470c254-f4b9-406b-a4e2-a56162cd78db
selfLink: /apis/kafka.strimzi.io/v1beta2/namespaces/kafka/kafkas/kafka-cluster-sc
status:
conditions:
- lastTransitionTime: '2023-09-27T19:00:15.174Z'
message: >-
Failure executing: POST at:
https://10.0.0.1/apis/policy/v1beta1/namespaces/kafka/poddisruptionbudgets.
Message: the server could not find the requested resource. Received
status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[],
group=null, kind=null, name=null, retryAfterSeconds=null, uid=null,
additionalProperties={}), kind=Status, message=the server could not find
the requested resource, metadata=ListMeta(_continue=null,
remainingItemCount=null, resourceVersion=null, selfLink=null,
additionalProperties={}), reason=NotFound, status=Failure,
additionalProperties={}).
reason: KubernetesClientException
status: 'True'
type: NotReady
observedGeneration: 33
spec:
entityOperator:
topicOperator: {}
userOperator: {}
kafka:
authorization:
superUsers:
- CN=temfluence-user
- CN=temfluence-user-tls
- CN=temfluence-mm-user-tls-sc
type: simple
config:
auto.create.topics.enable: 'false'
default.replication.factor: 2
inter.broker.protocol.version: '3.1'
min.insync.replicas: 2
offsets.topic.replication.factor: 2
ssl.cipher.suites: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
ssl.enabled.protocols: TLSv1.2
ssl.protocol: TLSv1.2
transaction.state.log.min.isr: 1
transaction.state.log.replication.factor: 1
listeners:
- authentication:
type: scram-sha-512
name: tls
port: 9093
tls: true
type: internal
- authentication:
type: tls
configuration:
bootstrap:
annotations:
kubernetes.io/ingress.class: nginx
host: kafka-bootstrap.temfluence.internal
brokers:
- annotations:
kubernetes.io/ingress.class: nginx
broker: 0
host: kafka-broker-0.temfluence.internal
- annotations:
kubernetes.io/ingress.class: nginx
broker: 1
host: kafka-broker-1.temfluence.internal
- annotations:
kubernetes.io/ingress.class: nginx
broker: 2
host: kafka-broker-2.temfluence.internal
name: external
port: 9094
tls: true
type: ingress
livenessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
key: kafka-metrics-config.yml
name: kafka-metrics
readinessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
replicas: 2
resources:
limits:
memory: 1500Mi
requests:
cpu: 100m
memory: 600Mi
storage:
type: jbod
volumes:
- deleteClaim: false
id: 0
size: 100Gi
type: persistent-claim
- deleteClaim: false
id: 1
size: 100Gi
type: persistent-claim
template:
pod:
metadata:
annotations: {}
podDisruptionBudget: {}
version: 3.1.0
kafkaExporter:
groupRegex: .*
template:
pod:
metadata:
annotations: {}
topicRegex: .*
zookeeper:
livenessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
key: zookeeper-metrics-config.yml
name: kafka-metrics
readinessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
replicas: 3
resources:
limits:
memory: 1800Mi
requests:
cpu: 300m
memory: 700Mi
storage:
deleteClaim: false
size: 100Gi
type: persistent-claim
I've researched this issue but haven't found a solution. While some sources mention using the entity-operator for Kafka pod scaling, I'm unsure of the correct procedure. Thank you for your assistance!
You should not touch the StatefulSet resources created by Strimzi (and not just them, but most of the other resources it creates as well). If you want to scale the Kafka cluster, you should edit the
Kafkacustom resource and change the number of replicas in.spec.kafka.replicasand let the operator create/update the required resources.I'm not sure from the information you provided which of the errors you have there are related to you scaling the StatefulSets manually. There might be some other issues as well possibly. But scaling the Kafka cluster through the
KafkaCR would be the start.