Reboot MSK Broker Node via Lambda function

40 Views Asked by At

I am trying to reboot an MSK broker node, using boto3 API that's deployed on a lambda function. But, the call kept timing out with no error message that could be of use. Furthermore, upon realizing that the MSK API service I am trying to talk to was in another VPC than that of the invoking client (lambda function), we tried another approach: Use an API gateway, attach a VPC endpoint to it, and add the lambda trigger to the endpoint on the API gateway that would again invoke the same boto3 API code we had earlier in the plain lambda function. Yet, the call times out.

What we have verified:

  • permission required to perform the reboot broker action. It is present in the IAM policy attached to the lambda function that invokes the API call.
  • if we're able to invoke the API endpoint of the gateway via VPCe: Yes we're able to, as it returns a 200 response if we commented the API call to reboot broker endpoint on the MSK API service.

Overall, I am looking for an example where one has used a lambda function to invoke a boto3 API to reboot an MSK broker node. I do not see this example scenario covered by AWS in their documentation as of today.

1

There are 1 best solutions below

1
EdbE On

Just for the context, this is my code:

import boto3

def lambda_handler(event, context):
    cluster_arn = event.get('arn')
    broker_id = event.get('brokerId')

    # Create an MSK client
    msk_client = boto3.client('kafka')

    # Reboot the broker
    response = msk_client.reboot_broker(
        ClusterArn=cluster_arn,
        BrokerIds=[broker_id]
    )

    # Check the response for confirmation
    if response['ResponseMetadata']['HTTPStatusCode'] == 200:
        return "Broker reboot initiated successfully"
    else:
        return "Broker reboot failed"

This code works properly, and I can observe my broker being rebooted right at this moment. The API invocation successful code 200 indicates that the API was successfully accepted by the service. The reboot itself is async and takes whatever time it takes.

As for the issues with timing out invocation you are experiencing, I would like to mention the following:

  1. Boto3 reboot API is an MSK level API, not Kafka level.
  2. As a result of #1, you don't need to invoke it in the VPC where your cluster is running. It doesn't use brokers, nor bootstrap, nor Zookeeper to reboot a broker. It's a service endpoint.
  3. Not sure it could be related to IAM, but just in case, please check you have all required permissions to execute MSK APIs

IAM permissions for rebooting brokers:

        {
            "Sid": "AllowMskApis",
            "Effect": "Allow",
            "Action": [
                "kafka:ListClusters",
                "kafka:ListClustersV2",
                "kafka:ListNodes",
                "kafka:DescribeCluster",
                "kafka:GetBootstrapBrokers",
                "kafka:RebootBroker"
            ],
            "Resource": [
                "*"
            ]
        }