Getting EFSMountTimeout exception when triggering step function map state having very high concurrently

36 Views Asked by At

I have a step function, which has Distributed map state. This distributed map state has a lambda function as its child which do some files processing(read/write/delete operations) on EFS.

The issue is, I am using concurrency of this map state to 2000(my lambda has way more concurrency than 2000), and in that case when this map state trigger 2000 instances of my lambda function, EFS could not handle this sudden burst and lambda function started giving EFSMountTimeoutException.

The function could not mount the EFS file system with access point arn:aws:elasticfilesystem:us-east-1:1234567889:access-point/fsap-05b0494f1746d72 due to mount time out. (Service: AWSLambda; Status Code: 408; Error Code: EFSMountTimeoutException; Request ID: ec2a7b2a-2f74-47c8-12839-204fd37fb042; Proxy: null)

I believe here, the issue is this sudden burst of lambda functions synchronously triggered by the map state.

what can be done here to overcome this issue?

What I have tried so far but no luck: 1 - I tried using invoking 2000 lambda instances with Eventbridge events, through code, and they work fine. May be because of they are asynchronous and lambda can handle that. I cannot use it because I need control over these lambda instances in my use case.

2 - Stap function map state retry: The issue is, even if for ex: 1990 lambda are already finished and successful and any lambda instances failed afterwards, the step function retry will initiate whole map state again. Another scenario is, if all 2000 instances failed at once, retrying again after some time will trigger 2000 instances invocations, and accessing EFS by all of them at once fall in into the same scenario of EFS mount timeout.

3 - provisioned concurrency for this lambda: It turns out that it is too costly, considering how many times and number of instances i want to run for this lambda.

0

There are 0 best solutions below