Triggering Pyspark script execution on EMR using AWS Lambda Function

21 Views Asked by At

I am trying to run Pyspark script(.ipynb) saved in S3 on EMR cluster Via a lambda function.

def lambda_handler(event, context): # Initialize AWS clients s3_client = boto3.client('s3') #emr_client = boto3.client('emr') region = 'us-east-1' emr_client = boto3.client('emr', region_name=region,aws_access_key_id='*****',aws_secret_access_key='****’) # S3 bucket and notebook path s3_bucket = ‘XXXXX’ notebook_key = 's3://XXX/code.ipynb' # EMR cluster ID cluster_id = ‘XXX’ # Specify the steps to run steps = [{'Name': 'Test','ActionOnFailure': 'CONTINUE', # Define action on failure 'HadoopJarStep': { 'Jar': 'command-runner.jar','spark-submit','--class', 'org.apache.spark.deploy.PythonRunner', '--deploy-mode', 'cluster', 's3://XXX/pyspark_code.ipynb' } } ]

However, I get below error message although I have VPC setup for lambda with outbound rule for port 443 Connect timeout on endpoint URL: "https://elasticmapreduce.us-east-1.amazonaws.com/"

Traceback (most recent call last): File "/var/task/lambda_function.py", line 34, in lambda_handler response = emr_client.add_job_flow_steps(JobFlowId=cluster_id, Steps=steps)

0

There are 0 best solutions below