CreateClusterAndSubmitSteps
Create an EMR Cluster, submit steps to be processed, then get the cluster ID as an output.
type: "io.kestra.plugin.aws.emr.CreateClusterAndSubmitSteps"
Create an EMR Cluster, submit a Spark job, wait until the job is terminated.
id: aws_emr_create_cluster
namespace: company.team
tasks:
- id: create_cluster
type: io.kestra.plugin.aws.emr.CreateClusterAndSubmitSteps
accessKeyId: <access-key>
secretKeyId: <secret-key>
region: eu-west-3
clusterName: "Spark job cluster"
logUri: "s3://my-bucket/test-emr-logs"
keepJobFlowAliveWhenNoSteps: true
applications:
- Spark
masterInstanceType: m5.xlarge
slaveInstanceType: m5.xlarge
instanceCount: 3
ec2KeyName: my-ec2-ssh-key-pair-name
steps:
- name: Spark_job_test
jar: "command-runner.jar"
actionOnFailure: CONTINUE
commands:
- spark-submit s3://mybucket/health_violations.py --data_source s3://mybucket/food_establishment_data.csv --output_uri s3://mybucket/test-emr-output
wait: true
YES
Cluster Name.
YES
YES
Master Instance Type.
EC2 instance type for master instances.
YES
emr-5.20.0
Release Label.
It specifies the EMR release version label. Pattern is 'emr-x.x.x'.
YES
Slave Instance Type.
EC2 instance type for slave instances.
YES
Access Key Id in order to connect to AWS.
If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YES
Applications.
List of applications name: Ex: "Hive", "Spark", "Ganglia"
YES
YES
PT10S
duration
Check interval duration.
The frequency with which the task checks whether the job is completed.
YES
EC2 Key name.
The name of the Amazon EC2 key pair that can be used to connect to the master node using SSH as the user called "hadoop".
YES
EC2 Subnet ID.
Applies to clusters that use the uniform instance group configuration. To launch the cluster in Amazon Virtual Private Cloud (Amazon VPC), set this parameter to the identifier of the Amazon VPC subnet where you want the cluster to launch. If you do not specify this value and your account supports EC2-Classic, the cluster launches in EC2-Classic.
YES
The endpoint with which the SDK should communicate.
This property allows you to use a different S3 compatible storage backend.
YES
EMR_EC2_DefaultRole
Job flow Role.
Also called instance profile and Amazon EC2 role. An IAM role for an Amazon EMR cluster. The Amazon EC2 instances of the cluster assume this role. The default role is EMR_EC2_DefaultRole. In order to use the default role, you must have already created it using the CLI or console.
YES
false
YES
Log URI.
The location in Amazon S3 to write the log files of the job flow. If a value is not provided, logs are not created.
YES
AWS region with which the SDK should communicate.
YES
Secret Key Id in order to connect to AWS.
If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YES
EMR_DefaultRole
Service Role.
The IAM role that Amazon EMR assumes in order to access Amazon Web Services resources on your behalf. If you've created a custom service role path, you must specify it for the service role when you launch your cluster.
YES
AWS session token, retrieved from an AWS token service, used for authenticating that this user has received temporary permissions to access a given resource.
If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YES
The AWS STS endpoint with which the SDKClient should communicate.
YES
AWS STS Role.
The Amazon Resource Name (ARN) of the role to assume. If set the task will use the StsAssumeRoleCredentialsProvider
. If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YES
AWS STS External Id.
A unique identifier that might be required when you assume a role in another account. This property is only used when an stsRoleArn
is defined.
YES
PT15M
duration
AWS STS Session duration.
The duration of the role session (default: 15 minutes, i.e., PT15M). This property is only used when an stsRoleArn
is defined.
YES
AWS STS Session name.
This property is only used when an stsRoleArn
is defined.
YES
true
YES
false
YES
PT1H
duration
Completion timeout.
Job flow ID.
YES
TERMINATE_CLUSTER
CANCEL_AND_WAIT
CONTINUE
TERMINATE_JOB_FLOW
Action on failure.
Possible values : TERMINATE_CLUSTER, CANCEL_AND_WAIT, CONTINUE, TERMINATE_JOB_FLOW.
YES
JAR path.
A path to a JAR file run during the step.
YES
Step configuration name.
Ex: "Run Spark job"
YES
Commands.
A list of commands that will be passed to the JAR file's main function when executed.
YES
Main class.
The name of the main class in the specified Java file. If not specified, the JAR file should specify a Main-Class in its manifest file.