Create an AWS EMR cluster, submit steps, and retrieve the cluster ID.

yaml
type: "io.kestra.plugin.aws.emr.createclusterandsubmitsteps"

Create an EMR Cluster, submit a Spark job, wait until the job is terminated.

yaml
id: aws_emr_create_cluster
namespace: company.team

tasks:
  - id: create_cluster
    type: io.kestra.plugin.aws.emr.CreateClusterAndSubmitSteps
    accessKeyId: <access-key>
    secretKeyId: <secret-key>
    region: eu-west-3
    clusterName: "Spark job cluster"
    logUri: "s3://my-bucket/test-emr-logs"
    keepJobFlowAliveWhenNoSteps: true
    applications:
        - Spark
    masterInstanceType: m5.xlarge
    slaveInstanceType: m5.xlarge
    instanceCount: 3
    ec2KeyName: my-ec2-ssh-key-pair-name
    steps:
        - name: Spark_job_test
          jar: "command-runner.jar"
          actionOnFailure: CONTINUE
          commands:
            - spark-submit s3://mybucket/health_violations.py --data_source s3://mybucket/food_establishment_data.csv --output_uri s3://mybucket/test-emr-output
    wait: true

Cluster Name.

Instance count.

Master Instance Type.

EC2 instance type for master instances.

Default emr-5.20.0

Release Label.

It specifies the EMR release version label. Pattern is 'emr-x.x.x'.

Slave Instance Type.

EC2 instance type for slave instances.

Access Key Id in order to connect to AWS.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

SubType string

Applications.

List of applications name: Ex: "Hive", "Spark", "Ganglia"

Enable compatibility mode.

Use it to connect to S3 bucket with S3 compatible services that don't support the new transport client.

Default PT10S

Format duration

Check interval duration.

The frequency with which the task checks whether the job is completed.

EC2 Key name.

The name of the Amazon EC2 key pair that can be used to connect to the master node using SSH as the user called "hadoop".

EC2 Subnet ID.

Applies to clusters that use the uniform instance group configuration. To launch the cluster in Amazon Virtual Private Cloud (Amazon VPC), set this parameter to the identifier of the Amazon VPC subnet where you want the cluster to launch. If you do not specify this value and your account supports EC2-Classic, the cluster launches in EC2-Classic.

The endpoint with which the SDK should communicate.

This property allows you to use a different S3 compatible storage backend.

Force path style access.

Must only be used when compatibilityMode is enabled.

Default EMR_EC2_DefaultRole

Job flow Role.

Also called instance profile and Amazon EC2 role. An IAM role for an Amazon EMR cluster. The Amazon EC2 instances of the cluster assume this role. The default role is EMR_EC2_DefaultRole. In order to use the default role, you must have already created it using the CLI or console.

Default false

Keep job flow alive.

Specifies whether the cluster should remain available after completing all steps. Defaults to false.

Log URI.

The location in Amazon S3 to write the log files of the job flow. If a value is not provided, logs are not created.

AWS region with which the SDK should communicate.

Secret Key Id in order to connect to AWS.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

Default EMR_DefaultRole

Service Role.

The IAM role that Amazon EMR assumes in order to access Amazon Web Services resources on your behalf. If you've created a custom service role path, you must specify it for the service role when you launch your cluster.

AWS session token, retrieved from an AWS token service, used for authenticating that this user has received temporary permissions to access a given resource.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

SubType

Steps

List of steps to run.

The AWS STS endpoint with which the SDKClient should communicate.

AWS STS Role.

The Amazon Resource Name (ARN) of the role to assume. If set the task will use the StsAssumeRoleCredentialsProvider. If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

AWS STS External Id.

A unique identifier that might be required when you assume a role in another account. This property is only used when an stsRoleArn is defined.

Default PT15M

Format duration

AWS STS Session duration.

The duration of the role session (default: 15 minutes, i.e., PT15M). This property is only used when an stsRoleArn is defined.

AWS STS Session name.

This property is only used when an stsRoleArn is defined.

Default true

Visible to all users.

Set this value to true so that IAM principals in the Amazon Web Services account associated with the cluster can perform Amazon EMR actions on the cluster that their IAM policies allow.

Default false

Wait for the end of the run.

If set to true it will wait until the cluster has status TERMINATED or WAITING.

Default PT1H

Format duration

Completion timeout.

Job flow ID.

Possible Values

TERMINATE_CLUSTERCANCEL_AND_WAITCONTINUETERMINATE_JOB_FLOW

Action on failure.

Possible values : TERMINATE_CLUSTER, CANCEL_AND_WAIT, CONTINUE, TERMINATE_JOB_FLOW.

JAR path.

A path to a JAR file run during the step.

Step configuration name.

Ex: "Run Spark job"

SubType string

Commands.

A list of commands that will be passed to the JAR file's main function when executed.

Main class.

The name of the main class in the specified Java file. If not specified, the JAR file should specify a Main-Class in its manifest file.