🚀 New! Kestra raises $3 million to grow Learn more

PySparkSubmit PySparkSubmit

yaml
type: "io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit"

Submit an Apache PySpark batch workload.

Examples

yaml
id: "py_spark_submit"
type: "io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit"
mainPythonFileUri: 'gs://spark-jobs-kestra/pi.py'
name: test-pyspark

Properties

mainPythonFileUri

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The HCFS URI of the main Python file to use as the Spark driver. Must be a .py file.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

name

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The batch name

region

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The region

archiveUris

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

HCFS URIs of archives to be extracted into the working director of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

args

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

The arguments to pass to the driver.

Do not include arguments that can be set as batch properties, such as --conf, since a collision can occur that causes an incorrect batch submission.

execution

Execution configuration for a workload.

fileUris

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

HCFS URIs of files to be placed in the working directory of each executor.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

jarFileUris

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

peripherals

Peripherals configuration for a workload.

projectId

  • Type: string
  • Dynamic: ✔️
  • Required:

The GCP project id

runtime

Runtime configuration for a workload.

scopes

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:
  • Default: [https://www.googleapis.com/auth/cloud-platform]

The GCP scopes to used

serviceAccount

  • Type: string
  • Dynamic: ✔️
  • Required:

The GCP service account key

Outputs

state

  • Type: string
  • Possible Values:
    • STATE_UNSPECIFIED
    • PENDING
    • RUNNING
    • CANCELLING
    • CANCELLED
    • SUCCEEDED
    • FAILED
    • UNRECOGNIZED

The state of the batch.

Definitions

PeripheralsConfiguration

metastoreService

  • Type: string
  • Dynamic: ✔️
  • Required:

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

sparkHistoryServer

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

RuntimeConfiguration

containerImage

  • Type: string
  • Dynamic: ✔️
  • Required:

Optional custom container image for the job runtime environment.

If not specified, a default container image will be used.

properties

  • Type: object
  • SubType: string
  • Dynamic: ✔️
  • Required:

properties used to configure the workload execution (map of key/value pairs).

version

  • Type: string
  • Dynamic: ✔️
  • Required:

Version of the batch runtime.

SparkHistoryServerConfiguration

dataprocCluster

  • Type: string
  • Dynamic: ✔️
  • Required:

Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.

Example: projects/[project_id]/regions/[region]/clusters/[cluster_name]

ExecutionConfiguration

kmsKey

  • Type: string
  • Dynamic: ✔️
  • Required:

The Cloud KMS key to use for encryption.

networkTags

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

Tags used for network traffic control.

networkUri

  • Type: string
  • Dynamic: ✔️
  • Required:

Network URI to connect workload to.

serviceAccountEmail

  • Type: string
  • Dynamic: ✔️
  • Required:

Service account used to execute workload.

subnetworkUri

  • Type: string
  • Dynamic: ✔️
  • Required:

Subnetwork URI to connect workload to.