PySparkSubmit PySparkSubmit

yaml
type: "io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit"

Submit an Apache PySpark batch workload.

Examples

yaml
id: gcp_dataproc_py_spark_submit
namespace: company.team
tasks:
  - id: py_spark_submit
    type: io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit
    mainPythonFileUri: 'gs://spark-jobs-kestra/pi.py'
    name: test-pyspark
    region: europe-west3

Properties

mainPythonFileUri

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The HCFS URI of the main Python file to use as the Spark driver. Must be a .py file.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

name

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The batch name

region

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The region

archiveUris

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

HCFS URIs of archives to be extracted into the working director of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

args

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

The arguments to pass to the driver.

Do not include arguments that can be set as batch properties, such as --conf, since a collision can occur that causes an incorrect batch submission.

execution

Execution configuration for a workload.

fileUris

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

HCFS URIs of files to be placed in the working directory of each executor.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

impersonatedServiceAccount

  • Type: string
  • Dynamic: ✔️
  • Required:

The GCP service account to impersonate.

jarFileUris

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

peripherals

Peripherals configuration for a workload.

projectId

  • Type: string
  • Dynamic: ✔️
  • Required:

The GCP project ID.

runtime

Runtime configuration for a workload.

scopes

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:
  • Default: [https://www.googleapis.com/auth/cloud-platform]

The GCP scopes to be used.

serviceAccount

  • Type: string
  • Dynamic: ✔️
  • Required:

The GCP service account.

Outputs

state

  • Type: string
  • Required:
  • Possible Values:
    • STATE_UNSPECIFIED
    • PENDING
    • RUNNING
    • CANCELLING
    • CANCELLED
    • SUCCEEDED
    • FAILED
    • UNRECOGNIZED

The state of the batch.

Definitions

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-PeripheralsConfiguration

Properties

metastoreService
  • Type: string
  • Dynamic: ✔️
  • Required:

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

sparkHistoryServer

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-RuntimeConfiguration

Properties

containerImage
  • Type: string
  • Dynamic: ✔️
  • Required:

Optional custom container image for the job runtime environment.

If not specified, a default container image will be used.

properties
  • Type: object
  • SubType: string
  • Dynamic: ✔️
  • Required:

properties used to configure the workload execution (map of key/value pairs).

version
  • Type: string
  • Dynamic: ✔️
  • Required:

Version of the batch runtime.

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-SparkHistoryServerConfiguration

Properties

dataprocCluster
  • Type: string
  • Dynamic: ✔️
  • Required:

Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.

Example: projects/[project_id]/regions/[region]/clusters/[cluster_name]

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-ExecutionConfiguration

Properties

kmsKey
  • Type: string
  • Dynamic: ✔️
  • Required:

The Cloud KMS key to use for encryption.

networkTags
  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

Tags used for network traffic control.

networkUri
  • Type: string
  • Dynamic: ✔️
  • Required:

Network URI to connect workload to.

serviceAccountEmail
  • Type: string
  • Dynamic: ✔️
  • Required:

Service account used to execute workload.

subnetworkUri
  • Type: string
  • Dynamic: ✔️
  • Required:

Subnetwork URI to connect workload to.

Was this page helpful?