PySparkSubmit
PySparkSubmit
type: "io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit"
Submit an Apache PySpark batch workload.
Examples
id: "py_spark_submit"
type: "io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit"
mainPythonFileUri: 'gs://spark-jobs-kestra/pi.py'
name: test-pyspark
region: europe-west3
Properties
mainPythonFileUri
- Type: string
- Dynamic: ✔️
- Required: ✔️
The HCFS URI of the main Python file to use as the Spark driver. Must be a .py file.
Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix
name
- Type: string
- Dynamic: ✔️
- Required: ✔️
The batch name
region
- Type: string
- Dynamic: ✔️
- Required: ✔️
The region
archiveUris
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
HCFS URIs of archives to be extracted into the working director of each executor. Supported file types: .jar
, .tar
, .tar.gz
, .tgz
, and .zip
.
Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix
args
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
The arguments to pass to the driver.
Do not include arguments that can be set as batch properties, such as
--conf
, since a collision can occur that causes an incorrect batch submission.
execution
- Type: AbstractBatch-ExecutionConfiguration
- Dynamic: ✔️
- Required: ❌
Execution configuration for a workload.
fileUris
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
HCFS URIs of files to be placed in the working directory of each executor.
Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix
jarFileUris
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.
Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix
peripherals
- Type: AbstractBatch-PeripheralsConfiguration
- Dynamic: ✔️
- Required: ❌
Peripherals configuration for a workload.
projectId
- Type: string
- Dynamic: ✔️
- Required: ❌
The GCP project ID.
runtime
- Type: AbstractBatch-RuntimeConfiguration
- Dynamic: ✔️
- Required: ❌
Runtime configuration for a workload.
scopes
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
- Default:
[https://www.googleapis.com/auth/cloud-platform]
The GCP scopes to be used.
serviceAccount
- Type: string
- Dynamic: ✔️
- Required: ❌
The GCP service account key.
Outputs
state
- Type: string
- Dynamic: ❓
- Required: ❌
- Possible Values:
STATE_UNSPECIFIED
PENDING
RUNNING
CANCELLING
CANCELLED
SUCCEEDED
FAILED
UNRECOGNIZED
The state of the batch.
Definitions
io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-PeripheralsConfiguration
Properties
metastoreService
- Type: string
- Dynamic: ✔️
- Required: ❌
Resource name of an existing Dataproc Metastore service.
Example:
projects/[project_id]/locations/[region]/services/[service_id]
sparkHistoryServer
- Type: AbstractBatch-SparkHistoryServerConfiguration
- Dynamic: ✔️
- Required: ❌
Resource name of an existing Dataproc Metastore service.
Example:
projects/[project_id]/locations/[region]/services/[service_id]
io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-RuntimeConfiguration
Properties
containerImage
- Type: string
- Dynamic: ✔️
- Required: ❌
Optional custom container image for the job runtime environment.
If not specified, a default container image will be used.
properties
- Type: object
- SubType: string
- Dynamic: ✔️
- Required: ❌
properties used to configure the workload execution (map of key/value pairs).
version
- Type: string
- Dynamic: ✔️
- Required: ❌
Version of the batch runtime.
io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-SparkHistoryServerConfiguration
Properties
dataprocCluster
- Type: string
- Dynamic: ✔️
- Required: ❌
Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.
Example:
projects/[project_id]/regions/[region]/clusters/[cluster_name]
io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-ExecutionConfiguration
Properties
kmsKey
- Type: string
- Dynamic: ✔️
- Required: ❌
The Cloud KMS key to use for encryption.
networkTags
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
Tags used for network traffic control.
networkUri
- Type: string
- Dynamic: ✔️
- Required: ❌
Network URI to connect workload to.
serviceAccountEmail
- Type: string
- Dynamic: ✔️
- Required: ❌
Service account used to execute workload.
subnetworkUri
- Type: string
- Dynamic: ✔️
- Required: ❌
Subnetwork URI to connect workload to.
Was this page helpful?