Submit Apache Spark SQL queries as a batch workload to a Google Cloud Dataproc cluster.

For more details, check out the Apache Spark SQL documentation.

yaml
type: "io.kestra.plugin.gcp.dataproc.batches.sparksqlsubmit"

Examples

yaml
id: gcp_dataproc_spark_sql_submit
namespace: company.team
tasks:
  - id: spark_sql_submit
    type: io.kestra.plugin.gcp.dataproc.batches.SparkSqlSubmit
    queryFileUri: 'gs://spark-jobs-kestra/foobar.py'
    name: test-sparksql
    region: europe-west3

Properties

name *string

The batch name

queryFileUri *string

The HCFS URI of the script that contains Spark SQL queries to execute.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

region *string

The region

archiveUris array

SubType string

HCFS URIs of archives to be extracted into the working director of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.

args array

SubType string

The arguments to pass to the driver.

Do not include arguments that can be set as batch properties, such as --conf, since a collision can occur that causes an incorrect batch submission.

execution AbstractBatch-ExecutionConfiguration

Execution configuration for a workload.

fileUris array

SubType string

HCFS URIs of files to be placed in the working directory of each executor.

impersonatedServiceAccount string

The GCP service account to impersonate.

jarFileUris array

SubType string

HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.

peripherals AbstractBatch-PeripheralsConfiguration

Peripherals configuration for a workload.

projectId string

The GCP project ID.

runtime AbstractBatch-RuntimeConfiguration

Runtime configuration for a workload.

scopes array

SubType string

Default ["https://www.googleapis.com/auth/cloud-platform"]

The GCP scopes to be used.

serviceAccount string

The GCP service account.

Outputs

state string

Possible Values

STATE_UNSPECIFIEDPENDINGRUNNINGCANCELLINGCANCELLEDSUCCEEDEDFAILEDUNRECOGNIZED

The state of the batch.

Definitions

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-PeripheralsConfiguration

metastoreService string

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

sparkHistoryServer AbstractBatch-SparkHistoryServerConfiguration

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-RuntimeConfiguration

containerImage string

Optional custom container image for the job runtime environment.

If not specified, a default container image will be used.

properties object

SubType string

properties used to configure the workload execution (map of key/value pairs).

version string

Version of the batch runtime.

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-SparkHistoryServerConfiguration

dataprocCluster string

Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.

Example: projects/[project_id]/regions/[region]/clusters/[cluster_name]

io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-ExecutionConfiguration

kmsKey string

The Cloud KMS key to use for encryption.

networkTags array

SubType string

Tags used for network traffic control.

networkUri string

Network URI to connect workload to.

serviceAccountEmail string

Service account used to execute workload.

subnetworkUri string

Subnetwork URI to connect workload to.