
PythonSubmit
PythonSubmit
yaml
type: "io.kestra.plugin.spark.PythonSubmit"Examples
yaml
id: spark_python_submit
namespace: company.team
tasks:
- id: python_submit
type: io.kestra.plugin.spark.PythonSubmit
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
networkMode: host
user: root
master: spark://localhost:7077
args:
- "10"
mainScript: |
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("PythonPi").getOrCreate()
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_: int) -> float:
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
Properties
mainScript*Requiredstring
master*Requiredstring
appFilesobject
SubTypestring
argsarray
SubTypestring
configurationsobject
SubTypestring
containerImagestring
Default
apache/spark:4.0.1-java21-rdeployModestring
Default
CLIENTPossible Values
CLIENTCLUSTERenvobject
SubTypestring
namestring
pythonFilesobject
SubTypestring
runnerstring
Possible Values
PROCESSDOCKERsparkSubmitPathstring
Default
/opt/spark/bin/spark-submittaskRunnerNon-dynamic
Definitions
Run a task in a Docker container.
type*Requiredobject
configstringobject
cpu
io.kestra.plugin.scripts.runner.docker.Cpu
cpusnumberstring
credentials
Credentials for a private container registry.
authstring
identityTokenstring
passwordstring
registrystring
registryTokenstring
usernamestring
deletebooleanstring
Default
truedeviceRequestsarray
A request for devices to be sent to device drivers.
capabilitiesarray
SubTypearray
countintegerstring
deviceIdsarray
SubTypestring
driverstring
optionsobject
SubTypestring
entryPointarray
SubTypestring
Default
[
""
]extraHostsarray
SubTypestring
fileHandlingStrategystring
Default
VOLUMEPossible Values
MOUNTVOLUMEhoststring
killGracePeriodstring
Default
PT0SFormat
durationmemory
io.kestra.plugin.scripts.runner.docker.Memory
kernelMemorystring
memorystring
memoryReservationstring
memorySwapstring
memorySwappinessstring
oomKillDisablebooleanstring
networkModestring
portBindingsarray
SubTypestring
privilegedbooleanstring
pullPolicyobject
resumebooleanstring
Default
trueshmSizestring
userstring
versionstring
volumesarray
SubTypestring
waitbooleanstring
Default
trueTask runner that executes a task as a subprocess on the Kestra host.
type*Requiredobject
versionstring
Task runner that executes a task inside a job in Google Cloud Batch.
region*Requiredstring
type*Requiredobject
bucketstring
completionCheckIntervalstring
Default
PT5SFormat
durationcomputeResource
io.kestra.plugin.ee.gcp.runner.Batch-ComputeResource
bootDiskstring
cpustring
memorystring
deletebooleanstring
Default
trueentryPointarray
SubTypestring
impersonatedServiceAccountstring
lifecyclePoliciesarray
io.kestra.plugin.ee.gcp.runner.Batch-LifecyclePolicy
actionstring
Possible Values
ACTION_UNSPECIFIEDRETRY_TASKFAIL_TASKUNRECOGNIZEDactionCondition
io.kestra.plugin.ee.gcp.runner.Batch-LifecyclePolicyAction
exitCodesarray
SubTypeinteger
machineTypestring
Default
e2-mediummaxCreateJobRetryCountintegerstring
Default
2maxRetryCountinteger
Minimum
>= 0Maximum
<= 10networkInterfacesarray
io.kestra.plugin.ee.gcp.runner.Batch-NetworkInterface
network*Requiredstring
subnetworkstring
projectIdstring
reservationstring
resumebooleanstring
Default
truescopesarray
SubTypestring
Default
["https://www.googleapis.com/auth/cloud-platform"]serviceAccountstring
syncWorkingDirectorybooleanstring
Default
falseversionstring
waitForLogIntervalstring
Default
PT5SFormat
durationwaitUntilCompletionstring
Default
PT1HFormat
durationTask runner that executes a task inside a job in Google Cloud Run.
region*Requiredstring
type*Requiredobject
bucketstring
completionCheckIntervalstring
Default
PT5SFormat
durationdeletebooleanstring
Default
trueimpersonatedServiceAccount