CustomJob CustomJob

type: "io.kestra.plugin.gcp.vertexai.CustomJob"

Start a Vertex AI custom job (opens new window).

# Examples

id: "custom_job"
type: "io.kestra.plugin.gcp.vertexai.CustomJob"
projectId: my-gcp-project
region: europe-west1
displayName: Start Custom Job
spec:
  workerPoolSpecs:
  - containerSpec:
      imageUri: gcr.io/my-gcp-project/my-dir/my-image:latest
    machineSpec:
      machineType: n1-standard-4
    replicaCount: 1

# Properties

# delete

  • Type: boolean
  • Dynamic: ✔️
  • Required: ✔️
  • Default: true

Delete the job at the end.

# displayName

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The job display name

# projectId

  • Type: string
  • Dynamic: ✔️
  • Required:

The GCP project id

# region

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The region

# scopes

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:
  • Default: [https://www.googleapis.com/auth/cloud-platform]

The GCP scopes to used

# serviceAccount

  • Type: string
  • Dynamic: ✔️
  • Required:

The GCP service account key

# spec

The job specification

# wait

  • Type: boolean
  • Dynamic: ✔️
  • Required: ✔️
  • Default: true

Wait for the end of the job.

Allowing to capture job status & logs

# Outputs

# createDate

  • Type: string

Time when the CustomJob was created.

# endDate

  • Type: string

Time when the CustomJob was ended.

# name

  • Type: string

Resource name of a CustomJob.

# state

  • Type: string

  • Possible Values:

    • JOB_STATE_UNSPECIFIED
    • JOB_STATE_QUEUED
    • JOB_STATE_PENDING
    • JOB_STATE_RUNNING
    • JOB_STATE_SUCCEEDED
    • JOB_STATE_FAILED
    • JOB_STATE_CANCELLING
    • JOB_STATE_CANCELLED
    • JOB_STATE_PAUSED
    • JOB_STATE_EXPIRED
    • JOB_STATE_UPDATING
    • UNRECOGNIZED

The detailed state of the job.

# updateDate

  • Type: string

Time when the CustomJob was created.

# Definitions

# ContainerSpec

# args

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

The arguments to be passed when starting the container.

# commands

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:

The command to be invoked when the container is started.

It overrides the entrypoint instruction in Dockerfile when provided.

# imageUri

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The URI of a container image in the Container Registry that is to be run on each worker replica.

Must be on google container registry, example: gcr.io///:

# CustomJobSpec

# baseOutputDirectory

The Cloud Storage location to store the output of this job.

# enableWebAccess

  • Type: boolean
  • Dynamic:
  • Required:

Whether you want Vertex AI to enable interactive shell access (opens new window) to training containers.

# network

  • Type: string
  • Dynamic: ✔️
  • Required:

The full name of the Compute Engine network to which the Job should be peered.

For example, projects/12345/global/networks/myVPC.
Format is of the form projects/{project}/global/networks/{network}. Where {project} is a project number, as in 12345, and {network} is a network name.
To specify this field, you must have already configured VPC Network Peering for Vertex AI (opens new window).
If this field is left unspecified, the job is not peered with any network.

# scheduling

Scheduling options for a CustomJob.

# serviceAccount

  • Type: string
  • Dynamic: ✔️
  • Required:

Specifies the service account for workload run-as account.

   Users submitting jobs must have act-as permission on this run-as account.
   If unspecified, the [Vertex AI Custom Code Service
   Agent](https://cloud.google.com/vertex-ai/docs/general/access-control#service-agents)
   for the CustomJob's project is used.

# tensorboard

  • Type: string
  • Dynamic: ✔️
  • Required:

The name of a Vertex AI Tensorboard resource to which this CustomJob

will upload Tensorboard logs. Format:projects/{project}/locations/{location}/tensorboards/{tensorboard}

# workerPoolSpecs

  • Type: array
  • SubType: WorkerPoolSpec
  • Dynamic: ✔️
  • Required: ✔️
  • Min items: 1

The spec of the worker pools including machine type and Docker image.

All worker pools except the first one are optional and can be skipped

# GcsDestination

# outputUriPrefix

  • Type: string
  • Dynamic:
  • Required: ✔️

Google Cloud Storage URI to output directory.

If the uri doesn't end with '/', a '/' will be automatically appended. The directory is created if it doesn't exist.

# WorkerPoolSpec

# containerSpec

The custom container task.

# discSpec

  • Type: DiscSpec
  • Dynamic:
  • Required:

The specification of the disk.

# machineSpec

The specification of a single machine.

# pythonPackageSpec

The python package specs.

# replicaCount

  • Type: integer
  • Dynamic:
  • Required:

The specification of the disk.

# PythonPackageSpec

# args

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required: ✔️

The Google Cloud Storage location of the Python package files which are the training program and its dependent packages.

The maximum number of package URIs is 100.

# envs

  • Type: object
  • SubType: string
  • Dynamic: ✔️
  • Required: ✔️

Environment variables to be passed to the python module.

Maximum limit is 100.

# packageUris

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required: ✔️

The Google Cloud Storage location of the Python package files which are the training program and its dependent packages.

The maximum number of package URIs is 100.

# DiscSpec

# bootDiskSizeGb

  • Type: integer
  • Dynamic:
  • Required:
  • Default: 100

Size in GB of the boot disk.

# bootDiskType

  • Type: string

  • Dynamic:

  • Required:

  • Default: PD_SSD

  • Possible Values:

    • PD_SSD
    • PD_STANDARD

Type of the boot disk.

# MachineSpec

# acceleratorCount

  • Type: integer
  • Dynamic:
  • Required:

The number of accelerators to attach to the machine.

# acceleratorType

  • Type: string

  • Dynamic: ✔️

  • Required:

  • Possible Values:

    • ACCELERATOR_TYPE_UNSPECIFIED
    • NVIDIA_TESLA_K80
    • NVIDIA_TESLA_P100
    • NVIDIA_TESLA_V100
    • NVIDIA_TESLA_P4
    • NVIDIA_TESLA_T4
    • NVIDIA_TESLA_A100
    • TPU_V2
    • TPU_V3
    • UNRECOGNIZED

The type of accelerator(s) that may be attached to the machine.

# machineType

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The type of the machine.

See the list of machine types supported forprediction (opens new window)
See the list of machine types supported for custom training (opens new window).

# Scheduling

# restartJobOnWorkerRestart

  • Type: boolean
  • Dynamic:
  • Required: ✔️

Restarts the entire CustomJob if a worker gets restarted.

This feature can be used by distributed training jobs that are not resilient to workers leaving and joining a job.

# timeOut

  • Type: string
  • Dynamic:
  • Required: ✔️
  • Format: duration

The maximum job running time. The default is 7 days.