Error Handling​Error ​Handling

Handle errors with automatic retries and notifications.

Failure is inevitable. Kestra offers automatic retries and error handling to help you build resilient workflows.

Error handling

By default, if any task fails, the execution stops and is marked as failed. For more control over error handling, you can add errors tasks, AllowFailure tasks, or automatic retries.

The errors property allows you to execute one or more actions before terminating the flow (e.g., sending an email or a Slack message to your team). The property is named errors because it is triggered when errors occur within a flow.

You can implement error handling at the flow level or namespace level:

  1. Flow-level: Useful to implement custom alerting for a specific flow or task. This can be accomplished by adding errors tasks.
  2. Namespace-level: Useful to send a notification for any failed Execution within a given namespace. This approach allows you to implement centralized error handling for all flows within a given namespace.

Flow-level error handling using errors

The errors property of a flow accepts a list of tasks to execute when an error occurs. You can add as many tasks as you want, and they will be executed sequentially.

The following example workflow automatically sends a Slack alert via the SlackIncomingWebhook whenever any flow in the company.team namespace fails or finishes with warnings.

yaml
id: unreliable_flow
namespace: company.team

tasks:
  - id: fail
    type: io.kestra.plugin.core.execution.Fail

errors:
  - id: alert_on_failure
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    url: "{{ secret('SLACK_WEBHOOK') }}" # https://hooks.slack.com/services/xyz/xyz/xyz
    messageText: "Failure alert for flow {{ flow.namespace }}.{{ flow.id }} with ID {{ execution.id }}"

Check the error handling page for more details.


Namespace-level error handling using a Flow trigger

To get notified on a workflow failure, you can leverage Kestra's built-in notification tasks, including among others (the list keeps growing with new releases):

For a centralized namespace-level alerting, we recommend adding a dedicated monitoring workflow with one of the above mentioned notification tasks and a Flow trigger. Below is an example workflow that automatically sends a Slack alert as soon as any flow in the namespace company.team fails or finishes with warnings.

yaml
id: failure_alert
namespace: system

tasks:
  - id: send
    type: io.kestra.plugin.notifications.slack.SlackExecution
    url: "{{ secret('SLACK_WEBHOOK') }}"
    channel: "#general"
    executionId: "{{trigger.executionId}}"

triggers:
  - id: listen
    type: io.kestra.plugin.core.trigger.Flow
    conditions:
      - type: io.kestra.plugin.core.condition.ExecutionStatus
        in:
          - FAILED
          - WARNING
      - type: io.kestra.plugin.core.condition.ExecutionNamespace
        namespace: company.team
        prefix: true

Adding this flow ensures you receive a Slack alert for any flow failure in the company.team namespace.

alert notification


Retries

When working with external systems, transient errors are common. For example, a file may not be available yet, an API might be temporarily unreachable, or a database can be under maintenance. In such cases, retries can often resolve the issue without human intervention.

Configuring retries

Each task can be retried a certain number of times and in a specific way. Use the retry property with the desired type of retry.

The following types of retries are currently supported:

  • Constant: The task will be retried every X seconds/minutes/hours/days.
  • Exponential: The task will also be retried every X seconds/minutes/hours/days but with an exponential backoff (i.e., an exponential time interval in between each retry attempt.)
  • Random: The task will be retried every X seconds/minutes/hours/days with a random delay (i.e., a random time interval in between each retry attempt.)

In this example, the task is retried up to 5 times within a total duration of 1 minute, with a constant 2-second interval between attempts.

yaml
id: retries
namespace: company.team

tasks:
  - id: fail_four_times
    type: io.kestra.plugin.scripts.shell.Commands
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - 'if [ "{{ taskrun.attemptsCount }}" -eq 4 ]; then exit 0; else exit 1; fi'
    retry:
      type: constant
      interval: PT2S
      maxAttempts: 5
      maxDuration: PT1M
      warningOnRetry: false

errors:
  - id: will_never_run
    type: io.kestra.plugin.core.debug.Return
    format: This will never be executed as retries will fix the issue

Adding a retry configuration to our tutorial workflow

Returning to the example from the Fundamentals section. We will add a retry configuration to the api task. API calls are prone to transient errors, so we will retry that task up to 10 times, for at most 1 hour of total duration, every 10 seconds (i.e., with a constant interval of 10 seconds in between retry attempts).

yaml
id: getting_started
namespace: company.team

tasks:
  - id: api
    type: io.kestra.plugin.core.http.Request
    uri: https://dummyjson.com/products
    retry:
      type: constant
      interval: PT20S
      maxDuration: PT1H
      maxAttempts: 10
      warningOnRetry: true

Was this page helpful?