Errors and Retries​Errors and ​Retries

Handle errors with automatic retries and notifications.

Failure is inevitable. Kestra provides automatic retries and error handling to help you build resilient workflows.

Error handling

By default, a failure of any task will stop the execution and will mark it as failed. For more control over error handling, you can add errors tasks, AllowFailure tasks, or automatic retries.

The errors property allows you to execute one or more actions before terminating the flow, e.g. sending an email or a Slack message to your team. The property is named errors because they are triggered upon errors within your flow.

You can implement error handling at the flow-level or on a namespace-level.

  1. Flow-level: Useful to implement custom alerting for a specific flow or task. This can be accomplished by adding errors tasks.
  2. Namespace-level: Useful to send a notification for any failed Execution within a given namespace. This approach allows you to implement centralized error handling for all flows within a given namespace.

Flow-level error handling using errors

The errors is a property of a flow that accepts a list of tasks that will be executed when an error occurs. You can add as many tasks as you want, and they will be executed sequentially.

The example below sends a flow-level failure alert via Slack using the SlackIncomingWebhook task defined using the errors property.

yaml
id: unreliable_flow
namespace: prod

tasks:
  - id: fail
    type: io.kestra.plugin.scripts.shell.Commands
    runner: PROCESS
    commands:
      - exit 1

errors:
  - id: alert_on_failure
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    url: "{{ secret('SLACK_WEBHOOK') }}" # https://hooks.slack.com/services/xyz/xyz/xyz
    payload: |
      {
        "channel": "#alerts",
        "text": "Failure alert for flow {{ flow.namespace }}.{{ flow.id }} with ID {{ execution.id }}"
      }

Check the error handling page for more details.


Namespace-level error handling using a Flow trigger

To get notified on a workflow failure, you can leverage Kestra's built-in notification tasks, including among others (the list keeps growing with new releases):

For a centralized namespace-level alerting, we recommend adding a dedicated monitoring workflow with one of the above mentioned notification tasks and a Flow trigger. Below is an example workflow that automatically sends a Slack alert as soon as any flow in the namespace prod fails or finishes with warnings.

yaml
id: failure_alert
namespace: prod.monitoring

tasks:
  - id: send
    type: io.kestra.plugin.notifications.slack.SlackExecution
    url: "{{ secret('SLACK_WEBHOOK') }}"
    channel: "#general"
    executionId: "{{trigger.executionId}}"

triggers:
  - id: listen
    type: io.kestra.core.models.triggers.types.Flow
    conditions:
      - type: io.kestra.core.models.conditions.types.ExecutionStatusCondition
        in:
          - FAILED
          - WARNING
      - type: io.kestra.core.models.conditions.types.ExecutionNamespaceCondition
        namespace: prod
        prefix: true

Adding this single flow will ensure that you receive a Slack alert on any flow failure in the prod namespace. Here is an example alert notification:

alert notification


Retries

When working with external systems, transient errors are common. For example, a file may not be available yet, an API might be temporarily unreachable, or a database can be under maintenance. In such cases, retries can automatically fix the issue without human intervention.

Configuring retries

Each task can be retried a certain number of times and in a specific way. Use the retry property with the desired type of retry.

The following types of retries are currently supported:

  • Constant: The task will be retried every X seconds/minutes/hours/days.
  • Exponential: The task will also be retried every X seconds/minutes/hours/days, but with an exponential backoff, i.e. an exponential time interval in between each retry attempt.
  • Random: The task will be retried every X seconds/minutes/hours/days with a random delay i.e. a random time interval in between each retry attempt.

In this example, we will retry the task 5 times up to 1 minute of a total task run duration, with a constant interval of 2 seconds between each retry attempt.

yaml
id: retries
namespace: dev

tasks:
  - id: fail_four_times
    type: io.kestra.plugin.scripts.shell.Commands
    runner: PROCESS
    commands:
      - 'if [ "{{ taskrun.attemptsCount }}" -eq 4 ]; then exit 0; else exit 1; fi'
    retry:
      type: constant
      interval: PT2S
      maxAttempt: 5
      maxDuration: PT1M
      warningOnRetry: false

errors:
  - id: will_never_run
    type: io.kestra.core.tasks.debugs.Return
    format: This will never be executed as retries will fix the issue

Adding a retry configuration to our tutorial workflow

Let's get back to our example from the Fundamentals section. We will add a retry configuration to the api task. API calls are prone to transient errors, so we will retry that task up to 10 times, for at most 1 hour of total duration, every 10 seconds (i.e. with a constant interval of 10 seconds in between retry attempts).

yaml
id: getting_started
namespace: dev

tasks:
  - id: api
    type: io.kestra.plugin.fs.http.Request
    uri: https://dummyjson.com/products
    retry:
      type: constant # type: string
      interval: PT20S # type: Duration
      maxDuration: PT1H # type: Duration
      maxAttempt: 10 # type: int
      warningOnRetry: true # type: boolean, default is false