Retries

Retries handle transient failures in your workflows.

They are defined at the task level and can be configured to retry a task a certain number of times or with a certain delay between each retry.

What are retries

Kestra provides a task retry functionality. This makes it possible to add retry behavior for any failed task run based on configurations in the flow description.

A retry on a task run creates a new task run attempt.

Example

The following example defines a retry for the retry-sample task with a maximum of 5 attempts every 15 minutes:

yaml

- id: retry_sample
  type: io.kestra.plugin.core.log.Log
  message: my output for task {{task.id}}
  timeout: PT10M
  retry:
    type: constant
    maxAttempts: 5
    interval: PT15M

In next example, the flow is retried four times every 0.25 ms, with each attempt executing for a maximum of 1 minute before being failed.

Due to the shell command logic used, it will succeed at the 5th attempt as we can track how many times it retries with {{ taskrun.attemptsCount }}

yaml

id: retry
namespace: company.team
description:
  This flow will be retry 4 times and will success at the 5th attempts

tasks:
- id: failed
  type: io.kestra.plugin.scripts.shell.Commands
  taskRunner:
    type: io.kestra.plugin.core.runner.Process
  commands:
  - 'if [ "{{taskrun.attemptsCount}}" -eq 4 ]; then exit 0; else exit 1; fi'
  retry:
    type: constant
    interval: PT0.25S
    maxAttempts: 5
    maxDuration: PT1M
    warningOnRetry: true

errors:
  - id: never_happen
    type: io.kestra.plugin.core.debug.Return
    format: Never happened {{task.id}}

Timeout vs. Max Retry Duration

It's important to distinguish between the timeout and retry.maxDuration settings:

timeout: This setting applies to each individual task run attempt, including the initial attempt and any subsequent retries. If a single attempt exceeds the specified timeout duration, it is immediately marked as failed, and retries are initiated if configured.
retry.maxDuration: This sets the total maximum duration allowed for the entire task execution. It includes the time taken by the initial attempt, all retry attempts, and the intervals between retries. If the cumulative duration exceeds maxDuration, the task is marked as failed, and no further retries are attempted.

Example: Consider a task with a timeout of 10 minutes and a retry.maxDuration of 30 minutes:

Any single attempt (initial or retry) that runs longer than 10 minutes will be terminated and marked as failed.
The entire task execution (from the start of the initial attempt through all retries) will be aborted if it exceeds 30 minutes in total, even if individual attempts are under their 10-minute timeout. Ensure that retry.interval is smaller than maxDuration, otherwise retries may not be executed as expected.

Retry options for all retry types

name	type	description
`type`	`string`	Retry behavior to apply. Can be one of `constant`, `exponential`, `random`.
`maxAttempts`	`integer`	Number of retries performed before the system stops retrying
`maxDuration`	`Duration`	Maximum total duration allowed for the task, including the initial attempt and all retries. Once exceeded, the task is marked as failed and no further retries are attempted.
`warningOnRetry`	`Boolean`	Flag the execution as warning if any retry was done on this task. Default `false`.

Some options above have to be filled with a duration notation. Durations are expressed in the ISO 8601 Durations format, although the week, month and year designator are not supported. Here are some examples:

name	description
PT0.250S	250 milliseconds delay
PT2S	2 seconds delay
PT1M	1 minute delay
PT3.5H	3 hours and a half delay
P6DT4H	6 days and 4 hours delay

Retry types

`constant`

This establishes constant retry times: if the interval property is set to 10 minutes, it retries every 10 minutes.

name	type	description
`interval`	`Duration`	Duration between each retry.

`exponential`

This establishes retry behavior that waits longer between each retry e.g., 1s, 5s, 15s, ...

name	type	description
`interval`	`Duration`	Duration between each retry
`maxInterval`	`Duration`	Max Duration between each retry
`delayFactor`	`Double`	Multiplier for the `interval` on between retry, default is 2. For example, with an interval of 30s and a delay factor of 2, retry will append at 30s, 1m30, 3m30, ...

`random`

This establishes retries with a random delay between minimum and maximum limits.

name	type	description
`minInterval`	`Duration`	Minimal duration between each retry
`maxInterval`	`Duration`	Maximum duration between each retry

Configuring retries globally

You can also configure retries globally for all tasks in a flow by using the plugins configuration:

yaml

kestra:
  plugins:
    configurations:
      - type: io.kestra
        values:
          retry:
            type: constant
            maxAttempts: 3
            interval: PT30S

This configuration applies a constant retry with a maximum of 3 attempts every 30 seconds to all tasks across all flows.

Flow-level retries

You can also set a flow-level retry policy to restart the execution if any task fails. The retry behavior is customizable — you can choose to:

Create a new execution: CREATE_NEW_EXECUTION
Retry the failed task only: RETRY_FAILED_TASK

Flow-level retries are particularly useful when you want to retry the entire flow if any task fails. This way, you don't need to configure retries for each task individually.

Here's an example of how you can set a flow-level retry policy:

yaml

id: flow_level_retry
namespace: company.team

retry:
  maxAttempts: 3
  behavior: CREATE_NEW_EXECUTION # RETRY_FAILED_TASK
  type: constant
  interval: PT1S

tasks:
  - id: fail_1
    type: io.kestra.plugin.core.execution.Fail
    allowFailure: true

  - id: fail_2
    type: io.kestra.plugin.core.execution.Fail
    allowFailure: false

The behavior property can be set to CREATE_NEW_EXECUTION or RETRY_FAILED_TASK. Only with the CREATE_NEW_EXECUTION behavior, the attempt of the execution is incremented. Otherwise, only the failed task run is restarted (incrementing the attempt of the task run rather than the execution).

Apart from the behavior property, the retry policy is identical to the one already known from task retries.

Note: Flow-level retries also restart Subflow as a new execution.

Retry vs. Restart vs. Replay

Automatic vs. manual

Retries ensure that failed task runs are automatically rerun within the same Execution. Apart from retries, defined within your flow code, you can also manually rerun the flow from the Flow Execution page in the UI using the Restart or Replay buttons.

Restart vs. Replay

While Restart reruns failed tasks within the current Execution (i.e., without creating a new execution), a Replay results in a completely new run with a different Execution ID than the initial run.

When you replay an Execution, a new execution gets created for the same flow. However, you can still track which Execution triggered this new run thanks to the Original Execution field:

Replay can be executed from any task, even if that task executed successfully. But note that when you trigger a replay from a specific failed task, it will still result in a new Execution running all tasks downstream of your chosen start task:

When you want to rerun only failed tasks, use Restart.

Summary: Retries vs. Restart vs. Replay

The table below summarizes the differences between a retry, restart, and replay.

Concept	Flow or task level	Automatic or manual	Does it create a new execution?
Retry	Task level	Automatic	No, it only reruns a given task within the same Execution. Each retry results in a new attempt number, allowing you to see how many times a given task run was retried.
Restart	Flow level	Manual	No, it only reruns all failed tasks within the same Execution. It's meant to handle unanticipated, transient failures. The UI shows a new attempt number for all task runs that were restarted.
Replay	Either flow or task level	Manual	Yes. You can pick an arbitrary step from which a new execution should be triggered. If you select a task in the middle that needs outputs from a previous task, its output is retrieved from cache.

Was this page helpful?

Workflow ComponentsErrors

Workflow ComponentsTask timeout

​Retries

Retries