Available on: >= 0.20.0

Assert that your workflows meet SLAs.

What is an SLA

A Service Level Agreement (SLA) is a core property of a flow that defines a behavior to trigger if the flow runs too long or fails to meet the defined assertion.

SLA types

Currently, Kestra supports the following SLA types:

  1. MAX_DURATION — the maximum allowed execution duration before the SLA is breached
  2. EXECUTION_ASSERTION — an assertion defined by a Pebble expression that must be met during the execution. If the assertion doesn't hold true, the SLA is breached.

How to use SLAs

SLAs are defined using the sla property at the root of a flow, and they declare the desired state that must be met during executions of the flow.

MAX_DURATION

If a workflow execution exceeds the expected duration, an SLA can trigger corrective actions, such as cancelling the execution.

The following SLA cancels an execution if it takes more than 8 hours:

yaml
id: sla_example
namespace: company.team

sla:
  - id: maxDuration
    type: MAX_DURATION
    duration: PT8H
    behavior: CANCEL
    labels:
      sla: miss
      reason: durationExceeded

tasks:
  - id: punctual
    type: io.kestra.plugin.core.log.Log
    message: Workflow started, monitoring SLA compliance

  - id: sleepyhead
    type: io.kestra.plugin.core.flow.Sleep
    duration: PT9H

  - id: never_executed_task
    type: io.kestra.plugin.core.log.Log
    message: This task will never start because the SLA was breached

EXECUTION_ASSERTION

An SLA can also be based on an assertion that must hold true during execution. If the assertion fails, the SLA is breached.

The following SLA fails if the output of mytask is not equal to expected output:

yaml
id: sla_demo
namespace: company.team

sla:
  - id: assert_output
    type: EXECUTION_ASSERTION
    assert: "{{ outputs.mytask.value == 'expected output' }}"
    behavior: FAIL
    labels:
      sla: miss
      reason: outputMismatch

tasks:
  - id: mytask
    type: io.kestra.plugin.core.debug.Return
    format: expected output

SLA behavior

The behavior property of an SLA defines the action to take when the SLA is breached. The following behaviors are supported:

  1. CANCEL — cancels the execution
  2. FAIL — fails the execution
  3. NONE — logs a message

In addition, each breached SLA can set labels that can be used to filter executions or trigger follow-up actions.

Alerts on SLA breaches

For example, if you want to receive a Slack alert when an SLA is breached, you can use a Flow trigger to react to cancelled or failed executions labeled with sla: miss:

yaml
id: sla_miss_alert
namespace: system

tasks:
  - id: send_alert
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    url: "{{secret('SLACK_WEBHOOK')}}"
    messageText: "SLA breached for flow `{{trigger.namespace}}.{{trigger.flowId}}` with ID `{{trigger.executionId}}`"

triggers:
  - id: alert_on_failure
    type: io.kestra.plugin.core.trigger.Flow
    labels:
      sla: miss
    states:
      - FAILED
      - WARNING
      - CANCELLED

Was this page helpful?