Available on: Beta>= 0.20.0

Assert that your workflows meet SLAs.

What is an SLA

An SLA (Service Level Agreement) is a core property of a flow that defines a behavior that should be triggered if the flow runs for too long or doesn't satisfy the assertion defined in the SLA.

SLA types

Currently, Kestra supports the following SLA types:

  1. MAX_DURATION — the maximum allowed execution duration before the SLA is breached
  2. EXECUTION_ASSERTION — an assertion defined by a Pebble expression that must be met during the execution. If the assertion doesn't hold true, the SLA is breached.

How to use SLAs

SLAs are defined usig the sla property at the root of a flow and they declare the desired state that must be met during executions of the flow.

MAX_DURATION

If a workflow execution exceeds the expected duration, an SLA can trigger corrective actions e.g. cancelling such execution.

The following SLA cancels an execution if it takes more than 8 hours:

yaml
id: sla_example
namespace: company.team

sla:
  - id: maxDuration
    type: MAX_DURATION
    duration: PT8H
    behavior: CANCEL
    labels:
      sla: miss
      reason: durationExceeded

tasks:
  - id: punctual
    type: io.kestra.plugin.core.log.Log
    message: Workflow started, monitoring SLA compliance

  - id: sleepyhead
    type: io.kestra.plugin.core.flow.Sleep
    duration: PT9H

  - id: never_executed_task
    type: io.kestra.plugin.core.log.Log
    message: This task will never start because the SLA was breached

EXECUTION_ASSERTION

An SLA can also be defined by an assertion that must be met during the execution. If the assertion doesn't hold true, the SLA is breached.

The following SLA fails if the output of mytask is not equal to expected output:

yaml
id: sla_demo
namespace: company.team

sla:
  - id: assert_output
    type: EXECUTION_ASSERTION
    assert: "{{ outputs.mytask.value == 'expected output' }}"
    behavior: FAIL
    labels:
      sla: miss
      reason: outputMismatch

tasks:
  - id: mytask
    type: io.kestra.plugin.core.debug.Return
    format: expected output

SLA behavior

The behavior property of an SLA defines the action to take when the SLA is breached. The following behaviors are supported:

  1. CANCEL — cancels the execution
  2. FAIL — fails the execution
  3. NONE — logs a message.

On top of that behavior, each breached SLA can set labels that can be used to filter executions or to trigger specific actions.

Alerts on SLA breaches

Let's say that you want to receive a Slack alert anytime an SLA is breached. Using the Flow trigger, you can react to cancelled or failed Executions that were marked with the sla: miss label:

yaml
id: sla_miss_alert
namespace: system

tasks:
  - id: send_alert
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    url: "{{secret('SLACK_WEBHOOK')}}"
    payload: |
      {
        "text": "SLA breached for flow `{{trigger.namespace}}.{{trigger.flowId}}` with ID `{{trigger.executionId}}`"
      }

triggers:
  - id: alert_on_failure
    type: io.kestra.plugin.core.trigger.Flow
    labels:
      sla: miss
    states:
      - FAILED
      - WARNING
      - CANCELLED

Was this page helpful?