Available on: Beta>= 0.20.0
Assert that your workflows meet SLAs.
What is an SLA
An SLA (Service Level Agreement) is a core property of a flow that defines a behavior
that should be triggered if the flow runs for too long or doesn't satisfy the assertion defined in the SLA.
Note that SLA is in Beta so some properties might change in the next release or two. Please be aware that its API could change in ways that are not compatible with earlier versions in future releases, or it might become unsupported. If you have any questions or suggestions, please let us know via Slack or GitHub.
SLA types
Currently, Kestra supports the following SLA types:
- MAX_DURATION — the maximum allowed execution duration before the SLA is breached
- EXECUTION_ASSERTION — an assertion defined by a Pebble expression that must be met during the execution. If the assertion doesn't hold true, the SLA is breached.
How to use SLAs
SLAs are defined usig the sla
property at the root of a flow and they declare the desired state that must be met during executions of the flow.
MAX_DURATION
If a workflow execution exceeds the expected duration, an SLA can trigger corrective actions e.g. cancelling such execution.
The following SLA cancels an execution if it takes more than 8 hours:
id: sla_example
namespace: company.team
sla:
- id: maxDuration
type: MAX_DURATION
duration: PT8H
behavior: CANCEL
labels:
sla: miss
reason: durationExceeded
tasks:
- id: punctual
type: io.kestra.plugin.core.log.Log
message: Workflow started, monitoring SLA compliance
- id: sleepyhead
type: io.kestra.plugin.core.flow.Sleep
duration: PT9H
- id: never_executed_task
type: io.kestra.plugin.core.log.Log
message: This task will never start because the SLA was breached
EXECUTION_ASSERTION
An SLA can also be defined by an assertion that must be met during the execution. If the assertion doesn't hold true, the SLA is breached.
The following SLA fails if the output of mytask
is not equal to expected output
:
id: sla_demo
namespace: company.team
sla:
- id: assert_output
type: EXECUTION_ASSERTION
assert: "{{ outputs.mytask.value == 'expected output' }}"
behavior: FAIL
labels:
sla: miss
reason: outputMismatch
tasks:
- id: mytask
type: io.kestra.plugin.core.debug.Return
format: expected output
SLA behavior
The behavior
property of an SLA defines the action to take when the SLA is breached. The following behaviors are supported:
- CANCEL — cancels the execution
- FAIL — fails the execution
- NONE — logs a message.
On top of that behavior, each breached SLA can set labels that can be used to filter executions or to trigger specific actions.
Alerts on SLA breaches
Let's say that you want to receive a Slack alert anytime an SLA is breached. Using the Flow trigger, you can react to cancelled or failed Executions that were marked with the sla: miss
label:
id: sla_miss_alert
namespace: system
tasks:
- id: send_alert
type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
url: "{{secret('SLACK_WEBHOOK')}}"
payload: |
{
"text": "SLA breached for flow `{{trigger.namespace}}.{{trigger.flowId}}` with ID `{{trigger.executionId}}`"
}
triggers:
- id: alert_on_failure
type: io.kestra.plugin.core.trigger.Flow
labels:
sla: miss
states:
- FAILED
- WARNING
- CANCELLED
Was this page helpful?