Alerting & Monitoring

This page provides best practices for setting up alerting and monitoring in your Kestra instance.

Alerting

Failure alerts are essential. When a production workflow fails, you should be notified immediately. To implement failure alerting, you can use Kestra’s built-in notification tasks, such as:

Technically, you can add custom failure alerts to each flow separately using the errors tasks:

yaml

id: onFailureAlert
namespace: company.team

tasks:
  - id: fail
    type: io.kestra.plugin.core.execution.Fail

errors:
  - id: slack
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    url: "{{ secret('SLACK_WEBHOOK') }}"
    messageText: "Failure alert for flow `{{ flow.namespace }}.{{ flow.id }}` with ID `{{ execution.id }}`. Here is a bit more context about why the execution failed: `{{ errorLogs() }}`"

However, this can lead to some boilerplate code if you start copy-pasting this errors configuration to multiple flows.

For centralized namespace-level alerting, we recommend creating a dedicated monitoring workflow with a notification task and a Flow trigger. Below is an example workflow that automatically sends a Slack alert as soon as any flow in a namespace company.analytics fails or finishes with warnings.

yaml

id: failureAlertToSlack
namespace: company.monitoring

tasks:
  - id: send
    type: io.kestra.plugin.notifications.slack.SlackExecution
    url: "{{ secret('SLACK_WEBHOOK') }}"
    channel: "#general"
    executionId: "{{trigger.executionId}}"

triggers:
  - id: listen
    type: io.kestra.plugin.core.trigger.Flow
    conditions:
      - type: io.kestra.plugin.core.condition.ExecutionStatus
        in:
          - FAILED
          - WARNING
      - type: io.kestra.plugin.core.condition.ExecutionNamespace
        namespace: company.analytics
        prefix: true

Adding this single flow will ensure that you receive a Slack alert on any flow failure in the company.analytics namespace. Here is an example alert notification:

alert notification

Note that if you want this alert to be sent on failure across multiple namespaces, you will need to add an OrCondition to the conditions list. See the example below:

yaml

id: alert
namespace: company.system

tasks:
  - id: send
    type: io.kestra.plugin.notifications.slack.SlackExecution
    url: "{{ secret('SLACK_WEBHOOK') }}"
    channel: "#general"
    executionId: "{{trigger.executionId}}"

triggers:
  - id: listen
    type: io.kestra.plugin.core.trigger.Flow
    conditions:
      - type: io.kestra.plugin.core.condition.ExecutionStatus
        in:
          - FAILED
          - WARNING
      - type: io.kestra.plugin.core.condition.Or
        conditions:
          - type: io.kestra.plugin.core.condition.ExecutionNamespace
            namespace: company.product
            prefix: true
          - type: io.kestra.plugin.core.condition.ExecutionFlow
            flowId: cleanup
            namespace: company.system

The example above works correctly. However, if you list the conditions without using OrCondition, no alerts will be sent because Kestra will try to match all conditions simultaneously. Since there’s no overlap between them, the conditions cancel each other out. See the example below:

yaml

id: bad_example
namespace: company.monitoring
description: This example will not work

tasks:
  - id: send
    type: io.kestra.plugin.notifications.slack.SlackExecution
    url: "{{ secret('SLACK_WEBHOOK') }}"
    channel: "#general"
    executionId: "{{trigger.executionId}}"

triggers:
  - id: listen
    type: io.kestra.plugin.core.trigger.Flow
    conditions:
      - type: io.kestra.plugin.core.condition.ExecutionStatus
        in:
          - FAILED
          - WARNING
      - type: io.kestra.plugin.core.condition.ExecutionNamespace
        namespace: company.product
        prefix: true
      - type: io.kestra.plugin.core.condition.ExecutionFlow
        flowId: cleanup
        namespace: company.system

Here, there's no overlap between the two conditions. The first condition will only match executions in the company.product namespace, while the second condition will only match executions from the cleanup flow in the company.system namespace. If you want to match executions from the cleanup flow in the company.system namespace or any execution in the product namespace, make sure to add the OrCondition.

Monitoring

Kestra exposes a monitoring endpoint on port 8081 by default. You can change this port using the endpoints.all.port property in the configuration options.

This monitoring endpoint provides invaluable information for troubleshooting and monitoring, including Prometheus metrics and several Kestra's internal routes. For instance, the /health endpoint exposed by default on port 8081 (e.g., http://localhost:8081/health) generates a similar response as shown below as long as your Kestra instance is healthy:

json

{
  "name": "kestra",
  "status": "UP",
  "details": {
    "jdbc": {
      "name": "kestra",
      "status": "UP",
      "details": {
        "jdbc:postgresql://postgres:5432/kestra": {
          "name": "kestra",
          "status": "UP",
          "details": {
            "database": "PostgreSQL",
            "version": "15.3 (Debian 15.3-1.pgdg110+1)"
          }
        }
      }
    },
    "compositeDiscoveryClient()": {
      "name": "kestra",
      "status": "UP",
      "details": {
        "services": {

        }
      }
    },
    "service": {
      "name": "kestra",
      "status": "UP"
    },
    "diskSpace": {
      "name": "kestra",
      "status": "UP",
      "details": {
        "total": 204403494912,
        "free": 13187035136,
        "threshold": 10485760
      }
    }
  }
}

Prometheus

Kestra exposes Prometheus metrics on the endpoint /prometheus. This endpoint is compatible with Prometheus and can be scraped by any Prometheus-based monitoring system.

For more details about Prometheus setup, refer to the Monitoring with Grafana & Prometheus article.

For a complete list of available metrics, refer to the Prometheus metrics page.

Kestra's metrics

You can leverage Kestra's internal metrics to configure custom alerts. Each metric provides multiple time series with tags allowing to track at least namespace & flow but also other tags depending on available tasks.

Kestra metrics use the prefix kestra. This prefix can be changed using the kestra.metrics.prefix property in the configuration options.

Each task type can expose custom metrics that will be also exposed on Prometheus.

Worker

Metrics	Type	Description
worker.running.count	`GAUGE`	Number of tasks currently running
worker.started.count	`COUNTER`	Count of tasks started
worker.retried.count	`COUNTER`	Count of tasks retried
worker.ended.count	`COUNTER`	Count of tasks completed
worker.ended.duration	`TIMER`	Duration of tasks completed
worker.job.running	`GAUGE`	Count of currently running worker jobs
worker.job.pending	`GAUGE`	Count of currently pending worker jobs
worker.job.thread	`GAUGE`	Total worker job thread count

The worker.job.pending, worker.job.running, and worker.job.thread metrics are intended for autoscaling worker servers.

Executor

Metrics	Type	Description
executor.taskrun.next.count	`COUNTER`	Count of tasks found
executor.execution.end.count	`COUNTER`	Count of completed executions
executor.taskrun.ended.duration	`TIMER`	Duration of tasks completed
executor.workertaskresult.count	`COUNTER`	Count of task results sent by a worker
executor.execution.started.count	`COUNTER`	Count of executions started
executor.execution.end.count	`COUNTER`	Count of executions completed
executor.execution.duration	`TIMER`	Duration of executions completed
executor.flowable.execution.count	`COUNTER`	Count of flowable tasks executed
executor.execution.popped.count	`COUNTER`	Count of executions popped
executor.execution.queued.count	`COUNTER`	Count of executions queued
executor.thread.count	`COUNTER`	Count of executor threads

Indexer

Metrics	Type	Description
indexer.count	`COUNTER`	Count of index requests sent to a repository
indexer.duration	`DURATION`	Duration of index requests sent to a repository

Scheduler

Metrics	Type	Description
scheduler.trigger.count	`COUNTER`	Count of triggers
scheduler.evaluate.running.count	`COUNTER`	Evaluation of triggers actually running
scheduler.evaluate.duration	`TIMER`	Duration of trigger evaluation

JDBC Queue

Metrics	Type	Description
queue.big_message.count	`COUNTER`	Count of big messages
queue.produce.count	`COUNTER`	Count of produced messages
queue.receive.duration	`TIMER`	Duration to receive and consume a batch of messages
queue.poll.size	`GAUGE`	Size of a poll to the queue (message batch size)

Other metrics

Kestra also exposes all internal metrics from the following sources:

Micronaut
Kafka
Thread pools of the application
JVM

Check out the Micronaut documentation for more information.

Grafana and Kibana

Kestra uses Elasticsearch to store all executions and metrics. Therefore, you can easily create a dashboard with Grafana or Kibana to monitor the health of your Kestra instance.

We'd love to see what dashboards you will build. Feel free to share a screenshot or a template of your dashboard with the community. Meanwhile, here is an example of a Grafana dashboard that we use internally to monitor Kestra - feel free to use it as a starting point for your own dashboard:

grafana

Grafana Dashboard JSON

Kestra endpoints

Kestra exposes internal endpoints on the management port (8081 by default) to provide status corresponding to the server type:

/worker: will expose all currently running tasks on this worker.
/scheduler: will expose all currently scheduled flows on this scheduler with the next date.
/kafkastreams: will expose all Kafka Streams states and aggregated store lag.
/kafkastreams/{clientId}/lag: will expose details lag for a clientId.
/kafkastreams/{clientId}/metrics: will expose details metrics for a clientId.

Other Micronaut default endpoints

Since Kestra is based on Micronaut, the default Micronaut endpoints are enabled by default on port 8081:

/info Info Endpoint with git status information.
/health Health Endpoint usable as an external heathcheck for the application.
/loggers Loggers Endpoint allows changing logger level at runtime.
/metrics Metrics Endpoint metrics in JSON format.
/env Environment Endpoint to debug configuration files.

You can disable some endpoints following the above Micronaut configuration.

Debugging techniques

Here are several debugging techniques administrators can use to investigate issues:

Enable verbose log

Kestra has some management endpoints including one that allows changing logging verbosity at run time.

Inside the container (or locally if standalone jar is used), send this command to enable very verbose logging:

shell

curl -i -X POST -H "Content-Type: application/json" \
  -d '{ "configuredLevel": "TRACE" }' \
  http://localhost:8081/loggers/io.kestra

Alternatively, you can change logging levels on configuration files:

yaml

logger:
  levels:
    io.kestra.core.runners: TRACE

Capture some java dump

As we run a JRE not a JVM, there is no monitoring tools available, so first you need to install Jattach:

As of version 0.22, Jattach is included in the Kestra image, so there is no need to separately install. If running an older version, continue to follow the steps below.

shell

curl -L -o jattach https://github.com/jattach/jattach/releases/download/v2.2/jattach
chmod +x jattach

You need to find the pid of the Kestra process, it's usually 1 on docker installation.
You can get JVM information with jattach <pid> jcmd VM.info > vminfo
You can get a heap history via jattach <pid> inspectheap > inspectheap
You can get a heap dump via jattach <pid> dumpheap > dumpheap
You can get a thread dump via jattach <pid> threaddump > threaddump

Alternatively, you can request a thread dump via the /threaddump endpoint available on the management port (8081 if not configured otherwise).

Was this page helpful?

Administrator GuideSoftware and Hardware Requirements

Administrator GuideTroubleshooting

​Alerting & ​Monitoring

Alerting & Monitoring