Optimizing Performance in Kestra in Version 0.22

The engineering team focused on improving Kestra’s performance in version 0.22. Here’s a clear overview of the optimizations we’ve made:

Smarter output processing for better CPU and memory efficiency
Parallelization of execution queues in the JDBC backend
More efficient execution processing
Reduced latency in the Kafka backend
Improved database table cleaning for long-running systems

Smarter Output Processing

The Kestra Executor merges task outputs so that subsequent tasks can access previous task outputs via our expression language. This operation clones the entire output map, incurring high CPU and memory costs.

With 0.22, we optimized this by only merging outputs for tasks that require it (e.g., ForEach tasks as they produce outputs from the same task identifier) among other enhancements.

Preliminary benchmark tests conducted on a flow with 2,100 task runs using a 3-level ForEach task showed both CPU and memory improvements:

CPU usage for merging outputs dropped from 15.3% to 11.8%
Memory allocation for merging dropped from 44% to 38%

For more details, you can have a look at these two pull requests:

Parallelized JDBC Backend Queues

The Kestra Executor listens to multiple internal queues, including execution updates, task results, and killing events. Previously, when using a JDBC backend, each queue was processed by a single thread, creating a bottleneck.

Of all those queues, two are the most important and receive a lot of messages:

Execution updates (triggers on every execution update)
Worker task results (triggers when the Worker finishes a task execution)

In 0.22, we’ve enabled parallel processing for these two most critical queues. By default, both queues now use half of the available CPU cores for the Executor (minimum of 2 threads per queue).

This can be customized by the configuration property kestra.jdbc.executor.thread-count.

This change would not impact the performance of an executor when there is not a high number of concurrent executions and tasks; at low throughput (number of executions or tasks processed per second), the execution latency will not be improved. However, it will significantly improve the execution latency at high throughput!

In a benchmark running 10,000 executions over 5 minutes (33 executions/s throughput):

Before: Execution latency peaked at 9m 50s (Takes into account the time to execute the last executions)
After: Execution latency dropped to 100ms (executions processed in real-time!)

Previously a single execution took 100ms, the Kestra Executor could not handle such a high load, and execution latency increased. Executions are now processed at the same speed the benchmark injects them!

We plan to do more tests later to show how many executions can be processed in seconds with Kestra; keep posted!

For more details on the work to achieve these results, have a look at this pull request: PR #7382.

Optimized Execution Processing

When you start an execution from the UI, Kestra’s web interface updates in real time using a Server-Sent Event (SSE) socket. This socket connects to our API, which consumes the execution queue and filters all execution messages to send only those related to the current execution to the UI. Thus, each connection spawned a new consumer on the execution queue, adding unnecessary load.

In 0.22, we switched to a shared consumer with a fan-out mechanism, reducing queue backend (database or Kafka) stress when multiple users trigger executions simultaneously.

For more details, you can have a look at this pull request: PR #6777.

Improvements to Flowable Task Executions

When the Kestra Executor executes an execution, it can handle two types of tasks:

runnable tasks that are sent to the Worker for processing.
flowable tasks that are executed directly inside the Executor (e.g., loops, conditionals).

Previously, flowable task processing created a Worker Task Result (to mimic a task execution from the Worker) and sent it to the Worker Task Result queue so it would go inside the queue and be re-processed later by the Executor.

With Kestra 0.22, flowable tasks are no longer queued for later processing but handled immediately, eliminating unnecessary queue operations and reducing execution latency.

Further details in this pull request: PR #7250.

Reduced Kafka Backend Latency

During benchmarking, we discovered that our default Kafka configuration was designed for throughput, not latency. By default, Kafka is optimized to process messages in batch.

By fine-tuning our configuration (poll.ms and commit.interval.ms reduced from 100ms → 25ms), we significantly improved execution speed.

In a benchmark (a flow with two tasks, one processing JSON), this configuration update significantly reduces latency:

At 10 executions/s, latency dropped from 1s → 200ms
At 100 executions/s, latency dropped from 8s → 250ms

JDBC `queues` Table Cleaning

Last but not least, we reviewed how we clean the JDBC queues table.

The JDBC queues table stores internal queue messages. Previously, the JDBC Cleaner only periodically cleaned the table. The default configuration was to clean the table each hour and keep 7 days of messages.

Our internal queue is a multiple producer / multiple consumer queue; this means that the JDBC cleaner cannot know if all consumers have read a message, as not all components read the same message. In our Kafka backend, we rely on the Kafka topic and consumer group, so it doesn’t suffer from the same issue.

We received feedback from users processing a large number of executions per day that this queues table can grow big (tens or even hundreds of gigabytes) and sometimes induce a very high load on the database.

To mitigate this user issue, we decided to improve our queues table cleaning in two ways:

On-execution purge: Purging all execution-related messages at the end of the execution, keeping only the last execution message so late consumers can still be updated on the execution terminal state.
Early cleanup of high-volume logs: High cardinality messages such as logs, metrics, and audit logs (consumed by a single consumer) are now purged after 1 hour instead of 7 days.

With these two changes, the number of records in the queues table was reduced a staggering 95% on contrived benchmarks!

For more details, you can have a look at these two pull requests:

Sneak Peek of 0.22 vs. 0.21

We plan to discuss Kestra’s performance further in a later blog post, but here is a performance comparison of 0.22 versus 0.21. What’s important here is not the raw numbers but the difference between the two sets.

The benchmark scenario is a flow triggered by a Kafka Realtime Trigger that performs a JSON transformation for each message and returns the output in a second task. We generate 1000 executions by publishing messages to a Kafka topic at 10, 25, 50, 75, and 100 messages per second, then check the execution latency by looking at the last execution of the scenario run.

Expand to see the Benchmark Flow

id: realtime-kafka-json
namespace: company.team

triggers:
  - id: kafka-logs
    type: io.kestra.plugin.kafka.RealtimeTrigger
    topic: test_kestra
    properties:
      bootstrap.servers: localhost:9092
    groupId: myGroup

tasks:
  - id: transform
    type: io.kestra.plugin.transform.jsonata.TransformValue
    from:  "{{trigger.value}}"
    expression: |
      {
        "order_id": order_id,
        "customer_name": first_name & ' ' & last_name,
        "address": address.city & ', ' & address.country,
        "total_price": $sum(items.(quantity * price_per_unit))
      }
  - id: hello
    type: io.kestra.plugin.core.output.OutputValues
    values:
      log: "{{outputs.transform.value}}"

JDBC backend

Throughput (exec/s)	Latency in 0.21	Latency in 0.22	Improvement
10	400ms	150ms	62% faster
25	26s	200ms	99% faster
50	43s	5s	88% faster
75	49s	10s	80% faster
100	59s	12s	80% faster

Key takeaways:

Performance has improved dramatically in 0.22, even when executions are not run concurrently (which is almost the case at 10 executions/s).
Performance starts to degrade vastly around throughput of 50 executions/s.

Kafka backend

Throughput (exec/s)	Latency in 0.21	Latency in 0.22	Improvement
10	800ms	200ms	75% faster
25	800ms	225ms	72% faster
50	900ms	225ms	75% faster
75	1s	300ms	70% faster
100	1s	300ms	70% faster
150	1.2s	750ms	38% faster
200	2s	1.9s	5% faster

Key takeaways:

In 0.21, our Kafka backend can sustain higher throughput than our JDBC backend, but on low throughput, latency is more than the JDBC backend.
In 0.22, our Kafka backend achieves almost the same latency at low throughput as our JDBC backend. At up to 100 executions per second, latency didn’t increase much, and in all cases, it stayed under the latency seen in 0.21.

Conclusion

Version 0.22 brings major efficiency improvements, making Kestra faster and more scalable. As we continue to optimize performance, stay tuned for more updates on how far we can push Kestra’s execution capabilities in upcoming versions.

If you have any questions, reach out via Slack or open a GitHub issue.

If you like the project, give us a GitHub star and join the community.

Get Kestra updates to your inbox

Stay up to date with the latest features and changes to Kestra

Optimizing Performance in Kestra in Version 0.22

Authors

Category

Last Updated

Table of contents

Contribute

Share this news

Smarter Output Processing

Parallelized JDBC Backend Queues

Optimized Execution Processing

Improvements to Flowable Task Executions

Reduced Kafka Backend Latency

JDBC `queues` Table Cleaning

Sneak Peek of 0.22 vs. 0.21

JDBC backend

Kafka backend

Conclusion

Get Kestra updates to your inbox

Optimizing Performance in Kestra in Version 0.22

Authors

Category

Last Updated

Table of contents

Contribute

Share this news

Smarter Output Processing

Parallelized JDBC Backend Queues

Optimized Execution Processing

Improvements to Flowable Task Executions

Reduced Kafka Backend Latency

JDBC queues Table Cleaning

Sneak Peek of 0.22 vs. 0.21

JDBC backend

Kafka backend

Conclusion

Get Kestra updates to your inbox

JDBC `queues` Table Cleaning