Data lakehouse orchestration with Kestra, Dremio, dbt and Python
Recently, we've released the Dremio and Arrow Flight SQL plugins for Kestra, which allow you to seamlessly integrate Dremio with Kestra workflows. This blog post will dive into how you can leverage Dremio, dbt, Python, and Kestra to orchestrate workflows built on top of a data lakehouse.
Recently, we’ve released the Dremio and Arrow Flight SQL plugins, which allow you to automate Dremio data workflows with Kestra. This post will dive into how you can leverage Dremio, dbt, Python and Kestra to orchestrate processes built on top of a data lakehouse.
What is Dremio
Dremio is a data lakehouse platform that simplifies big data analytics. It allows you to directly access data from various sources, such as Postgres, S3 and Azure Data Lake Storage, without needing to copy or move the data. Its key features include a fast query engine, a semantic layer to help manage and share data, a catalog for Iceberg tables, and reflections — a market-leading query acceleration technology that delivers sub-second query response times. Designed to work with SQL and common BI tools, Dremio provides self-service analytics and data management for BI workloads with the best price performance and lowest cost.
How Kestra Helps to Manage Data Lakehouse Workflows
Kestra is a universal orchestration platform for business-critical operations. It allows you to automate data workflows across multiple systems, including Dremio. It’s designed to reliably manage complex workflows, be it event-driven pipelines, scheduled batch data transformations, or API-driven automations. Kestra is built with a modular architecture that allows you to easily extend its functionality with plugins, including among others dbt, Python, Dremio and the Arrow Flight SQL plugins. These plugins allow you to execute custom scripts and SQL queries, trigger actions based on data changes, and much more.
Dremio Plugin
With the Dremio plugin, you can interact with Dremio data sources via JDBC. This plugin includes:
- Trigger: trigger workflows based on conditions in the Dremio database. For example, you can execute a workflow when there is an anomaly in the data, or when certain KPIs fall outside of an expected range.
- Task: run SQL queries against Dremio databases to fetch data for downstream processing in Python, R, or other languages supported by Kestra.
Arrow Flight SQL Plugin
For broader compatibility with databases using Apache Arrow, Kestra offers the Arrow Flight SQL plugin. It includes:
- Trigger: trigger workflows in response to data changes in Arrow Flight SQL-compatible databases.
- Task: use SQL to fetch columnar data from Arrow Flight SQL-compatible databases for further processing.
Tutorial: How to Automate Data Lakehouse Workflows with Kestra and Dremio
Let’s look at a practical application of Kestra and Dremio for data lakehouse orchestration. The workflow involves data transformation with dbt, then fetching transformed data from a Dremio lakehouse using SQL, and finally processing it with Polars in a Python task.
Prerequisites
- Kestra Setup: start Kestra — by default, it includes all plugins that you need to follow this tutorial. See the Getting Started documentation for more installation details.
- Dremio Account: if you are new to Dremio, the easiest way to familiarize yourself with the platform is to create a Dremio Test Drive account, which provides you with a Dremio instance and sample datasets to query without requiring any subscription or installation.
- Dremio Token: create a Personal Access Token (PAT) in your Dremio account. To do that, go to your Dremio account settings and then to the section “Personal Access Token”. From here, you can create the token, copy it and store it as a Kestra secret to avoid exposing it directly in your flow code. Check the Dremio documentation for more details.
- Dremio Project ID: you can find the project ID in your Dremio URL. For example, if your Dremio URL is
https://app.dremio.cloud/sonar/ead79cc0-9e93-4d50-b364-77639a56d4a6, then your project ID is the last stringead79cc0-9e93-4d50-b364-77639a56d4a6.
Workflow Breakdown
-
Transforming Data with dbt:
- The
io.kestra.plugin.git.Clonetask clones the Jaffle Shop repository, which contains a sample dataset in theseedsdirectory, and dbt models in themodelsdirectory. - The
io.kestra.plugin.dbt.cli.DbtCLItask runs adbtcommand to build models and run tests. - Make sure to follow the dbt Dremio installation instructions to set up your dbt profile for Dremio.
- The
-
Querying Data from Dremio:
- The
io.kestra.plugin.jdbc.dremio.Querytask queries the Dremio data source using JDBC. - The query results are fetched for downstream use into Kestra’s internal storage as a file serialized in the Amazon Ion format.
- The
-
Convert Ion to CSV:
- The
io.kestra.plugin.serdes.csv.IonToCsvtask converts the Ion file into a CSV file. This task saves us many lines of code to transform between data formats because the Polars library, used in the next task, does not support Amazon Ion files as input yet.
- The
-
Processing Data with Python:
- The
io.kestra.plugin.scripts.python.Scripttask utilizes a Docker image with Polars. - The script reads the query result into a Polars DataFrame.
- This DataFrame is then available for further analysis or reporting.
- The

Workflow Code
Here is the full workflow code for this example:
id: dremio_sql_pythonnamespace: dremio.analytics
variables: user: rick.astley@kestra.io project_id: ead79cc0-9e93-4d50-b364-77639a56d4a6
tasks: - id: wdir type: io.kestra.plugin.core.flow.WorkingDirectory tasks: - id: cloneRepository type: io.kestra.plugin.git.Clone url: https://github.com/dbt-labs/jaffle_shop branch: main
- id: dbt type: io.kestra.plugin.dbt.cli.DbtCLI docker: image: ghcr.io/kestra-io/dbt-dremio:latest profiles: | jaffle_shop: outputs: dev: type: dremio threads: 16 cloud_host: api.dremio.cloud cloud_project_id: "{{ vars.project_id }}" user: "{{ vars.user }}" pat: "{{ secret('DREMIO_TOKEN') }}" use_ssl: true target: dev commands: - dbt build
- id: query type: io.kestra.plugin.jdbc.dremio.Query disabled: false url: "jdbc:dremio:direct=sql.dremio.cloud:443;ssl=true;PROJECT_ID={{vars.project_id}};" username: $token password: "{{ secret('DREMIO_TOKEN') }}" sql: SELECT * FROM "@{{ vars.user }}"."stg_customers"; store: true
- id: to_csv type: io.kestra.plugin.serdes.csv.IonToCsv from: "{{ outputs.query.uri }}"
- id: python type: io.kestra.plugin.scripts.python.Script warningOnStdErr: false docker: image: ghcr.io/kestra-io/polars:latest inputFiles: data.csv: "{{ outputs.to_csv.uri }}" script: | import polars as pl df = pl.read_csv("data.csv")
print(df.glimpse())This flow clones a Git repository with dbt code, and runs dbt models and tests. Then, it fetches that transformed data from Dremio and passes the query results to a downstream Python task that further transforms fetched data using Polars.

How to Extend this Workflow
This use case is deliberately simple to demonstrate the basic usage of these plugins. However, you can easily extend it to accommodate more complex data processing requirements. For example, if you need to automate business-critical operations for each row from a large dataset, add a ForEachItem task. That task allows you to iterate over a list of items fetched from your data lakehouse and reliably execute downstream tasks in parallel for each row.
Next Steps
Integrating Dremio with Kestra offers a powerful solution for analytical workflows and scenarios involving complex data transformations and business-critical operations. The simple example covered in this post can be easily customized based on your requirements. If you like Kestra, give us a star on GitHub and join the community. If you want to learn more about Dremio, visit https://www.dremio.com for more information.