Orchestrate your LakeHouse with Kestra and DuckDB

Combining Kestra and DuckDB to create a lakehouse architecture offers a modern approach to managing data by merging the strengths of data lakes and warehouses. This significantly reduces costs and complexities.

This blog summarizes a talk I gave at the first DuckDB Meetup in Paris where I explored how DuckDB’s in-memory columnar database integrates well with the orchestration capabilities of Kestra. It also covers the transition from traditional data storage methods to the more efficient and flexible lakehouse model, and discusses the use of DuckDB within Kestra environments through best practices for building scalable, cost-effective data ecosystems.

Data Lake, Data Warehouse or Data Lakehouse?

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. Meanwhile, a data warehouse is a system used for structuring data and support use cases such as reporting and data analysis. It is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.

The lakehouse architecture merges the flexibility of data lakes with the structured query capabilities of data warehouses. It consists of three layers: a query engine, a transaction layer, and a storage layer.

The transaction layer introduces an abstraction over raw storage, enabling direct table-like access to raw data, facilitated by Open Table Formats such as Apache Iceberg or Apache Hudi. This design enhances analytical and operational workflows, providing ACID transactions and efficient data management within a unified platform.

We Need a Control Plane

To manage various data sources, applications, and infrastructure, a robust control plane is essential. Kestra aligns with the concept of a control plane by providing a centralized, comprehensive orchestration layer for managing data workflows, infrastructure, and applications. It simplifies complex operations, enabling users to define, automate, and monitor processes across different environments. This orchestration layer acts as a control plane by offering visibility and control over various components, streamlining execution, and enhancing efficiency in data-driven ecosystems.

About DuckDB

DuckDB is considered SQLite for Analytics. It’s an open-source embedded OLAP database that can run in-process rather than relying on the traditional client-server architecture. As with SQLite, there’s no need to install a database server to get started. Installing DuckDB instantly turns your laptop into an OLAP engine capable of aggregating large volumes of data at an impressive speed using just CLI and SQL. It’s especially useful for reading data from local files or objects stored in cloud storage buckets (e.g., parquet files on S3).

This lightweight embedded database allows fast queries in virtually any environment with almost no setup. However, it doesn’t offer high concurrency or user management, and it doesn’t scale horizontally. That’s where MotherDuck can help. MotherDuck

Orchestrate your Lakehouse with kestra and DuckDB

In our exploration of DuckDB’s integration within Kestra environments, we’ve identified several layers of complexity that showcase the applications of DuckDB as the query engine for lakehouse architectures. From automating basic queries to implementing advanced data management and analytics, these levels reflect the practical scenarios and solutions our user community is actively deploying:

Level 1: focuses on automating queries and managing dependencies within the stack, utilizing DuckDB for query execution and an S3 bucket for storage, with Kestra orchestrating the workflow. This level is straightforward and fits very well in projects that need to be efficient and rely on simple technologies. However, this level does not address SQL query management, metadata management, or “big data” capabilities as it relies only on DuckDB engine (could be hard to scale horizontally and distribute compute)

Level 2: advances by incorporating SQL query management through tools like SQLMesh or DBT, still orchestrated by Kestra. It enhances the structure from Level 1 by uncoupling the business logic (the SQL queries) from the orchestration logic. For example, dbt project can be stored in a dedicated GitHub repository and pulled at runtime in Kestra. This way, dependencies between upstream tasks (data ingestion or data wrangling) and downstream tasks (alerting, refreshing a BI tool, etc.) is very easy to manage and monitor. It also simplifies backfilling and retry strategy management. Still, it remains limited by the absence of metadata management and “big data” support.

id: dbt_duckdb
namespace: company.team

tasks:
  - id: dbt
    type: io.kestra.plugin.core.flow.WorkingDirectory
    tasks:
    - id: cloneRepository
      type: io.kestra.plugin.git.Clone
      url: https://github.com/kestra-io/dbt-example
      branch: main

    - id: dbt-run
      type: io.kestra.plugin.dbt.cli.DbtCLI
      runner: DOCKER
      docker:
        image: ghcr.io/kestra-io/dbt-duckdb:latest
      commands:
        - dbt run

Level 3: represents a fully-fledged lakehouse model, adding metadata management, ACID properties, and time travel features through the integration of Apache Iceberg for the transaction layer. This level maintains the advancements of Level 2 and addresses its limitations, offering a comprehensive solution for data management and analytics. Note that this integration is still in its early phase: Iceberg writing disposition with DuckDB is still in development and several challenges could arise. Still, it’s something open-source developers are working on and it could be a very interesting and mature solution soon.

Each level progressively builds upon the last, offering more sophisticated features and capabilities, illustrating Kestra’s flexibility and power in orchestrating complex data ecosystems with DuckDB as the query engine.

The End of Big Data?

This article from MotherDuck highlights a shift towards more manageable data sizes in many organizations, suggesting that large-scale distributed computing might not be necessary for all. Instead, leveraging single-node databases like DuckDB for most data scenarios—where volumes are under 10GB—offers a cost-effective, simplified approach. For tasks that demand distributed computing, tools like Databricks or BigQuery provide the necessary scale. This perspective encourages a balanced approach to data analytics, prioritizing efficiency and practicality over the pursuit of handling ever-increasing data sizes.

Having a control-plane like Kestra is key here: it allows one to manage different projects with the corresponding tools. For example having some pipelines using DuckDB as the backbone for fast and efficient small data processing while running in parallel workflows using Spark to distribute heavy data processing on several nodes. It bridges the gap between different solutions by adding dependencies management and branching opportunities, making pipelines connection a seamless task for every engineer.

Engineers Is Your First Budget Cost not Compute

While the lakehouse architecture reduces computing costs by a great factor, an equally crucial aspect is the human element of development.

Kestra’s intuitive, declarative syntax enhances productivity, enabling teams to focus more on innovation and less on the intricacies of orchestration. It facilitates the separation of business and orchestration logic through a fully managed control plane - either using a full web user interface or using code assets like Terraform resources or following GitOps patterns.

To go further between the integration of DuckDB and Kestra you can check our article on how to automate your data analyse using kestra and DuckDB or check out our article about when you need to move from DuckDB to MotherDuck

Join the Slack community if you have any questions or need assistance. Follow us on Twitter for the latest news. Check the code in our GitHub repository and give us a star if you like the project.

Contribute

Share this news

Get Kestra updates to your inbox

Stay up to date with the latest features and changes to Kestra

Orchestrate your LakeHouse with Kestra and DuckDB

Authors

Category

Last Updated

Table of contents

Contribute

Share this news

Data Lake, Data Warehouse or Data Lakehouse?

We Need a Control Plane

About DuckDB

Orchestrate your Lakehouse with kestra and DuckDB

The End of Big Data?

Engineers Is Your First Budget Cost not Compute

Contribute

Share this news

Get Kestra updates to your inbox