Authors
Martin-Pierre Roset
The round-trip to disk is one of the most persistent bottlenecks in data work. Embedded databases eliminate it. Tools like DuckDB, Polars, and SQLite handle the majority of embedded workloads in 2026, with chDB filling a specific niche for ClickHouse users.
This post breaks down the embedded database landscape in 2026: what’s worth using now, what’s overhyped, and what to watch.
Embedded Databases have become indispensable because they are fast. Well, it’s a bit more complex than that; what they do best is:
Those efficiency gains translate into a few concrete benefits:
Developer Pain Points: Managing complex joins, handling multi-gigabyte datasets, and avoiding disk bottlenecks. Embedded Databases help address these by bringing data as close to computation as possible.
DuckDB is often called the SQLite of OLAP due to its simplicity and performance. With it, you can run complex SQL queries on local data without spinning up clusters or servers.
Why DuckDB still dominates in 2026:
DuckDB executes SQL directly inside Jupyter notebooks, which means data scientists can query Parquet files or run ad-hoc joins during exploratory analysis without switching tools.
Under the hood, DuckDB’s vectorized execution engine processes data in batches using SIMD (Single Instruction, Multiple Data). It supports lazy loading, so large Parquet or CSV files can be queried without pulling the full dataset into memory.
Code Example:
import duckdbconn = duckdb.connect()df = conn.sql("SELECT product_name, SUM(total) AS total FROM 'data/sales.parquet' GROUP BY product_name ORDER BY total DESC LIMIT 10")df.to_df().to_json("top_products.json", orient="records")Use case: Fast prototyping of data transformations and interactive analysis on datasets stored locally, all within a single-node environment.
chDB embeds the ClickHouse SQL engine in-process, letting you run ClickHouse-grade OLAP queries without a server. v4 landed in March 2026.
It’s narrower than DuckDB or Polars: Python-only (3.9+), macOS and Linux only. But for teams already running ClickHouse in production, chDB eliminates the server entirely for local development and testing.
import chdb
data = chdb.query("""SELECT sum(total) as total, avg(quantity) as avg_quantityFROM url('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv');""", 'PrettyCompact')print(data)Use case: Python developers in ClickHouse-native environments who need the same SQL dialect locally without spinning up a server instance.
Thanks to a lightweight, self-contained database for embedded systems and applications requiring local storage, SQLite is still essential to a modern stack.
Why Developers Still Choose SQLite:
Performance Insight: SQLite’s B-tree indexing ensures fast read/write access, though it’s single-threaded by default. For high-concurrency use cases, developers can enable write-ahead logging (WAL) mode to improve parallel read performance.
Limitations: While great for single-user scenarios, SQLite may not be suitable for highly concurrent write operations due to the lack of native parallel write support.
Use Case: Offline-first mobile applications, local testing environments, and lightweight caching for microservices.
Polars graduated from “pandas replacement” to serious analytics contender. Its $21M Series A in September 2025 was a signal: Polars isn’t a hobby project anymore.
The core advantage is its streaming engine. Where pandas loads entire datasets into memory, Polars processes datasets larger than RAM by streaming chunks through an optimized query plan. In practice it consistently outpaces pandas on aggregation-heavy work, often by a significant margin, and holds its own against DuckDB on DataFrame-heavy workloads.
What makes Polars different:
scan_parquet() builds a query plan without reading data. .collect() executes it once, fully optimized.Code example:
import polars as pl
# Lazy scan — nothing executes yetdf = pl.scan_parquet("data/sales.parquet")
result = ( df.filter(pl.col("region") == "EU") .group_by("product_name") .agg(pl.col("total").sum().alias("total_revenue")) .sort("total_revenue", descending=True) .limit(10) .collect() # execute the full plan here)print(result)The scan_parquet + collect() pattern means Polars sees the full query before executing any of it. For large Parquet files with selective filters, this eliminates unnecessary reads entirely.
Use case: Data transformation pipelines where you mix Python-native DataFrame operations with SQL aggregations. Polars handles the reshaping; DuckDB handles the heavy SQL joins.
Limbo: one to watch. Turso’s Rust-based SQLite reimplementation brings async I/O (io_uring on Linux), WASM-first design, and a modern testing approach via Deterministic Simulation Testing. Still in beta as of May 2026 and not production-ready, but it’s worth bookmarking if you want to see where SQLite-compatible databases are heading.
Kestra is an event-driven orchestration platform that coordinates these databases as steps in a larger pipeline.
Why Kestra is Essential:
Extended Example Kestra Workflow:
id: embedded_databasesnamespace: company.team
tasks: - id: chDB type: io.kestra.plugin.scripts.python.Script allowWarning: true taskRunner: type: io.kestra.plugin.core.runner.Process beforeCommands: - pip install chdb script: | import chdb
data = chdb.query(""" SELECT sum(total) as total, avg(quantity) as avg_quantity FROM url('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv'); """, 'PrettyCompact') print(data)
- id: duckDB type: io.kestra.plugin.jdbc.duckdb.Query sql: | INSTALL httpfs; LOAD httpfs;
SELECT sum(total) as total, avg(quantity) as avg_quantity FROM read_csv_auto('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv', header=True); fetchType: FETCHAdvanced Configuration: Kestra also supports retries, error handling, and parallel task execution, making it easy to build robust data pipelines.
Use Case: Building a real-time recommendation system pipeline that processes raw sales data, aggregates insights, and exports outputs for downstream APIs.
Kestra’s ability to mix batch and event-driven tasks in one pipeline means developers can easily adapt to complex data processing needs.
If you have any questions, reach out via Slack or open a GitHub issue.
If you like the project, give us a GitHub star and join the community.
Stay up to date with the latest features and changes to Kestra