Authors
Martin-Pierre Roset
Whether you’re building complex ETL pipelines, conducting exploratory data analysis, or powering real-time APIs, these databases are usually in your stack. Why? They eliminate the latency of disk I/O. Tools like DuckDB, chDB, and SQLite, alongside the rise of Limbo, are more relevant than ever for 2025.
This post breaks down Embedded Database tool choices in 2025.
Embedded Databases have become indispensable because they are fast. Well, it’s a bit more complex than that; what they do best is:
These advantages make them also the go-to solution because of distinct advantages such as:
Developer Pain Points: Managing complex joins, handling multi-gigabyte datasets, and avoiding disk bottlenecks. Embedded Databases help address these by bringing data as close to computation as possible.
DuckDB is often called the SQLite of OLAP due to its simplicity and performance. With it, you can run complex SQL queries on local data without spinning up clusters or servers.
Why DuckDB still dominates in 2025:
Jupyter Notebook and Embedded Analytics: DuckDB’s ability to execute SQL directly within Jupyter notebooks makes it an attractive option for data scientists working with Parquet files or performing ad-hoc joins during exploratory analysis. It allows interactive workflows where developers can visualize results without moving between different systems.
Deep Dive: DuckDB’s vectorized execution engine processes data in batches, leveraging SIMD (Single Instruction, Multiple Data) to maximize CPU efficiency. It supports lazy loading, meaning large files like Parquet or CSV can be queried without loading the full dataset into memory.
Code Example:
import duckdbconn = duckdb.connect()df = conn.sql("SELECT product_name, SUM(total) AS total FROM 'data/sales.parquet' GROUP BY product_name ORDER BY total DESC LIMIT 10")df.to_df().to_json("top_products.json", orient="records")This example showcases how DuckDB simplifies querying local Parquet files, avoiding the need for preprocessing or external storage.
Use Case: Fast prototyping of data transformations and interactive analysis on datasets stored locally, all within a single-node environment.
chDB is an in-process SQL OLAP engine built on top of ClickHouse. It allows developers to run high-performance analytical queries directly within their applications without needing an external database server. By embedding the ClickHouse SQL engine, chDB enables fast, local data processing while minimizing the complexity of traditional OLAP deployments.
chDB is designed for in-process queries, making it well-suited for analytical workloads. It can process structured data from formats such as Parquet, Arrow, CSV, and JSON. The queries operate directly on data files without requiring a full database instance.
Key technical features:
Below is an example of how to use chDB to query a Parquet file:
import chdb
data = chdb.query("""SELECT *FROM url('https://huggingface.co/datasets/kestra/datasets/resolve/main/json/products.json');""", 'PrettyCompact')print(data)This snippet demonstrates how chDB performs SQL queries directly on a file, providing immediate access to results without requiring an external service.
chDB leverages vectorized query execution to process data in batches, making full use of CPU parallelism. Unlike traditional databases that may read entire rows of data, chDB’s columnar format ensures that only the necessary columns are accessed during query execution. This reduces memory consumption and improves speed, especially for large datasets.
By scanning data directly without loading full tables into memory, chDB offers a significant performance advantage for ad-hoc queries and local processing tasks.
As demand grows for tools that simplify in-process analytics without requiring additional infrastructure, chDB stands out for its simplicity and power. By embedding an OLAP engine within applications, it bridges the gap between full database deployments and lightweight data exploration tools.
For developers building machine learning pipelines, internal dashboards, or analytical workflows, chDB provides a way to execute high-speed queries with minimal setup. Its design makes it a valuable option for local-first processing and in-process SQL analytics in modern development workflows.
Thanks to a lightweight, self-contained database for embedded systems and applications requiring local storage, SQLite is still essential to a modern stack.
Why Developers Still Choose SQLite:
Performance Insight: SQLite’s B-tree indexing ensures fast read/write access, though it’s single-threaded by default. For high-concurrency use cases, developers can enable write-ahead logging (WAL) mode to improve parallel read performance.
Limitations: While great for single-user scenarios, SQLite may not be suitable for highly concurrent write operations due to the lack of native parallel write support.
Use Case: Offline-first mobile applications, local testing environments, and lightweight caching for microservices.
If you’re a developer looking for something fresh in embedded databases, Limbo is worth your attention. It’s a reimagining of SQLite, built from scratch in Rust for modern workloads. Limbo isn’t trying to replace SQLite’s simplicity; it amplifies it with memory safety, asynchronous operations, and performance built for cloud-native and serverless environments.
Traditional SQLite queries run synchronously, making them fast but limited when facing slow storage or network requests. Limbo rewrites the rules by embracing asynchronous I/O from the start. Instead of waiting for large reads or remote requests to finish, Limbo hands back control, letting your app stay responsive.
On Linux, it leverages io_uring, a high-performance API for asynchronous system calls, making it ideal for distributed apps where latency matters.
Limbo also prioritizes browser-friendly workflows with WASM support. This means you can run a full database in the browser or in a serverless function—without hacks or wrappers. Tools like Drizzle ORM already work seamlessly, making in-browser queries a first-class experience.
Instead of inheriting SQLite’s C-based testing suite, Limbo leans on Deterministic Simulation Testing (DST). DST simulates years of database operations within minutes, throwing thousands of edge cases at the system in controlled, repeatable environments. When bugs appear, they can be reproduced exactly—no more “works on my machine.”
The partnership with Antithesis takes this further by simulating system-level failures—like partial writes and disk interruptions—to ensure Limbo behaves predictably under real-world stress. This approach lets Limbo aim for the same ironclad reliability SQLite is known for, with the benefits of modern testing techniques.
It’s faster where it matters. In benchmarks, it has shown 20% faster read performance compared to SQLite. A simple SELECT * FROM users LIMIT 1 runs in 506 nanoseconds on an M2 MacBook Air, compared to 620 nanoseconds for SQLite.
Unlike SQLite, which often needs configuration tweaks (WAL mode, advisory locks) for optimal performance, Limbo delivers speed out of the box. By removing outdated or non-essential features, it stays lightweight while offering a more intuitive developer experience.
Whether you’re deploying cloud-native apps, serverless functions, or building browser-based tools, it aligns with the demands of distributed systems:
Kestra empowers developers with an event-driven, declarative platform.
Why Kestra is Essential:
Extended Example Kestra Workflow:
id: embedded_databasesnamespace: company.team
tasks: - id: chDB type: io.kestra.plugin.scripts.python.Script allowWarning: true taskRunner: type: io.kestra.plugin.core.runner.Process beforeCommands: - pip install chdb script: | import chdb
data = chdb.query(""" SELECT sum(total) as total, avg(quantity) as avg_quantity FROM url('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv'); """, 'PrettyCompact') print(data)
- id: duckDB type: io.kestra.plugin.jdbc.duckdb.Query sql: | INSTALL httpfs; LOAD httpfs;
SELECT sum(total) as total, avg(quantity) as avg_quantity FROM read_csv_auto('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv', header=True); fetchType: FETCHAdvanced Configuration: Kestra also supports retries, error handling, and parallel task execution, making it easy to build robust data pipelines.
Use Case: Building a real-time recommendation system pipeline that processes raw sales data, aggregates insights, and exports outputs for downstream APIs.
Kestra’s ability to mix batch and event-driven tasks in one pipeline means developers can easily adapt to complex data processing needs.
Future Trends: Expect continued convergence of OLAP and OLTP, improved support for multi-cloud, advancements in distributed computing, and open-source OLAP engines gaining even more traction. The rise of data mesh architectures may also influence how developers design workflows, emphasizing decentralized data ownership and interoperability.
If you have any questions, reach out via Slack or open a GitHub issue.
If you like the project, give us a GitHub star and join the community.
Stay up to date with the latest features and changes to Kestra