ETL Workflow: How to Design, Orchestrate, and Scale Data Integration
ETL workflows are more than the three-letter acronym. Learn how to design, orchestrate, and operate production-grade ETL with modern declarative tools.
“ETL” is a three-letter acronym that underpins nearly every data stack, but a robust ETL workflow is far more complex than just “extract, transform, load.” It involves intricate scheduling, robust error handling, version control, and comprehensive observability. Many teams grapple with legacy ETL systems that are brittle, hard to debug, and struggle to scale.
This article cuts through the marketing fluff to explore how senior data engineers design, orchestrate, and operate production-grade ETL in 2026. We’ll look at the architectural patterns that endure, the trade-offs between ETL and ELT, and how declarative orchestration can simplify the most challenging aspects of data integration.
ETL Beyond the Acronym: What a Workflow Truly Means
The term “ETL” is often used loosely, leading to confusion. To build reliable systems, it’s critical to distinguish between three related concepts:
- ETL (The Concept): This is the data integration pattern of Extracting data from a source, Transforming it into a desired format or structure, and Loading it into a destination system. It’s a strategic approach.
- ETL Pipeline (The Implementation): This is the code that executes the ETL pattern. It could be a set of Python scripts, SQL queries, or Spark jobs. This is the “what”—the specific logic that moves and changes data.
- ETL Workflow (The Orchestration): This is the system that manages the execution of one or more ETL pipelines. It handles scheduling, dependencies, retries, monitoring, and alerting. The workflow is the “how”—the operational layer that turns a simple script into a reliable, production-grade process.
Focusing only on the pipeline code is a common mistake. A robust ETL strategy depends on a well-designed workflow that provides reliability and visibility, moving beyond simple data orchestration for analytics to encompass the full operational lifecycle.
The Stages of a Production-Grade ETL Workflow
While “Extract, Transform, Load” provides a neat summary, production workflows include several more implicit stages that are essential for data quality and operational stability.
Extract: Sourcing Data Reliably
The first step is to pull data from its source. This isn’t just a simple read operation. Production systems must handle:
- Connectivity: Securely connecting to a variety of sources, from relational databases and REST APIs to file systems like S3 or FTP servers.
- Loading Strategy: Deciding between full loads (copying the entire dataset) and incremental loads (capturing only changes since the last run). Incremental loads, often using Change Data Capture (CDC) techniques, are more efficient for large datasets.
- Data Contracts: Ensuring the source data schema hasn’t changed unexpectedly, which could break downstream transformations.
Transform & Cleanse: Shaping Data for Value
This is where the raw data becomes useful. The transformation stage is often the most complex and can involve multiple sub-steps:
- Cleansing: Handling null values, correcting data types, and removing duplicates.
- Enrichment: Joining data with other datasets to add context or new attributes.
- Business Logic: Applying calculations, aggregations, or custom rules to meet business requirements.
- Structuring: Reshaping the data, such as pivoting or unpivoting, to fit the schema of the target data warehouse.
Tools like dbt are frequently used for SQL-based transformations, while Apache Spark is a common choice for processing very large datasets in a distributed manner.
Load: Delivering to the Destination
Once transformed, the data is loaded into its final destination. Key considerations here include:
- Target System: This could be a cloud data warehouse like Snowflake or BigQuery, a data lake, or another operational application.
- Loading Pattern: Data can be appended, overwritten, or upserted (updating existing records and inserting new ones).
- Idempotency: Ensuring that running the same load process multiple times with the same data produces the exact same result. This is critical for recovery after a failure.
Validate & Monitor: Ensuring Data Quality and Health
These stages run throughout the workflow and are what separate hobby projects from enterprise systems.
- Validation: Running checks before and after loading to ensure data quality. This could be as simple as checking row counts or as complex as running statistical tests for anomalies.
- Monitoring: Tracking the health of the workflow itself. Are jobs running on time? Are there performance bottlenecks? Are failure rates increasing? This is where observability and alerting become crucial.
ETL vs. ELT: Choosing the Right Data Integration Pattern
A common alternative to ETL is ELT (Extract, Load, Transform). In this pattern, raw data is loaded directly into the data warehouse first, and all transformations happen inside the warehouse using its powerful SQL engine.
Neither pattern is universally better; they serve different needs.
| Aspect | ETL (Extract, Transform, Load) | ELT (Extract, Load, Transform) |
|---|---|---|
| Transformation Location | In a separate processing engine (e.g., Spark, Python script) before loading. | Inside the target data warehouse using its compute resources. |
| Data in Warehouse | Only clean, transformed data is stored. | Raw, untransformed data is available alongside transformed models. |
| Use Cases | Good for sensitive data (PII can be removed before loading), complex pre-processing, or when the target system has limited compute power. | Ideal for modern cloud data warehouses, agile development (analysts can transform raw data themselves), and situations where future use cases for raw data are unknown. |
| Tooling | Requires a distinct transformation tool or custom code. | Leverages the data warehouse’s SQL engine, often managed with a tool like dbt. |
The choice is strategic. ETL is still highly relevant for compliance and performance reasons, while ELT offers greater flexibility for modern analytics. For a deeper dive, see our comparison of ETL vs. ELT.
Operational Challenges of Scaling ETL Workflows
Building an ETL script that works once is easy. Building an ETL workflow that runs reliably every day for years is hard. As you scale, you’ll inevitably face these operational hurdles:
- Scheduling & Triggers: Simple cron schedules are a starting point, but production systems need more. Event-driven triggers—such as a new file arriving in S3, an API webhook, or the completion of an upstream workflow—are essential for creating responsive and efficient data systems.
- Error Handling & Retries: What happens when a network connection drops or an API returns a
503error? A robust workflow needs automatic retries with configurable backoff strategies and dead-letter queues to handle unrecoverable failures gracefully. - Backfills & Idempotency: You will eventually need to re-process historical data. Your workflow must be idempotent to handle this without creating duplicate records or corrupting state.
- Observability & Alerting: When a workflow fails at 3 AM, you need more than just a failure status. You need detailed logs, performance metrics, data lineage, and proactive alerts that tell you why it failed.
- Versioning & Rollbacks: ETL logic changes. Workflows must be version-controlled (ideally in Git) so that you can see a history of changes, review new proposals, and quickly roll back to a previous version if a deployment goes wrong.
- Resource Management: A transformation job might need significant CPU and memory for a short period. A scalable workflow system must manage these compute resources efficiently, provisioning them on-demand and releasing them when done.
Modern ETL with Declarative Orchestration
Addressing these operational challenges is the core job of an orchestrator. The modern approach favors declarative orchestration, where you define the what (the desired state of your workflow) in a configuration file, and the engine handles the how.
Why Declarative YAML for ETL Workflows?
Defining workflows in a declarative format like YAML, as opposed to a general-purpose programming language like Python, offers significant operational advantages:
- Reviewability: A YAML file is a simple data structure. A
git diffis unambiguous, making code reviews faster and safer. You can see exactly what changed—a timeout, a new task, a different schedule—without having to mentally execute code. - Simplified Rollbacks: A bad deployment is a
git revertaway. There’s no complex application state to manage. - Language-Agnosticism: The orchestration layer is separate from the execution layer. Your ETL workflow can seamlessly orchestrate tasks written in SQL, Python, R, Shell, or any other language, allowing teams to use the best tool for each job.
- Separation of Concerns: Business logic (the Python or SQL script) is kept separate from orchestration logic (the YAML file). This makes both easier to manage, test, and maintain.
Integrating with the Modern Data Stack
A declarative orchestrator acts as the central nervous system for the modern data stack. Instead of building a monolithic ETL application, you can compose a workflow from best-in-class tools:
- Extract & Load: Use a connector-based tool like Airbyte or Fivetran.
- Transform: Use dbt to manage SQL-based transformations.
- Orchestrate: Use Kestra to trigger the Airbyte sync, wait for it to complete, run the dbt models, and then run data quality checks, all in one coordinated workflow.
This modular approach, exemplified in blueprints like the Airbyte Cloud and dbt Core integration, allows each component to be updated independently and makes the entire system more resilient.
Kestra for Production ETL Workflows
Kestra is an open-source, declarative orchestration platform designed to solve these modern ETL challenges. It allows data engineers to define complex, resilient workflows as simple YAML files.
- YAML-Defined Flows: Kestra’s core is declarative. Every workflow is a YAML file, enabling GitOps practices for all your data processes.
- Polyglot Tasks: With a vast library of plugins, you can run anything from a shell command or Python script to a dbt model or Spark job as a native task.
- Event-Driven Triggers: Kestra has built-in triggers for schedules, webhooks, file detection, and more, making it easy to build reactive ETL workflows.
By combining a declarative interface with a powerful, language-agnostic execution engine, Kestra provides a robust foundation for building and scaling ETL pipelines across your entire organization. To explore more patterns and solutions, you can browse our full collection of data resources or learn how Kestra can power your data platform.
Example: An Opinionated Kestra ETL Workflow
Here’s what a simple ETL workflow—extracting from Postgres, transforming with Python, and loading to S3—looks like in Kestra:
id: postgres-etl-to-s3namespace: company.team.production
description: Extracts new user data from Postgres, performs a simple transformation, and uploads it as a CSV to S3.
tasks: - id: extract_from_postgres type: io.kestra.plugin.jdbc.postgresql.Query sql: "SELECT user_id, email, signup_date FROM users WHERE signup_date >= '{{ trigger.date }}';" store: true # Stores the result in Kestra's internal storage
- id: transform_with_python type: io.kestra.plugin.scripts.python.Script inputFiles: data.json: "{{ outputs.extract_from_postgres.uri }}" script: | import pandas as pd import json
with open('data.json', 'r') as f: data = [json.loads(line) for line in f]
df = pd.DataFrame(data) df['signup_year'] = pd.to_datetime(df['signup_date']).dt.year df.to_csv('transformed_users.csv', index=False)
- id: load_to_s3 type: io.kestra.plugin.aws.s3.Upload from: "{{ outputs.transform_with_python.outputFiles['transformed_users.csv'] }}" bucket: my-data-bucket key: "processed/users/{{ execution.startDate | date('yyyy-MM-dd') }}.csv"
triggers: - id: daily_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 5 * * *" # Runs daily at 5 AM UTCThis single YAML file defines the entire workflow, including its schedule, dependencies between tasks, and how data flows from one step to the next.
Related resources
Frequently asked questions
Find answers to your questions right here, and don't hesitate to Contact Us if you couldn't find what you're looking for.