How to Create a Data Pipeline: A Step-by-Step Guide
Creating a data pipeline takes more than stitching scripts together. Learn the five-step process, the three core stages (ingestion, transformation, loading), the main tools, and how to build a pipeline that survives production.
Every analytics dashboard, machine learning model, and operational report depends on a data pipeline somewhere upstream. The question most teams face isn’t whether to build one — it’s how to build one that doesn’t break at 3 a.m. and require an engineer to restart it manually.
Creating a data pipeline is more than writing a script that copies data from A to B. A production pipeline needs scheduling, retries, monitoring, lineage, and a way to recover when sources change shape. This guide covers the full process: what a data pipeline is, the three stages it goes through, the five-step build process, the tools you’ll choose between, and a concrete YAML example to anchor the theory.
What Is a Data Pipeline?
A data pipeline is an automated sequence of steps that moves data from source systems, transforms it according to business rules, and loads it into a destination. The sources can be transactional databases, SaaS APIs, event streams, or files. The destination can be a warehouse, a lake, a feature store, or an operational system.
What separates a data pipeline from a one-off script is repeatability. A pipeline runs on a trigger (schedule, event, or dependency), handles failures gracefully, and produces consistent output. The system wrapping the pipeline — the orchestration layer — is what turns a collection of steps into infrastructure that can be trusted in production.
For a deeper look at pipeline architecture and patterns, see What Is a Data Pipeline? Definition, Architecture, and Tools.
The Three Main Stages of a Data Pipeline
The three main stages are ingestion (extracting from sources), transformation (cleaning, joining, applying business logic), and loading (writing to the destination). ELT pipelines reverse the last two — load raw data first, transform inside the warehouse. ETL transforms before loading. Both are valid depending on compliance and cost constraints.
Ingestion — Getting Data In
Ingestion pulls data out of source systems. The method depends on how fresh downstream consumers need the data:
- Batch — pull data on a schedule (hourly, daily). Good enough for most analytics, easy to reason about.
- Change Data Capture (CDC) — replicate database changes continuously using tools like Debezium or native connectors.
- Event streaming — consume from Kafka, Kinesis, or Pub/Sub when latency matters in seconds, not hours.
- API polling — pull from SaaS APIs on a schedule, handling pagination and rate limits.
Managed tools like Airbyte and Fivetran handle most ingestion patterns out of the box. Custom Python is still common for APIs with unusual auth or non-standard pagination.
Transformation — Making Data Useful
Transformation reshapes raw data into something analytics-ready. Typical operations: deduplication, type casting, joining across sources, aggregating metrics, applying business rules, redacting sensitive fields.
Where transformation happens is a design choice. Warehouse-native SQL (via dbt) is the dominant pattern in 2026 — cheap compute, version-controlled logic, native testing. Python and Spark handle the cases SQL can’t: unstructured data, ML preprocessing, complex window logic.
Loading — Delivering to the Destination
Loading writes the final data where consumers can use it. Warehouses (Snowflake, BigQuery, Redshift) dominate analytics. Lakehouses (Databricks, Iceberg on S3) are growing for mixed SQL + ML workloads. Operational systems (CRMs, marketing tools) receive data through reverse ETL when the pipeline’s output needs to drive action, not just reporting.
Loading strategy matters: full refresh, incremental loads, and upserts each have different trade-offs around cost, latency, and consistency.
How to Create a Data Pipeline in Five Steps
Building a data pipeline follows a repeatable sequence. Most teams that skip steps end up rebuilding the pipeline within a year.
Step 1 — Define Goals and Requirements
Start with the consumer. Who uses the data, how often do they need it, and what’s the cost of it being late or wrong? These answers set the non-negotiables: latency target, freshness SLA, acceptable failure modes, observability requirements. Without this step, every downstream decision is guesswork.
Step 2 — Select Data Sources and Ingestion Patterns
List every source and classify each by access pattern: batch-pullable API, CDC-capable database, streaming feed, file drop. For each, decide between full refresh and incremental ingestion. Incremental wins on cost and speed at scale, but requires a reliable high-watermark or primary key on the source side.
Step 3 — Design Processing and Storage
Decide where transformations run. In-warehouse with SQL and dbt is the default for structured data. Out-of-warehouse with Python, Polars, or Spark is right for unstructured data, complex ML preprocessing, or when keeping compute off the warehouse matters for cost. Either way, structure the storage layer in distinct zones — raw, staging, curated — so lineage stays clear and debugging stays possible.
Step 4 — Choose the Destination
Match the destination to the consumer. BI dashboards → warehouse. ML models → feature store or warehouse with point-in-time correctness. Operational systems → reverse ETL to CRM or marketing platform. Many pipelines have multiple destinations and fan out from a single transformation layer.
Step 5 — Orchestrate the Flow
This is where most first pipelines fail. Without orchestration, the pipeline is a collection of scripts glued together with cron. Orchestration handles triggers (schedule, event, webhook, file-drop), dependencies between tasks, retry logic on failure, alerting, and execution history. It’s the difference between a pipeline that runs in production and a pipeline that breaks in production.
A Concrete Example — Building a Pipeline End to End
Here’s what a production pipeline looks like in Kestra: ingest from an API, transform with dbt, load into Snowflake, notify Slack. Everything in one YAML file, triggered on a schedule, with retries and error handling baked in.
id: create_data_pipeline_examplenamespace: company.data
triggers: - id: daily_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 6 * * *"
tasks: - id: ingest_from_api type: io.kestra.plugin.scripts.python.Script script: | import requests, pandas as pd r = requests.get("https://api.example.com/orders", headers={ "Authorization": f"Bearer {{ secret('API_TOKEN') }}" }) df = pd.DataFrame(r.json()) df.to_parquet("orders.parquet") retry: type: exponential maxAttempts: 3 interval: PT1M
- id: load_to_warehouse type: io.kestra.plugin.gcp.bigquery.Load from: "{{ outputs.ingest_from_api.outputFiles['orders.parquet'] }}" destinationTable: "raw.orders" format: PARQUET
- id: transform_with_dbt type: io.kestra.plugin.dbt.cli.DbtCLI commands: - dbt deps - dbt build --select tag:daily
- id: notify_team type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook url: "{{ secret('SLACK_WEBHOOK') }}" payload: '{"text": "Daily pipeline completed ✅"}'
errors: - id: alert_on_failure type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook url: "{{ secret('SLACK_WEBHOOK') }}" payload: '{"text": "⚠️ Pipeline failed — check logs"}'Five tasks, one trigger, error handling, and secrets management — in one declarative file that any engineer can read on day one. That’s what “orchestrated” looks like in practice.
Tools for Creating Data Pipelines
The pipeline tool landscape splits into two categories: orchestrators (what runs your pipelines) and managed ETL/ELT vendors (what moves data through them). Most production setups use both.
| Tool | Category | Interface | Cloud-agnostic | Best for |
|---|---|---|---|---|
| Kestra | Orchestrator | YAML + UI | ✅ | Multi-language teams, event-driven, hybrid stacks |
| Airflow / Astronomer | Orchestrator | Python DAGs | ✅ | Python-native data teams |
| Dagster | Orchestrator | Python + assets | ✅ | Asset-centric modeling, Python-first |
| AWS Glue / Step Functions | Orchestrator | Visual / JSON | ❌ (AWS) | AWS-native stacks |
| Airbyte / Fivetran | Managed ingestion | Connectors | ✅ | Offloading connector maintenance |
| dbt | Transformation | SQL + Jinja | ✅ | Warehouse-native transforms |
For most teams building a pipeline today, the pragmatic stack is: Airbyte or Fivetran for ingestion, dbt for transformation, and a general orchestrator to tie them together with event triggers, retries, and observability.
Types of Data Pipelines — Batch, Streaming, Event-Driven
Three patterns dominate in 2026. Most teams run a mix.
- Batch — fixed schedule, bounded data per run. Easiest to reason about. Default for analytics where freshness within hours is acceptable.
- Streaming — continuous processing, latency in seconds or milliseconds. Used for fraud detection, real-time personalization, operational dashboards. Higher complexity around ordering, state, and exactly-once guarantees.
- Event-driven — triggers on events (file lands in S3, webhook fires, message on a queue) rather than a schedule. Eliminates the “wait until next run” tax of batch without the full complexity of streaming.
Event-driven is increasingly the default pattern for modern data teams. See the deeper breakdown in What Is a Data Pipeline? for when each pattern fits.
Best Practices for Production Pipelines
Six practices separate pipelines that scale from pipelines that rot:
- Idempotent tasks — the same pipeline run twice should produce the same output. Re-runs are the most common recovery mechanism; non-idempotent tasks turn recovery into a minefield.
- Incremental processing — ingest and transform only what’s changed since the last run. Full-refresh pipelines don’t scale past a few million rows.
- Data quality tests — assertion tests (dbt tests, Great Expectations) catch regressions before they reach dashboards. Cheaper to fix upstream than to explain to a CFO why Q3 revenue was wrong.
- Version control — every pipeline definition lives in Git, reviewed like application code. The pipeline is infrastructure; treat it that way.
- Observability — logs alone aren’t enough. Per-task metrics, data volumes, latency, and SLA tracking are what let you answer “what broke?” in minutes instead of hours.
- Secrets management — credentials in a vault, not in the pipeline code. Rotatable, auditable, and not committed to Git.
Getting Started
Creating a data pipeline is less about the specific tools and more about the system around them. The orchestrator you pick, the way you structure storage, and the observability you bake in from day one shape whether the pipeline is still running reliably a year later.
For teams evaluating orchestration, Kestra is open-source, self-hostable, and ships with ready-to-use blueprints for the most common pipeline patterns. Start with a ready-made data pipeline blueprint, explore the data orchestration guide, or read about the broader declarative orchestration approach.
Related resources
Frequently asked questions
Find answers to your questions right here, and don't hesitate to Contact Us if you couldn't find what you're looking for.