Pandas ETL - passing data between Python script tasks running in separate containers

Source

yaml

id: wdir-pandas-python-outputs
namespace: company.team

tasks:
  - id: etl
    type: io.kestra.plugin.core.flow.WorkingDirectory
    tasks:
      - id: extract_csv
        type: io.kestra.plugin.scripts.python.Script
        taskRunner:
          type: io.kestra.plugin.scripts.runner.docker.Docker
        containerImage: ghcr.io/kestra-io/pydata:latest
        script: |
          import pandas as pd
          data = {
              'Column1': ['A', 'B', 'C', 'D'],
              'Column2': [1, 2, 3, 4],
              'Column3': [5, 6, 7, 8]
          }
          df = pd.DataFrame(data)
          print(df.head())
          df.to_csv("raw_data.csv", index=False)

      - id: transform_and_load_csv
        type: io.kestra.plugin.scripts.python.Script
        taskRunner:
          type: io.kestra.plugin.scripts.runner.docker.Docker
        containerImage: ghcr.io/kestra-io/pydata:latest
        outputFiles:
          - final.csv
        script: |
          import pandas as pd
          df = pd.read_csv("raw_data.csv")
          df['Column4'] = df['Column2'] + df['Column3']
          print(df.head())
          df.to_csv("final.csv", index=False)

About this blueprint

Python

This flow demonstrates how to use the WorkingDirectory task to persist data between multiple Python script tasks running in separate containers.

The first task stores the data as a CSV file called "raw_data.csv". The second task loads the CSV file, transforms it and outputs the final CSV file to Kestra's internal storage. You can download that file from the Outputs tab on the Execution's page.

Kestra's internal storage allows you to use the output in other tasks in the flow, even if those tasks are processed in different containers. The final result file will also be available as a downloadable artifact, making it easier to share the results of your data workflow with various business stakeholders.

Working Directory

Script

Docker

More Related Blueprints

PythonCLIGit

Clone a GitHub repository and run a Python ETL script

PythonCLIDevOpsGitAWS

Run multiple Python scripts in parallel on AWS ECS Fargate with AWS Batch

Python

Run a SQL query on Databricks, output the result to a CSV file and read that CSV file in a Python script with Pandas

New to Kestra?

Use blueprints to kickstart your first workflows.

Get started with Kestra