Process partitions in parallel in Python, report outputs and metrics about number of rows and processing time for all partitioned files

Source

yaml

id: python-partitions-metrics
namespace: company.team
description: Process partitions in parallel

tasks:
  - id: get_partitions
    type: io.kestra.plugin.scripts.python.Script
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: ghcr.io/kestra-io/pydata:latest
    script: |
      from kestra import Kestra
      partitions = [f"file_{nr}.parquet" for nr in range(1, 10)]
      Kestra.outputs({'partitions': partitions})

  - id: process_partitions
    type: io.kestra.plugin.core.flow.ForEach
    concurrencyLimit: 0
    values: "{{ outputs.get_partitions.vars.partitions }}"
    tasks:
      - id: partition
        type: io.kestra.plugin.scripts.python.Script
        taskRunner:
          type: io.kestra.plugin.scripts.runner.docker.Docker
        containerImage: ghcr.io/kestra-io/pydata:latest
        script: |
          import random
          import time
          from kestra import Kestra

          filename = '{{ taskrun.value }}'
          print(f"Reading and processing partition {filename}")
          nr_rows = random.randint(1, 1000)
          processing_time = random.randint(1, 20)
          time.sleep(processing_time)
          Kestra.counter('nr_rows', nr_rows, {'partition': filename})

          Kestra.timer('processing_time', processing_time, {'partition':
          filename})

About this blueprint

Python Kestra

This flow extracts a list of partitions and then processes each partition in parallel in isolated Python scripts, each running in a separate Docker container. The flow will then track the number of rows and the processing time for each partition, which you can inspect in the Metrics tab.

Script

Docker

For Each

More Related Blueprints

PythonKestra

Run specific tasks only on business days for a specific country

PythonKestra

Extract a zip file, decompress it, and convert CSV to parquet format in Python

PythonKestra

Run a Python script and generate outputs, metrics and files specified with a variable

New to Kestra?

Use blueprints to kickstart your first workflows.

Get started with Kestra