Managing pip Package Dependencies​Managing pip ​Package ​Dependencies

Learn how to manage pip package dependencies in your flows.

Motivation

Your Python code may require pip package dependencies. How you manage these dependencies can affect the execution time of your flows.

If you install pip packages within beforeCommands, the packages will be downloaded and installed each time the task runs. This can significantly increase the duration of workflow executions. The following sections describe several ways to manage pip package dependencies efficiently in your flows.

Using a custom Docker image

Instead of using the base Python Docker image and installing dependencies through beforeCommands, you can create a custom Docker image that includes Python and all required pip packages. Since the dependencies are built into the image, they do not need to be downloaded and installed at runtime. This reduces overhead and ensures that execution time is dedicated solely to running your Python code.

For example, if your Python script depends on pandas, you can use a container image that already includes it, such as ghcr.io/kestra-io/pydata:latest. This eliminates the need to install dependencies using beforeCommands:

yaml
id: docker_dependencies
namespace: company.team

tasks:
  - id: code
    type: io.kestra.plugin.scripts.python.Script
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: ghcr.io/kestra-io/pydata:latest
    script: |
      import pandas as pd

      df = pd.read_csv('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv')
      total_revenue = df['total'].sum()

Installing pip package dependencies at server startup

Another way to avoid installing dependencies during every execution is to preinstall them before starting the Kestra server. For a standalone Kestra server, you can run:

bash
pip install requests pandas polars && ./kestra server standalone --worker-thread=16

If you are running Kestra with Docker, create a Dockerfile and install dependencies using the RUN command. Set the USER to root to allow package installation:

dockerfile
FROM kestra/kestra:latest

USER root
RUN pip install requests pandas polars

CMD ["server", "standalone"]

In your Docker Compose configuration, replace the image property with build: . to use your custom Dockerfile instead of the official image from Docker Hub. Also, remove the command property, since the CMD instruction in your Dockerfile now handles it:

yaml
services:
  ...
  kestra:
    build: .
    ...

When you start Kestra using Docker Compose, the Python dependencies will already be included in the container.

In both installation methods, you must run Python tasks using the Process Task Runner to ensure the code can access the dependencies installed in the Kestra server process.

You can verify that the dependencies are installed with the following example:

yaml
id: list_dependencies
namespace: company.team

tasks:
  - id: check
    type: io.kestra.plugin.scripts.python.Commands
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - pip list

Using cache files

In a WorkingDirectory task, you can create a virtual environment using the Process Task Runner, install all required pip dependencies, and cache the venv folder. This ensures the dependencies are reused in subsequent executions, eliminating the need for repeated installations. For more details, see the caching page.

The example below demonstrates how to cache the venv folder:

yaml
id: python_cached_dependencies
namespace: company.team

tasks:
  - id: working_dir
    type: io.kestra.plugin.core.flow.WorkingDirectory
    tasks:
      - id: python_script
        type: io.kestra.plugin.scripts.python.Script
        taskRunner:
          type: io.kestra.plugin.core.runner.Process
        beforeCommands:
          - python -m venv venv
          - . venv/bin/activate
          - pip install pandas
        script: |
          import pandas as pd
          print(pd.__version__)
    cache:
      patterns:
        - venv/**
      ttl: PT24H

By using one of these techniques, you can avoid reinstalling dependencies for each execution and reduce overall execution time.

Was this page helpful?