Managing pip Package Dependencies​Managing pip ​Package ​Dependencies

Learn how to manage pip package dependencies in your flows.

Motivation

Your Python code may require some pip package dependencies. The way you manage these dependencies can have an impact on the execution time of your flows.

If you install pip packages within beforeCommands, these packages will be downloaded and installed each time you run your task. This can lead to increased duration of your workflow executions. The following sections describe several ways to manage pip package dependencies in your flows.

Using a custom Docker image

Instead of using the Python Docker image, and installing pip package dependencies using beforeCommands, you can create a customer Docker image with Python and the required pip package dependencies. As all the pip packages would be part of this custom Docker image, you need not download and install the pip package dependencies during each execution. This would prevent the load on the execution, and the execution time will be dedicated to only the processing of the Python code.

For example, the Python example has pandas as a dependency. We can specify a Python container image that has this pre-installed, such as ghcr.io/kestra-io/pydata:latest meaning we don't need to use beforeCommands:

yaml
id: docker_dependencies
namespace: company.team

tasks:
  - id: code
    type: io.kestra.plugin.scripts.python.Script
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: ghcr.io/kestra-io/pydata:latest
    script: |
      import pandas as pd

      df = pd.read_csv('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv')
      total_revenue = df['total'].sum()

Install pip package dependencies at server startup

This is another way of preventing the overload of downloading and installing pip package dependencies in each execution. You can install all the pip package dependencies, and then start the Kestra server. For Kestra standalone server, you can achieve this by running the command below:

bash
pip install requests pandas polars && ./kestra server standalone --worker-thread=16

If you run Kestra using Docker, create a Dockerfile if you haven't already and install your dependencies using RUN inside of your Dockerfile. You will also need to set the USER to root for this to work.

dockerfile
FROM kestra/kestra:latest

USER root
RUN pip install requests pandas polars

CMD ["server", "standalone"]

Inside of your Docker Compose, you'll need to replace the image property with build: . to use our Dockerfile instead of the kestra image directly from DockerHub. Also, remove the command property as this is now handled in our Dockerfile with CMD:

yaml
services:
  ...
  kestra:
    build: .
    ...

When you run Kestra using Docker Compose, you will now see the Python dependencies added to the Dockerfile.

In either of these Kestra server installations, you will need to run the Python tasks using the Process Task Runner so that the Python code has access to the pip package dependencies installed in the Kestra server process.

We can check our dependencies are installed with the example below:

yaml
id: list_dependencies
namespace: company.team

tasks:
  - id: check
    type: io.kestra.plugin.scripts.python.Commands
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - pip list

Using cache files

In a WorkingDirectory task, you can have the virtual environment setup with the Process Task Runner, install all the pip package dependencies, and cache the venv folder. The pip package dependencies will then be cached as part of the virtual environment folder, and you need not install it on every execution of the flow. This is explained in detail in the caching page.

Here is a sample flow demonstrating how the venv folder can be cached:

yaml
id: python_cached_dependencies
namespace: company.team

tasks:
  - id: working_dir
    type: io.kestra.plugin.core.flow.WorkingDirectory
    tasks:
      - id: python_script
        type: io.kestra.plugin.scripts.python.Script
        taskRunner:
          type: io.kestra.plugin.core.runner.Process
        warningOnStdErr: false
        beforeCommands:
          - python -m venv venv
          - . venv/bin/activate
          - pip install pandas
        script: |
          import pandas as pd
          print(pd.__version__)
    cache:
      patterns:
        - venv/**
      ttl: PT24H

Thus, using one of the above techniques, you can prevent the installation of the pip package dependencies with every execution, and reduce your execution time.

Was this page helpful?