Managing pip Package Dependencies
Learn how to manage pip package dependencies in your flows.
Motivation
Your Python code may require pip
package dependencies. How you manage these dependencies can affect the execution time of your flows.
If you install pip
packages within beforeCommands
, the packages will be downloaded and installed each time the task runs. This can significantly increase the duration of workflow executions. The following sections describe several ways to manage pip
package dependencies efficiently in your flows.
Using a custom Docker image
Instead of using the base Python Docker image and installing dependencies through beforeCommands
, you can create a custom Docker image that includes Python and all required pip
packages. Since the dependencies are built into the image, they do not need to be downloaded and installed at runtime. This reduces overhead and ensures that execution time is dedicated solely to running your Python code.
For example, if your Python script depends on pandas
, you can use a container image that already includes it, such as ghcr.io/kestra-io/pydata:latest
. This eliminates the need to install dependencies using beforeCommands
:
id: docker_dependencies
namespace: company.team
tasks:
- id: code
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
containerImage: ghcr.io/kestra-io/pydata:latest
script: |
import pandas as pd
df = pd.read_csv('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv')
total_revenue = df['total'].sum()
Installing pip package dependencies at server startup
Another way to avoid installing dependencies during every execution is to preinstall them before starting the Kestra server. For a standalone Kestra server, you can run:
pip install requests pandas polars && ./kestra server standalone --worker-thread=16
If you are running Kestra with Docker, create a Dockerfile and install dependencies using the RUN
command. Set the USER
to root
to allow package installation:
FROM kestra/kestra:latest
USER root
RUN pip install requests pandas polars
CMD ["server", "standalone"]
In your Docker Compose configuration, replace the image
property with build: .
to use your custom Dockerfile instead of the official image from Docker Hub. Also, remove the command
property, since the CMD
instruction in your Dockerfile now handles it:
services:
...
kestra:
build: .
...
When you start Kestra using Docker Compose, the Python dependencies will already be included in the container.
In both installation methods, you must run Python tasks using the Process Task Runner to ensure the code can access the dependencies installed in the Kestra server process.
You can verify that the dependencies are installed with the following example:
id: list_dependencies
namespace: company.team
tasks:
- id: check
type: io.kestra.plugin.scripts.python.Commands
taskRunner:
type: io.kestra.plugin.core.runner.Process
commands:
- pip list
Using cache files
In a WorkingDirectory
task, you can create a virtual environment using the Process Task Runner, install all required pip
dependencies, and cache the venv
folder. This ensures the dependencies are reused in subsequent executions, eliminating the need for repeated installations. For more details, see the caching page.
The example below demonstrates how to cache the venv
folder:
id: python_cached_dependencies
namespace: company.team
tasks:
- id: working_dir
type: io.kestra.plugin.core.flow.WorkingDirectory
tasks:
- id: python_script
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.core.runner.Process
beforeCommands:
- python -m venv venv
- . venv/bin/activate
- pip install pandas
script: |
import pandas as pd
print(pd.__version__)
cache:
patterns:
- venv/**
ttl: PT24H
By using one of these techniques, you can avoid reinstalling dependencies for each execution and reduce overall execution time.
Was this page helpful?