Managing pip Package Dependencies
Learn how to manage pip package dependencies in your flows.
Motivation
Your Python code may require some pip
package dependencies. The way you manage these dependencies can have an impact on the execution time of your flows.
If you install pip
packages within beforeCommands
, these packages will be downloaded and installed each time you run your task. This can lead to increased duration of your workflow executions. The following sections describe several ways to manage pip package dependencies in your flows.
Using a custom Docker image
Instead of using the Python Docker image, and installing pip package dependencies using beforeCommands
, you can create a customer Docker image with Python and the required pip package dependencies. As all the pip packages would be part of this custom Docker image, you need not download and install the pip package dependencies during each execution. This would prevent the load on the execution, and the execution time will be dedicated to only the processing of the Python code.
For example, the Python example has pandas
as a dependency. We can specify a Python container image that has this pre-installed, such as ghcr.io/kestra-io/pydata:latest
meaning we don't need to use beforeCommands
:
id: docker_dependencies
namespace: company.team
tasks:
- id: code
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
containerImage: ghcr.io/kestra-io/pydata:latest
script: |
import pandas as pd
df = pd.read_csv('https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv')
total_revenue = df['total'].sum()
Install pip package dependencies at server startup
This is another way of preventing the overload of downloading and installing pip package dependencies in each execution. You can install all the pip package dependencies, and then start the Kestra server. For Kestra standalone server, you can achieve this by running the command below:
pip install requests pandas polars && ./kestra server standalone --worker-thread=16
If you run Kestra using Docker, create a Dockerfile if you haven't already and install your dependencies using RUN
inside of your Dockerfile. You will also need to set the USER
to root for this to work.
FROM kestra/kestra:latest
USER root
RUN pip install requests pandas polars
CMD ["server", "standalone"]
Inside of your Docker Compose, you'll need to replace the image
property with build: .
to use our Dockerfile instead of the kestra image directly from DockerHub. Also, remove the command
property as this is now handled in our Dockerfile with CMD
:
services:
...
kestra:
build: .
...
When you run Kestra using Docker Compose, you will now see the Python dependencies added to the Dockerfile.
In either of these Kestra server installations, you will need to run the Python tasks using the Process Task Runner so that the Python code has access to the pip package dependencies installed in the Kestra server process.
We can check our dependencies are installed with the example below:
id: list_dependencies
namespace: company.team
tasks:
- id: check
type: io.kestra.plugin.scripts.python.Commands
taskRunner:
type: io.kestra.plugin.core.runner.Process
commands:
- pip list
Using cache files
In a WorkingDirectory
task, you can have the virtual environment setup with the Process Task Runner, install all the pip package dependencies, and cache the venv
folder. The pip package dependencies will then be cached as part of the virtual environment folder, and you need not install it on every execution of the flow. This is explained in detail in the caching page.
Here is a sample flow demonstrating how the venv
folder can be cached:
id: python_cached_dependencies
namespace: company.team
tasks:
- id: working_dir
type: io.kestra.plugin.core.flow.WorkingDirectory
tasks:
- id: python_script
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.core.runner.Process
warningOnStdErr: false
beforeCommands:
- python -m venv venv
- . venv/bin/activate
- pip install pandas
script: |
import pandas as pd
print(pd.__version__)
cache:
patterns:
- venv/**
ttl: PT24H
Thus, using one of the above techniques, you can prevent the installation of the pip package dependencies with every execution, and reduce your execution time.
Was this page helpful?