Managing pip package dependencies
When you have python commands or python scripts to execute in the task, you need to install pip dependencies required to execute your python code. If you use python docker image, and install pip package dependencies as part of your beforeCommands
, these packages will be downloaded and installed each time you run your task. This can be pretty time consuming and will lead to increased duration for your executions. This page describes some of the best ways to manage pip package dependencies for your flows.
Using custom docker image
Instead of using the python docker image, and installing pip package dependencies using beforeCommands
, you can create a customer docker image with python and the required pip package dependencies. As all the pip packages would be part of this custom docker image, you need not download and install the pip package dependencies during each execution. This would prevent the load on the execution, and the execution time will be dedicated to only the processing of the python code.
Install pip package dependencies at server startup
This is another way of preventing the overload of downloading and installing pip package dependencies in each execution. You can install all the pip package dependencies, and then start the Kestra server. For Kestra standalone server, you can achieve this by running the command in the following fashion:
pip install requests pandas polars && ./kestra server standalone --worker-thread=16
In case you want to run Kestra using docker, you can make the changes in the docker-compose.yml file to install the pip package dependencies, and then start the Kestra server.
In either of these Kestra server installations, you will need to run the Python tasks using the RUNNER
process so that the python code has access to the pip package dependencies installed in the Kestra server process.
Using cache files
In a WorkingDirectory task, you can have the virtual environment setup, install all the pip package dependencies, and cache the venv
folder. The pip package dependencies will then be cached as part of the virtual environment folder, and you need not install it on every execution of the flow. This is explained in detail in the caching page.
Here is a sample flow demonstrating how the venv
folder can be cached:
id: python_cached_dependencies
namespace: dev
tasks:
- id: working_dir
type: io.kestra.core.tasks.flows.WorkingDirectory
tasks:
- id: python_script
type: io.kestra.plugin.scripts.python.Script
runner: PROCESS
warningOnStdErr: false
beforeCommands:
- python -m venv venv
- source venv/bin/activate
- pip install pandas
script: |
import pandas as pd
print(pd.__version__)
cache:
patterns:
- venv/**
ttl: PT24H
Thus, using one of the above techniques, you can prevent the installation of the pip package dependencies with every execution, and reduce your execution time.