Manage file caching inside of Kestra.

Kestra provides a file caching, which is especially useful when you work with sizeable package dependencies that don't change often.

Cache files in a WorkingDirectory task

The file caching functionality on the WorkingDirectory task allows you to cache a subset of files to speed up your workflow execution. This is especially useful when you work with sizeable package dependencies that don't change often.

Use cases for file caching

The file caching is useful if you want to install some pip or npm packages before running your script. You can cache the node_modules or Python venv folder to avoid re-installing the dependencies on each run.

To do that, add a cache to your WorkingDirectory task. The cache property accepts a list of glob patterns to match files to cache. The cache will be automatically invalidated after a specified time-to-live using the ttl property accepting a duration.

yaml
id: caching_files
namespace: company.team

tasks:
  - id: working_dir
    type: io.kestra.plugin.core.flow.WorkingDirectory
    cache:
      patterns:
        - some_directory/**
      ttl: PT1H

How does it work under the hood

Kestra packages the files that need to be cached and stores them in the internal storage. When the task is executed again, the cached files are retrieved, initializing the working directory with their contents.

Node.js example

Here's an example of a flow that installs the colors package before running a Node.js script. The node_modules folder is cached for one hour.

yaml
id: node_cached_dependencies
namespace: company.team

tasks:
  - id: working_dir
    type: io.kestra.plugin.core.flow.WorkingDirectory
    cache:
      patterns:
        - node_modules/**
      ttl: PT1H
    tasks:
    - id: node_script
      type: io.kestra.plugin.scripts.node.Script
      beforeCommands:
        - npm install colors
      script: |
        const colors = require("colors");
        console.log(colors.red("Hello"));

Python example

Here's an example of a flow that installs the pandas package before running a Python script. The venv folder is cached for one day.

yaml
id: python_cached_dependencies
namespace: company.team

tasks:
  - id: working_dir
    type: io.kestra.plugin.core.flow.WorkingDirectory
    tasks:
      - id: python_script
        type: io.kestra.plugin.scripts.python.Script
        taskRunner:
          type: io.kestra.plugin.core.runner.Process
        warningOnStdErr: false
        beforeCommands:
          - python -m venv venv
          - source venv/bin/activate
          - pip install pandas
        script: |
          import pandas as pd
          print(pd.__version__)
    cache:
      patterns:
        - venv/**
      ttl: PT24H

How to invalidate the cache

Here's how that works:

  • After the first run, the files are cached
  • The next time the task is executed:
    • if the ttl didn't pass, the files are retrieved from cache
    • If the ttl passed, the cache is invalidated and no files will be retrieved from cache; because cache is no longer present, the npm install command from the beforeCommands property will take a bit longer to execute
  • If you edit the task and change the ttl to:
    • a longer duration e.g. PT5H โ€” the files will be cached for five hours using the new ttl duration
    • a shorter duration e.g. PT5M โ€” the cache will be invalidated after five minutes using the new ttl duration.

The ttl is evaluated at runtime. If the most recently set ttl duration has passed as compared to the last task run execution date, the cache is invalidated and the files are no longer retrieved from cache.

Was this page helpful?