Source
yaml
id: download-parquet-from-databricks
namespace: company.team
description: >
This flow will download a Parquet file from Databricks File System (DBFS) to
Kestra's internal storage.
tasks:
- id: download
type: io.kestra.plugin.databricks.dbfs.Download
authentication:
token: "{{ secret('DATABRICKS_TOKEN') }}"
host: "{{ secret('DATABRICKS_HOST') }}"
from: /Shared/myFile.parquet
- id: process_downloaded_file
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
dependencies:
- pandas
script: |
import pandas as pd
df = pd.read_parquet("{{ outputs.download.uri }}")
df.head()
About this blueprint
Data
This flow retrieves a Parquet file stored in Databricks File System (DBFS) and makes it available inside Kestra for downstream processing.
It performs two main steps:
- Downloads the Parquet file from DBFS into Kestra’s internal storage.
- Loads and inspects the dataset using a Python script running in a Docker container, making it easy to validate or transform the data with Pandas.
This pattern is useful when you need to reuse Databricks-generated datasets outside of Databricks, for example in data quality checks, exploratory analysis, or additional ETL steps orchestrated by Kestra.
More Related Blueprints