Blueprints
Run a SQL query on Databricks, output the result to a CSV file and read that CSV file in a Python script with Pandas
Source
yaml
id: databricks-sql
namespace: company.team
description: |
This flow demonstrates how to run a SQL query on a Databricks lakehouse.
tasks:
- id: sql_query
type: io.kestra.plugin.databricks.sql.Query
accessToken: "{{ secret('DATABRICKS_TOKEN') }}"
host: "{{ secret('DATABRICKS_HOST') }}"
httpPath: sql/protocolv1/o/3497567377305430/0713-130754-9oa8zceh
sql: SELECT * FROM samples.nyctaxi.trips LIMIT 100;
- id: csv
type: io.kestra.plugin.serdes.csv.IonToCsv
from: "{{ outputs.sql_query.uri }}"
- id: pandas
type: io.kestra.plugin.scripts.python.Script
beforeCommands:
- pip install kestra pandas > /dev/null
script: |
import pandas as pd
df = pd.read_csv("{{ outputs.csv.uri }}")
df.head()
About this blueprint
pip Python Databricks
This flow demonstrates how to run a SQL query on a Databricks lakehouse. The results are stored in a CSV file. Then, the flow uses Pandas to read the CSV file and further process the data. The Databricks SQL Connector is designed to perform SQL queries on a Databricks lakehouse and handle the results. In simple terms, it allows you to query data stored in Databricks and manipulate or analyze this data. This process involves three main steps:
- SQL Query Execution: The connector executes a SQL query on the Databricks lakehouse.
- CSV Conversion: The results of the SQL query are converted into a CSV format.
- Data Processing with Pandas: The CSV data is then processed using the Pandas library in Python for further analysis or manipulation. This flow is particularly useful for extracting and transforming data stored in Databricks for various data analysis and data science applications.