Run a SQL query on Databricks, output the result to a CSV file and read that CSV file in a Python script with Pandas

Source

yaml

id: databricks-sql
namespace: company.team
description: |
  This flow demonstrates how to run a SQL query on a Databricks lakehouse.

tasks:
  - id: sql_query
    type: io.kestra.plugin.databricks.sql.Query
    accessToken: "{{ secret('DATABRICKS_TOKEN') }}"
    host: "{{ secret('DATABRICKS_HOST') }}"
    httpPath: sql/protocolv1/o/3497567377305430/0713-130754-9oa8zceh
    sql: SELECT * FROM samples.nyctaxi.trips LIMIT 100;

  - id: csv
    type: io.kestra.plugin.serdes.csv.IonToCsv
    from: "{{ outputs.sql_query.uri }}"

  - id: pandas
    type: io.kestra.plugin.scripts.python.Script
    beforeCommands:
      - pip install kestra pandas > /dev/null
    script: |
      import pandas as pd

      df = pd.read_csv("{{ outputs.csv.uri }}")
      df.head()

About this blueprint

Python

This flow demonstrates how to run a SQL query on a Databricks lakehouse. The results are stored in a CSV file. Then, the flow uses Pandas to read the CSV file and further process the data. The Databricks SQL Connector is designed to perform SQL queries on a Databricks lakehouse and handle the results. In simple terms, it allows you to query data stored in Databricks and manipulate or analyze this data. This process involves three main steps:

SQL Query Execution: The connector executes a SQL query on the Databricks lakehouse.
CSV Conversion: The results of the SQL query are converted into a CSV format.
Data Processing with Pandas: The CSV data is then processed using the Pandas library in Python for further analysis or manipulation. This flow is particularly useful for extracting and transforming data stored in Databricks for various data analysis and data science applications.

Query

Ion To Csv

Script

More Related Blueprints

NotificationsPythonAIDevOps

Kubernetes AI-powered incident monitoring, analysis, and alerting

Getting StartedPythonSQLAPI

Getting started with Kestra — a Data Engineering Pipeline example

Python

Ingest Pipedrive CRM data to BigQuery using dlt and schedule it to run every hour

New to Kestra?

Use blueprints to kickstart your first workflows.

Get started with Kestra