Snowflake data pipeline with Kestra
The 21st century is the golden age of data. Today, modern data stacks offer numerous tools and workloads for data warehousing, data lakes, data analytics, and data engineering to manage massive amounts of data inflow. One such platform is Snowflake, a robust data cloud tool that streamlines many operations in a modern data stack.
In this article, we'll discuss how Snowflake integrates with Kestra (opens new window) (an open-source, scalable orchestration and scheduling platform) using Snowflake plugin to create a seamless and scalable data pipeline.
First, let's discuss Snowflake briefly.
# What is Snowflake?
Leveraged by thousands of organizations worldwide, Snowflake (opens new window) is a top-of-the-line SaaS cloud data warehousing and data lake platform offering reliable data pipelines with high performance, concurrency, and scale across multiple cloud providers like AWS, Azure, and GCP.
Snowflake is a unified cloud-native platform that enables teams to avoid data silos by sharing and collaborating on data pipelines securely. It offers a flexible architecture (opens new window) with three components, i.e., robust cloud storage, query processing engine, and a layer of cloud services with numerous third-party integrations.
Snowflake is a multi-cloud, multi-cluster warehouse platform that can auto-scale to manage virtual warehouses dynamically. In the auto-scale mode (opens new window), it starts and stops warehouses to govern cloud resource usage and their corresponding cost credits.
# Why is Snowflake Popular?
In the last few years, Snowflake has had huge traction. Today more than 5900+ companies (as of March 2022) are running their critical data pipelines on Snowflake.
With Snowflake, companies don't have to pay for huge and expensive licenses for using data warehousing tools. Unlike conventional data warehousing tools, Snowflake charges on a pay-as-you-basis for storage and compute, enabling small and medium-scale companies to leverage cloud data pipelines for improved data management.
Now, companies can build scalable data workloads that can perform strong data analysis on structured, unstructured, and semi-structured data to derive valuable business insights and make data-driven decisions. Additionally, Snowflake Data Marketplace (opens new window) allows customers to access numerous ready-to-query datasets, further reducing integration costs.
# Where does kestra fit in?
Typically, data warehouse workloads are part of a bigger technology stack within an organization or across organizations. To streamlines various operations, data pipelines should be orchestrated and scheduled. Data orchestration unifies data from disparate sources and transforms it to enable analytics. This is where Kestra jumps in.
Kestra is a state-of-the-art, open-source, cloud-native platform built to orchestrate & schedule scalable data workflows, aiming to improve the productivity of DataOps teams. Kestra's orchestration platform can build, run, manage, and monitor all kinds of complex workflows sequentially and in parallel. It offers numerous plugin integrations to build advanced data workflows, including our Snowflake plugin, discussed below.
Moreover, Kestra can execute workflows based on event-based, time-based, and API-based scheduling, giving complete control to the data teams. With a unified end-to-end platform, companies save time and money on storage, compute, and development costs.
Snowflake already offers many cost optimization processes like data compression and auto-scaling. However, Kestra makes it simpler to download, upload, and query data by integrating with Snowflake's storage and compute resources. Let's discuss Kestra's Snowflake plugin in detail.
# Dissecting Kestra's Snowflake Plugin.
Kestra's Snowflake plugin can create complex data pipelines, whether Snowflake is a target or simply storing data as a part of a larger workflow for downstream applications.
Besides the Snowflake plugin, Kestra offers numerous JDBC plugin integrations, including ClickHouse, DuckDb, MySQL, Oracle, Apache Pinot, PostgreSQL, Redshift, Rockset, SQL Server, Trino, Vectorwise, and Vertica. These plugins can effectively process and transform tabular data within relational databases, reducing the processing cost of platforms like Snowflake.
Using Kestra, the data teams are at liberty to decide where certain data, plugins, and connectors would fit. As mentioned above, the Snowflake plugin can perform the download, upload, and query tasks. Let's discuss each of them below.
# Query Snowflake table
Kestra can query the Snowflake server using this task to insert, update, and delete data. The Query task offers numerous properties, including auto-committing SQL statements, different fetching operations, specifying access-control roles, and storing fetch results. When the
true, Kestra allows storage of large results as an output of the Query task.
The plugin allows the usage of multi-SQL statements in the same transaction as a full SQL script with isolation support. It allows simple queries and fetches results with
fetchOne properties, enabling teams to reuse the output on the next tasks from tools like Kafka Consume, Elastic Search, Mongo Find, and more. Some Query task instances are:
- Fetch a row from the database, and define multiple flows depending on the output
- Fetch the count of a store and iterate through the list. If an item doesn't exist, perform a particular task.
The following code snippet executes a query to fetch results from one table into Kestra internal storage.
id: select type: io.kestra.plugin.jdbc.snowflake.Query url: jdbc:snowflake://<account_identifier>.snowflakecomputing.com username: snowflake password: snowflake_passwd sql: select * from source fetch: true
# Download from Snowflake stage
As the name suggests, this task downloads data from the Snowflake server to an internal Kestra stage, which is based on Amazon ION (opens new window). The Download task offers properties such as data compression and access control role to streamline the download process of the connected database. The Download task outputs the URL of the downloaded file available on the Kestra storage server.
The following code snippet downloads default database to specified
fileName location on internal Kestra server.
id: "download" type: "io.kestra.plugin.jdbc.snowflake.Download" stageName: MYSTAGE fileName: prefix/destFile.csv
# Upload to Snowflake stage
This task uploads data to an internal Snowflake stage (opens new window). Similar to Download task, Upload can perform data compression and set access control role. Snowflake also support data transformation (opens new window) while loading data, which simplifies the ETL process.
The following code snippet uploads data to the specified
id: "upload" type: "io.kestra.plugin.jdbc.snowflake.Upload" stageName: MYSTAGE prefix: testUploadStream fileName: destFile.csv
# Building Dynamic Data Workflows With Kestra
Kestra has quickly become an essential piece of the modern data stack. Kestra provides unprecedented flexibility and control to data teams. It can orchestrate any kind of workflow with ease using a rich UI that monitors all flows.
Kestra's Snowflake plugin makes data warehousing simple. Even non-developers can write relevant flows using descriptive YAML. Your Snowflake storage pipeline accommodates raw data from multiple sources and transforms it using ETL operations. Additionally, you can skip the transformation and directly load data into the warehouse using the ELT pipeline. Kestra can manage both workflows simultaneously. In any case, Kestra ensures that the data is readily available to perform analysis and learn valuable patterns.
If you need any guidance with your Snowflake deployment, our experts at Kestra would love to hear from you. Let us know (opens new window) if you would like us to add more plugins to the list. Or start building your custom Kestra plugin today and send it our way. We always welcome contributions (opens new window)!