img
Ludovic Dehon
Published 

A conversation about data pipelines is never complete without discussing ingestion practices, such as ETL and ELT (extract, transform, load, or extract, load, and transform). ETL transforms the data before loading it into a data warehouse, while ELT loads the data and allows the transformation to be handled within the data warehouse.

Each practice is rooted in strong business needs, and are necessary parts of modern data flow practices. However, discussion of these practices is often couched in a competitive narrative, asking which one is better. You'll find any number of blogs out there with a title including "ETL vs ELT" in some way, shape or form. However, there are strong reasons why both are seen in use today, and neither one is going away anytime soon. So in this article, we'll cover these two methods, the reasons they are so often pitted against one another, the situations in which one or the other thrive, and why, with Kestra (opens new window)'s unique capabilities, you might want to consider a hybrid solution. Let's get started, shall we?

# ETL

ETL

First, let's look at the stages involved in both processes. As you might guess from the acronyms, both use the same stages, just in a different order.

  • Extract: This refers to pulling the source data from the original database or data source. With ETL, the data goes into a temporary staging area. With ELT, it goes immediately into a data lake storage system.
  • Transform: This refers to the process of changing the format/structure of the information so that it can integrate with the target system and the rest of the data within that system.
  • Load: This refers to the process of inserting the information into a data storage system. In ELT scenarios, raw/unstructured data is loaded, then transformed within the target system. In ETL, the raw data is transformed into structured data prior to reaching the target.

ETL emerged as early as the 1970s, in a response to the high cost of storage in what used to pass for datacenters in those days. Co-location facilities were rare, the cloud was not even a glimmer in the developer's eye, and most large organizations maintained their own centralized data for many years. Storage was, compared to today, highly expensive, and moving large amounts of data between systems was both costly and time-consuming, so simply storing raw unprocessed data was not very useful. For this reason, it was important to transform raw data before it was delivered to any system that would analyse the data, ensuring it was in a usable format for the system it was delivered to.

To this day, ETL is primarily used for on-premise solutions in which there is finite processing power and/or memory. Rather than wasting resources by moving or storing large amounts of data that is not useful, the data is sorted and transformed prior to being ingested by a storage destination where it can be analyzed by a business intelligence tool.

Originally, the ETL process was often coded by hand to collect data from relational databases. Now there are specialized tools for ETL that can gather and process data from multiple sources and automate these operations. Some common tools used in this space include Oracle Data Integrator, Talend, and Pentaho Data Integration, among many others. Because of the prevalence of tools specifically designed to simplify the ETL process, Many organizations end up with centralized data engineer teams specialized in a particular tool or toolset, rather than being trained to create an ETL pipeline. This can be a challenge because:

  1. A new employee not trained on the platform can become a bottle-neck, slowing or stopping the creation of new pipelines.
  2. The organization is often locked into a particular product or method, and may not be able to easily integrate newer best practices.

There are many reasons to implement ETL processes in the modern era, reasons that the availability of cheaper cloud storage, streaming processes and other advances have not superseded. Some of these include:

  • Technology Debt/Investment --- Your organization may be invested in solutions that require a particular format of data, or the raw data your organization gathers may be for a very specific purpose, and the format in which it is received may be important to internal processes. Modern ETL solutions are faster and easier than in the past, and there may be little impetus to change.
  • Continuous processes with a well-defined workflow --- ETL first extracts data from homogeneous or heterogeneous data sources. Next, it deposits the data into a staging area. From there, the data goes through a cleansing process, gets enriched and transformed, and is finally stored in a data warehouse. Cleansed and formatted data is essential to a defined workflow.
  • Compliance --- Some regulations prohibit storing information (particularly unencrypted information) in servers outside of a specific region or country's borders. Companies governed by GDPR (opens new window), HIPAA (opens new window), or CCPA (opens new window)protocols often need to remove, mask or encrypt data fields. It is more secure to perform such operations via an ETL process, rather than ELT. ELT requires the upload of potentially sensitive data first, prior to transformation to secure it. This means it appears in logs, where sysadmins can access it. In addition, GDPR compliance may be violated because non-compliant data might leave the EU prior to reaching a data lake or other storage medium.

# ELT

ELT

With the advent of much cheaper storage, and cloud advances such as data lakes, ELT has gained a great deal of popularity in the past decade or so. It is often considered the more modern solution. There is no need for data staging, and because the target storage can essentially be anything, including data lakes, it can make use of any or multiple data types, unstructured, structured, semi-structured, or even raw data. This flexibility is key to its strong adoption in modern solutions, particularly when large datasets and streaming are involved.

The ELT process is particularly suited to data lakes. Data Lakes are large repositories that are data agnostic --- in other words, capable of storing data in any and all formats. Because of this, it can be far more efficient to avoid the bottleneck of a staging area capable of handling a comparatively limited dataset.

The key advantages are:

  • Low maintenance: Database resources can be used to transform the data, and there are many different options ready and available.
  • Quicker loading: No previous transformation means that the data can simply be ingested without bottlenecks.
  • As Needed Transformation: In ELT, the data is transformed after loading, but it does not have to be immediately transformed, only when it is needed for a particular purpose.
  • High Data Availability: Once loaded, the data is always available. Tools that can use unstructured data can be leveraged as well as those that rely on structured data.

ELT democratizes data gathering processes, because the source can equally be anything. Just as the target (usually a data lake or large scale database) should be able to handle any type of data, the sources can be equally disparate. This is important for modern workloads, because modern use cases stream data from IoT sources, raw data, video, documents, files, and more. This is not to mention multi-cloud processes. Sometimes data integrations occur across organizations, across different cloud platforms, across regions. Flexibility is key to making these modern processes work. ETL processes simply cannot handle the volume, nor the variances involved. Certainly not in a real-time scenario.

# Drawbacks

ELT

ETL processes, while a more modern and flexible solution, is not suitable in every situation, for some of the reasons we already covered. The most important of which is the previously mentioned compliance issue. Because no transformation is performed upon the data, raw, unencrypted data is available at any time prior to reaching the database. Records that can be used to reconstruct the data or at least the metadata can be retrieved from logs by administrators for the system. Because data can traverse international boundaries, encryption is often the least requirement with regards to compliance.

Deletion of data is not enough of a precaution in these scenarios, because delete processes can often crash, particularly when dealing with large chunks of information. Often, the data can be personal in nature, ie hospital records, financial information, and more, which is the reason these regulations are so strict. Many data warehouses are not equipped with adequate transaction support to ensure the query required to delete the data is successful.

# Kestra Handles Complexity

Whether you need ETL or ELT processes in your solution, Kestra (opens new window) can handle it. Indeed, Kestra (opens new window) can manage both in the same solution, handling even the most complex workflows. ETL processes can be used to scrub sensitive data, ensuring compliance, loading the transformed data within a temporary table. With Kestra's capacity for parallel flows, the rest of the data can be handled by ELT.

Kestra (opens new window) is able to perform ELT workloads on its own or with integrations to many popular solutions. It can handle loading data from BigQuery (opens new window), CopyIn, Postgres (opens new window), and more. A simple query can be performed to move the data, for example, SQL INSERT INTO SELECT statements. Dependencies between flows can be handled with Kestra's trigger mechanisms (opens new window) to transform the data within the database (cloud or physical).

ETL is just as easily managed by Kestra's flexible workflows. FileTransform plugins (opens new window) are one possible method. You can write a simple Python (opens new window)/Javascript (opens new window)/Groovy (opens new window) script to transform an extracted dataset data row per row. For example, you can remove columns with personal data, clean columns by removing dates, and more. Integrating a custom docker image into your workflow is another method that can be used to transform the data. Not only can you transform data row per row, you can potentially handle conversion of data between formats, for example, transforming AVRO (opens new window) data to JSON (opens new window) or CSV (opens new window), or vice versa.

id: my-first-flow
namespace: my.company.teams
inputs:
  - type: FILE
    name: uploaded
tasks:
  - id: csvReader
    type: io.kestra.plugin.serdes.csv.CsvReader
    from: "{{ inputs.uploaded }}"
  - id: fileTransform
    type: io.kestra.plugin.scripts.nashorn.FileTransform
    description: This task will anonymize the contactName with a custom nashorn script (javascript over jvm). This show that you able to handle custom transformation or remapping in the ETL way
    from: "{{ outputs.csvReader.uri }}"
    script: |
      if (row['contactName']) {
        row['contactName'] = "*".repeat(row['contactName'].length);
      }

This is not usually possible with most solutions. Most ELT tools often prevent ETL processes by design because they cannot handle heavy transform operations. Kestra (opens new window) is able to handle both because all transformations are considered to be row per row, and therefore do not use any memory to perform the function, only CPU.

Kestraz User Interface

With Kestra (opens new window)'s innate flexibility, and many integrations, you are not locked into the choice of one ingestion method or the other. Complex workflows can be developed, whether in parallel or sequentially, to deliver both ELT and ETL processes. Simple descriptive yaml (opens new window) is used to connect plugins, and to create flows. Because workflows created in Kestra are represented visually (opens new window), and issues can be seen in relation to individual tasks, there is no need to fear complexity. Trouble can be traced to its source in an instant, allowing you to try new things and come up with a new solution without fear. Give it a try, and let us know what you come up with!

Perhaps your unique solution can become our next showcase!