Aug 16, 2023 Rahul Sharma

img.png

I. Modern Data Stack

The modern data stack in 2023 is filled with bespoke tools for each component in the data engineering journey, from the “E(xtract)” to the “L(load)” to the “T(ransform)”. While tooling in “T” is relatively new (Hello dbt!), “EL” products have enjoyed a consistent user adoption over the years, from SSIS to Airbyte.

Courtesy: LakeFS

Courtesy: LakeFS

While the landscape has certainly been a bit overblown (see above), the modern data stack still boils down to the plain old ELT. As disk becomes cheaper, enterprises are happy to move and preserve historical data in the Data Lakes, and transform them into useful data products for BI and AI downstream.

In this post, we will focus on the “EL” of “ELT” and how different tools come together to accomplish it.


II. Extract and Load

image.png

The first step of the modern data stack is to ingest the data from different sources (e.g., on-prem SQL Server, Third-party APIs, flat files). The ingestion itself can be divided into two steps:

  1. Extract (from the source)
  2. Load (to the destination)

As a developer, the first instinct would be to use the corresponding SDKs of source and destinations to write the code and perform the ingestion. However, this quickly becomes inefficient due to a few reasons:

  1. The need to develop, manage, and maintain codebase for multiple sources (n > 10)
  2. Inability to quickly ramp up a new source without going through full SDLC
  3. Lack of collaboration between technical and business team members

This created a market for low/no code ingestion tools to allow both technical and non-technical users to easily ingest data from a multitude of sources. Prominent tools in this market are Fivetran, Airbyte, and Stitch to name a few. As of writing, both Airbyte and Fivetran support 350+ connectors (e.g., Azure SQL, AWS S3, Zendesk, Google Sheets) [1][2]. Below is a detailed comparison of the two.