Embrace the Data Lake Architecture

Often times, data engineers build data pipelines to extract data from external sources, transform them and enable other parts of the organization to query the resulting datasets. While it’s easier in the short term to just build all of this as a single stage pipeline, a more thoughtful data architecture is needed to scale this model to thousands of datasets spanning multiple tera/peta bytes.

Vinoth Chandar
3 min readJun 8, 2019

Common Pitfalls

Let’s understand some common pitfalls with the single-stage approach. First of all, it limits scalability since the input data to such a pipeline is obtained by scanning upstream databases (RDBMS-es or NoSQL stores) which would ultimately stress these systems and even result in outages. Further, accessing such data directly allows for very little standardization across pipelines (e.g: standard timestamp, key fields) as well increase risk of data brekages due to lack of schemas/data contracts. Finally, not all data or columns are available in a single place, to freely cross-correlate them for insights or design machine learning models.

Data Lakes

In recent years, the Data Lake architecture has grown in popularity. In this model, source data is first extracted with little-to-no transformation into a first set of raw datasets. The goal of these raw datasets is to effectively model an upstream source system and its data, but in a way that can scale for OLAP workloads (for e.g using columnar file formats). All data pipelines which express business specific transformation are then executed on top of these raw datasets instead.

Advantages

This approach has several advantages over the single-stage approach, avoiding all the pitfalls we listed before.

  • Scalable Design : Since the data is extracted once from each source system, the additional load on the source system is drastically reduced. Further extracted data is stored in optimized file formats on petabyte scale storage systems (e.g HDFS, Cloud stores), which are all specifically optimized for OLAP workloads.
  • Standardized & Schematized : During the ingestion of the raw datasets, a number of standardization steps can be performed and a schema can be enforced on the data, which validates both structural integrity and semantics. This prevents bad data from ever making its way into the data lake.
  • Nimble : Data engineers can develop, test and deploy changes to transformation business logic independently with access to large scale parallel processing using 1000s of compute cores.
  • Unlock data : Such a data lake houses all source data next to each other, with rich access to SQL and other tools to explore and derive business insights by joining them & producing even derived datasets. Machine learning models have unfettered access to all of the data.

Implementation

In this section, we will share a few tips for implementing the data lake architecture, based on lessons learnt from actually building large data lake(s).

  • Consider ingesting databases in an incremental fashion using change-capture systems or even JDBC based approaches (e.g: Apache Sqoop) to improve data freshness as well as further reduce load on databases.
  • Standardize both event streams from applications and such database change streams into a single event bus (e.g Apache Kafka, Apache Pulsar)
  • Ingest the raw datasets off the event bus using technologies that can support upserts to compact database changes into a snapshot (e.g Apache Kudu, Apache Hudi, Apache Hive ACID)
  • Design the raw datasets to support efficient retrieval of new records (similar to change capture) by either partitioning your data (e.g using Apache Hive metastore) or using a system that support change streams (e.g Apache Hudi)

Disclaimer : I am part of PPMC of Apache Hudi, as well as its co-creator.

--

--