The Enterprise Big Data Lake

Feb 27, 2022 Rahul Sharma

<aside> 💡 This is my book review of The Enterprise Big Data Lake by Alex Gorelik. All images belong to O’Reilly.

</aside>

Untitled

I. Introduction

There are lots of materials available on the internet that talk about how to set up a data lake e.g., storage layer, ingestion pipeline, spark cluster but very few on really why and when do you need a data lake.

In this book, Alex Gorelik summarizes why organizations need a data lake, how their needs evolved over time, why the data warehouse fell short of big data analytics and more.

Alex also touches on less talked about aspects of data lake i.e., data quality, data catalog, and data governance.

II. Key takeaways

Value prop of data lake
Enterprises trying to introduce data lake and big data analytics, there are three common approaches:
1. Offload some existing functions to data lake and gradually expand from there e.g, lift-and-shift existing on-prem ETL jobs to data lake
2. Start with a data science initiative, show great RoI (get business buy-in) and gradually expand to a lake
3. Build the data lake for a data governance perspective
1. When using operational data for analytics, it can be useful to transform it for several reasons,
  1. Harmonization: Data from different sources is converted to a common format i.e., studying and mapping disparate attributes (from disparate sources) to a common unified set.
  2. Entity resolution: Different instances of the same entity (e.g., customer) from different sources need to be identified and designated as one single entity.
  3. Performance optimization: Some schemas (e.g., star) are better suited for OLAP
2. Best practices for finding and documenting data (and the enterprise)
  1. Crowdsource the tribal knowledge and make it available to everyone
  2. Identify and encourage SMEs to share their domain expertise and tie that to an incentive
  3. Automate annotation of data sets
3. Three pillars to establish trust in data
  1. Data quality: how complete and clean the data is
  2. Lineage: where the data came from
  3. Stewardship: who created the dataset, and why
4. Advantages of merging multiple data lakes
  1. Optimized resource usage
  2. Optimized admin and operational cost
  3. Data redundancy reduction
  4. Reuse in enterprise projects
5. Advantages of keeping data lake separate
  1. Regulatory constraints
  2. Organizational barriers
  3. Predictability
6. Data lake maturity
7. Breakdown of data lake into workspaces or zones

III. Fancy words

Data puddle: A database; attempt at datalake
Data swamp: Datalake gone awry; garbage in, garbage out
Shadow IT: BUs running their own tech, separate from central IT