By Julien Le Dem

As I looked back at my experience working in data engineering for this post, I realized I never really consciously decided to specialize in data. It just kind of happened. The company I was working for was acquired by Yahoo! where Hadoop was emerging as the next industry leap in data processing. As I dug deeper in the platforms I was using and became interested in open source software, I inadvertently just did and I found there a career and a community.
Now I’m going to talk about the evolution of the industry towards what I like to call the open data lake, also referred to as the Lakehouse pattern. To get there, we’ll take a little trip back memory lane to understand better how we got there.

Before the cloud

The Origin of Times

Back when “Big Data” was in full bloom in the late 2000s, an open source project called Hadoop freed us from the shackles of convenient but expensive distributed transactional databases. Based on the Google Map Reduce and File System papers, it turned around our assumptions and forced us to think about solving data processing problems differently. One key design principle of that system was to move processing to the data. The mapper would read locally, try to minimize how much data would be sent over the network to the reducer and the reducer would write back locally.
Data is stored as files in a datacenter-size distributed file system. Files can only be appended to and everything written is immutable until deleted. Who needs transactions when data is immutable?

Better file formats

As Hadoop brought a lot of flexibility, it did so by providing low level APIs that didn’t define any efficient mechanism to query data. Learning from MPP databases, we created Parquet as a columnar format enabling the flurry of SQL-on-Hadoop projects to be faster and more efficient querying data in a distributed file system. At Twitter in particular, we were trying to make Hadoop a little bit more like Vertica.

Tying Storage and Compute on prem

Around that time, the running joke was “What’s the cloud? Somebody else’s computer!”. Docker didn’t even exist, The jokes about “serverless” hadn’t been invented yet.

Procuring a Hadoop cluster required ordering many machines and installing them. This would be a lengthy process with eyebrows raised and questions that would be hard to answer.
In particular, since we are “moving compute to the storage”, it is important to size the machine appropriately so that each node has a ratio of storage, cpu and disk that will minimize waste of resources. This depends on the workload:
- Are we going to be CPU intensive (need more CPU per node)?
- Are we going to be IO intensive (more disks per node)?
The short answer is: we don’t know and it will change in ways that are very hard to predict anyway.