The advent of the Open Data Lake
As I looked back at my experience working in data engineering for this post, I realized I never really consciously decided to specialize in data. It just kind of happened. The company I was working for was acquired by Yahoo! where Hadoop was emerging as the next industry leap in data processing. As I dug deeper in the platforms I was using and became interested in open source software, I inadvertently just did and I found there a career and a community.
Now I’m going to talk about the evolution of the industry towards what I like to call the open data lake, also referred to as the Lakehouse pattern. To get there, we’ll take a little trip back memory lane to understand better how we got there.
Before the cloud
The Origin of Times
Back when “Big Data” was in full bloom in the late 2000s, an open source project called Hadoop freed us from the shackles of convenient but expensive distributed transactional databases. Based on the Google Map Reduce and File System papers, it turned around our assumptions and forced us to think about solving data processing problems differently. One key design principle of that system was to move processing to the data. The mapper would read locally, try to minimize how much data would be sent over the network to the reducer and the reducer would write back locally.
Data is stored as files in a datacenter-size distributed file system. Files can only be appended to and everything written is immutable until deleted. Who needs transactions when data is immutable?
Better file formats
As Hadoop brought a lot of flexibility, it did so by providing low level APIs that didn’t define any efficient mechanism to query data. Learning from MPP databases, we created Parquet as a columnar format enabling the flurry of SQL-on-Hadoop projects to be faster and more efficient querying data in a distributed file system. At Twitter in particular, we were trying to make Hadoop a little bit more like Vertica.
Tying Storage and Compute on prem
Around that time, the running joke was “What’s the cloud? Somebody else’s computer!”. Docker didn’t even exist, The jokes about “serverless” hadn’t been invented yet.
Procuring a Hadoop cluster required ordering many machines and installing them. This would be a lengthy process with eyebrows raised and questions that would be hard to answer.
In particular, since we are “moving compute to the storage”, it is important to size the machine appropriately so that each node has a ratio of storage, cpu and disk that will minimize waste of resources. This depends on the workload:
- Are we going to be CPU intensive (need more CPU per node)?
- Are we going to be IO intensive (more disks per node)?
The short answer is: we don’t know and it will change in ways that are very hard to predict anyway.
Invariably, we would either run out of storage, and adding more nodes would underutilize CPUs or run out of compute, and adding more nodes would underutilize disks. Scaling would take months anyways so users would deal with this the way they could. Prioritizing workloads and storage was a headache. A lot of friction every way you look at it.
This problem wasn’t limited to Hadoop and Map Reduce, the early MPP databases like Vertica and Redshift had the same constraints, tying together compute and storage.
Taking the cloud seriously
Elasticity and separation of Storage and Compute
Over time, the jokes about cloud ran stale and it became a thing. As network bandwidth was increasing 10 times faster than CPU/memory bandwidth, moving compute to storage became less necessary. S3 was born and things started changing.
When you scale compute and storage separately and resources are almost infinitely and instantly available (I’ll assume a spherical cloud in a vacuum for this paragraph) a lot of friction can go away.
Storage is on-demand and doesn’t require knowing how much compute you will need to process the data. It doesn’t even require knowing if you’ll need to read the data at all. It is even cheaper if you don’t read it. Compute resources are allocated on-demand and you pay only for what you use. Nobody has to compete for scarce resources anymore. If you have the budget to use more, then you can just use more.
As this happened, Netflix was a pioneer in leveraging the cloud for their infrastructure. They innovated in their use of Hadoop by making blob storage (in this case S3) the persistent store for their data using Parquet as the file format and bringing Hadoop clusters up and down in the cloud to allow for more flexible workloads.
Newer MPP Databases like Snowflake and BigQuery were designed from the ground up as cloud native, storing their data in blob storage with decoupled compute. (And Redshift also evolved in that direction).
New problems
As this ecosystem evolved, we saw the rise of vertically integrated cloud data warehouses. After importing your data into your cloud data warehouse provider of choice, you could query it and generate derived datasets. If compute and storage resources were decoupled and scaled independently in the modern data warehouse (Snowflake, Bigquery, …), it was very much not expected nor even considered that another engine could access the storage layer and the engine expected data to be in the exact proprietary format it was optimized for; while in the Hadoop ecosystem it was the norm that data was all in one place to be accessed by various tools.
Importing data in the warehouse would come at a cost as it would be transformed into the vendor’s native representation and then any derived data would also live in that walled garden. Exporting data to process it with another tool external to the Vendor’s stack would also come for a fee.
This caused the apparition of silos. What data you would be able to use with each tool would depend on where it was produced and stored. The volume of data and the time and cost it would take to copy it, would make syncing all data across tools prohibitively expensive. Data transfer became a complex network of one-off ETL jobs. Again a lot of friction.
Meanwhile, the Hadoop ecosystem followed Netflix’s lead in using blob storage for data persistence. There was less and less Hadoop in it. Spark became the tool of choice for distributed processing on top of that storage. The term data lake became a thing and new jokes just started writing themselves. Suddenly competitors all had data swamps.
A proper table abstraction
The idea of connecting the Warehouse to the data lake wasn’t far-fetched. People didn’t like having to deal with silos.
Remember how I started by saying that we were storing a bunch of immutable files in a big bucket? It turned out that being able to mutate data was actually needed and therefore transactions would prove useful. Who would have thought?
Having the ability to accept late data, fix bad data already ingested by reprocessing it as well as a whole bunch of maintenance operations required some level of transaction to make updates atomic. This is the only way we can trust data wouldn’t get corrupted at scale.
For data updates
In this Data lake world, and OLAP in general, we avoid processing small transactions, most processing is performed in batch and updates executed in bulk. This gives some leeway to implement atomicity (if not transactions) in a way that is not outrageously costly.
In particular, we can provide Snapshot isolation. Data can be updated while another process is reading without affecting it. The reader will keep seeing the same snapshot of the data as of when it started reading. The next run of that job will see the newer version. In a world where processing is orchestrated by Directed Acyclic Graphs (but even with Cyclic ones as there is no cycle at the snapshot level), updates propagate asynchronously and we maintain eventual consistency.
When sharing data through immutable files without centralized coordination of data access, you cannot really update anything without potentially breaking the data consumers. In the Hadoop era, as files are created in a distributed system we needed an atomic mechanism to know that a dataset made of multiple files generated in parallel was complete. Usually this happened either by renaming a temporary folder atomically to the destination or by creating a _SUCCESS file in the end that the consumer would know to wait for. The centralized Hadoop namenode guaranteed atomicity of those operations, unfortunately that also made it a single point of failure that did not scale horizontaly. Blob storage implementations chose different trade offs.
If these mechanisms enabled atomicity of data creation in limited patterns, they didn’t allow modifying data at all. In many ways data engineering relied then on everything always working correctly the first time around. Fixing problems would require extreme care and extensive knowledge of interdependencies of various jobs and how they were scheduled.
This led to the creation of Iceberg at Netflix with the goal to define a better table abstraction on top of blob storage.
The general idea is to leverage the bulk update nature of the data lake to keep track of each snapshot of a table. One can atomically update the pointer to the current snapshot of the table and keep track of previous snapshots without affecting any process currently reading the table.
Suddenly, many operations become a lot easier. You can rewrite a table atomically without any need for coordination. Partitioning is abstracted and we can decouple data access from the actual layout of the table, enabling storage optimization independently of the jobs accessing it.
Yes, we also re-invented the DBA role.
For efficient data access
The combination of blob storage, a columnar format, sorting, partitioning and table metadata enables efficient data access for a large array of OLAP use cases.
When a query engine accesses data, it wants to minimize how much data needs to actually be scanned or deserialized from the storage format. Iceberg facilitates this by consolidating table metadata from the files that make a snapshot.
A query engine reduces the cost of scanning columnar data in a few ways:
- Projection push down:
- By reading only the columns it needs.
- Predicate push down:
- By skipping the rows that it doesn’t need to look at. This typically leverages embedded statistics.
- By better skipping ahead while decoding values. Simply by leveraging understanding the underlying encodings. Skipping in Run Length Encoding or fixed width encodings is really cheap.
- Vectorization:
- By using vectorized conversion from the on-disk Parquet columnar representation to the in-memory Arrow representation.
Breaking up the silos
With this new setup, you can own your data, store it once in your own bucket and connect various processing engines, open source or proprietary (Spark, Trino, Data warehouses, ML, etc).
There is no need to import or export data, trying out new tools is easy and data can be mutated without requiring additional coordination.
One very nice property of the whole setup is that all the distributed data access, whether it’s to deal with atomicity of snapshot creation or efficient access for queries, is entirely served by your blob storage of choice. There are very few concerns in terms of scalability or elasticity. Do you need to run a very large job that’s going to read a lot of data as a one off? No problem. Do you need more storage at very short notice? also no problem. S3 is probably one of the largest storage services on the planet and also overall it is pretty cheap and provides many storage tiers if you need it to become cheaper. You don’t need to read that data any time soon? It can move to a cheaper storage tier.
Owning your data in cheap scalable storage in your own cloud account allows you to avoid vendor lock-in and removes silos.
You are now free to optimize your cloud costs better than if storage becomes abstracted behind several vendors.
Why now?
If you’re wondering what all the fuss is about lately, this is what’s driving the adoption of Iceberg with Parquet to implement the Open Data Lake. Vendors are adopting this pattern not only because their customers do not appreciate vendor lock-in (they never did) but because there is enough momentum that there is an alternative and they will lose market share if they don’t.
It’s all columnar data in blob storage anyways. You may as well be the one taking advantage of it.