Posts
Bottom up Architecture
The topic of software architecture has become a bit cringe. Some people will roll their eyes at the mere mention of it.
The advent of the Open Data Lake
As I looked back at my experience working in data engineering for this post, I realized I never really consciously decided to specialize in data. It just kind of happened. The company I was working for was acquired by Yahoo! where Hadoop was emerging as the next industry leap in data processing. As I dug deeper in the platforms I was using and became interested in open source software, I inadvertently just did and I found there a career and a community.
Now I’m going to talk about the evolution of the industry towards what I like to call the open data lake, also referred to as the Lakehouse pattern. To get there, we’ll take a little trip back memory lane to understand better how we got there.The Future of Lineage
Lineage has long been a requirement for anyone processing data - whether for complying with regulations, ensuring data reliability or, to quote Marvin Gaye, plainly just knowing what’s going on from provenance to impact analysis.
However, our industry has historically had difficulties collecting data lineage reliably.
From the early days of lineage powered by spreadsheets, we’ve come a long way towards standardizing lineage. We have evolved from painful, manual approaches to automated operational lineage extraction across batch and stream processing.
Now, we’re on the brink of a new era when lineage will be built into every data processing layer - whether ETL, data warehouse or ai - and not an afterthought.The Deconstructed Database
In 2018, I wanted to describe how the components of databases, distributed or not, were being commoditized as individual parts that anyone could recombine into use-case specific engines. Given your own constraints, you can leverage those components to build a query engine that solves your problem much faster than building everything from the ground up. I liked the idea of calling it “the Deconstructed Database” and gave a few talks about it.
Improv for Engineers
Recently I’ve signed up for an improv class and it’s been a lot of fun. It had been way too long since the last time I had taken classes, back I was at Twitter, and I wish I had done this earlier. That class I took ten years ago was part of “Twitter University”, a program designed to help employees develop their skills. There, you could learn about many topics from programming Scala to Improv’. Employees were also encouraged to teach classes. For example, I taught the analytics onboarding class for a while and a few other one-off classes.
Chapter III: Onwards, OpenLineage
Much the same as there was a common need for a columnar file format and a columnar in-memory representation, there's a common need for lineage across the data ecosystem. In this chapter, I'm telling the story of how OpenLineage came to be and filled that need.Chapter II: From Parquet to Arrow
In 2015, a discussion started in the Parquet community around the need for an in-memory columnar format. The goal was to enable vectorization of query engines and interoperability of data exchange. The requirements were different enough from Parquet to warrant the creation of a different format, one focused on in-memory processing.Chapter I: The birth of Parquet
15 years ago (2007-2011) I was at Yahoo! working with Map/Reduce and Apache Pig, which was the better Map/Reduce at the time. The Dremel paper just came out and, as everything I worked with seemed to be inspired from Google papers, I read it. I could see it applying to what we were doing at Yahoo! and this was to become a big inspiration for my future work.Ten years of Building Open Source Standards: From Parquet to Arrow to OpenLineage
Over the last decade, I have been lucky enough to contribute to a few successful open source projects in the data ecosystem.Dremel made simple with Parquet
Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases
subscribe via RSS