By Julien Le Dem

Lineage has long been a requirement for anyone processing data - whether for complying with regulations, ensuring data reliability or, to quote Marvin Gaye, plainly just knowing what’s going on from provenance to impact analysis.
However, our industry has historically had difficulties collecting data lineage reliably.
From the early days of lineage powered by spreadsheets, we’ve come a long way towards standardizing lineage. We have evolved from painful, manual approaches to automated operational lineage extraction across batch and stream processing.
Now, we’re on the brink of a new era when lineage will be built into every data processing layer - whether ETL, data warehouse or ai - and not an afterthought.

Why lineage?

Lineage is not a feature, it is a means to an end. You do need lineage to achieve a specific goal but that goal is often obscured behind an oversimplification that “we just need lineage”. When someone asks for lineage they can mean many different things. Depending on the context the lineage requirement is often multi-faceted, like the proverbial three kids in a trenchcoat.
Let’s break it down.

Data reliability

We want to make sure our data is available, updated on time, and correct. Those attributes are not just a quality of a dataset, they derive directly from the process of producing it from other upstream datasets. Problems with data are almost always coming from an upstream dependency. Whether it’s delays in ingestion, bad data updates or changes in how we collect data, anything that changes upstream will potentially impact quality. This is the reason lineage is key to measuring and troubleshooting data quality issues. Guaranteeing data quality also means that quality commitments must be consistent across dependencies. For example, If a dashboard is critical, we must ensure that all datasets upstream from this dashboard - directly or indirectly - must have the same level of quality requirement. A production-level dataset should not have a dependency on an experimental dataset with no on-call rotation or data quality expectations.

Impact Analysis for change management

For the same reason that we investigate sources of data unreliability by looking upstream, we also need to check that changes we make to datasets we are responsible for don’t adversely impact downstream consumers. Modern data engineering is emancipating ourselves from an uncontrolled flow of upstream changes that hinders our ability to deliver quality data. Through lineage we can analyze the impact of a change to the reliability of other datasets and prevent unwittingly breaking them. We can define best practices based on explicit dependencies and contracts to better work together.

Data discovery, data governance

Data tends to accumulate and increase in complexity. It goes through several layers of cleaning up and modeling before it becomes a reliable source of insights. For example, you might have many tables named “customers”. It is critical to use the right data source and join on the correct ID. You can not base your analysis on raw data where spam and internal usage have not been removed. You can not copy data containing PII into another table. For all those reasons you need a way to discover everything that exists and also clarify what is usable and how datasets are joined together. You need to be able to verify that data you rely on is derived from the correct layer of modeling. Lineage enables our understanding of where data is coming from and where it is going.

Compliance

In addition to expecting basic levels of correctness and understanding of our data we are also simply often held accountable by regulators on our practices. Whether it’s to guarantee our user’s privacy by making sure their data is stored and used appropriately or tracking the flow of transactions to guarantee correctness, there are financial repercussions to not meeting the accepted bar of understanding how data is flowing from one dataset to the next. In a word: lineage.

A brief history of lineage collection

The Antiquity

In the olden days, we used to collect lineage manually. There would be a program organizing the collection of lineage. Someone would define a template (most likely this would take the form of a spreadsheet) and then ask various people responsible for the collection, transformation or usage of data to fill in the document. There would be extended communication back and forth to ensure complete coverage. Verifying integrity would be difficult but since the process was defined and we were following it, from a compliance perspective, this was a success. Unfortunately, by the time the process was over we’d have to start again for the next iteration. With more people and more data, this quickly becomes untenable and the collection of lineage takes more time than the frequency of audits allows, creating a significant burden on the organization.

image source

The middle ages

As people suffered the manual toil of collecting lineage, the next step to improve the situation was clear: automation. Any good engineer would look at our data system of choice and figure out that most of the time, the lineage information is already there, latent. It is implicitly encoded in all the SQL queries and other programs accessing and transforming data.

I can reverse engineer all those transformation layers, write a SQL parser here and there, instrument the libraries and automatically audit data access. This is way better than doing it by hand across an entire organization.
However, we haven’t yet found our silver bullet. This solution has a couple drawbacks. First, if open source solutions are easier to reverse engineer, proprietary databases are more opaque. Second, we create a dependency on a vast surface area of internal apis that have no guarantee of stability. Every vendor/system has their own way to process data which produces a solution with vast amounts of complexity. This also makes it brittle as it requires constant fixing as those internal APIs change over time.
Maintaining those integrations is expensive and the few lineage vendors in this field have been acquired and disappeared over time, making it difficult to rely on as a solution for lineage collection.

The renaissance

As we progressed in our quest for lineage, and met others sharing the same goal along the way, it became obvious that we could all benefit from uniting behind a common solution.
By standardizing how lineage is collected, we solve multiple problems ladening our reverse engineering solution. We share the cost with others, creating more value for everyone. But sharing solutions alone doesn’t quite prevent the ongoing maintenance burden of keeping up with a complex and fast moving data ecosystem.

image source

The revolution:

The final step of standardization is to move the responsibility of producing lineage metadata to the producer of data itself. As it emerges as a common need for all data practitioners, lineage becomes a requirement for all data tools, open source or proprietary. Now that OpenLineage has standardized how to represent lineage, there is an easy path to follow for every data processor to support exposing lineage as a built-in feature and not an afterthought.

Where we are and the future

The dream of standardizing lineage has been steadily happening. We can now see the day will come where lineage is a table stakes feature in every data pipeline.
In particular, let’s review support for lineage in key open source projects.

Airflow

Lineage support in Airflow started as manual annotations. Soon after, OpenLineage provided automated lineage extraction with its Airflow integration. However since it was external to the Airflow project, it would occasionally get broken by changes in internal APIs. As of Airflow 2.7, this is no longer the case as Airflow provides built-in support for OpenLineage. It is now the responsibility of each operator to expose lineage.

Flink is a great example of OpenLineage support for streaming jobs. As a streaming jobs runs continuously until stopped, in addition to the start and complete events, this integration sends events on Checkpoint. It started initially as an external integration that prompted discussions for a more native implementation of lineage in Flink. This effort paves the road for native OpenLineage support.

Spark

The OpenLineage Spark integration has become the most popular way to collect lineage from Spark jobs. It covers column-level lineage and provides up-to-date support as new versions of Spark get released. It is the recommended mechanism to collect lineage in Dataproc. Like for Airflow, we are working on making it easier to make OpenLineage support native in Spark by enabling each Datasource to natively identify the Datasets following the OpenLineage standard.

Trino

One of the most popular open source distributed SQL engines, Trino now has built-in support for OpenLineage.

Data Catalogs

Many major data catalog providers, open source or proprietary, accept OpenLineage as a source of lineage:

  • Atlan
  • AWS Glue
  • Datahub
  • GCP Dataplex
  • OpenMetadata
  • Marquez
  • Metaphor
  • Microsoft Purview
  • Select Star

Code-based Lineage

Companies like Foundational leverage static and dynamic analysis of source code to determine lineage at build-time and enable a better, more streamlined and reliable data engineering practice. These methods can provide lineage across multiple code repositories thus providing visibility to lineage changes as a result of code modifications. Accessing code can also sometimes simplify lineage extraction, from both an operational and cost perspective.

What’s next

At this point, it is clear that, in the future, lineage will be built-in in everything. Just like we don’t really think of how electricity gets to the outlet, we won’t even think about how lineage gets in our data catalog. We just need to plug it into the OpenLineage outlet.