In the last blog post we talked about data’s trustworthiness and using data lineage as one control to help establish the accuracy and veracity of data. Being able to track and ‘show’ a dataset’s origin and any transformations it has undergone helps to authenticate its accuracy for use in analytics. In that same vein, deciphering data lineage is a complex task with its own set of challenges.
I think it’s helpful to frame the problems of data lineage in terms of different kinds or experiences of data lineage rather than in terms of a lineage monolith that applies in the same way (with the same constraints) to all users at all times.
Take for example the controls and data requirements around the practice of quarterly financial reporting which in order to ensure the validity of the information and the exactness of accounting must have a clearly visible audit trail, and impose those same lineage rules (with all of its constraints) on a business analyst who works in a financial institution’s consumer fraud department—is to hamstring that analyst.
In the first case, the attributes of lineage that matter most in financial reporting – e.g., detailed information about upstream source systems, with their specific tables, columns, and values; the specific transformations this data undergoes as it is engineered and persisted for downstream reporting and analysis – are not of similar importance to the consumer fraud analyst.
Now imagine that a marketing analyst wants to harvest certain fields – e.g., date, time, and ZIP code information – from a sales transaction dataset. The analyst wants to enrich this data with data from Google Analytics, from social media, and from one or more subscription sources (e.g., a television advertising information service) so she can begin to study the effectiveness of her company’s recent media spend campaign. From her perspective, it is very important that the data which matters to her is accurate and correct. If it is, she knows she can trust it to use it for her analysis. This demand for access to data irrespective of its canonical completeness and veracity is a common one in analysis.
A New Way to Frame the Problem of Data Lineage
With these examples in mind, a new way to frame data lineage is to distinguish between three basic types of data lineage: first-party, second-party, and third-party lineage. The default perspective for this frame is that of the self-service user. This is actually a radical departure from the way we usually think about (and frame) lineage. In the self-service experience, data flows aren’t the only sources of data for analysis. The workbook itself is increasingly a source of data, too. The emergence of the workbook-as-source complicates – in point of fact, explodes – our traditional understanding of lineage. The frame I outline below gives us a new, workbook-friendly way of thinking about lineage.
Third-party lineage is a data lineage record that is produced by applications, services, processes, etc. that are “outside” of (e.g., upstream from) the self-service context. For example, an ETL tool generates what data management practitioners call “technical lineage:” a highly detailed record that (in its granularity) is mostly of interest to auditors, data stewards, etc. Ideally, this record comprises a complete account of the provenance of data, from source to target, beginning with mappings of upstream systems – with their constitutive tables, rows, columns, etc. – to tables, rows, columns, etc. on downstream targets. The canonical definition of technical lineage also specifies a no less complete record of all of the transformations data undergoes as part of its extraction, movement, and persistence in a downstream repository.
Data integration processes are rarely closed-loop or self-contained however. A Tableau workbook might be compounded of data from multiple data flows, some of which are the products of two or more data engineering steps. One data flow might begin with a bulk ETL process in which data is extracted from core OLTP systems and moved to a staging area or operational data store. This would trigger another ETL process that’s designed to produce a smaller (derived) data flow – e.g., related product, customer, and location data. At the same time, a separate ETL process is used to process a clickstream data feed, extract certain fields, validate them, and write the output to a file. In the self-service context, a business analyst imports this data into Tableau with the intent of enriching it with data from external sources – e.g. census data and social media data she’s procured from the Web. At this point, the analyst faces the unenviable task of establishing the lineage of all of these “third-party” data flows: where the data comes from, what is done to it, by whom it is done, along with where it “touches” prior to its being accessed by the analyst. Sometimes (as with an ETL tool) the process generates a detailed lineage record; in other cases, however, the lineage of a dataset might be less clear: rife with lacunae or gaps. This is the challenge of third-party lineage.
First-party lineage is this process in miniature. If third-party lineage ideally comprises a genealogy of all of the antecedent steps in the production of a data flow, first-party lineage has to do with what is done to data when it is manipulated in the self-service context, be it Tableau, Power BI, or – with a nod to the folks sponsoring this blog – Unifi’s data catalog. First-party lineage can be highly granular in detail: analogous to or identical with traditional technical lineage. Ideally, it also gives the self-service user the option of hiding or abstracting most of this granularity, however. (I’ll explain what abstraction of this kind actually entails in a follow up post.)
Second-party lineage is like first-party lineage, albeit applied to datasets we receive from trusted sources – such as our coworkers. Think of second-party lineage as someone else’s first-party lineage. A trusted (internal) Tableau workbook is a great example of this.
This is just an outline of what I have in mind. I’ll say more about the particulars in the final blog post in the series