When we say we “trust” the data in a dataset, what do we really mean?
This is actually a surprisingly complicated question. In data and analytic discovery, we think of “trust” as a function of the authenticity and veracity of a dataset. If by utilizing certain processes and controls we can “show” that the data in a dataset is complete, authentic, and accurate we say it’s trustworthy. If the requisite processes or controls have failed or are lacking, however, we can’t vouch for its accuracy and reliability to use that data in our analysis. In this case, we say it’s untrustworthy.
Data lineage is one control we use to establish the accuracy and veracity of data.
To know the lineage of the data in a dataset is, notionally, to know not only where it came from (its provenance) but to account for its permutations and peregrinations: who did something to it, what did they do to it, when did they do it, where did they do it, and so on.
But lineage is a tricky thing. The “who” that’s encoded in a lineage record doesn’t have to be a person, per se, but can, instead, be any authenticated “user” – a person, a daemon or process, a RESTful service, a device. The thing that’s done to the dataset (the “what”) could take the form of any conceivable lineage-worthy event, such as an ETL process that manipulates a dataset’s contents, an update by an RDBMS that alters its metadata, a business analyst who appends a comment of some kind to it to, or a person or process that … makes a copy of it.
For the most part, determining the “who” and the “what” components of a dataset’s lineage seems like a relatively straightforward proposition – except when it isn’t. After all, datasets are rarely the products of a closed-loop data integration process. Imagine, for example, that the data in a Tableau workbook is the product of multiple data preparation or data engineering processes: a bulk ETL process that extracts data from core OLTP systems; another ETL process that engineers and transforms this OLTP data, producing a smaller (derived) dataset – e.g., related product, customer, and location data; an ETL process that profiles and parses web clickstream data, capturing specific product, customer, and location fields and writing them in CSV format to a text file; a user-collated data preparation step in which an analyst harvests census data with the intent of correlating it (as part of still another data preparation step) with customer, product, and location data obtained from prior ETL processes; and so on, and so forth. Some of this ETL (e.g., extracts from OLTP systems) occurs as part of a batch process and is performed by a commercial ETL tool. Some (Web clickstream capture) is coded in Python and is performed in Spark. And some, again, is performed by a human being using a spreadsheet and/or data visualization tool.
What does it mean to speak of the “lineage” of the resultant dataset product? Ideally, this lineage would consist of a record of all of the manipulations that have been performed on the dataset. But creating a record of this kind is a hard problem that cuts across the people, process, and technology axes. On the plus side, ETL tools are lineage-creation mechanisms par excellence. By default, however, the lineage they record is maintained by the ETL tool itself. This last is true, too, of the data engineering that is performed in Spark or by human analysts: even if each of these steps produces a record of the output dataset’s lineage – a big “if” – this record is distributed across multiple contexts. In other words, once a derived dataset leaves the ETL or Spark or (to cite a common example) Tableau domains, there’s a gap in its lineage record. The upshot is that most final or working datasets have holes in their genealogies.
When – for Whom, in What Contexts, and Why – Does Lineage Matter (Most)?
Does this matter? Yes and no. In some contexts, it can matter to the utmost degree; in others, however, it doesn’t matter quite so much. I once attended a panel discussion in which TDWI’s Philip Russom, that master of wry understatement, shared a story from his days as a product manager with a data integration vendor. Russom had been called in to help one of his company’s customers troubleshoot some upstream data lineage issues preparatory to a routine regulatory audit of some kind. These issues had been a matter of studied indifference to the customer; now, with regulators descending upon it, the customer was in a state of acute distress, its executives desperate to produce a complete account of the upstream systems, tables, and columns from which it had sourced its financial data, along with an exhaustive record of which transformations, if any, this data had undergone as part of the data integration process. As Russom recounted, its known lineage issues just weren’t a source of concern for this customer until they became a source of concern for this customer – and for Russom himself. Of course, once it becomes a problem, lineage has the potential to become the most pressing problem in the world. In this worst-case scenario, then, lineage matters. Quite a little bit. The worst-case scenario is, by definition, an outlier. But the way we think about data lineage is still to some extent dominated by this outlier scenario.
For a long time, we imposed a set of criteria appropriate for a worst-case scenario (e.g., an audit of core financial systems) on all aspects of data provisioning and access. Then, for a time, we went the other way. In the early days of self-service, data lineage, metadata management, and most other elements of data management orthodoxy – e.g., reuse, repeatability, governance itself – were viewed with suspicion by users. Metadata and lineage were dismissed as IT-centric priorities: bugs, not features. In the last half-decade, we’ve again come full circle. We’ve accepted that a pragmatic conceptualization of lineage might not be such a bad thing. The upshot is that it’s time to rethink what we mean by lineage.
But where do we start? I’ve got a few ideas, and I’m going to outline them in a series of follow-up blog posts.