In a previous blog, I introduced a different frame to discuss data lineage, distinguishing between first-party, second-party, and third-party lineage. These aren’t just arbitrary distinctions.

In data management, we’re used to distinguishing between what we call “technical lineage” and “business lineage.” The former describes a highly granular record of data lineage, the (technical) details of which are primarily of interest to auditors, data stewards, architects, compliance officers, and the like. “Business lineage,” by contrast, elides the technical stuff and represents lineage as (for example) a sequence of mappings between sources and targets, substituting familiar business terms and business rules for technical names and/or details.

The problem with this traditional frame is that it forces us to adopt an IT-centric perspective with respect to the question of lineage. The canonical distinction – i.e., between “technical” and “business” lineage – arises out of enterprise IT’s use of ETL as a primary “tool” or mechanism for data integration. (My use of scare quotes is intentional. ETL is as much a method of integrating data as a discrete product or tool. Thirty years ago, all ETL was coded by hand, consisting in most cases of scripted data transformations, scripted data validation routines, scripted – usually FTP-powered – data movement between repositories, etc.)

What we call technical lineage evolved as the output of the ETL-driven data integration processes that to some extent still dominate enterprise IT. It was as much a diagnostic and regulatory tool as a useful rubric for assessing the completeness and trustworthiness of data.

Business lineage, by contrast, evolved as a kind of afterthought: a record that the ETL product vendors basically derived from the technical lineage generated by their tools. This is harsh, but not overly so. The point is that business lineage is a product of the same IT-centric perspective that valorized ETL. But the self-service revolution evolved as a business-driven challenge to IT’s traditional role as gatekeeper of data. Some, including yours truly, have made the argument that IT should behave more like a shopkeeper than a gatekeeper, but that isn’t what I’m talking about here. I’m talking, instead, about the need to reimagine the concept of data lineage, using as our perspectival frame the experience of the self-service user. What does lineage look like from her perspective? How – why, when, in what way – is it important?


Facts Are Stubborn Things, but Statistics Are [More] Pliable[1]

In the vast majority of cases, a business analyst or self-service user doesn’t need to know anything about the technical lineage of a data flow to do her job. She doesn’t need the granularity of statistics: she just needs to know whether she can trust the data in the dataset.

By now, self-service users are accustomed to collecting and curating their own data sets for analysis; it is no longer uncommon, however, for self-service users to design their own data flows, too – even to the point of constructing complex data integration pipelines that (e.g.) consume data from upstream sources, sequence and process it in one or more steps, and persist the output to one or more downstream targets. Sometimes these pipelines consume the output of ETL-driven, IT-managed data flows; in many cases, however, they don’t.

Self-service users aren’t building data flows for the thrill of it, of course: they’re doing so in the context of discovery and analysis. And they’re persisting the output of these data flows as datasets – i.e., workbooks, sheets within workbooks – and not to a central, managed repository, such as the data warehouse. Increasingly, self-service workbooks – along with similar artifacts (spreadsheets, text and CSV files) – are potential sources of data for analysis.

What is the “technical” lineage of a Tableau workbook? Until recently, Tableau itself couldn’t offer a very good answer to that question. As I’ve argued, however, this is beside the point. In most cases, for most potential users, the question of technical lineage is largely a moot one: what matters is whether or not (or to what degree) a dataset or analysis is trustworthy. The answer to this question is a binary one – data is either trustworthy or untrustworthy; usable or unusable – although it can and will vary from user to user depending on their requirements.

This is why I think distinguishing between first-, second-, and third-party lineage is useful. We vouch for the first-party datasets that we create ourselves; our colleagues vouch for the second-party datasets they create. (And we can make our own decision about whether or not we’re willing to trust the lineage of a problem colleague’s second-party dataset.) Our tools are also doing a much better job of keeping track of what we do with data – along with, less commonly, where that data came from – as part of the self-service user experience. So, our colleagues don’t have to take our word (or vice-versa) that a dataset is trustworthy and usable: the tools we use help to ensure this, too. They provide enough detail and context to permit us to make that binary determination. In this way, even second-party lineage can be transmuted into first-party gold. Third-party lineage is more of a problem, however. For data that is the product of a regular, repeatable process (such as an ETL data flow), there’s detailed technical lineage, along with (in most cases) business lineage. There’s plenty we can use to assess its trustworthiness. For data that is produced by other means, this often isn’t the case, however.


A Pragmatic Perspective on Data Lineage

The ongoing challenge for self-service is to demystify the problem of third-party lineage. Traditionally, third-party lineage wasn’t captured (i.e., imported from other tools) as part of the self-service user experience: most tools had a difficult enough time generating and managing first-party lineage in their own environments, after all. Things have improved, significantly, on this last tip, but third-party lineage is still something of a black hole in the context of the self-service user experience. The good news is that this is beginning to change. Self-service tools are getting much better about capturing and managing third-party lineage.

It’s incumbent upon us to get better, too: better, that is, or more nuanced, with respect to how we think about data lineage as a whole. Most of us still conceive of lineage from an IT-centric perspective. Viewed from the perspective of self-service, however, data lineage looks very different – less a function of technical granularity and exhaustive detail than of abstraction and utility. Both frames – i.e., IT/technical and self-service – are nominally concerned with the same questions: first, is a data set trustworthy, and, second, what does it mean for a data set to be trustworthy? The IT/technical view conflates these two questions into a single formulation: a data set is trustworthy if it can be shown to be complete and consistent. QED.

For the self-service practitioner, however, both questions are very much alive. Not only is there no single commutative formula – no QED – but self-service views trustworthiness as less a function of the completeness and consistency of all of the data in a data set than of its accuracy and potential usefulness: its value – particularly when combined with data from other sources – for analysis. Traditionally, in fact, self-service users rejected the complete and consistent (hence “trustworthy”) data that had been provisioned for them by IT in favor of datasets that they collected and curated on their own. The IT-provisioned data might have been “complete” and “consistent,” within very strict limits, but it wasn’t the “right” data. (Historically, in fact, IT was unable to provision the “right” data rapidly enough to suite the requirements of a sizeable minority of business people. This is what gave rise to self-service in the first place.)

A self-service perspective on lineage is a subtler, a suppler – a more pragmatic thing: a long-overdue reimagining of a foundational concept in data management. Let’s not lose sight of this even as self-service technologies begin to ape the language and methods of IT.

[1] I’d like to give a shout-out Mark Twain, but I can’t find an authoritative source for his authorship of this quote!