At a recent CDO conference in Boston attended by one of our executives, 67% of attendees rated data quality as their most important challenge. As analytics and data science moves towards self-service, whereby business users have the ability to search for data sources and then prep data for analysis, the importance of data quality becomes even more paramount.
Data quality spans many aspects. To some, it is the accuracy of address information as verified by USPS or other country postal data. For others, it is removing duplicates or identifying “Golden Records”. And still, to others, it is the normalization of data types such as phone numbers, credit card numbers, social security numbers etc. Regardless of what you consider data quality, it is clear that methods must exist in any data pipeline to address overall data quality.
Come One, Come All to Data Quality Tasks
Historically, it has been the function of the data engineer or ETL programmer to perform data cleansing tasks as part of fulfilling requests for insights or reports from the business. This has led to “institutional lock-in” on data quality issues. Individuals that deal with certain data sets know the issues and how to apply scripts to remove them. Often that knowledge is undocumented and can “walk out the door” if someone leaves the company, requiring those who remain to reinvent the wheel.
In an environment where data cleansing tasks can be associated with a data set, those same cleansing tasks can be recommended to other users. If you know a data set has duplicates and a complex SQL or Python script has been written to de-duplicate the data, shouldn’t everyone in the organization derive benefit from that function? Of course, they should, and that goes for every other cleansing, enriching, normalizing or parsing function associated with each data set.
In a classic crowd-source community model, simple comments, annotations or tagging also lead to significant enhancements in shared learning and, ultimately, data quality. If someone has deduced that an attribute called “Ext_1_absolute” actually describes the play function on a video streaming service, this might be useful information for every other user of that data set. Making those descriptions searchable dramatically improves data discoverability and reduces unnecessary calls to IT for support.
This Data Set Rated R for Reliable
Simply understanding the community sentiment about a data set will help the business user know how much they can trust that data in their analytics or data science results. A system of ratings, such as that used on Yelp for restaurant reviews, will help users determine which data sets are more accurate, or of higher quality in accuracy than others. This helps business users rate the outcome of any data transform or business insight derived from that data.
The Cost of Not Collaborating
At another recent CDO conference, an IT manager at a global pharmaceutical company shared how he supports multiple development teams and was approached by another scientist to join four separate data sets together for an insight. The IT manager put it this way; “this drug has been in development for nine years! It’s inconceivable that this analysis has not been performed in the past.” But because there is no shared knowledge repository for this group, they find themselves redoing past work which costs both time and money.
This scenario is probably true for every enterprise in the country. Imagine how many billions of dollars are wasted annually with work being redone because the organization lacks collaboration tools associated with the data that is running their business.
Broadening the definition of “Data Quality”
As organizations find themselves needing to make data accessible to an ever-expanding set of users, we must expand our definition of musty IT-centric terms such as “Data Governance” and “Data Quality”. There is no doubt that, as the audience of data consumers expands and self-service takes hold, governance and quality are more critical than ever, especially in light of regulatory initiatives such as GDPR. However, Data Quality that doesn’t expand beyond legacy systems and processes simply isn’t going to be able to keep up, or ultimately deliver, either quality or governance.
Leveraging the community effect – the “wisdom of crowds” among your data consumers is the best way to ensure that quality is maximized – that the data your users are consuming and the analyses they are building become not just outputs of quality data, but inputs as well.