On April 16, Unifi Sponsored a webinar with host Eric Kavanagh, Bloor Group, and presenter David Wells, Eckerson. There was significant audience interest and many more questions than we could answer during the allotted time. If we did not answer your question live we apologize. We have provided written responses to all of the questions here:
Q. Can you describe what factors constitute trusted in a data set?
A. This is very much a company based decision and is generally agreed by a cross-section of technical and business stakeholders knowledgeable about the specific data set. A number of factors should be considered however: The frequency the data changes, if any of the attributes are derived from external processing and if so, what is that process and what other data sets are being joined in this process – and thus what is their level of trustworthiness and so on. Data cleansing functions may be necessary to normalize data so that reports and analysis makes sense – an example of this might be transactional data that spans both European and U.S. date formats e.g. mm/dd/yyyy and dd/mm/yyyy – to make sense of transactions on a given day, week, month, quarter etc. dates must be normalized – this process runs and produces a trusted transactional data set.
Q. Can Unifi clean up data to make it trustworthy?
A. This is where the seamless integration with our Data Platform product makes the Unifi Data Catalog stand out. The OneMind AI technology that auto-detects data types and profiles data to determine PII data and other User Defined Data Types passes that knowledge seamlessly to the self-service data prep functions within our Data Platform. This means that recommendations can be made on how to auto-cleanse, enrich, parse, join, filter and format data to derive an output transformation job. Every time someone uses the tool it learns and updates the algorithm to make more and more accurate recommendations.
Q. Can you describe some of the AI technologies or methods you employ?
A. Unifi offers automation across almost every capability within the application. The best use of automation / ML is when the user does not know it is even there from a user experience standpoint. Below are just a few examples of where these capabilities have been applied:
- Enterprise Knowledge Graph (EKG) – This drives the catalog user experience as well as Unifi’s NLP suite – This is serviced by a Recurrent Convolutional Neural Network and OpenNLP
- Semi-Structured Data Parsing – This enables Unifi’s parsing support of WebLogs / Log Files, Parquet, Avro, XML, JSON, etc.. – This is serviced by Hidden Markov Model and Gene Sequencing Algo’s
- Similar Datasets – Data discovery feature of displaying datasets that are similar in nature to the dataset of focus – K-Means Clustering against the Unifi Knowledge Graph and add’l metadata based features
- Dataset & Attribute Classification – This is Unifi’s ability to classify both the dataset and attributes of a dataset, i.e, Contact based Dataset, SSN as a attribute – This provides the ability to auto mask, auto tag, and auto authorize – This is serviced by a series of Parsing Expression Grammar (PEG’s) Algo’s + ML Regular Expression Framework
- Auto-Completion of Search / Sentences – Unifi’s recommendation of the next character or series of characters for both relevance based and NLP grammar based questions and search – LSTM (Long Short Term Memory) + Beam Searches enable this functionality
- Recommendation Engine – This enables the recommendations during Prep creation, Tag Recommendations, and some additional user experience enhancements – This is serviced by a Logistic Regression and K-Means
Q. It seems like different vendors describe lineage in different ways can you describe your definition of data lineage
A. This is a complex topic. There are two types of data lineage:
First party lineage – lineage derived from the processing environment that created a data transformation
Third party lineage – an exercise in connecting to external data transformation solutions and capturing the data transformation they processed
When someone uses the Unifi Data Platform we provide comprehensive end-to-end first party data lineage that shows exactly how a derived data set has been produced, shows the Spark expression(s) used, the original source data sources etc. Third party lineage is much harder, not only must you integrate with every transformation environment employed for the data sets cataloged you must also have insights into stored procedures, workflow schedules and other operationalized functions that drive a derived output. This is why Unifi works with each customer to understand the environment and their requirements for determining and displaying data lineage. There are some third party tools that specialize in this capability also, Manta and Octopai are two examples however, no vendor currently supports every transformation environment so you have to be selective about which data set lineage journeys you will wish to examine.
Q. We have a data lake but not all of the data I am interested in has be transferred there – can I still catalog it and how would we factor its trustworthiness?
A. Unifi is a connect-in-place Catalog solution which means you can leave your data in source systems and we will connect to it and form a central data catalog. After all, few people will care where the data is actually stored as long as they can find it. If your data lake is 100% trusted data – which is an excellent goal, then you’ll have to have policies in place to determine the trustworthiness of data outside the data lake. One advantage of cataloging data where ever it resides in the enterprise is that you will gain insight into which data sets the organization wants to use – this can set a priority on which data sets should get your attention to determine trustworthiness.
Q: How, based on what, and who addresses the question with the documenting/describing data in the catalog?
A: This is a role that either an individual or group can and must undertake on behalf of the organization if the catalog is to be of value. Sometimes (around 50% of the time in our experience) some kind of data dictionary or business glossary already exists in some shareable form – such as Sharepoint, O365 or Google Sheet – this is used as a reference look up tool by the organization and in some cases the data set contains a field with a link to this document. The fact they are not integrated into the searchable catalog is an impediment to good operational workflow and to maintaining and extending the shared knowledge or value of the catalog. To alleviate this and create immediate value we can import an existing glossary or dictionary into the Catalog as a starting point.
We use a permissions-based user role to determine who can add descriptions or glossary terms – typically this is defined by the “owner” of the data i.e. who is responsible for ensuring that the dataset is up to date and determining who is allowed access to it and what attribute sample data they can see (classic data governance functionality). Others users may be able to rate or score the accuracy of descriptions and suggest or recommend improvements. The intent here, as discussed, is to as rapidly as possible suck “tribal knowledge” about your data out of the organization and make that knowledge available to everyone and searchable.
Q. You talk about data scientists a fair amount. Can you work out data lineage when transformations are written in Python?
A. This is a thorny issue, short answer is, no. The challenge on lineage, which I have to say is a little understood subject, is not an individual transformation but to do it right a comprehensive view of all the transforms that effect the creation of a data set. This means knowing the pedigree and lineage of every data set – this often means combining SQL stored procedures with third party transform engines like Informatica, Talend, Alteryx etc. Our Data Platform product will give you a comprehensive lineage story and can trace stored procedures but does not currently take third party lineage into consideration. We are working with Manta and Octopai to deliver this.
Q. What is the pricing model?
A. The cost is $50/user/month or $600 per year. The minimum number of seats is 25 thus an initial seat license of $15K. Additional seats can be added any time. In addition, we charge a one-time implementation fee of $10K. So your first year investment is $25K with renewal being $15K per year assuming you add no new seats. You can also upgrade a catalog seat to our Platform product that goes beyond the discovery of data into the use of that data through self-service data prep.
Q. Can we evaluate your catalog before we buy?
A. Yes. We offer cloud and on-premises evaluations. You can learn more about that here