On March 20, Unifi and TDWI spoke together on a webinar discussing how a data catalog powered by AI can enable your organization to stem the tide of data chaos and better manage and govern self-service analytics, BI, and data preparation.
You can watch a recording of the full webinar here. Continue reading for answers to questions asked during this live event.
I have a data warehouse in AWS and already have taken care of data curation. What other tools do I need to enable Self-Service?
The Unifi data catalog will crawl file systems and unstructured data – we support this today with S3, GC, and Azure Blob Storage. You’ll have to check with the BI vendors on what they can discover with these data sources but generally, they need structured sources.
Where would we start when we don’t have any data catalog?
A great place to start is with our free trial on Microsoft Azure. Our onboarding documentation and videos will help you connect to your data sources and start building your data catalog.
Can you give an example of how AI and ML would be used to catalog data?
When Unifi connects to a data source our AI starts to profile the data – from this we can determine automatically attributes such as PII (for example Credit Card, FICO score, SS#, DL# etc.) Unifi’s governance and security pillar will then automatically tag these fields and can auto-mask the data prior to exposing the data through the catalog to users.
Will you talk more about how NLP enhances a data catalog?
Russ covered this a bit in the demo; but in essence, this adds Google-like queries to your data. For example, you could ask the question “What is the median price of single-family homes in San Mateo County right now?” and Unifi OneMind NLP will parse this query into component elements and run the query against the catalog data and show you the answer.
It seems like the AI algorithms would need access to the metadata repositories and the data sources themselves. How do we keep up with our data security requirements to protect the MDR and Data Sources?
When Unifi is connected to a source we automatically pull metadata from that source as a starting point for the catalog. We can also import business glossaries from a product such as Informatica. That is exposed in the interface as Russ demonstrated during the demo. Our AI-based profiling looks at the actual data in the connected sources and starts deriving insights about the data.
The Unifi governance & security tools play their role here in allowing the data steward to set up and manage compliance, and grant or revoke access to data as needed. Unifi can read Active Directory or Kerberos to onboard groups, and the data governor can also manage permissions at different levels for different roles. All this is tracked by the audit features of the product.
Data catalogs ideally support both the warehouse environments and the transactional environment. Do you see these two very different use cases aligning in a centralized data catalog?
We are seeing a stratification of the data catalog tools on the market: there are data lake catalog vendors, which work well if your data is all in a lake; there are data warehouse or data mart catalogs for legacy transactions and relational databases; there are cloud catalogs for data in a specific cloud such as Azure or Cloudera. Most organizations need an Enterprise Data Catalog (this is what Unifi delivers) which can catalog data regardless of where it is or its structure.
If you have a wide range of data sources and types you will want to consider an Enterprise Data Catalog as this will give you the most flexibility. Going with a more dedicated environment catalog may lead to the necessity for a “catalog of catalogs” whereby the user must know where the data is before they can use the appropriate catalog.
What is the difference between a Business Glossary, Data Catalog and Data Dictionary?
The data catalog is a central location to help you search for disparate data across the enterprise. A data dictionary is often a description of that data or maybe the translation of an obscure attribute name into a business-friendly attribute. The Unifi catalog can import business glossaries from legacy systems such as Informatica, we also provide the ability for users to not only rate the description but recommend improvements in the metadata to make it more useful to other users.
Can the Unifi data platform integrate with other catalog tools like Collibra?
We’re working on it and also have a very capable governance and security feature set of our own. An advantage of the Unifi governance and security tools over a stand-alone product is the ability to govern the end-to-end data pipeline. We didn’t have time on today’s webinar to show our data prep tools, but if we had you might have seen that when an analyst transforms or blends multiple, disparate data sources together it is possible to inadvertently create and expose PII. For this reason, the Unifi governance and security tools cross over data catalog and data preparation.
How does Unifi access all the metadata and data in order to catalog it? What if there are multiple sources in the cloud and in data warehouses etc.?
We support over 100 different native data connectors today. We can connect to most data sources such as relational data, SQL, NoSQL, structured and unstructured, SaaS, IoT (via Flume or Kafka), Snowflake, Redshift, file systems etc. You do not move the data to have it cataloged in Unifi. You can leave it on the source and only move it if/when you decide to continue with a process or transformation of the data.
Is there GraphDB for the data and its relationships?
We demonstrated this to some extent. The Unifi catalog employs KnowledgeGraph on top of our OneMind AI engine to show the relationship between data sources, attributes, workflow automation jobs, data prep transformations, and Tableau TWBX files etc. We can also support GraphDB systems such as NEO 4J through XML.
Are there prerequisites before a company can or should start with a data catalog?
The prerequisite is a “yes” to the following question: “Do you have disparate data distributed across the enterprise and the need by users throughout the organization to find and use the corporate information that is most important to them?” If the answer is yes, then the key starting point is to know where all your data is located before you can make it available to a wider audience. That is why you’d need a data catalog. Another very real use for a data catalog is to identify and secure data sources that would allow the organization to operate under compliance. For example, GDPR will affect most business analysts and data scientists. An Enterprise Data Catalog will have Governance and Security tools to assist with compliance.
Is it possible to get to know who are the most frequent users of a particular dataset? It might be a good way to know who to ask complex questions about the dataset.
Yes. This is managed through the admin interface and is part of the data audit solution. When a user searches for data, the results returned can be filtered by popularity or most frequently selected for example.
What kind of plumbing work is needed to get data ingested into the tool? How much of an implementation time is needed for a mid-size organization?
We offer both hosted and on-prem deployment options. You can try Unifi for free on Azure Marketplace here. For hosted deployment, we will need to set up a VPN for you to connect to your on-prem data. (This is very quick and only requires you to complete a port questionnaire.) For data where ports are already open, there is no issue. For on-prem deployments, we charge a modest implementation fee to set up Unifi on a linux server running PostGres and connect to some data sources. We can implement our catalog product on-prem within a day and our catalog & prep platform, which has a Hadoop dependency, deployed with 2-3 days (depending on the state of the Hadoop environment).
Do you have data lineage and impact analysis capabilities at the file and attribute level for the data sources being considered for analysis? This would include the system of record for a particular file and any transformations it’s gone through before it gets to the target being considered.
Yes, we support data lineage and touched on that briefly in the demo. We connect natively to source data and employ the bulk offload APIs to ensure minimal impact on the environment. It is important to note that no data has to be moved in Unifi in order to create a comprehensive data catalog.