On August 16 Unifi sponsored a TDWI webinar where attendees learned about Data Lakes and how to get more business value from them. At the end of the webinar, there was a lively Q&A panel but we didn’t have enough time to answer all the questions. In this 2-part blog series, you’ll find the questions that were asked followed by a short response. If you’re looking for a more detailed answer we welcome you to contact us. If you haven’t already seen Part 1, you can find it here.

When starting a cataloging process are you referring to a Data Dictionary of the information you a capturing? Are there specific tools you would recommend to accommodate this?

Data dictionary, Meta-data definition is an important component of the cataloging process. Other components of catalog include business rules (data prep/transformation jobs), how data is connected with each other, statistics of the data and much more. Unifi is a comprehensive Data as a Service platform that can be used for cataloging that integrates with catalogs of databases, Hadoop systems, SaaS applications, etc.

Do you catalog the data inside Hadoop? Or do you transfer it to another database / Unifi to be cataloged?

The data is cataloged in a database so that its easy to index, search and query the information from the catalog. The size of the meta-data is small as compared to data and hence its convenient and easier to use a database.

What is a good approach for data lake governance?

To effectively govern a data lake, its important to capture meta-data, data-type information (example: PII), statistics, usage and lineage of the data assets and data pipeline that gets created. That way you have a 360-degree view of data.

If similar data exists in multiple locations within a lake – can you setup rules to determine from which source the data is pulled?

Its easy to figure out in Unifi which datasets are similar through multiple mechanisms and our lineage feature allows you to understand details of data flow from source to target. Rules can also be setup to achieve the same.

In a logical data lake integrating data from multiple sources, is the data virtualized via a tool such as Denodo or how is it brought together to be a unified “single” source?

Data virtualization aspects like query optimization for data coming from different sources, data acquisition, advanced caching, self-service data discovery, complete abstraction to the business user on whats the source type (example: database, file-system or a SaaS application) or format type (example: CSV, JSON, etc.) are embedded in Unifis data as a service platform.

Can you describe the makeup (number and skill sets) of the team that is needed to support a Data lake? Also, how much data is added to a lake on a daily/monthly time frame?

From an administration standpoint, there are two components of data lake system administration of physical machines or virtualized infrastructure and administration of software pieces (example if the lake is comprised of Hadoop and Unifi). On the software side, 1 to 2 people are enough for administration.

The amount of data that has to be added completely depends on the use case the data being generated in the enterprise. Use cases can vary from 10s of GBs per day to multiple TBs per day.

What does value mean when discussing a Data lake? How do you show the value of the data lake to the enterprise?

Agility in making business decisions is becoming critical for todays data-driven enterprises. A data lake allows more business decisions and democratization of use of data that finally ties to better insights or business decisions that either generate value or save money for enterprises. Hence the most simplistic way of showing value out of a data lake is to highlight the number of questions its answering and how its different than scenarios when there is no lake. The next level of highlighting value is figuring out ROI through those business decisions that are being made through the lake.