On August 16 Unifi sponsored a TDWI webinar where attendees learned about Data Lakes and how to get more business value from them. At the end of the webinar, there was a lively Q&A panel but we didn’t have enough time to answer all the questions. In this 2-part blog series, you’ll find the questions that were asked followed by a short response. If you’re looking for a more detailed answer we welcome you to contact us.

What is Data as a Service?  Can’t all Cloud Platform providers call themselves Data as a Service?

Data as a Service (DaaS) builds on the concept that the data can be provided on demand to the users in an organization regardless of data source, geographic location or department and provisioned in a self-service manner. To achieve a true DaaS model, the cloud platform must allow for self-service data access and provide the necessary tools to support Cataloging & Discovery, AI-Assisted Data Prep, Community Collaboration for business users while providing Governance & Security for IT organizations.

When you use the term “Data as a Service” are you referring to a platform/product that sits on top of the Data Lake?

The term DaaS usually refers to a data strategy for data democratization as explained above. That said, many platforms or products can help facilitate this journey by providing utilities for the several components needed to achieve DaaS. Unifi is the only platform in the market that includes all the components for a true DaaS Platform.  (Governance & Security, Cataloging & Discovery, AI-Assisted Data prep, Community Collaboration and Cloud-Optimization). Other solutions would require the use of multiple platforms connected together which brings on a slew of challenges. 

What tools are available for DaaS

There are several tools on the spectrum to achieve DaaS that covers the individual functionality required for DaaS (as outlined above). Most of the tools in the market focus on just a single pillar of the stack; for example, there are tools that are really strong in data cataloging; others are excellent Data Prep tools. An organization could combine and integrate these tools together but it would certainly pose development challenges for the IT org. To achieve true DaaS, organizations should identify a platform that offers all tools of DaaS together in one.

“Data Lake” is an abstraction or a method of thinking about storing data sources, correct? What is a good definition for a Data Lake?

A Data Lake is the method for organizing large volumes of highly diverse data from disparate sources to be used for broad data exploration, discovery and advanced analytics. Depending on the platform, a data lake may handle many data structures. It is meant for fast data ingest so that it can be available instantly and prepared and analyzed later. Data lakes are meant to store the most granular and raw data to allow for analytics to be built on top of it. Data lakes are very flexible and allow for data to be used many times for different analytic needs and use cases.

Can you provide a visualization of what Data Lake is?  What does the implementation of a Data Lake look like?

What are the architectures on top of warehouse technology besides Hadoop that makes a Data Lake? And what then makes it DaaS implementation?

Most current data lake architectures are composed of several technologies depending on the use cases. The main pillars for a data lake will be ingest/data acquisition, storage, processing/preparation, data access, security, and governance. For a DaaS implementation to be successful, the most important aspect is the Data Democratization by incorporating self-service utilities that will empower the users of to have easy access to all the data on-demand. A self-service ability that includes discovery, catalog, prep, and governance will deliver an enterprise grade DaaS implementation.   

How can an organization make sure that there is no duplication of data in the data lake?

There are several methods and tools to ensure no data duplication. Foremost a good governance and cataloging tool, as well as good data prep with quality routines, will ease this pain and ensure that you don’t end up with a data swamp.

Does the advantage of a data lake change when using a Logical Data Lake?  (i.e. Not having everything residing in Hadoop but some in Hadoop and other data in the traditional Data Warehouse.)

Not necessarily although it might be harder to enforce governance across multiple systems. Tools like Unifi will help counter these pain points by allowing to catalog and run ‘federated’ queries across multiple sources and have one single platform to manage governance, security and data access for the logical data lake.

How does data modeling come into the picture with data lakes?

Data Lakes usually require very little data modeling up front. Most of the use cases of Data Lakes operate with a ‘schema-on-read’ approach which is meant for the rapid landing and storing of large amounts of data into the data lake. The schema-on-read approach is very flexible and is meant for rapid access to raw, unprocessed data. On the other hand, schema-on-write require much more upfront modeling as the data structures need to be defined prior to loading the data into the data lake.

The beauty of the data lake is that it allows for both ‘schema-on-read’ and ‘schema-on-write’ approaches. This allows for the data scientist to discover hidden insights from the raw data while the more traditional business analysts can benefit from the predefined structured data for reporting and repeatable analysis.

Continue on to Part 2