Every data set in use today requires some degree of data cleansing. Anomalies often exist buried deep inside nested tables and can go years before they are discovered if at all. Legacy data has its own set of challenges—header information in the tables, which obviously meant something to the person who created the database in the first place, can often be complete gibberish to current users and data stewards.

So starts the painful exercise of extracting the clean, useful data from each data source. Only then can it be made available to the business user. This problem is compounded a hundred-fold when multiple data sets are to be combined to gain business insights—the traditional role of the overworked ETL programmer supporting the business. Often the institutional knowledge about a specific data set, its quirks, foibles and anomalies are maintained in the heads of these IT programmers. Leaving the vagaries of such critical information to chance is a dangerous strategy for an organization who has staked its future on becoming data driven.

One of the biggest challenges facing technical staff in dealing with this challenge today is the number, size and complexity of data sources required to run the business. Every fortune 500 company is awash with data and there’s more on the way! Many of the new data sources are cloud-based, unstructured, or arrive in real time. To gain the business insights required from all this data places tremendous pressure on the analysts that support the business and, in turn, the IT staff that support them. Trying to manually extract, transform and load this data into BI tools for analysis is not only a thankless task, the resources cannot possibly scale to meet the demands of the organization using the traditional approach.

Machine Learning (ML) and Artificial Intelligence (AI) are two powerful ways to augment the human intelligence that manages and uses the data within the organization. Until recently the use of ML and AI has been confined to specific tasks in data and results have been very promising. Unifi is a pioneer in the use of both ML and AI applied to the task of traditional ETL.

In the emerging big data category of Self-Service Data Preparation there is already confusion around the definitions of both ML and AI. This confusion is being introduced by vendors who are categorizing data cleansing tasks as ML, when the vendor’s product is really just running a series of SQL functions on the data to derive the desired results. To be clear then, the following is not only the industry definition of ML and AI but also the definition that Unifi employs in that aspect of the product. There are two distinctly different uses of the technology..

  • Supervised ML/AI – Supervised ML and AI is the learning process when the outcome variable is known. For example, if you know the format of a phone number you can use ML to find such patterns and correct for formats that are incorrect within the same data column.
  • Unsupervised ML/AI – With unsupervised ML and AI the outcome is unknown. Using the above example, the machine must distinguish numbers that are phone numbers from any other sequence of numbers.

In both supervised and unsupervised ML and AI, the resulting outcomes and learning must be fed back in to the algorithm so that new tasks can benefit from previous learning. Teaching the algorithms for both supervised and unsupervised tasks is the function of the programmers and input from users. For example, a user may define their desired phone number format e.g. (123) 456 7890 or 123.456.7890 etc. and feed that in to the ML as the desired outcome. Unifi does the rest.

Unifi uses supervised and unsupervised learned models to infer human and non-human personalization at every step of the data integration process. As soon as a data source is connected to Unifi the ML and AI technologies start generating insights about that data and provide information to the user. Unifi automatically creates a comprehensive data catalog from the data so the information contained within that data set becomes “searchable” in a natural language, or google-like, way. As the data set is used the ML and AI start providing the user informed prompts that will help in its derived value.

Simple AI functions such as auto-data formatting and cleansing are provided and are used in practically every data set connected to the platform. From address, phone number, date and time normalization to the discovery and cleansing (and optional obfuscation) of social security numbers, credit card numbers etc. Even the parsing of semi-structured or unstructured datasets is entirely automated within the Unifi Platform. And while relative child’s play for the Unifi ML and AI algorithms the architecture of the product can produce some amazing insights as more and more data is connected and used on the platform.

At one customer, 32 legacy databases provided telemetry information from different services delivered to the consumer. Every data set used different metadata and column headers to define identical functions across each of the services. Unifi’s ML and AI automatically identified the relationship between the header and data in each table and prompted the users to combine the data from all 32 tables using a common header description. This solved months of manual data manipulation that often resulted in errors and unearthed anomalies in some data sets that would never have been discovered using traditional ETL processes and would have perpetuated incorrect analysis.

As individual data sets are being used within Unifi the system learns about the data, its lineage, the number of times its selected and other statistics about the data, the quality score the data set is given by the users, etc. This constantly growing intelligence within the Unifi platform is used to feed its core differentiator—its recommendation engine.

When a user wishes to combine two or more data sets together and view the results in their BI tool, the ML and AI engines within Unifi recommends each step. Its uncanny ability to predict what the user wants before they have even thought of it is what makes Unifi so easy to use for any business user.

Selecting one data set will present the user with the most likely data set they will want to join. If the platform got this right, it will present a stack-ranked list of insights the user is most likely wanting to gain by combining these data sets together. If the user selects more than two data sets the prediction of the insight required will be even more accurate. The user is prompted in a natural language, stacked-ranked way; the mostly likely insight desired, followed by the next, the next etc. For example; joining online sales to retail sales data might offer: Highest SKU depletions by outlet type, same SKU depletions by outlet, common customers by channel, sales by region by outlet type etc.

As Unifi learns the operational function of the user i.e. marketing analyst, finance analyst sales analyst etc. it will become even more accurate with its predictions and thus recommendations.


Insights at the Speed of Thought

Not only does the Unifi AI and ML predict and recommend what an individual user will want to do with the data they have selected, it will learn how the user wants to consume the data. Does this user use Qlik, Tableau or Excel? Unifi knows that from selections previously made and so delivers the data transformation ready to view to the user. As a user starts to create a transform, the AI in the platform understands if this has been previously created by another user and will prompt this user to see if they can use the same job. If this recommendation turns out to be valid, and the data in use for this job has been updated since the last time it was run, the platform’s ML may set this up as a workflow job so it is constantly refreshed for future consumption.

Every time a user selects one or more of the recommendations, or creates their own completely ad hoc transform job, the system learns for the next user. This self-perpetuating, community-learning tool set at the heart of the enterprise data environment promises the panacea of the data driven organization—true data democratization.eady to learn more about how Unifi can democratize your enterprise data and help grow your business?  Download the complimentary Gartner report and visit us online at UnifiSoftware.com to contact our team of experts today!