In a previous blog, I wrote that natural language processing (NLP) technology “permits an interactive question-driven experience that mimics in a sense, the human experience of discovery.” I argued that the process of interacting with NLP mimics the process of analysis itself, whereby one interrogates oneself, one’s peers, and the surrounding world.
The benefit as I see it is two-fold. A person can ask a simple question—“What is the rate of violent crime in the Seattle metropolitan area?”—and get a simple answer: 6.33 incidents of violent crime per 1,000 residents. (Data is valid for 2017; figures for 2018 are unavailable.) Perhaps that’s all she needs; maybe she needs more, however. Maybe she wants to isolate or control for one or more dependent variables: say, for example, education, employment, and income, along with wild cards, such as environment-specific factors. She might find a positive correlation between neighborhoods that have high rates of violent crime and neighborhoods that have high rates of unemployment, low incomes, and low rates of educational attainment.
But she might discover something else: namely, that a fourth variable—i.e., higher-than-normal blood lead levels in children—is in the mix, too. It isn’t just that low-income neighborhoods with high rates of unemployment and low rates of educational attainment tend to have high rates of violent crime, it’s that they also tend to have high rates of lead contamination. This doesn’t so much lead her to a conclusion—e.g., (low-income) + (lack of education) + (high unemployment) = violent crime—as lead her to another inceptive question: is there a causal relationship between lead exposure and violent crime? The answer, she’ll discover, is an unqualified maybe. Neighborhoods or communities in which children have (or have had) abnormally high blood lead levels also tend to have abnormally high rates of violent crime.
I’ve cheated with this example. NLP query is less a tool for discovering unknown-unknowns—stuff like the lead-crime link, which wasn’t identified until relatively recently—than for surfacing formal knowledge: i.e., stuff we already know and can relate to other stuff we already know. If the answer to the question we want to ask is already “there” in the very structure of the data, it can be surfaced via natural-language query. A question such as “What is the sum total of sales for California for 2018?” is analogous, in its way, to a SQL query. Provided the requisite data is integrated and accessible (e.g., in a sales data mart or Salesforce.com data connector to the search tool), NLP query will surface an answer.
If our question is open-ended, such that we don’t even know how to express what we want to ask, NLP query is much less helpful. And that’s why I say I’m cheating. In point of fact, lead-crime is the kind of causal connection for which visual data discovery was basically invented.
I like to imagine a hypothetical researcher preparing her data set from scratch, mixing data from the Census Bureau, the Bureau of Labor Statistics, the Bureau of Justice Statistics, and, not least, the Centers for Disease Control. Her operating plan was to study the relationship between violent crime and a select few dependent variables: say, income, education, and employment. The thing is, as soon as she runs a cluster analysis on her data set in Tableau, she literally sees something new: the lead-crime link. It’s right there in the visualization, plain as day. NLP query can’t do this—not yet, at least. It’s great for surfacing facts (or relationships between facts) in a data set. But the facts and relationships it surfaces are in some way structured as facts and relationships, if not in a tabular format then via grammar and syntax.
One of the reasons for this difference is that in the example above the analysts had to construct a data join condition in order to surface the relationship between the data tables – using a common key, in this case ZIP code to pivot the tables and gain the insight. This is round the corner for tools like Unifi Data Platform. Using NLP technologies such as Recurrent Convolutional Neural Networking, Hidden Markov Model algorithms and Long & Short Term Memory Beam Searches, Unifi will be able to combine data sets via NLP to enable Boolean searches.
The upshot is that NLP is a technology par excellence for surfacing knowledge. In this regard, it is suitable for a wide range of business users, from the executive who simply requires a top-line answer to a question (How many W2 employees do we have in California?) to the data scientist or researcher who is charged with using machine learning, data visualization, and other advanced technologies to test and analyze all relevant business determinations.
On top of this, NLP has a killer feature that, I think, is unique among all knowledge-surfacing technologies: namely, its potential to promote a dialectical question-driven experience, not unlike what happens when you bring a bunch of smart people together and start asking them questions about a particular subject. The process as a whole isn’t exactly open-ended, but, by the same token, no one knows quite where it’s going to end up, either.
In this way, NLP leads a questioner not just to answers, but, just as important, to new questions. Just as with most Google searches they cleverly offer other questions that might be related to what you search for. This often leads the questioner on a journey that ends a long way from the original question. Unifi Data Catalog searches do this by (for example) surfacing assets that are related in some way to a question or topic: be they discrete facts and figures; analyses, reports, and tables; documents, notes, and comments; etc. A person can interrogate these related assets, too, in the process surfacing additional answers and context. In my experience, once a person has a chance to work with NLP query, she is usually hooked. She can easily ask questions, pose follow-up questions, etc. In this way, questions beget answers that tend to beget new questions.
Don’t get me wrong: NLP is still very much a work in progress. Some implementations are almost Alexa-like in their richness, permitting a person to pose comparatively complex natural-language questions—e.g., “What was the sum total of all sales in California in Q2 2018?”—and get valid answers. (In point of fact, Unifi Labs is developing a voice interface in support of NLP.
On the one hand, NLP technology is evolving quickly. On the other hand, thanks to its commodification and to the AI-ification of almost every aspect of workaday life, NLP has become a less technology category, too. For example, the use of voice recognition technology—hinted at here—is a scheme in which NLP is used along with an ensemble of other machine learning (ML) techniques. In the popular imagination, however, speech recognition is increasingly identified with so-called “AI”—really, task-specific ML—in the form of Amazon Alexa, Apple Siri, Google Assistant, etc. When NLP is used in this larger context, it has the potential to wholly transform the experience whereby one consumes analytics.
Imagine for example, that a person is a participant in a voice-activated analytical experience. If for some reason she’s confused or has questions about a result, she can easily pose follow-up questions—basically, as naturally as if she were speaking with a human interlocutor. But that’s just the beginning. She could use voice activation to invoke a data visualization tool, to load a data set, to trigger different kinds of visualizations, etc. She could use it to retrieve and display reports, presentations, and documents, too. By itself, NLP isn’t an “answer,” per se. It is when it’s used to complement, enhance, and, in some cases, to drive other technologies and techniques (data visualization, visual data discovery, data scientific research, etc.) that it is at its most powerful. We’re beginning to get a taste of that, with much, much more to come.
More important, we’re reached a point where NLP technology has already become so useful as to be invaluable. A wide range of BI, analytic, visual data discovery, etc. products now incorporate it. Given NLP’s visceral appeal to the imagination—it has what psychologist William James once called “sensational tang”—it’s destined to become an integral part of the analytical experience. Once this happens, NLP will effectively disappear as a differentiator—much like pixel-perfect reporting, ad hoc query, and, even, to a degree, ETL before it….