If you knew how much your competitor was investing in certain businesses or technologies, you might discover a picture of their strategy. This could give you an early warning and allow you to plan for the moves your competitor was making—possibly into your market segment or against your investments—or it could give you the corporate insight needed to hone your own strategy.

It turns out there is a tremendous amount of public data available to be mined for these insights. A few examples where such insights might be uncovered include: USPTO patent filings, social media commentary, job postings and new employee listings on LinkedIn, university recruiting fairs, trade event speaking engagements, guest lecturing, and the list goes on.

The challenge is that you need two key elements to unearth the intelligence in this data:

1)      Key words and phrases: identify the words that are likely to lead to insights

2)      Word tokenization: score the key words in their contextual use and in association with other key words


Let’s say you’re an energy company trying to get a clear picture of how much your competitors are investing in new forms of renewable energy. You might start with a list of key words like solar, photovoltaic, photochemistry, electrochemistry, panels, wind turbine, hydroelectric, WEC (Wave Energy Conversion), etc. On their own these words might describe millions of articles; the trick is to understand the context. The word “panel” could be anything from a sheet of plywood to a collection of experts—it’s only when it is preceded by the word “solar” that the phrase “solar panel” becomes interesting to this search.

Normalizing Unstructured Data

The problem with these words and phrases is that they are contained in billions of webpages and electronic documents like PDFs, Word documents, and PowerPoint presentations. All of this data is unstructured, meaning that before you can start unearthing relevancy in these words and phrases and start the tokenization process you must first normalize the data. Unifi’s unique OneParse™ algorithm manages this process automatically. Connect Unifi to unstructured or semi-structured data sources, such as web logs or JSON files and the software creates a structured tabular view of that data source.

Tokenization can occur by running a Natural Language Processing (NLP) routine in conjunction with Unifi. NLP tools are provided with Python or Apache Spark, among others, which can be used in Hadoop to parallel-process the NLP parsing in conjunction with the transform job. The NLP task can be triggered in Unifi as part of an automated workflow like this:

1)      Connect to unstructured data sources and pull into Hadoop

2)      Run OneParse normalization to create structured view

3)      Combine data sets together to derive a unified view

4)      Derived attributes for tokenization

5)      Execute neural nets model

6)      Gather results in Hadoop and format in Unifi for export and analysis

The true value of insights is often locked away in some of the most challenging data to process. The Unifi platform helps IT organizations and business analysts unearth these unique insights to create new business opportunities.