Data-as-a-Service (DaaS) platforms handle much of the profiling, cleansing and formatting necessary to prepare data for analysis. So where does that leave data scientists? 

As more companies integrate self-service data preparation into their operations, data scientists will focus more on:

  • Developing statistical models for the layman.
  • Identifying new data sources by referencing the business's objectives.
  • Verifying that business analysts are answering questions effectively.
  • Engaging in trial-and-error statistical modeling.

What do all of these responsibilities look like in practice? 

The Statistical Modeling Consultant

Most business professionals know what sort of questions they're trying to answer. However, very few know how to leverage statistics and mathematics to figure out how they should find those answers. It's possible they may not be asking the right questions, given what they're really trying to discover. 

For example, suppose an HR manager working for an advertising agency wants to figure out why the company frequently churns design talent. What's causing these designers to leave? The manager wants to reference data from exit interviews, quarterly employee satisfaction surveys and email communications between designers and their managers. 

The HR manager uses a DaaS solution to extract, transform and normalize the data sets he wants. Once the data is ready for analysis, he's not completely sure how he should weigh one variable against another. In addition, he may not know how to validate the relationships between those different data sets before he starts crunching the numbers.

This is where the data scientist will enter the picture. She sits down with the HR manager, assesses the data sets he's using to answer his question and formulates a statistical model that defines mathematical relationships between the variables within those data sets. The model will determine how the HR manager will analyze the data. 

Data scientists will spend more time modeling and less time retrieving data. Data scientists will spend more time modeling and less time retrieving data.

Independent Statistical Research 

In the event data scientists do become consultants to both professional analysts and typical business users, it's easy imagine them conducting research on which statistical methods are most helpful to their organizations. Data scientists may take a psuedo-academic approach to this endeavor, pulling seemingly "random" data sets and testing statistical models on a regular basis. 

What will these statistical models look like? It all depends on what sort of information data scientists are trying to reveal. 

Let's revisit the example of the HR manager trying to analyze the graphics department's poor retention rate. He wanted to pull survey data, email communications and exit interview transcripts.

The data scientist may recommend extracting survey responses from both current and previous graphic designers. In addition, she may advise matching individual survey responses with relevant exit interviews. Such measures do not lead to statistical models, but they do guide data preparation.

After preparation concludes, the data scientist could develop a regression model which describes the relationship between the number of clients a designer had and stress indicators such as the number sick days taken within a two-week period. 

Overall, data scientists will focus more on honing their mathematical capabilities than managing data analysis projects. In the example above, the data scientist not only acts as a consultant but also assumes the role of the teacher. All of the mathematical knowledge she utilizes to create statistical models will be passed on to analysts during the consultation process. 

Data scientists will position themselves as consultants for analysts. Data scientists will position themselves as consultants for analysts.

Communicating with DaaS Platforms 

A select few Data-as-a-Service platforms utilize machine learning to combine data sets based on associated characteristics. For example, if one data set lists a warehouse's daily power consumption rate and another data set details that facility's HVAC runtime, an ML program would recognize the relationship between the two data sets: HVAC runtime affects how much power the warehouse consumes. 

However, there may be cases when two data sets are related, but it's not obvious to an ML program.

For instance, an auto manufacturer may want to know whether it produces more spare parts than it needs. An analyst within the company could pull two data sets: one that describes factory production and another that details parts sales. A data scientist could recognize connection and communicate it to the platform.

Here's the exciting thing: Data scientists may be prompted to justify their reasoning to the ML programs so that the latter can learn how to draw conclusions in a similar fashion, and therefore identify relationships between seemingly unrelated datasets. This is an example of how data scientists could be the mediator between business analysts and artificially intelligent applications.