The data community will have plenty of opportunities in 2017—and a few gnarly challenges. Here's a look at what lies ahead.
1. More data scientists will begin using deep learning.
2016 saw major advancements in deep learning and the release of new tools to make deep learning simpler, as well as tools that integrate directly with existing big data platforms and frameworks. And there are simply too many useful things you can do with deep learning—many that are becoming mission critical—for data scientists to avoid deep learning in 2017. Think times series and event data (including anomaly detection), IoT and sensor-related data analysis, speech recognition, and text mining recommenders, to name a few.
2. Data engineering skills will be in increasing demand.
In 2012, Harvard Business Review named data scientist the "sexiest job of the 21st century." In 2017, we expect demand to continue for data scientists, but the talent gap will be framed in terms of data engineers (more than data scientists). Companies are looking for data scientists who can code. We’ll need more data scientists who can touch production systems. Yes, those are unicorn skills, but they’ll command happily-ever-after salaries as well.
3. More companies will be using managed services in the cloud.
A recent O’Reilly survey found that “after an organization has gained some experience using big data in the cloud, it’s more likely to expand its use of similar big data services. In other words, once they’ve tested the waters, they’re more likely to jump into the pool.”
Companies now have access to a wide array of managed services for storage, data processing, visualization, analytics, and AI. While popular open source components are available in this manner, proprietary managed services are proving to be popular options. Because the tools will be managed by the service providers, in-house data professionals will be able to focus on the problem at hand rather than their tools—though they’ll have to learn how to design, build, and manage applications that run in the cloud.
4. But not everything will move to the public cloud.
Legacy systems, sensitive data, security, compliance, and privacy issues will require a mix of cloud, on-premises, and hybrid applications. There will also be applications that utilize specialized or even private cloud providers, like Predix for the industrial IoT or the Amazon Web Services-built CIA cloud. Organizations will need solutions architects who understand how to leverage the best of both worlds.
5. The democratization of data: Simpler tools will simplify many tasks.
New tools for self-service analytics have made it easier to do a variety of data analysis tasks. Some require no programming, while other tools make it easier to combine code, visuals, and text in the same workflow. This empowers users who aren’t statisticians or data geeks to do routine data analysis, freeing up the data experts for more complex projects or to focus on optimizing end-to-end pipelines and applications.
This has been happening for several years now, but we’ve recently seen tools emerging that democratize more advanced analytics (Microsoft Azure, for example), allow the ingestion of large-scale streaming data sources, and enable advanced machine learning (Google Cloud Platform and Amazon Machine Learning, for example).
6. The decoupling of storage and computation will accelerate.
The UC Berkeley AMPLab project ended last November, but the team behind Apache Spark and Alluxio are far from the only ones to highlight the separation of storage and computation. As noted above, popular object stores in the cloud and even some recent deep learning architectures emphasize this pattern.
7. Notebooks and workflow tools will continue to evolve.
Jupyter Notebook is widely used by data scientists because it offers a rich architecture of elements that can be used and recomposed for a broad range of problems, including data cleaning and transformation, numerical simulation, statistical modeling, and machine learning. (O’Reilly uses Jupyter Notebook as the basis for Oriole Interactive Tutorials, for example.) It’s useful for data teams because you can create and share documents that contain live code, equations, visualizations, and explanatory text. And by connecting Jupyter to Spark, you can write Python code with Spark from an easy-to-use interface instead of using the Linux command line or Spark shell.
Data professionals continue to use a variety tools. Beaker notebooks support many programming languages, and there are now multiple notebooks that target the Spark community (Spark Notebook, Apache Zeppelin, and Databricks Cloud). However, not all data professionals are using notebooks: they aren't suited for managing complex data pipelines—workflow tools are better suited for that. And data engineers favor tools used by software developers. With deep learning and other new techniques entering the data science and big data communities, we anticipate that existing tools will evolve even more.
8. The data community will continue to hammer out best practices for things like privacy and ethics.
As machine learning becomes more common, data sources more varied, and algorithms more complex, transparency becomes much more difficult to achieve. Achieving fairness in data applications is more challenging than ever. Over 2017 we expect to see more discussion about public policy that addresses these concerns, best practices for testing for bias, and a growing awareness that biased assumptions lead to biased results.