10 data trends on our radar for 2016
Promising topics in data that we'll be watching closely in the year ahead.
Get the O’Reilly Data Newsletter and receive weekly insight from data industry insiders—plus exclusive content and offers. The following piece was first published in the Data newsletter.
As O’Reilly’s chief data scientist and program development director of the Strata+Hadoop World events, I spend a lot of my time thinking about (and anticipating) trends in data. We’re not just interested in what’s hot now. We’re interested in those sleeper techniques and technologies that emerge and change our world. The mundane and unsexy techniques that are, nonetheless, immensely useful. And the already-hot technologies that have unexpected uses in an entirely different arena. Here are 10 promising topics in data that we’ll be watching closely in 2016.
Joe Hellerstein recently sketched out a vision for open, vendor-neutral metadata services, which can give rise to many novel data products and applications, as well as lead to data-governance policies.
2. Systems optimization
Textbook machine learning problems can be boiled down to solving a problem in mathematical optimization, of which there are many libraries and packages to choose from. In practice, applying an algorithm is hardly the only thing a data scientist needs to do. The reality is that data scientists need to optimize data pipelines (acquire, wrangle, featurize, fit a function) which means chaining together primitives and systems and optimizing them across interdependent steps. With recent successes in computer vision, speech, and machine translation, deep neural networks provide an approach for optimizing such pipelines via gradient descent. But there’s no reason to believe that other algorithms can’t become competitive with deep learning. For alternative strategies to emerge, frameworks and platforms for comparing and optimizing data pipelines need to get better.
3. Structured data extraction
Companies and researchers are automatically converting many data sources—Web pages and the Dark Web, documents—into structured information, which can then be used as features in machine learning models. These automatic information extraction systems are able to match the accuracy produced by human domain experts.
4. Cloud computing
Companies are increasingly comfortable putting some of their data and analytic tools on cloud-environments, and I expect this trend to accelerate in the years to come. As a sign that competition is alive and well, cloud providers Google and Microsoft have been releasing interesting big data tools at a steady pace.
5. Intelligent real-time, data applications
Massive-scale real-time processing is a hot topic at Strata+Hadoop World. While much of the focus is often on tools and architectures, applications are what I’m most excited about. From smart cities to industrial applications to health and other consumer applications, the application of AI to big data (volume, variety, velocity) is becoming more common. Researchers and startups are developing algorithms for mining massive time series and event data.
6. Ethics and data for social good
Fairness, transparency, and privacy are some of the issues that data professionals are starting to engage with. There are also many organizations—such as our partners DataKind—who match data scientists with non-profit agencies in need of experts to use data for social good.
Spark remains the most popular big data framework, and in 2016 there will be many important developments in Spark and its core libraries (Streaming, MLLib, SQL). This year I look forward to seeing breakout contributions in the growing ecosystem of packages. Here are two recent research projects out of AMPLab that caught my eye: Succinct Spark (for search) and SparkNet (deep learning).
AMPLab, the UC Berkeley lab that originated Apache Spark, Tachyon (and in an earlier incarnation, Apache Mesos), AMPLab is scheduled to reboot into another lab in 2016. Having produced a software stack that has become very popular in industry, I’m looking forward to what comes next.
Hardware topics don’t get enough coverage in the data community. GPUs, SSDs, and CPU caches are just some of things that have impacted data systems in the past year. There will be many more interesting developments in the next 18 months.
10. Human-in-the-loop, visualization, and interfaces
Massive-scale data visualization is one area where GPUs are game changers. The importance of interfaces can be traced to the realization that in many settings augmented intelligence still beats AI. I’m also interested in seeing potential applications of augmented/virtual reality to visualizing big data.