O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Big Data Now: 2015 Edition

Book Description

Now in its fifth year, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve talked about over the past year. For 2015, we’ve included a collection of blog posts, authored by leading thinkers and experts in the field, that reflect a unique set of themes we’ve identified as gaining significant attention and traction.

Our list of 2015 topics include:

  • Data-driven cultures
  • Data science
  • Data pipelines
  • Big data architecture and infrastructure
  • The Internet of Things and real time
  • Applications of big data
  • Security, ethics, and governance

Is your organization on the right track? Get a hold of this free report now and stay in tune with the latest significant developments in big data.

Table of Contents

  1. Introduction
  2. 1. Data-Driven Cultures
    1. How an Enterprise Begins Its Data Journey
      1. Where to Begin?
      2. Using Tools to Offload ETL Workloads
      3. The Path Forward
    2. Improving Corporate Planning Through Insight Generation
    3. On Leadership
      1. Bridging Two Worlds
      2. Someone Else Will Handle the Details
      3. Here to Serve
      4. Thinking on Your Feet
      5. Showing the Way
      6. Be the Leader You Would Follow
    4. Embracing Failure and Learning from the Impostor Syndrome
    5. The Key to Agile Data Science: Experimentation
      1. An Example Using the Stack Overflow Data Explorer
      2. Lessons Learned from a Minimum Viable Experiment
  3. 2. Data Science
    1. What It Means to “Go Pro” in Data Science
      1. Going Pro
      2. Think Like a Pro
      3. Design Like a Pro
      4. Build Like a Pro
      5. Tools of the Pro
      6. Epilogue: How This Article Came About
    2. Graphs in the World: Modeling Systems as Networks
      1. Networks and Markets
      2. LinkedIn InMaps
      3. Inbox Networks
      4. Customer Relationship Management Analytics
      5. Conclusion
    3. Let’s Build Open Source Tensor Libraries for Data Science
      1. Tensor Methods Are Accurate and Embarrassingly Parallel
      2. Hierarchical Decomposition Models
      3. Why Aren’t Tensors More Popular?
  4. 3. Data Pipelines
    1. Building and Deploying Large-Scale Machine Learning Pipelines
      1. Identify and Build Primitives
      2. Make Machine Learning Modular: Simplifying Pipeline Synthesis
      3. Do Some Error Analysis
    2. Three Best Practices for Building Successful Data Pipelines
      1. Ensuring Reproducibility by Providing a Reliable Audit Trail
      2. Establishing Consistency in Data
      3. Productionizability: Developing a Common ETL
      4. Focusing on the Science
    3. The Log: The Lifeblood of Your Data Pipeline
      1. Changing the Way We Think About Log Data
      2. The Need for the Unified Logging Layer
      3. Fluentd
      4. Appendix: The Duality of Kafka and Fluentd
    4. Validating Data Models with Kafka-Based Pipelines
      1. A/B Testing Multiple Data Stores and Data Models in Parallel
      2. Kafka’s Place in the “Which Data Store Do We Choose” Debate
  5. 4. Big Data Architecture and Infrastructure
    1. Lessons from Next-Generation Data-Wrangling Tools
      1. Scalability ~ Data Variety and Size
      2. Empower Domain Experts
      3. Consider DSLs and Visual Interfaces
      4. Intelligence and Automation
      5. Don’t Forget About Replication
    2. Why the Data Center Needs an Operating System
      1. Machines Are the Wrong Abstraction
      2. If My Laptop Were a Data Center
      3. It’s Time for the Data Center OS
      4. An API for the Data Center
      5. Example Primitives
      6. A New Way to Deploy Applications
      7. The “Cloud” Is Not an Operating System
      8. Apache Mesos: The Distributed Systems Kernel
    3. A Tale of Two Clusters: Mesos and YARN
      1. Brief Explanation of Mesos and YARN
      2. Mesos Scheduling
      3. YARN Scheduling
      4. Is It YARN Versus Mesos?
      5. Introducing Project Myriad
      6. Final Thoughts
    4. The Truth About MapReduce Performance on SSDs
      1. SSDs Versus HDDs of Equal Aggregate Bandwidth
      2. Configuring a Hybrid HDD-SSD Cluster
      3. Price Per Performance Versus Price Per Capacity
      4. SSD Economics—Exploring the Trade-Offs
    5. Accelerating Big Data Analytics Workloads with Tachyon
      1. Creating an Ad-Hoc Query Engine
      2. From Hive, to Spark SQL, to Tachyon
      3. A High-Performance, Reliable Cache Layer
      4. Tachyon and Spark SQL
      5. How to Get Data from Tachyon
      6. Performance and Deployment
      7. Problems Encountered in Practice
      8. Time-to-Live Feature and What’s Next for Tachyon
  6. 5. The Internet of Things and Real Time
    1. A Real-Time Processing Revival
    2. Improving on the Lambda Architecture for Streaming Analysis
      1. What Is Lambda?
      2. Common Lambda Applications
      3. Limitations of the Lambda Architecture
      4. Simplifying the Lambda Architecture
      5. Applications That Make Use of In-Memory Capabilities
      6. Conclusion
    3. How Intelligent Data Platforms Are Powering Smart Cities
      1. Data Collection and Transport
      2. Data Processing, Storage, and Real-Time Reports
      3. Intelligent Data Applications
    4. The Internet of Things Has Four Big Data Problems
      1. Problem #1: Nobody Will Wear 50 Devices
      2. Problem #2: More Inference, Less Sensing
      3. Problem #3: Datamandering
      4. Problem #4: Context Is Everything
  7. 6. Applications of Big Data
    1. How Trains Are Becoming Data Driven
      1. Deriving Insight from the Data of Trains
      2. Requirements of Industrial Data
      3. How Machine Learning Fits In
    2. Multimodel Database Case Study: Aircraft Fleet Maintenance
      1. Aircraft Fleet Maintenance: A Case Study
      2. Queries for Aircraft Fleet Maintenance
      3. Lessons Learned for Data Modeling
      4. Additional Use Cases for Multimodel Databases
      5. The Future of Multimodel Databases
    3. Big Data Is Changing the Face of Fashion
      1. Solutions for a Unique Business Problem
      2. Using Data to Drive Big Sales
    4. The Original Big Data Industry
  8. 7. Security, Ethics, and Governance
    1. The Security Infusion
      1. Combining Metadata into Policies
      2. Storing Policies with the Data
      3. The Policy Enforcement Engine
      4. Benefits of Centralizing Policy
    2. We Need Open and Vendor-Neutral Metadata Services
      1. Improved Data Analysis: Metadata on Use
      2. Enhanced Interoperability: Standards on Use
      3. Comprehensive Interpretation of Results
      4. Reproducibility
      5. Data Governance Policies by the People, for the People
      6. Time Travel and Simulations
    3. What the IoT Can Learn from the Healthcare Industry
    4. There Is Room for Global Thinking in IoT Data Privacy Matters
    5. Five Principles for Applying Data Science for Social Good
      1. “Statistics” Is So Much More Than “Percentages”
      2. Finding Problems Can Be Harder Than Finding Solutions
      3. Communication Is More Important Than Technology
      4. We Need Diverse Viewpoints
      5. We Must Design for People
      6. What’s Next