O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Big Data Now: 2016 Edition

Book Description

Now in its sixth edition, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve examined throughout 2016. This collection of blog posts, authored by leading thinkers and experts in the field, reflects a unique set of themes we’ve identified as gaining significant attention and traction.

Our list of topics for 2016 includes:

  • Careers in data
  • Tools and architecture for big data
  • Intelligent real-time applications
  • Cloud infrastructure
  • Machine learning: models and training
  • Deep learning and artificial intelligence

Table of Contents

  1. Introduction
  2. 1. Careers in Data
    1. Five Secrets for Writing the Perfect Data Science Resume
    2. There’s Nothing Magical About Learning Data Science
      1. Put Aside the Technology Stack
      2. Keep Data Lying Around
      3. Have a Strategy
      4. Hack
      5. Experiment
    3. Data Scientists: Generalists or Specialists?
      1. Early Days
      2. Later Stage
      3. Conclusion
  3. 2. Tools and Architecture for Big Data
    1. Apache Cassandra for Analytics: A Performance and Storage Analysis
      1. Wide Spectrum of Storage Costs and Query Speeds
      2. Summary of Methodology for Analysis
      3. Scan Speeds Are Dominated by Storage Format
      4. Storage Efficiency Generally Correlates with Scan Speed
      5. A Formula for Modeling Query Performance
      6. Can Caching Help? A Little Bit.
      7. The Future: Optimizing for CPU, Not I/O
      8. Filtering and Data Modeling
      9. Cassandra’s Secondary Indices Usually Not Worth It
      10. Predicting Your Own Data’s Query Performance
      11. Conclusions
    2. Scalable Data Science with R
    3. Data Science Gophers
      1. Go, a Cure for Common Data Science Pains
      2. The Go Data Science Ecosystem
      3. Data Gathering, Organization, and Parsing
      4. Arithmetic and Statistics
      5. Exploratory Analysis and Visualization
      6. Machine Learning
      7. Get Started with Go for Data Science
    4. Applying the Kappa Architecture to the Telco Industry
      1. What Is Kappa Architecture?
      2. Building the Analytics Pipeline
      3. Incorporating a Bayesian Model to Do Advanced Analytics
      4. Conclusion
  4. 3. Intelligent Real-Time Applications
    1. The World Beyond Batch Streaming
      1. Streaming 102
    2. Extend Structured Streaming for Spark ML
    3. Semi-Supervised, Unsupervised, and Adaptive Algorithms for Large-Scale Time Series
      1. Surfacing Anomalies
      2. Adaptive, Online, Ensupervised Algorithms at Scale
      3. Discovering Relationships Among KPIs and Semi-Supervised Learning
    4. Related Resources:
    5. Uber’s Case for Incremental Processing on Hadoop
      1. Near-Real-Time Use Cases
      2. Incremental Processing via “Mini” Batches
      3. Challenges of Incremental Processing
      4. Takeaways
  5. 4. Cloud Infrastructure
    1. Where Should You Manage a Cloud-Based Hadoop Cluster?
      1. High-Level Differentiators
      2. Cloud Ecosystem Integration
      3. Big Data Is More Than Just Hadoop
      4. Key Takeaways
    2. Spark Comparison: AWS Versus GCP
      1. Submitting Spark Jobs to the Cloud
      2. Configuring Cloud Services
      3. You Get What You Pay For
      4. Performance Comparison
      5. Conclusion
    3. Time-Series Analysis on Cloud Infrastructure Metrics
      1. Infrastructure Usage Data
      2. Scheduled Auto Scaling
      3. Dynamic Auto Scaling
      4. Assess Cost Savings First
  6. 5. Machine Learning: Models and Training
    1. What Is Hardcore Data Science—in Practice?
      1. Computing Recommendations
      2. Bringing Mathematical Approaches into Industry
      3. Understanding Data Science Versus Production
      4. Why Start Small?
      5. Distinguishing a Production System from Data Science
      6. Data Scientists and Developers: Modes of Collaboration
      7. Constantly Adapt and Improve
    2. Training and Serving NLP Models Using Spark MLlib
      1. Constructing Predictive Models with Spark
      2. The Process of Building a Machine-Learning Product
      3. Operationalization
      4. Spark’s Role
      5. Fitting It Into Our Existing Platform with IdiML
      6. Faster, Flexible Performant Systems
    3. Three Ideas to Add to Your Data Science Toolkit
      1. Use a Reusable Holdout Method to Avoid Overfitting During Interactive Data Analysis
      2. Use Random Search for Black-Box Parameter Tuning
      3. Explain Your Black-Box Models Using Local Approximations
    4. Related Resources
    5. Introduction to Local Interpretable Model-Agnostic Explanations (LIME)
      1. Intuition Behind LIME
      2. Examples
      3. Conclusion
  7. 6. Deep Learning and AI
    1. The Current State of Machine Intelligence 3.0
      1. Ready Player World
      2. Why Even Bot-Her?
      3. On to 11111000001
      4. Peter Pan’s Never-Never Land
      5. Inspirational Machine Intelligence
      6. Looking Forward
    2. Hello, TensorFlow!
      1. Names and Execution in Python and TensorFlow
      2. The Simplest TensorFlow Graph
      3. The Simplest TensorFlow Neuron
      4. See Your Graph in TensorBoard
      5. Making the Neuron Learn
      6. Flowing Onward
    3. Compressing and Regularizing Deep Neural Networks
      1. Current Training Methods Are Inadequate
      2. Deep Compression
      3. DSD Training
      4. Generating Image Descriptions
      5. Advantages of Sparsity