O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Creating a Data-Driven Enterprise with DataOps

Book Description

Many companies are busy collecting massive amounts of data, but few are taking advantage of this treasure horde to build a truly data insights-driven organization. To do so, the data team must democratize both data and the insights in a way that provides real-time access to all employees in the organization. This report explores DataOps, the process, culture, tools, and people required to scale big data pervasively across the enterprise.

Just as DevOps has enabled organizations to improve coordination between developers and the operations team, DataOps closely connects everyone who handles data, including engineers, data scientists, analysts, and business users. Democratizing data with this approach requires removing barriers typical of siloed data, teams, and systems.

In this report, Apache Hive creators Ashish Thusoo and Joydeep Sen Sarma examine the characteristics of a data-driven organization that supports a self-service model.

  • Explore related topics such as data lakes, metadata, cloud architecture, and data-infrastructure-as-a-service
  • Examine conclusions from a survey of more than 400 senior executives whose companies are in various stages of data maturity
  • Learn how data pioneers at Facebook, Uber, LinkedIn, Twitter, and eBay created data-driven cultures and self-service data infrastructures for their organizations

Table of Contents

  1. Acknowledgments
  2. I. Foundations of a Data-Driven Enterprise
  3. 1. Introduction
    1. The Journey Begins
    2. The Emergence of the Data-Driven Organization
    3. Moving to Self-Service Data Access
    4. The Emergence of DataOps
    5. In This Book
  4. 2. Data and Data Infrastructure
    1. A Brief History of Data
    2. The Evolution of Data to “Big Data”
    3. Challenges with Big Data
    4. The Evolution of Analytics
    5. Components of a Big Data Infrastructure
      1. The Data “Supply Chain”
      2. Different Types of Analyses (and Related Tools)
    6. How Companies Adopt Data: The Maturity Model
      1. Stage 1: Aspiration
      2. Stage 2: Experiment
      3. Stage 3: Expansion
      4. Stage 4: Inversion
      5. Stage 5: Nirvana
    7. How Facebook Moved Through the Stages of Data Maturity
    8. Summary
  5. 3. Data Warehouses Versus Data Lakes: A Primer
    1. Data Warehouse: A Definition
    2. What Is a Data Lake?
    3. Key Differences Between Data Lakes and Data Warehouses
    4. When Facebook’s Data Warehouse Ran Out of Steam
    5. Is Using Either/Or a Possible Strategy?
    6. Common Misconceptions
      1. Data Warehouses are Dead
      2. Data Warehouses Will Become Data Lakes
    7. Difficulty Finding Qualified Personnel
    8. Summary
  6. 4. Building a Data-Driven Organization
    1. Creating a Self-Service Culture
      1. Fostering a Culture of Data-Driven Decision-Making
      2. Tips on Building a Data-Driven Culture
      3. Potential Roadblocks to Becoming a Self-Service Data Culture
    2. Organizational Structure That Supports a Self-Service Culture
      1. How the Hub-and-Spoke Model Works
      2. Training Is Essential for Data Analysts
    3. Roles and Responsibilities
      1. A Central Forum for Coming Together
    4. Summary
  7. 5. Putting Together the Infrastructure to Make Data Self-Service
    1. Technology That Supports the Self-Service Model
    2. Tools Used by Producers and Consumers of Data
    3. The Importance of a Complete and Integrated Data Infrastructure
    4. The Importance of Resource Sharing in a Self-Service World
      1. Multitenancy and Fair Sharing
      2. Protecting Against Inadvertent Misuse
    5. Security and Governance
    6. Self Help Support for Users
      1. Authoring and Tuning of Queries
    7. Monitoring Resources and Chargebacks
    8. The “Big Compute Crunch”: How Facebook Allocates Data Infrastructure Resources
    9. Using the Cloud to Make Data Self Service
    10. Summary
  8. 6. Cloud Architecture and Data Infrastructure-as-a-Service
    1. Five Properties of the Cloud
      1. Scalability
      2. Elasticity
      3. Self-Service and Collaboration
      4. Cost Effectiveness
      5. Monitoring and Usage Tracking
    2. Cloud Architecture
      1. Separation of Compute and Storage
      2. Multitenancy and Security
      3. Why “Lift and Shift” to the Cloud Is Not Possible
    3. Objections About the Cloud Refuted
      1. “The Cloud Isn’t Secure”
      2. “The Cloud Isn’t Compliant”
      3. “The Cloud Is More Expensive”
    4. What About a Private Cloud?
    5. Data Platforms for Data 2.0
    6. Summary
  9. 7. Metadata and Big Data
    1. The Three Types of Metadata
      1. Descriptive Metadata
      2. Structural Metadata
      3. Administrative Metadata
    2. The Challenges of Metadata
    3. Effectively Managing Metadata
    4. Summary
  10. 8. A Maturity-Model “Reality Check” for Organizations
    1. Organizations Understand the Need for Big Data, But Reach Is Still Limited
    2. Significant Challenges Remain
    3. Summary
  11. II. Case Studies
  12. 9. LinkedIn: The Road to Data Craftsmanship
    1. Tracking and DALI
    2. Faster Access to Data and Insights
    3. Organizational Structure of the Data Team
    4. The Move to Self-Service
  13. 10. Uber: Driven to Democratize Data
    1. Uber’s First Data Challenge: Too Popular
    2. Uber’s Second Data Challenge: Scalability
      1. Hiring to Scale: Six Roles to Fill
      2. Technical Scalability
      3. Critical Open Source Technologies Used by Uber
    3. Making Data Democratic
      1. Self-Serve Platforms
      2. Supporting Users
  14. 11. Twitter: When Everything Happens in Real Time
    1. Twitter Develops Heron
    2. Seven Different Use Cases for Real-Time Streaming Analytics
    3. Advice to Companies Seeking to Be Data-Driven
    4. Looking Ahead
  15. 12. Capture All Data, Decide What to Do with It Later: My Experience at eBay
    1. Ensuring “CAP-R” in Your Data Infrastructure
      1. Organizational Structure
      2. Governing and Democratizing Data
    2. Personalization: A Key Benefit of Data-Driven Culture
    3. Building Data Tools and Giving Back to the Open Source Community
    4. The Importance of Machine Learning
    5. Looking Ahead
  16. A. A Podcast Interview Transcript