O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Web Operations

Book Description

A web application involves many specialists, but it takes people in web ops to ensure that everything works together throughout an application's lifetime. It's the expertise you need when your start-up gets an unexpected spike in web traffic, or when a new feature causes your mature application to fail. In this collection of essays and interviews, web veterans such as Theo Schlossnagle, Baron Schwartz, and Alistair Croll offer insights into this evolving field. You'll learn stories from the trenches--from builders of some of the biggest sites on the Web--on what's necessary to help a site thrive.

  • Learn the skills needed in web operations, and why they're gained through experience rather than schooling
  • Understand why it's important to gather metrics from both your application and infrastructure
  • Consider common approaches to database architectures and the pitfalls that come with increasing scale
  • Learn how to handle the human side of outages and degradations
  • Find out how one company avoided disaster after a huge traffic deluge
  • Discover what went wrong after a problem occurs, and how to prevent it from happening again

Contributors include:

John Allspaw

Heather Champ

Michael Christian

Richard Cook

Alistair Croll

Patrick Debois

Eric Florenzano

Paul Hammond

Justin Huff

Adam Jacob

Jacob Loomis

Matt Massie

Brian Moon

Anoop Nagwani

Sean Power

Eric Ries

Theo Schlossnagle

Baron Schwartz

Andrew Shafer

Table of Contents

  1. Dedication
  2. Foreword
  3. Preface
    1. How This Book Is Organized
    2. Who This Book Is For
    3. Conventions Used in This Book
    4. Using Code Examples
    5. How to Contact Us
    6. Safari® Books Online
    7. Acknowledgments
  4. 1. Web Operations: The Career
    1. Why Does Web Operations Have It Tough?
      1. A Strong Background in Computing
      2. Practiced Decisiveness
      3. A Calm Disposition
    2. From Apprentice to Master
      1. Knowledge
      2. Tools
      3. Experience
        1. The organizational challenge of inexperience
        2. The concept of “senior operations”
      4. Discipline
    3. Conclusion
  5. 2. How Picnik Uses Cloud Computing: Lessons Learned
    1. Where the Cloud Fits (and Why!)
      1. Storage
      2. Hybrid Computing with EC2
    2. Where the Cloud Doesn’t Fit (for Picnik)
    3. Conclusion
  6. 3. Infrastructure and Application Metrics
    1. Time Resolution and Retention Concerns
    2. Locality of Metrics Collection and Storage
    3. Layers of Metrics
      1. High-Level Business or Feature-Specific Metrics
      2. System- and Service-Level Metrics
    4. Providing Context for Anomaly Detection and Alerts
    5. Log Lines Are Metrics, Too
    6. Correlation with Change Management and Incident Timelines
    7. Making Metrics Available to Your Alerting Mechanisms
    8. Using Metrics to Guide Load-Feedback Mechanisms
    9. A Metrics Collection System, Illustrated: Ganglia
      1. Background
      2. A Quick Introduction to Ganglia
        1. The need to keep collection and aggregation costs low
        2. The need to automatically discover new nodes and metrics
        3. The need to match network transport with your metrics collection task
        4. The need to implicitly prioritize cluster metrics
        5. The need to aggregate and organize metrics once they’re collected
        6. The need to provide convenient interfaces for creating new metrics and pulling out existing metrics for correlation against other data
    10. Conclusion
  7. 4. Continuous Deployment
    1. Small Batches Mean Faster Feedback
    2. Small Batches Mean Problems Are Instantly Localized
    3. Small Batches Reduce Risk
    4. Small Batches Reduce Overhead
    5. The Quality Defenders’ Lament
      1. Why Does It Work?
    6. Getting Started
      1. Step 1: Continuous Integration Server
      2. Step 2: Source Control Commit Check
      3. Step 3: Simple Deployment Script
      4. Step 4: Real-Time Alerting
      5. Step 5: Root-Cause Analysis (Five Whys)
    7. Continuous Deployment Is for Mission-Critical Applications
      1. Another Release? Do I Have To?
      2. The QA Dilemma
    8. Conclusion
  8. 5. Infrastructure As Code
    1. Service-Oriented Architecture
      1. Configuration Management
        1. Configuration management is policy driven
        2. System automation is configuration management policy made into code
        3. Configuration management in system administration
      2. System Integration
        1. Step 1: Break the infrastructure down into reusable, network-accessible services
          1. The bootstrapping service.
          2. The configuration service.
        2. Step 2: Integrate the services together
    2. Conclusion
  9. 6. Monitoring
    1. Story: “The Start of a Journey”
    2. Step 1: Understand What You Are Monitoring
    3. Step 2: Understand Normal Behavior
    4. Step 3: Be Prepared and Learn
    5. Conclusion
  10. 7. How Complex Systems Fail
    1. How Complex Systems Fail
      1. (Being a Short Treatise on the Nature of Failure; How Failure Is Evaluated; How Failure Is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)
        1. Complex systems are intrinsically hazardous systems
        2. Complex systems are heavily and successfully defended against failure
        3. Catastrophe requires multiple failures–single-point failures are not enough
        4. Complex systems contain changing mixtures of failures latent within them
        5. Complex systems run in degraded mode
        6. Catastrophe is always just around the corner
        7. Post-accident attribution to a “root cause” is fundamentally wrong
        8. Hindsight biases post-accident assessments of human performance
        9. Human operators have dual roles: as producers and as defenders against failure
        10. All practitioner actions are gambles
        11. Actions at the sharp end resolve all ambiguity
        12. Human practitioners are the adaptable element of complex systems
        13. Human expertise in complex systems is constantly changing
        14. Change introduces new forms of failure
        15. Views of “cause” limit the effectiveness of defenses against future events
        16. Safety is a characteristic of systems and not of their components
        17. People continuously create safety
        18. Failure-free operations require experience with failure
      2. As It Pertains Specifically to Web Operations
        1. It will be difficult to tell that the system has failed
        2. It will be difficult to tell what has failed
        3. Meaningful response will be delayed
        4. Communications will be strained and tempers will flare
        5. Maintenance will be a major source of new failures
        6. Recovery from backup is itself difficult and potentially dangerous
        7. Create test procedures that front-line people can use to verify system status
        8. Manage operations on a daily basis
        9. Control maintenance
        10. Assess performance at regular intervals
        11. Be a (unique) customer
    2. Further Reading
  11. 8. Community Management and Web Operations
  12. 9. Dealing with Unexpected Traffic Spikes
    1. How It All Started
    2. Alarms Abound
    3. Putting Out the Fire
    4. Surviving the Weekend
    5. Preparing for the Future
    6. CDN to the Rescue
    7. Proxy Servers
    8. Corralling the Stampede
    9. Streamlining the Codebase
    10. How Do We Know It Works?
    11. The Real Test
    12. Lessons Learned
    13. Improvements Since Then
  13. 10. Dev and Ops Collaboration and Cooperation
    1. Deployment
    2. Shared, Open Infrastructure
    3. Trust
    4. On-call Developers
      1. Live Debugging Tools
      2. Feature Flags
    5. Avoiding Blame
    6. Conclusion
  14. 11. How Your Visitors Feel: User-Facing Metrics
    1. Why Collect User-Facing Metrics?
      1. Successful Start-ups Learn and Adapt
      2. Performance Matters
      3. Recent Research Quantifies the Relationship
    2. What Makes a Site Slow?
      1. Service Discovery
      2. Sending the Request
      3. Thinking About the Response
      4. Delivering the Response
      5. Asynchronous Traffic and Refresh
      6. Rendering Time
    3. Measuring Delay
      1. Synthetic Monitoring
        1. When to use synthetic monitoring
        2. Limitations of synthetic monitoring
        3. Configuring synthetic monitoring
      2. Real User Monitoring
        1. When to use RUM
        2. Limitations of RUM
        3. Configuring RUM
    4. Building an SLA
      1. Apdex
    5. Visitor Outcomes: Analytics
      1. How Marketing Defines Success
      2. The Four Kinds of Sites
      3. A (Very) Basic Model of Analytics
      4. Correlating Performance and Analytics by Time
      5. Correlating Performance and Analytics by Visits
    6. Other Metrics Marketing Cares About
      1. Web Interaction Analytics
      2. Voice of the Customer
    7. How User Experience Affects Web Ops
      1. Many More Stakeholders
      2. Monitoring As Part of the Life Cycle, Not Just QA
    8. The Future of Web Monitoring
      1. Moving from Parts to Users
      2. Service-Centric Architectures
      3. Clouds and Monitoring
      4. APIs and RSS Feeds
        1. Delivering an API to others
        2. Consuming an API from someone else
      5. Rich Internet Applications
      6. HTML5: Server-Sent Events and WebSockets
      7. Online Communities and the Long Funnel
      8. Tying Together Mail and Conversion Loops
      9. The Capacity/Cost/Revenue Equation
    9. Conclusion
  15. 12. Relational Database Strategy and Tactics for the Web
    1. Requirements for Web Databases
      1. Always On
      2. Mostly Transactional Workload
      3. Simple Data, Simple Queries
      4. Availability Trumps Consistency
      5. Rapid Development
      6. Online Deployment
      7. Built by Developers
    2. How Typical Web Databases Grow
      1. Single Server
      2. Master and Replication Slaves
      3. Functional Partitioning
      4. Sharding, or Horizontal Partitioning
      5. Caching Layer
    3. The Yearning for a Cluster
      1. The CAP Theorem and ACID Versus BASE
      2. State of MySQL Clustering
        1. DRBD and Heartbeat
        2. Master-Master Replication Manager (MMM)
        3. Heartbeat with replication
        4. Proxy-based solutions
        5. InfiniDB, Galera, Tungsten, and ScaleDB
        6. Summary
    4. Database Strategy
      1. Architecture Requirements
        1. Easy wins
      2. Safe-Bet Architectures
      3. Risky Architectures
        1. Sharding
        2. Writing to more than one master
        3. Multilevel replication
        4. Ring replication (beyond two nodes)
        5. Reliance on DNS
        6. The so-called Entity-Attribute-Value (EAV) design pattern
    5. Database Tactics
      1. Taking Backups on a Slave
      2. Online Schema Changes
      3. Monitoring, Graphing, and Instrumentation
      4. Analyzing Performance
      5. Archiving and Purging Data
    6. Conclusion
  16. 13. How to Make Failure Beautiful: The Art and Science of Postmortems
    1. The Worst Postmortem
    2. What Is a Postmortem?
    3. When to Conduct a Postmortem
    4. Who to Invite to a Postmortem
    5. Running a Postmortem
    6. Postmortem Follow-Up
    7. Conclusion
  17. 14. Storage
    1. Data Asset Inventory
    2. Data Protection
    3. Capacity Planning
    4. Storage Sizing
    5. Operations
    6. Conclusion
  18. 15. Nonrelational Databases
    1. NoSQL Database Overview
      1. Pure Key/Value
      2. Data Structure
      3. Graph
      4. Document Oriented
      5. Highly Distributed
    2. Some Systems in Detail
      1. Cassandra
      2. HBase
      3. Riak
      4. CouchDB
      5. MongoDB
      6. Redis
    3. Conclusion
  19. 16. Agile Infrastructure
    1. Agile Infrastructure
      1. But Agile Is Not the Only Thing That Has Evolved
      2. Some People Are Born to Web Operations, Some People Have Web Operations Thrust upon Them...
      3. Working Software Is the Primary Measure of Progress
      4. The Application Is the Infrastructure, the Infrastructure Is the Application
    2. So, What’s the Problem?
      1. Talk Does Not Cook Rice
        1. The infrastructure is an application
        2. Version control: The foundation of sanity
        3. Configuration management and automated deployments
        4. Monitoring
        5. Dev-test-prod life cycle, continuous integration, and disaster recovery
        6. Radiate information
        7. Reflective process improvement
        8. Incremental changes and refactoring
        9. The simplest thing that could work
        10. Separation of concerns
        11. Technical debt
        12. Continuous deployment
        13. Pairing
        14. Managing flow
    3. Communities of Interest and Practice
    4. Trading Zones and Apologies
      1. What to Do?
    5. Conclusion
  20. 17. Things That Go Bump in the Night (and How to Sleep Through Them)
    1. Definitions
    2. How Many 9s?
    3. Impact Duration Versus Incident Duration
    4. Datacenter Footprint
    5. Gradual Failures
    6. Trust Nobody
    7. Failover Testing
    8. Monitoring and History of Patterns
    9. Getting a Good Night’s Sleep
  21. A. Contributors
  22. Index
  23. About the Authors
  24. Colophon
  25. Copyright