book

Web Operations

by John Allspaw, Jesse Robbins

June 2010

Intermediate to advanced

338 pages

10h 33m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Web Operations: Keeping the Data on Time
SPECIAL OFFER: Upgrade this ebook with O’Reilly
Foreword
Preface
How This Book Is Organized
Who This Book Is For
Conventions Used in This Book
Using Code Examples
How to Contact Us
Safari® Books Online
Acknowledgments

1. Web Operations: The Career
Why Does Web Operations Have It Tough?A Strong Background in ComputingPracticed DecisivenessA Calm Disposition
From Apprentice to Master
KnowledgeToolsExperienceThe organizational challenge of inexperienceThe concept of "senior operations"Discipline
Conclusion
2. How Picnik Uses Cloud Computing: Lessons Learned
Where the Cloud Fits (and Why!)StorageHybrid Computing with EC2
Where the Cloud Doesn't Fit (for Picnik)
Conclusion
3. Infrastructure and Application Metrics
Time Resolution and Retention Concerns
Locality of Metrics Collection and Storage
Layers of Metrics
High-Level Business or Feature-Specific MetricsSystem- and Service-Level Metrics
Providing Context for Anomaly Detection and Alerts
Log Lines Are Metrics, Too
Correlation with Change Management and Incident Timelines
Making Metrics Available to Your Alerting Mechanisms
Using Metrics to Guide Load-Feedback Mechanisms
A Metrics Collection System, Illustrated: Ganglia
BackgroundA Quick Introduction to GangliaThe need to keep collection and aggregation costs lowThe need to automatically discover new nodes and metricsThe need to match network transport with your metrics collection taskThe need to implicitly prioritize cluster metricsThe need to aggregate and organize metrics once they're collectedThe need to provide convenient interfaces for creating new metrics and pulling out existing metrics for correlation against other data
Conclusion
4. Continuous Deployment
Small Batches Mean Faster Feedback
Small Batches Mean Problems Are Instantly Localized
Small Batches Reduce Risk
Small Batches Reduce Overhead
The Quality Defenders' Lament
Why Does It Work?
Getting Started
Step 1: Continuous Integration ServerStep 2: Source Control Commit CheckStep 3: Simple Deployment ScriptStep 4: Real-Time AlertingStep 5: Root-Cause Analysis (Five Whys)
Continuous Deployment Is for Mission-Critical Applications
Another Release? Do I Have To?The QA Dilemma
Conclusion
5. Infrastructure As Code
Service-Oriented ArchitectureConfiguration ManagementConfiguration management is policy drivenSystem automation is configuration management policy made into codeConfiguration management in system administrationSystem IntegrationStep 1: Break the infrastructure down into reusable, network-accessible servicesThe bootstrapping service.The configuration service.Step 2: Integrate the services together
Conclusion
6. Monitoring
Story: "The Start of a Journey"
Step 1: Understand What You Are Monitoring
Step 2: Understand Normal Behavior
Step 3: Be Prepared and Learn
Conclusion
7. How Complex Systems Fail
How Complex Systems Fail(Being a Short Treatise on the Nature of Failure; How Failure Is Evaluated; How Failure Is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)Complex systems are intrinsically hazardous systemsComplex systems are heavily and successfully defended against failureCatastrophe requires multiple failures–single-point failures are not enoughComplex systems contain changing mixtures of failures latent within themComplex systems run in degraded modeCatastrophe is always just around the cornerPost-accident attribution to a "root cause" is fundamentally wrongHindsight biases post-accident assessments of human performanceHuman operators have dual roles: as producers and as defenders against failureAll practitioner actions are gamblesActions at the sharp end resolve all ambiguityHuman practitioners are the adaptable element of complex systemsHuman expertise in complex systems is constantly changingChange introduces new forms of failureViews of "cause" limit the effectiveness of defenses against future eventsSafety is a characteristic of systems and not of their componentsPeople continuously create safetyFailure-free operations require experience with failureAs It Pertains Specifically to Web OperationsIt will be difficult to tell that the system has failedIt will be difficult to tell what has failedMeaningful response will be delayedCommunications will be strained and tempers will flareMaintenance will be a major source of new failuresRecovery from backup is itself difficult and potentially dangerousCreate test procedures that front-line people can use to verify system statusManage operations on a daily basisControl maintenanceAssess performance at regular intervalsBe a (unique) customer
Further Reading
8. Community Management and Web Operations
9. Dealing with Unexpected Traffic Spikes
How It All Started
Alarms Abound
Putting Out the Fire
Surviving the Weekend
Preparing for the Future
CDN to the Rescue
Proxy Servers
Corralling the Stampede
Streamlining the Codebase
How Do We Know It Works?
The Real Test
Lessons Learned
Improvements Since Then
10. Dev and Ops Collaboration and Cooperation
Deployment
Shared, Open Infrastructure
Trust
On-call Developers
Live Debugging ToolsFeature Flags
Avoiding Blame
Conclusion
11. How Your Visitors Feel: User-Facing Metrics
Why Collect User-Facing Metrics?Successful Start-ups Learn and AdaptPerformance MattersRecent Research Quantifies the Relationship
What Makes a Site Slow?
Service DiscoverySending the RequestThinking About the ResponseDelivering the ResponseAsynchronous Traffic and RefreshRendering Time
Measuring Delay
Synthetic MonitoringWhen to use synthetic monitoringLimitations of synthetic monitoringConfiguring synthetic monitoringReal User MonitoringWhen to use RUMLimitations of RUMConfiguring RUM
Building an SLA
Apdex
Visitor Outcomes: Analytics
How Marketing Defines SuccessThe Four Kinds of SitesA (Very) Basic Model of AnalyticsCorrelating Performance and Analytics by TimeCorrelating Performance and Analytics by Visits
Other Metrics Marketing Cares About
Web Interaction AnalyticsVoice of the Customer
How User Experience Affects Web Ops
Many More StakeholdersMonitoring As Part of the Life Cycle, Not Just QA
The Future of Web Monitoring
Moving from Parts to UsersService-Centric ArchitecturesClouds and MonitoringAPIs and RSS FeedsDelivering an API to othersConsuming an API from someone elseRich Internet ApplicationsHTML5: Server-Sent Events and WebSocketsOnline Communities and the Long FunnelTying Together Mail and Conversion LoopsThe Capacity/Cost/Revenue Equation
Conclusion
12. Relational Database Strategy and Tactics for the Web
Requirements for Web DatabasesAlways OnMostly Transactional WorkloadSimple Data, Simple QueriesAvailability Trumps ConsistencyRapid DevelopmentOnline DeploymentBuilt by Developers
How Typical Web Databases Grow
Single ServerMaster and Replication SlavesFunctional PartitioningSharding, or Horizontal PartitioningCaching Layer
The Yearning for a Cluster
The CAP Theorem and ACID Versus BASEState of MySQL ClusteringDRBD and HeartbeatMaster-Master Replication Manager (MMM)Heartbeat with replicationProxy-based solutionsInfiniDB, Galera, Tungsten, and ScaleDBSummary
Database Strategy
Architecture RequirementsEasy winsSafe-Bet ArchitecturesRisky ArchitecturesShardingWriting to more than one masterMultilevel replicationRing replication (beyond two nodes)Reliance on DNSThe so-called Entity-Attribute-Value (EAV) design pattern
Database Tactics
Taking Backups on a SlaveOnline Schema ChangesMonitoring, Graphing, and InstrumentationAnalyzing PerformanceArchiving and Purging Data
Conclusion
13. How to Make Failure Beautiful: The Art and Science of Postmortems
The Worst Postmortem
What Is a Postmortem?
When to Conduct a Postmortem
Who to Invite to a Postmortem
Running a Postmortem
Postmortem Follow-Up
Conclusion
14. Storage
Data Asset Inventory
Data Protection
Capacity Planning
Storage Sizing
Operations
Conclusion
15. Nonrelational Databases
NoSQL Database OverviewPure Key/ValueData StructureGraphDocument OrientedHighly Distributed
Some Systems in Detail
CassandraHBaseRiakCouchDBMongoDBRedis
Conclusion
16. Agile Infrastructure
Agile InfrastructureBut Agile Is Not the Only Thing That Has EvolvedSome People Are Born to Web Operations, Some People Have Web Operations Thrust upon Them...Working Software Is the Primary Measure of ProgressThe Application Is the Infrastructure, the Infrastructure Is the Application
So, What's the Problem?
Talk Does Not Cook RiceThe infrastructure is an applicationVersion control: The foundation of sanityConfiguration management and automated deploymentsMonitoringDev-test-prod life cycle, continuous integration, and disaster recoveryRadiate informationReflective process improvementIncremental changes and refactoringThe simplest thing that could workSeparation of concernsTechnical debtContinuous deploymentPairingManaging flow
Communities of Interest and Practice
Trading Zones and Apologies
What to Do?
Conclusion
17. Things That Go Bump in the Night (and How to Sleep Through Them)
Definitions
How Many 9s?
Impact Duration Versus Incident Duration
Datacenter Footprint
Gradual Failures
Trust Nobody
Failover Testing
Monitoring and History of Patterns
Getting a Good Night's Sleep
A. Contributors
Index
About the Authors
Colophon
SPECIAL OFFER: Upgrade this ebook with O’Reilly

Content preview from Web Operations

Chapter 4. Continuous Deployment

Eric Ries

SOFTWARE SHOULD BE DESIGNED, WRITTEN, AND DEPLOYED IN SMALL BATCHES. Doing so is good for developers, the product, and operations, too.

The batch size is the unit at which work products move between stages in a development process. For software, the easiest batch to see is code. Every time an engineer checks in code, he is batching up a certain amount of work. There are many techniques for controlling these batches, ranging from the tiny batches needed for continuous deployment to more traditional branch-based development, where all of the code from multiple developers working for weeks or months is batched up and integrated together.

It turns out that there are tremendous benefits from working with a batch size radically smaller than traditional practice suggests. In my experience, a few hours of coding is enough to produce a viable batch and is worth checking in and deploying. Similar results apply in product management, design, testing, and even operations. This is actually a hard case to make, because most of the benefits of small batches are counterintuitive.

Small Batches Mean Faster Feedback

The sooner you pass your work on to a later stage, the sooner you can find out how that next stage will receive it. If you're not used to working in this way, it may seem annoying to get interrupted so soon after you were "done" with something, instead of just working it all out by yourself. But these interruptions are actually much more efficient ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449377465Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Web Operations

by John Allspaw, Jesse Robbins