Architecting for Scale

Book Description

Every day, companies struggle to scale critical applications. As traffic volume and data demands increase, these applications become more complicated and brittle, exposing risks and compromising availability. This practical guide shows IT, devops, and system reliability managers how to prevent an application from becoming slow, inconsistent, or downright unavailable as it grows.

Scaling isn’t just about handling more users; it’s also about managing risk and ensuring availability. Author Lee Atchison provides basic techniques for building applications that can handle huge quantities of traffic, data, and demand without affecting the quality your customers expect.

In five parts, this book explores:

  • Availability: learn techniques for building highly available applications, and for tracking and improving availability going forward
  • Risk management: identify, mitigate, and manage risks in your application, test your recovery/disaster plans, and build out systems that contain fewer risks
  • Services and microservices: understand the value of services for building complicated applications that need to operate at higher scale
  • Scaling applications: assign services to specific teams, label the criticalness of each service, and devise failure scenarios and recovery plans
  • Cloud services: understand the structure of cloud-based services, resource allocation, and service distribution

Publisher Resources

View/Submit Errata

Table of Contents

  1. Foreword
  2. Preface
    1. Who Should Read This Book
    2. Why I Wrote This Book
    3. A Word on Scale Today
    4. Navigating This Book
      1. Part I, “Availability”
      2. Part II, “Risk Management”
      3. Part III, “Services and Microservices”
      4. Part IV, “Scaling Applications”
      5. Part V, “Cloud Services”
      6. Part VI, “Conclusion”
    5. Online Resources
    6. Conventions Used in This Book
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  3. I. Availability
  4. 1. What Is Availability?
    1. Availability Versus Reliability
    2. What Causes Poor Availability?
  5. 2. Five Focuses to Improve Application Availability
    1. Focus #1: Build with Failure in Mind
    2. Focus #2: Always Think About Scaling
    3. Focus #3: Mitigate Risk
    4. Focus #4: Monitor Availability
    5. Focus #5: Respond to Availability Issues in a Predictable and Defined Way
    6. Being Prepared
  6. 3. Measuring Availability
    1. The Nines
      1. What’s Reasonable?
    2. Don’t Be Fooled
    3. Availability by the Numbers
  7. 4. Improving Your Availability When It Slips
    1. Measure and Track Your Current Availability
    2. Automate Your Manual Processes
      1. Automated Deploys
      2. Configuration Management
      3. Change Experiments and High Frequency Changes
      4. Automated Change Sanity Testing
    3. Improve Your Systems
    4. Your Changing and Growing Application
    5. Keeping on Top of Availability
  8. II. Risk Management
  9. 5. What Is Risk Management?
    1. Managing Risk
    2. Identify Risk
    3. Remove Worst Offenders
    4. Mitigate
    5. Review Regularly
    6. Managing Risk Summary
  10. 6. Likelihood Versus Severity
    1. The Top 10 List: Low Likelihood, Low Severity Risk
    2. The Order Database: Low Likelihood, High Severity Risk
    3. Custom Fonts: High Likelihood, Low Severity Risk
    4. T-Shirt Photos: High Likelihood, High Severity Risk
  11. 7. The Risk Matrix
    1. Scope of the Risk Matrix
    2. Creating the Risk Matrix
      1. Brainstorming the List
      2. Set the Likelihood and Severity Fields
      3. Risk Item Details
      4. Mitigation Plan
      5. Triggered Plan
    3. Using the Risk Matrix for Planning
    4. Maintaining the Risk Matrix
  12. 8. Risk Mitigation
    1. Recovery Plans
    2. Disaster Recovery Plans
    3. Improving Our Risk Situation
  13. 9. Game Days
    1. Staging Versus Production Environments
    2. Concerns with Running Game Days in Production
    3. Game Day Testing
  14. 10. Building Systems with Reduced Risk
    1. Redundancy
    2. Examples of Idempotent Interfaces
    3. Redundancy Improvements That Increase Complexity
    4. Independence
    5. Security
    6. Simplicity
    7. Self-Repair
    8. Operational Processes
  15. III. Services and Microservices
  16. 11. Why Use Services?
    1. The Monolith Application
    2. The Service-Based Application
    3. The Ownership Benefit
    4. The Scaling Benefit
  17. 12. Using Microservices
    1. What Should Be a Service?
      1. Dividing into Services
      2. Guideline #1: Specific Business Requirements
      3. Guideline #2: Distinct and Separable Team Ownership
      4. Guideline #3: Naturally Separable Data
      5. Guideline #4: Shared Capabilities/Data
      6. Mixed Reasons
    2. Going Too Far
    3. The Right Balance
  18. 13. Dealing with Service Failures
    1. Cascading Service Failures
    2. Responding to a Service Failure
      1. Predictable Response
      2. Understandable Response
      3. Reasonable Response
    3. Determining Failures
    4. Appropriate Action
      1. Graceful Degradation
      2. Graceful Backoff
      3. Fail as Early as Possible
      4. Customer-Caused Problems
  19. IV. Scaling Applications
  20. 14. Two Mistakes High
    1. What Is “Two Mistakes High”?
    2. “Two Mistakes High” in Practice
      1. Losing a Node
      2. Problems During Upgrades
      3. Data Center Resiliency
      4. Hidden Shared Failure Types
      5. Failure Loops
    3. Managing Your Applications
    4. The Space Shuttle
  21. 15. Service Ownership
    1. Single Team Owned Service Architecture
    2. Advantages of a STOSA Application and Organization
    3. What Does it Mean to Be a Service Owner?
  22. 16. Service Tiers
    1. Application Complexity
    2. What Are Service Tiers?
    3. Assigning Service Tier Labels to Services
      1. Tier 1
      2. Tier 2
      3. Tier 3
      4. Tier 4
    4. Example: Online Store
    5. What’s Next?
  23. 17. Using Service Tiers
    1. Expectations
    2. Responsiveness
    3. Dependencies
      1. Critical Dependency
      2. Noncritical Dependency
    4. Summary
  24. 18. Service-Level Agreements
    1. What are Service-Level Agreements?
    2. External Versus Internal SLAs
    3. Why Are Internal SLAs Important?
    4. SLAs as Trust
    5. SLAs for Problem Diagnosis
    6. Performance Measurements for SLAs
      1. Limit SLAs
      2. Top Percentile SLAs
      3. Latency Groups
    7. How Many and Which Internal SLAs?
    8. Additional Comments on SLAs
  25. 19. Continuous Improvement
    1. Examine Your Application Regularly
    2. Microservices
    3. Service Ownership
    4. Stateless Services
    5. Where’s the Data?
    6. Data Partitioning
    7. The Importance of Continuous Improvement
  26. V. Cloud Services
  27. 20. Change and the Cloud
    1. What Has Changed in the Cloud?
      1. Acceptance of Microservice-Based Architectures
      2. Smaller, More Specialized Services
      3. Greater Focus on the Application
      4. The Micro Startup
      5. Security and Compliance Has Matured
    2. Change Continues
  28. 21. Distributing the Cloud
    1. AWS Architecture
      1. AWS Region
      2. AWS Availability Zone
      3. Data Center
    2. Architecture Overview
    3. Availability Zones Are Not Data Centers
    4. Maintaining Location Diversity for Availability Reasons
  29. 22. Managed Infrastructure
    1. Structure of Cloud-Based Services
      1. Raw Resource
      2. Managed Resource (Server-Based)
      3. Managed Resource (Non-server-based)
    2. Implications of Using Managed Resources
    3. Implications of Using Non-Managed Resources
    4. Monitoring and CloudWatch
  30. 23. Cloud Resource Allocation
    1. Allocated-Capacity Resource Allocation
      1. Changing Allocations
      2. Reserved Capacity
    2. Usage-Based Resource Allocation
      1. The “Magic” of Usage-Based Resource Allocation
    3. The Pros and Cons of Resource Allocation Techniques
  31. 24. Scalable Computing Options
    1. Cloud-Based Servers
      1. Advantages
      2. Disadvantages
      3. Optimized Use Cases
    2. Compute Slices
      1. Advantages
      2. Disadvantages
      3. Optimized Use Cases
    3. Dynamic Containers
      1. Advantages
      2. Disadvantages
      3. Optimized Use Cases
    4. Microcompute
      1. Advantages
      2. Disadvantages
      3. Optimized Use Cases
    5. Now What?
  32. 25. AWS Lambda
    1. Using Lambda
      1. Event Processing
      2. Mobile Backend
      3. Internet of Things Data Intake
    2. Advantages and Disadvantages of Lambda
  33. VI. Conclusion
  34. 26. Putting It All Together
    1. Availability
    2. Risk Management
    3. Services
    4. Scaling
    5. Cloud
    6. Architecting for Scale
  35. Index

Product Information

  • Title: Architecting for Scale
  • Author(s): Lee Atchison
  • Release date: July 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491943397