Engineering Resilient Systems on AWS

Book description

To ensure that applications are reliable and always available, more businesses today are moving applications to AWS. But many companies still struggle to design and build these cloud applications effectively, thinking that because the cloud is resilient, their applications will be too. With this practical guide, software, DevOps, and cloud engineers will learn how to implement resilient designs and configurations in the cloud using hands-on independent labs.

Authors Kevin Schwarz, Jennifer Moran, and Dr. Nate Bachmeier from AWS teach you how to build cloud applications that demonstrate resilience with patterns like back off and retry, multi-Region failover, data protection, and circuit breaker with common configuration, tooling, and deployment scenarios. Labs are organized into categories based on complexity and topic, making it easy for you to focus on the most relevant parts of your business.

You'll learn how to:

  • Configure and deploy AWS services using resilience patterns
  • Implement stateless microservices for high availability
  • Consider multi-Region designs to meet business requirements
  • Implement backup and restore, pilot light, warm standby, and active-active strategies
  • Build applications that withstand AWS Region and Availability Zone impairments
  • Use chaos engineering experiments for fault injection to test for resilience
  • Assess the trade-offs when building resilient systems, including cost, complexity, and operational burden

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
      1. from Kevin
      2. from Jennifer Moran
      3. from Dr. Nate
  2. I. Foundations
  3. 1. Introduction
    1. People, Process and Technology
      1. The Role of People
      2. The Role of Processes
      3. Integrating People, Processes, and Technology
    2. Shared Responsibility Model
    3. AWS Responsibility
      1. AWS Global Infrastructure
    4. Customer Responsibility
      1. Setting Objectives
      2. Workload Architecture
      3. Networking
      4. Quotas
      5. Change Management
      6. Failure Management
      7. Observability
      8. Continuous Testing
      9. CI/CD and Automation
      10. Continuous Resilience
    5. Summary
  4. 2. Prepare Your Working Environment
    1. Hands-on Learning with Microservices
    2. AWS Account and Permissions
    3. Choosing a Development OS and IDE
    4. The Cloud 9 Environment
    5. Git and Code Samples Repository
    6. Python Environment
    7. NPM and Node.js
    8. AWS CDK
    9. Additional Software
      1. AWS CLI
      2. Python Packages
      3. Vue.js and Vite
      4. Bootstrap CSS
      5. Artillery.io
      6. curl and watch
      7. Boto3
      8. PostgreSQL
      9. Lambda Powertools
      10. Docker Desktop
    10. Custom Domain and Route53 Hosted Zone
    11. Security
      1. Encryption in transit
      2. Encryption at rest
      3. Authentication and Authorization for API endpoints
      4. Tokenization
      5. Code scanning
    12. Cleaning Up
    13. Summary
  5. II. Reliable Trading Portal
  6. 3. Frontend Web Application
    1. Technical Requirements
    2. Architecture Overview
    3. Deploying the AWS CDK Application
      1. Using an Amazon CloudFront Domain
      2. Amazon CloudFront
      3. Amazon Simple Storage Service (Amazon S3)
      4. Amazon Route 53
    4. Implementing Observability
      1. Synthetic Monitoring: Proactive Insight into User Experience
    5. Injecting Failure Modes
      1. Introducing Excessive Load
      2. Introducing Excessive Latency
      3. Addressing Single Points of Failure
    6. Cleaning Up
    7. Summary
  7. 4. Serverless Account Open API
    1. Technical Requirements
    2. Architecture Overview
      1. An AWS Serverless Approach
    3. Deploy the AWS CDK Application
    4. Sunny Day Scenario
    5. Strongly Typed Service Contracts
    6. Idempotent Responses
    7. Self-Healing with Message Queue Retries
    8. Rate Limiting: Throttle Unanticipated Load
    9. Surviving a Poison Pill
    10. STOP: Business Continuity Regional Switchover
    11. Returning to Business as Usual
    12. Blue-Green testing
    13. Cleaning up
    14. Summary
  8. 5. Containerized Trade Stock API
    1. Technical Requirements
    2. Architecture Overview
    3. Deploy the AWS CDK Application
      1. VpcStack
      2. TradeDatabaseStack
      3. TradeOrderStack
      4. TradeConfirmsStack
      5. Prepare the Database
    4. Container deployment failures
    5. Database connection exhaustion
    6. Database password rotation login failures
    7. Database primary writer failures
    8. Dependency intermittent failures
    9. Detecting and handling Availability Zone issues
    10. Dependency outages
    11. Cleaning up
    12. Summary
  9. 6. Integrated Stock Wise Frontend with APIs
    1. Technical Requirements
    2. Architecture Overview
    3. Deploy the AWS CDK Application
    4. Automating Stock Wise Endpoint Configuration
    5. Integrating Stock Wise Microservices
    6. Configure Client Timeouts
    7. Gracefully Degrade Features
    8. Real User Monitoring
    9. X-Ray for end-to-end tracing
    10. Cleaning up
    11. Summary
  10. 7. When Recovery Is Required
    1. Architecture Overview
    2. Deploy the AWS CDK Application
      1. Deploy the AWS CDK Orchestration Stack
      2. Integrate Backend API to Frontend
    3. Validating Region
    4. Database Failover
    5. Scaling Compute
    6. Routing at the Lambda Layer
    7. DNS Failover
    8. Importance of Backups
    9. Avoiding Configuration Drift
    10. Failover Verification
    11. Cleaning up
    12. Summary
  11. III. Discovering Trading Opportunities
  12. 8. Real-time Market Data Analytics
    1. Technical Requirements
    2. Designing a Reliable Data Ingestion Layer
      1. Role of Apache Kafka in Data Ingestion
      2. Designing the Kafka Topic Structure
      3. Securing the Kafka Cluster
    3. Implementing Reliable Consumers
      1. Ensuring Fault Tolerance and Scalability
      2. Consumer Groups and Record Processing
      3. Handling Invalid Messages
      4. Dealing with Downstream Dependencies
    4. Integrating Consumers and APIs
      1. Creating the Connection
      2. Designing Consumer State
      3. Implementing State Management
      4. Handling Concurrency
      5. Using Restartability
    5. Storing and Querying Processed Market Data
      1. Handling Firehose Failure Modes
      2. Querying Athena
      3. Optimizing Data Storage and Querying Performance
    6. Monitoring and Observability
    7. Testing Resiliency
    8. Cleanup
    9. Summary
  13. 9. Building Reliable News Feed Ingestion and Search APIs
    1. Technical Requirements
    2. Fetching and Processing News Articles
      1. Scheduler and Worker Node Architecture
      2. Leader Election for Scheduler High Availability
      3. Scheduler Configuration Failure Modes
      4. Additional Resiliency Strategies
    3. Storing Articles and Metadata
      1. Fallback and Caching
    4. Syncing Articles to OpenSearch
      1. Handling Indexing Failures
      2. Index Optimization Techniques
    5. Serving Search Traffic
      1. Connection Draining
      2. Progressive Degradation
      3. Fallback Mechanisms
    6. Testing Resiliency
    7. Security Considerations
    8. Monitoring and Observability
    9. Cleanup
    10. Summary
  14. 10. Building Resilient Multi-Region Architectures
    1. The Business Case for Multi-Region Architectures
    2. Architecting for Multi-Region Resiliency
    3. Multi-Region Streaming Architectures
      1. Replicating Kafka Data Across Regions
      2. Handling Active-Active Kafka Deployments
      3. Streaming Data to Other Destinations
    4. Multi-Region Search Architectures with OpenSearch
      1. Cross-Region Data Replication with OpenSearch
      2. Index Design and Shard Allocation
      3. Searching Across Regions
      4. Other Data Replication Options
    5. Caching in Multi-Region Architectures
    6. Best Practices for Multi-Region Architectures
    7. Summary
  15. 11. Putting It All Together
    1. Reviewing Core Concepts
      1. Reliability Frameworks
      2. Failure Modes with Reliability Patterns
      3. Connecting the Key Learnings
    2. Leading Resiliency Initiatives: Cultivating a Culture of Resilience
      1. Nurturing the Seeds of Resilience
      2. Becoming the Go-To Resilience Guru
      3. Sharpening Your Resilience Radar
      4. Embracing Continuous Resilience
      5. Making Resilience a Daily Habit
    3. Looking to the Future
      1. Navigating the multi-cloud and hybrid cloud landscape
      2. Harnessing AI for Resilience
      3. Embracing chaos engineering
      4. Leveraging Observability
    4. Summary
  16. A. AWS Services
    1. Amazon API Gateway
    2. Amazon Aurora
    3. Amazon DynamoDB
    4. Amazon Elastic Compute Cloud (EC2)
    5. Amazon ElastiCache
    6. Amazon Elastic Container Service (ECS)
    7. Amazon Managed Streaming for Apache Kafka (MSK)
    8. Amazon MemoryDB for Redis
    9. Amazon OpenSearch Service
    10. Amazon Simple Notification Service (SNS)
    11. Amazon Simple Queue Service (SQS)
    12. Amazon Simple Storage Service (S3)
    13. AWS CloudWatch
    14. AWS Lambda
    15. AWS Secrets Manager
    16. Amazon Virtual Private Cloud (VPC)
  17. About the Authors

Product information

  • Title: Engineering Resilient Systems on AWS
  • Author(s): Kevin Schwarz, Jennifer Moran, Nate Bachmeier
  • Release date: October 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098162429