Site Reliability Engineering on AWS

Video description

Reliability in AWS includes the ability of a system to recover from infrastructure or service disruptions. It's essential to acquire computing resources to meet the demand, and mitigate disruptions such as configuration issues or transient network problems.

In this course, you will first explore the key concepts and core services of AWS and Site Reliability Engineering (SRE). We show you step-by-step how to implement a real-world application that is built via the reliability principles defined within the AWS Well-Architected Framework using the SRE approach. So you can increase the reliability of application architectures on AWS by implementing resilience infrastructure and application resilience.

You will be covering some common architectural patterns used every day by real-world AWS solution architects to build reliable systems and implement fault tolerance into an application architecture running on AWS. While learning how to further increase the reliability of application architectures on AWS by implementing multi-region solutions for disaster recovery on a global scale.

By the end of this course, you will have gained a variety of AWS architecture skills that you can then apply to the real world.

What You Will Learn

  • Understand the core principles of Site Reliability Engineering, and how cloud computing enables this
  • Design applications for fault tolerance, auto-healing, resilience, and reliability
  • Examine a simple python microservice ecosystem and understand its limitations
  • Identify critical stack components, and redesign them so they re resilient and reliable
  • Map design changes to native AWS services with ease
  • Deploy redesigned applications in a globally accessible, resilient, and reliable way


Java developers, software engineers, students, or anyone who needs a thorough, reliable, and easy to understand resource that will help them move ahead in their career, will find this course useful.

Prior experience with coding in Java is assumed.

About The Author

Malcolm Orr: Malcolm Orr Is a Principal Architect in AWS Professional Services. He holds 7 AWS certifications along with CKAD and spends his time working with AWS customers to build, deploy and manage cloud native applications and microservices. Before AWS, Malcolm has worked in a number of roles including author, contractor, chief startup dogs body and advisory practice lead and enjoys the solving technical challenges.

Table of contents

  1. Chapter 1 : The Basics of Site Reliability Engineering
    1. Course Overview
    2. Reliability in Modern Applications
    3. The Impact of Failure and Determining Your Reliability Objectives
    4. Accepting Failure and Making It Part of the Design Process
    5. SRE is a Mindset
  2. Chapter 2 : Gaining Resilience and Reliability On AWS
    1. AWS Global, Regional, and Zonal Architecture Design
    2. Amazon's Global Storage Services - S3
    3. Running Resilient Databases On AWS - RDS and DynamoDB
    4. Fault Tolerant Computation On AWS - Lambda and EC2
    5. Core Resilience Principles for AWS - Load Balancing and Auto Scaling
    6. Using Kubernetes and ECS On AWS
  3. Chapter 3 : Accepting Failure In Multi-Tier Applications
    1. Typical Three-Tier Application Resilience and Why It Fails in Cloud
    2. Designing In Resilience With Microservices
    3. Managing State
    4. Typical Application Reliability Patterns
    5. The Architecture of Our Example Microservices
  4. Chapter 4 : Deploying Py-Simple On AWS
    1. Optimizing and Migrating Our Code
    2. Creating Our Container with CodeBuild
    3. Deploying ECS and RDS
    4. Deploying and Testing Our Py-Simple Application
    5. The Problem with What We've Just Built
  5. Chapter 5 : Designing Py-Global
    1. The Architecture of Py-Global and Failure Mode Analysis
    2. Multi-Regional Support
    3. Microservices Design
    4. Authentication and Authorization
    5. Code Deployment with CodePipeline
    6. Application Telemetry and Tracing
    7. Application Analytics
    8. Aurora and its Advantages Over MySQL
  6. Chapter 6 : Deploying a Resilient, Fault Tolerant Py-Global Application
    1. Running/Scaling Our Application On EKS
    2. Creating a Resilient and Reliable Data Store for Python with Amazon Aurora
    3. Deploying App-Mesh
  7. Chapter 7 : Surviving Failure of a Global Scale
    1. Review: AWS Global Architecture and What We Have Just Built
    2. Global Tools: Route 53, CloudFront
    3. Going Global: What Does This Mean For Your Users/Developers
    4. Operational Changes Required For a Global Application
    5. Course Summary

Product information

  • Title: Site Reliability Engineering on AWS
  • Author(s): Malcolm Orr
  • Release date: June 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781800205970