book

Building Secure and Reliable Systems

by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield

March 2020

Intermediate to advanced

555 pages

16h 29m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword by Royal Hansen
Foreword by Michael Wildpaner
Preface
Why We Wrote This BookWho This Book Is ForA Note About CultureHow to Read This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Introductory Material
1. The Intersection of Security and Reliability
On Passwords and Power DrillsReliability Versus Security: Design ConsiderationsConfidentiality, Integrity, AvailabilityConfidentialityIntegrityAvailabilityReliability and Security: CommonalitiesInvisibilityAssessmentSimplicityEvolutionResilienceFrom Design to ProductionInvestigating Systems and LoggingCrisis ResponseRecoveryConclusion
2. Understanding Adversaries
Attacker MotivationsAttacker ProfilesHobbyistsVulnerability ResearchersGovernments and Law EnforcementActivistsCriminal ActorsAutomation and Artificial IntelligenceInsidersAttacker MethodsThreat IntelligenceCyber Kill Chains™Tactics, Techniques, and ProceduresRisk Assessment ConsiderationsConclusion
II. Designing Systems
3. Case Study: Safe Proxies
Safe Proxies in Production EnvironmentsGoogle Tool ProxyConclusion
4. Design Tradeoffs
Design Objectives and RequirementsFeature RequirementsNonfunctional RequirementsFeatures Versus Emergent PropertiesExample: Google Design DocumentBalancing RequirementsExample: Payment ProcessingManaging Tensions and Aligning GoalsExample: Microservices and the Google Web Application FrameworkAligning Emergent-Property RequirementsInitial Velocity Versus Sustained VelocityConclusion
5. Design for Least Privilege
Concepts and TerminologyLeast PrivilegeZero Trust NetworkingZero TouchClassifying Access Based on RiskBest PracticesSmall Functional APIsBreakglassAuditingTesting and Least PrivilegeDiagnosing Access DenialsGraceful Failure and Breakglass MechanismsWorked Example: Configuration DistributionPOSIX API via OpenSSHSoftware Update APICustom OpenSSH ForceCommandCustom HTTP Receiver (Sidecar)Custom HTTP Receiver (In-Process)TradeoffsA Policy Framework for Authentication and Authorization DecisionsUsing Advanced Authorization ControlsInvesting in a Widely Used Authorization FrameworkAvoiding Potential PitfallsAdvanced ControlsMulti-Party Authorization (MPA)Three-Factor Authorization (3FA)Business JustificationsTemporary AccessProxiesTradeoffs and TensionsIncreased Security ComplexityImpact on Collaboration and Company CultureQuality Data and Systems That Impact SecurityImpact on User ProductivityImpact on Developer ComplexityConclusion

6. Design for Understandability
Why Is Understandability Important?System InvariantsAnalyzing InvariantsMental ModelsDesigning Understandable SystemsComplexity Versus UnderstandabilityBreaking Down ComplexityCentralized Responsibility for Security and Reliability RequirementsSystem ArchitectureUnderstandable Interface SpecificationsUnderstandable Identities, Authentication, and Access ControlSecurity BoundariesSoftware DesignUsing Application Frameworks for Service-Wide RequirementsUnderstanding Complex Data FlowsConsidering API UsabilityConclusion
7. Design for a Changing Landscape
Types of Security ChangesDesigning Your ChangeArchitecture Decisions to Make Changes EasierKeep Dependencies Up to Date and Rebuild FrequentlyRelease Frequently Using Automated TestingUse ContainersUse MicroservicesDifferent Changes: Different Speeds, Different TimelinesShort-Term Change: Zero-Day VulnerabilityMedium-Term Change: Improvement to Security PostureLong-Term Change: External DemandComplications: When Plans ChangeExample: Growing Scope—HeartbleedConclusion
8. Design for Resilience
Design Principles for ResilienceDefense in DepthThe Trojan HorseGoogle App Engine AnalysisControlling DegradationDifferentiate Costs of FailuresDeploy Response MechanismsAutomate ResponsiblyControlling the Blast RadiusRole SeparationLocation SeparationTime SeparationFailure Domains and RedundanciesFailure DomainsComponent TypesControlling RedundanciesContinuous ValidationValidation Focus AreasValidation in PracticePractical Advice: Where to BeginConclusion
9. Design for Recovery
What Are We Recovering From?Random ErrorsAccidental ErrorsSoftware ErrorsMalicious ActionsDesign Principles for RecoveryDesign to Go as Quickly as Possible (Guarded by Policy)Limit Your Dependencies on External Notions of TimeRollbacks Represent a Tradeoff Between Security and ReliabilityUse an Explicit Revocation MechanismKnow Your Intended State, Down to the BytesDesign for Testing and Continuous ValidationEmergency AccessAccess ControlsCommunicationsResponder HabitsUnexpected BenefitsConclusion
10. Mitigating Denial-of-Service Attacks
Strategies for Attack and DefenseAttacker’s StrategyDefender’s StrategyDesigning for DefenseDefendable ArchitectureDefendable ServicesMitigating AttacksMonitoring and AlertingGraceful DegradationA DoS Mitigation SystemStrategic ResponseDealing with Self-Inflicted AttacksUser BehaviorClient Retry BehaviorConclusion
III. Implementing Systems
11. Case Study: Designing, Implementing, and Maintaining a Publicly Trusted CA
Background on Publicly Trusted Certificate AuthoritiesWhy Did We Need a Publicly Trusted CA?The Build or Buy DecisionDesign, Implementation, and Maintenance ConsiderationsProgramming Language ChoiceComplexity Versus UnderstandabilitySecuring Third-Party and Open Source ComponentsTestingResiliency for the CA Key MaterialData ValidationConclusion
12. Writing Code
Frameworks to Enforce Security and ReliabilityBenefits of Using FrameworksExample: Framework for RPC BackendsCommon Security VulnerabilitiesSQL Injection Vulnerabilities: TrustedSqlStringPreventing XSS: SafeHtmlLessons for Evaluating and Building FrameworksSimple, Safe, Reliable Libraries for Common TasksRollout StrategySimplicity Leads to Secure and Reliable CodeAvoid Multilevel NestingEliminate YAGNI SmellsRepay Technical DebtRefactoringSecurity and Reliability by DefaultChoose the Right ToolsUse Strong TypesSanitize Your CodeConclusion
13. Testing Code
Unit TestingWriting Effective Unit TestsWhen to Write Unit TestsHow Unit Testing Affects CodeIntegration TestingWriting Effective Integration TestsDynamic Program AnalysisFuzz TestingHow Fuzz Engines WorkWriting Effective Fuzz DriversAn Example FuzzerContinuous FuzzingStatic Program AnalysisAutomated Code Inspection ToolsIntegration of Static Analysis in the Developer WorkflowAbstract InterpretationFormal MethodsConclusion
14. Deploying Code
Concepts and TerminologyThreat ModelBest PracticesRequire Code ReviewsRely on AutomationVerify Artifacts, Not Just PeopleTreat Configuration as CodeSecuring Against the Threat ModelAdvanced Mitigation StrategiesBinary ProvenanceProvenance-Based Deployment PoliciesVerifiable BuildsDeployment Choke PointsPost-Deployment VerificationPractical AdviceTake It One Step at a TimeProvide Actionable Error MessagesEnsure Unambiguous ProvenanceCreate Unambiguous PoliciesInclude a Deployment BreakglassSecuring Against the Threat Model, RevisitedConclusion
15. Investigating Systems
From Debugging to InvestigationExample: Temporary FilesDebugging TechniquesWhat to Do When You’re StuckCollaborative Debugging: A Way to TeachHow Security Investigations and Debugging DifferCollect Appropriate and Useful LogsDesign Your Logging to Be ImmutableTake Privacy into ConsiderationDetermine Which Security Logs to RetainBudget for LoggingRobust, Secure Debugging AccessReliabilitySecurityConclusion
IV. Maintaining Systems
16. Disaster Planning
Defining “Disaster”Dynamic Disaster Response StrategiesDisaster Risk AnalysisSetting Up an Incident Response TeamIdentify Team Members and RolesEstablish a Team CharterEstablish Severity and Priority ModelsDefine Operating Parameters for Engaging the IR TeamDevelop Response PlansCreate Detailed PlaybooksEnsure Access and Update Mechanisms Are in PlacePrestaging Systems and People Before an IncidentConfiguring SystemsTrainingProcesses and ProceduresTesting Systems and Response PlansAuditing Automated SystemsConducting Nonintrusive TabletopsTesting Response in Production EnvironmentsRed Team TestingEvaluating ResponsesGoogle ExamplesTest with Global ImpactDiRT Exercise Testing Emergency AccessIndustry-Wide VulnerabilitiesConclusion
17. Crisis Management
Is It a Crisis or Not?Triaging the IncidentCompromises Versus BugsTaking Command of Your IncidentThe First Step: Don’t Panic!Beginning Your ResponseEstablishing Your Incident TeamOperational SecurityTrading Good OpSec for the Greater GoodThe Investigative ProcessKeeping Control of the IncidentParallelizing the IncidentHandoversMoraleCommunicationsMisunderstandingsHedgingMeetingsKeeping the Right People Informed with the Right Levels of DetailPutting It All TogetherTriageDeclaring an IncidentCommunications and Operational SecurityBeginning the IncidentHandoverHanding Back the IncidentPreparing Communications and RemediationClosureConclusion
18. Recovery and Aftermath
Recovery LogisticsRecovery TimelinePlanning the RecoveryScoping the RecoveryRecovery ConsiderationsRecovery ChecklistsInitiating the RecoveryIsolating Assets (Quarantine)System Rebuilds and Software UpgradesData SanitizationRecovery DataCredential and Secret RotationAfter the RecoveryPostmortemsExamplesCompromised Cloud InstancesLarge-Scale Phishing AttackTargeted Attack Requiring Complex RecoveryConclusion
V. Organization and Culture
19. Case Study: Chrome Security Team
Background and Team EvolutionSecurity Is a Team ResponsibilityHelp Users Safely Navigate the WebSpeed MattersDesign for Defense in DepthBe Transparent and Engage the CommunityConclusion
20. Understanding Roles and Responsibilities
Who Is Responsible for Security and Reliability?The Roles of SpecialistsUnderstanding Security ExpertiseCertifications and AcademiaIntegrating Security into the OrganizationEmbedding Security Specialists and Security TeamsExample: Embedding Security at GoogleSpecial Teams: Blue and Red TeamsExternal ResearchersConclusion
21. Building a Culture of Security and Reliability
Defining a Healthy Security and Reliability CultureCulture of Security and Reliability by DefaultCulture of ReviewCulture of AwarenessCulture of YesCulture of InevitablyCulture of SustainabilityChanging Culture Through Good PracticeAlign Project Goals and Participant IncentivesReduce Fear with Risk-Reduction MechanismsMake Safety Nets the NormIncrease Productivity and UsabilityOvercommunicate and Be TransparentBuild EmpathyConvincing LeadershipUnderstand the Decision-Making ProcessBuild a Case for ChangePick Your BattlesEscalations and Problem ResolutionConclusion
Conclusion
A. A Disaster Risk Assessment Matrix
Index

Overview

Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable? Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. In this book, experts from Google share best practices to help your organization design scalable and reliable systems that are fundamentally secure.

Two previous O’Reilly books from Google—Site Reliability Engineering and The Site Reliability Workbook—demonstrated how and why a commitment to the entire service lifecycle enables organizations to successfully build, deploy, monitor, and maintain software systems. In this latest guide, the authors offer insights into system design, implementation, and maintenance from practitioners who specialize in security and reliability. They also discuss how building and adopting their recommended best practices requires a culture that’s supportive of such change.

You’ll learn about secure and reliable systems through:

Design strategies
Recommendations for coding, testing, and debugging practices
Strategies to prepare for, respond to, and recover from incidents
Cultural best practices that help teams across your organization collaborate effectively

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492083115Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills