Acing the System Design Interview

Book description

The system design interview is one of the hardest challenges you’ll face in the software engineering hiring process. This practical book gives you the insights, the skills, and the hands-on practice you need to ace the toughest system design interview questions and land the job and salary you want.

In Acing the System Design Interview you will master a structured and organized approach to present system design ideas like:

  • Scaling applications to support heavy traffic
  • Distributed transactions techniques to ensure data consistency
  • Services for functional partitioning such as API gateway and service mesh
  • Common API paradigms including REST, RPC, and GraphQL
  • Caching strategies, including their tradeoffs
  • Logging, monitoring, and alerting concepts that are critical in any system design
  • Communication skills that demonstrate your engineering maturity

Don’t be daunted by the complex, open-ended nature of system design interviews! In this in-depth guide, author Zhiyong Tan shares what he’s learned on both sides of the interview table. You’ll dive deep into the common technical topics that arise during interviews and learn how to apply them to mentally perfect different kinds of systems.

About the Technology
The system design interview is daunting even for seasoned software engineers. Fortunately, with a little careful prep work you can turn those open-ended questions and whiteboard sessions into your competitive advantage! In this powerful book, Zhiyong Tan reveals practical interview techniques and insights about system design that have earned developers job offers from Amazon, Apple, ByteDance, PayPal, and Uber.

About the Book
Acing the System Design Interview is a masterclass in how to confidently nail your next interview. Following these easy-to-remember techniques, you’ll learn to quickly assess a question, identify an advantageous approach, and then communicate your ideas clearly to an interviewer. As you work through this book, you’ll gain not only the skills to successfully interview, but also to do the actual work of great system design.

What's Inside
  • Insights on scaling, transactions, logging, and more
  • Practice questions for core system design concepts
  • How to demonstrate your engineering maturity
  • Great questions to ask your interviewer


About the Reader
For software engineers, software architects, and engineering managers looking to advance their careers.

About the Author
Zhiyong Tan is a manager at PayPal. He has worked at Uber, Teradata, and at small startups. Over the years, he has been in many system design interviews, on both sides of the table.

The technical editor on this book was Mohit Kumar.

Quotes
Deconstructs the system design interview and presents each component in an accessible manner for new job seekers as well as seasoned engineers. The attention to detail makes this book a must have.
- Mohammad Shafkat Amin, Meta

Comprehensively covers the most common topics, along with helpful tips and advice. It gives you all the tools you need to ace your next system design interview.
- Rajesh Kumar, Google

The practical advice and real world examples in this book will help you master the art of system design and succeed in your next interview.
- Kevin Goh, PayPal

Publisher resources

View/Submit Errata

Table of contents

  1. Inside front cover
  2. Acing the System Design Interview
  3. Copyright
  4. dedication
  5. contents
  6. front matter
    1. foreword
    2. preface
    3. acknowledgments
    4. about this book
      1. Who should read this book
      2. How this book is organized: A roadmap
      3. liveBook discussion forum
      4. Other online resources
    5. about the author
    6. about the cover illustration
  7. Part 1.
  8. 1 A walkthrough of system design concepts
    1. 1.1 A discussion about tradeoffs
    2. 1.2 Should you read this book?
    3. 1.3 Overview of this book
    4. 1.4 Prelude: A brief discussion of scaling the various services of a system
      1. 1.4.1 The beginning: A small initial deployment of our app
      2. 1.4.2 Scaling with GeoDNS
      3. 1.4.3 Adding a caching service
      4. 1.4.4 Content distribution network
      5. 1.4.5 A brief discussion of horizontal scalability and cluster management, continuous integration, and continuous deployment
      6. 1.4.6 Functional partitioning and centralization of cross-cutting concerns
      7. 1.4.7 Batch and streaming extract, transform, and load (ETL)
      8. 1.4.8 Other common services
      9. 1.4.9 Cloud vs. bare metal
      10. 1.4.10 Serverless: Function as a Service (FaaS)
      11. 1.4.11 Conclusion: Scaling backend services
    5. Summary
  9. 2 A typical system design interview flow
    1. 2.1 Clarify requirements and discuss tradeoffs
    2. 2.2 Draft the API specification
      1. 2.2.1 Common API endpoints
    3. 2.3 Connections and processing between users and data
    4. 2.4 Design the data model
      1. 2.4.1 Example of the disadvantages of multiple services sharing databases
      2. 2.4.2 A possible technique to prevent concurrent user update conflicts
    5. 2.5 Logging, monitoring, and alerting
      1. 2.5.1 The importance of monitoring
      2. 2.5.2 Observability
      3. 2.5.3 Responding to alerts
      4. 2.5.4 Application-level logging tools
      5. 2.5.5 Streaming and batch audit of data quality
      6. 2.5.6 Anomaly detection to detect data anomalies
      7. 2.5.7 Silent errors and auditing
      8. 2.5.8 Further reading on observability
    6. 2.6 Search bar
      1. 2.6.1 Introduction
      2. 2.6.2 Search bar implementation with Elasticsearch
      3. 2.6.3 Elasticsearch index and ingestion
      4. 2.6.4 Using Elasticsearch in place of SQL
      5. 2.6.5 Implementing search in our services
      6. 2.6.6 Further reading on search
    7. 2.7 Other discussions
      1. 2.7.1 Maintaining and extending the application
      2. 2.7.2 Supporting other types of users
      3. 2.7.3 Alternative architectural decisions
      4. 2.7.4 Usability and feedback
      5. 2.7.5 Edge cases and new constraints
      6. 2.7.6 Cloud-native concepts
    8. 2.8 Post-interview reflection and assessment
      1. 2.8.1 Write your reflection as soon as possible after the interview
      2. 2.8.2 Writing your assessment
      3. 2.8.3 Details you didn’t mention
      4. 2.8.4 Interview feedback
    9. 2.9 Interviewing the company
    10. Summary
  10. 3 Non-functional requirements
    1. 3.1 Scalability
      1. 3.1.1 Stateless and stateful services
      2. 3.1.2 Basic load balancer concepts
    2. 3.2 Availability
    3. 3.3 Fault-tolerance
      1. 3.3.1 Replication and redundancy
      2. 3.3.2 Forward error correction and error correction code
      3. 3.3.3 Circuit breaker
      4. 3.3.4 Exponential backoff and retry
      5. 3.3.5 Caching responses of other services
      6. 3.3.6 Checkpointing
      7. 3.3.7 Dead letter queue
      8. 3.3.8 Logging and periodic auditing
      9. 3.3.9 Bulkhead
      10. 3.3.10 Fallback pattern
    4. 3.4 Performance/latency and throughput
    5. 3.5 Consistency
      1. 3.5.1 Full mesh
      2. 3.5.2 Coordination service
      3. 3.5.3 Distributed cache
      4. 3.5.4 Gossip protocol
      5. 3.5.5 Random Leader Selection
    6. 3.6 Accuracy
    7. 3.7 Complexity and maintainability
      1. 3.7.1 Continuous deployment (CD)
    8. 3.8 Cost
    9. 3.9 Security
    10. 3.10 Privacy
      1. 3.10.1 External vs. internal services
    11. 3.11 Cloud native
    12. 3.12 Further reading
    13. Summary
  11. 4 Scaling databases
    1. 4.1 Brief prelude on storage services
    2. 4.2 When to use vs. avoid databases
    3. 4.3 Replication
      1. 4.3.1 Distributing replicas
      2. 4.3.2 Single-leader replication
      3. 4.3.3 Multi-leader replication
      4. 4.3.4 Leaderless replication
      5. 4.3.5 HDFS replication
      6. 4.3.6 Further reading
    4. 4.4 Scaling storage capacity with sharded databases
      1. 4.4.1 Sharded RDBMS
    5. 4.5 Aggregating events
      1. 4.5.1 Single-tier aggregation
      2. 4.5.2 Multi-tier aggregation
      3. 4.5.3 Partitioning
      4. 4.5.4 Handling a large key space
      5. 4.5.5 Replication and fault-tolerance
    6. 4.6 Batch and streaming ETL
      1. 4.6.1 A simple batch ETL pipeline
      2. 4.6.2 Messaging terminology
      3. 4.6.3 Kafka vs. RabbitMQ
      4. 4.6.4 Lambda architecture
    7. 4.7 Denormalization
    8. 4.8 Caching
      1. 4.8.1 Read strategies
      2. 4.8.2 Write strategies
    9. 4.9 Caching as a separate service
    10. 4.10 Examples of different kinds of data to cache and how to cache them
    11. 4.11 Cache invalidation
      1. 4.11.1 Browser cache invalidation
      2. 4.11. 2 Cache invalidation in caching services
    12. 4.12 Cache warming
    13. 4.13 Further reading
      1. 4.13.1 Caching references
    14. Summary
  12. 5 Distributed transactions
    1. 5.1 Event Driven Architecture (EDA)
    2. 5.2 Event sourcing
    3. 5.3 Change Data Capture (CDC)
    4. 5.4 Comparison of event sourcing and CDC
    5. 5.5 Transaction supervisor
    6. 5.6 Saga
      1. 5.6.1 Choreography
      2. 5.6.2 Orchestration
      3. 5.6.3 Comparison
    7. 5.7 Other transaction types
    8. 5.8 Further reading
    9. Summary
  13. 6 Common services for functional partitioning
    1. 6.1 Common functionalities of various services
      1. 6.1.1 Security
      2. 6.1.2 Error-checking
      3. 6.1.3 Performance and availability
      4. 6.1.4 Logging and analytics
    2. 6.2 Service mesh/sidecar pattern
    3. 6.3 Metadata service
    4. 6.4 Service discovery
    5. 6.5 Functional partitioning and various frameworks
      1. 6.5.1 Basic system design of an app
      2. 6.5.2 Purposes of a web server app
      3. 6.5.3 Web and mobile frameworks
    6. 6.6 Library vs. service
      1. 6.6.1 Language specific vs. technology-agnostic
      2. 6.6.2 Predictability of latency
      3. 6.6.3 Predictability and reproducibility of behavior
      4. 6.6.4 Scaling considerations for libraries
      5. 6.6.5 Other considerations
    7. 6.7 Common API paradigms
      1. 6.7.1 The Open Systems Interconnection (OSI) model
      2. 6.7.2 REST
      3. 6.7.3 RPC (Remote Procedure Call)
      4. 6.7.4 GraphQL
      5. 6.7.5 WebSocket
      6. 6.7.6 Comparison
    8. Summary
  14. Part 2.
  15. 7 Design Craigslist
    1. 7.1 User stories and requirements
    2. 7.2 API
    3. 7.3 SQL database schema
    4. 7.4 Initial high-level architecture
    5. 7.5 A monolith architecture
    6. 7.6 Using an SQL database and object store
    7. 7.7 Migrations are troublesome
    8. 7.8 Writing and reading posts
    9. 7.9 Functional partitioning
    10. 7.10 Caching
    11. 7.11 CDN
    12. 7.12 Scaling reads with a SQL cluster
    13. 7.13 Scaling write throughput
    14. 7.14 Email service
    15. 7.15 Search
    16. 7.16 Removing old posts
    17. 7.17 Monitoring and alerting
    18. 7.18 Summary of our architecture discussion so far
    19. 7.19 Other possible discussion topics
      1. 7.19.1 Reporting posts
      2. 7.19.2 Graceful degradation
      3. 7.19.3 Complexity
      4. 7.19.4 Item categories/tags
      5. 7.19.5 Analytics and recommendations
      6. 7.19.6 A/B testing
      7. 7.19.7 Subscriptions and saved searches
      8. 7.19.8 Allow duplicate requests to the search service
      9. 7.19.9 Avoid duplicate requests to the search service
      10. 7.19.10 Rate limiting
      11. 7.19.11 Large number of posts
      12. 7.19.12 Local regulations
    20. Summary
  16. 8 Design a rate-limiting service
    1. 8.1 Alternatives to a rate-limiting service and why they are infeasible
    2. 8.2 When not to do rate limiting
    3. 8.3 Functional requirements
    4. 8.4 Non-functional requirements
      1. 8.4.1 Scalability
      2. 8.4.2 Performance
      3. 8.4.3 Complexity
      4. 8.4.4 Security and privacy
      5. 8.4.5 Availability and fault-tolerance
      6. 8.4.6 Accuracy
      7. 8.4.7 Consistency
    5. 8.5 Discuss user stories and required service components
    6. 8.6 High-level architecture
    7. 8.7 Stateful approach/sharding
    8. 8.8 Storing all counts in every host
      1. 8.8.1 High-level architecture
      2. 8.8.2 Synchronizing counts
    9. 8.9 Rate-limiting algorithms
      1. 8.9.1 Token bucket
      2. 8.9.2 Leaky bucket
      3. 8.9.3 Fixed window counter
      4. 8.9.4 Sliding window log
      5. 8.9.5 Sliding window counter
    10. 8.10 Employing a sidecar pattern
    11. 8.11 Logging, monitoring, and alerting
    12. 8.12 Providing functionality in a client library
    13. 8.13 Further reading
    14. Summary
  17. 9 Design a notification/alerting service
    1. 9.1 Functional requirements
      1. 9.1.1 Not for uptime monitoring
      2. 9.1.2 Users and data
      3. 9.1.3 Recipient channels
      4. 9.1.4 Templates
      5. 9.1.5 Trigger conditions
      6. 9.1.6 Manage subscribers, sender groups, and recipient groups
      7. 9.1.7 User features
      8. 9.1.8 Analytics
    2. 9.2 Non-functional requirements
    3. 9.3 Initial high-level architecture
    4. 9.4 Object store: Configuring and sending notifications
    5. 9.5 Notification templates
      1. 9.5.1 Notification template service
      2. 9.5.2 Additional features
    6. 9.6 Scheduled notifications
    7. 9.7 Notification addressee groups
    8. 9.8 Unsubscribe requests
    9. 9.9 Handling failed deliveries
    10. 9.10 Client-side considerations regarding duplicate notifications
    11. 9.11 Priority
    12. 9.12 Search
    13. 9.13 Monitoring and alerting
    14. 9.14 Availability monitoring and alerting on the notification/alerting service
    15. 9.15 Other possible discussion topics
    16. 9.16 Final notes
    17. Summary
  18. 10 Design a database batch auditing service
    1. 10.1 Why is auditing necessary?
    2. 10.2 Defining a validation with a conditional statement on a SQL query’s result
    3. 10.3 A simple SQL batch auditing service
      1. 10.3.1 An audit script
      2. 10.3.2 An audit service
    4. 10.4 Requirements
    5. 10.5 High-level architecture
      1. 10.5.1 Running a batch auditing job
      2. 10.5.2 Handling alerts
    6. 10.6 Constraints on database queries
      1. 10.6.1 Limit query execution time
      2. 10.6.2 Check the query strings before submission
      3. 10.6.3 Users should be trained early
    7. 10.7 Prevent too many simultaneous queries
    8. 10.8 Other users of database schema metadata
    9. 10.9 Auditing a data pipeline
    10. 10.10 Logging, monitoring, and alerting
    11. 10.11 Other possible types of audits
      1. 10.11.1 Cross data center consistency audits
      2. 10.11.2 Compare upstream and downstream data
    12. 10.12 Other possible discussion topics
    13. 10.13 References
    14. Summary
  19. 11 Autocomplete/typeahead
    1. 11.1 Possible uses of autocomplete
    2. 11.2 Search vs. autocomplete
    3. 11.3 Functional requirements
      1. 11.3.1 Scope of our autocomplete service
      2. 11.3.2 Some UX details
      3. 11.3.3 Considering search history
      4. 11.3.4 Content moderation and fairness
    4. 11.4 Non-functional requirements
    5. 11.5 Planning the high-level architecture
    6. 11.6 Weighted trie approach and initial high-level architecture
    7. 11.7 Detailed implementation
      1. 11.7.1 Each step should be an independent task
      2. 11.7.2 Fetch relevant logs from Elasticsearch to HDFS
      3. 11.7.3 Split the search strings into words and other simple operations
      4. 11.7.4 Filter out inappropriate words
      5. 11.7.5 Fuzzy matching and spelling correction
      6. 11.7.6 Count the words
      7. 11.7.7 Filter for appropriate words
      8. 11.7.8 Managing new popular unknown words
      9. 11.7.9 Generate and deliver the weighted trie
    8. 11.8 Sampling approach
    9. 11.9 Handling storage requirements
    10. 11.10 Handling phrases instead of single words
      1. 11.10.1 Maximum length of autocomplete suggestions
      2. 11.10.2 Preventing inappropriate suggestions
    11. 11.11 Logging, monitoring, and alerting
    12. 11.12 Other considerations and further discussion
    13. Summary
  20. 12 Design Flickr
    1. 12.1 User stories and functional requirements
    2. 12.2 Non-functional requirements
    3. 12.3 High-level architecture
    4. 12.4 SQL schema
    5. 12.5 Organizing directories and files on the CDN
    6. 12.6 Uploading a photo
      1. 12.6.1 Generate thumbnails on the client
      2. 12.6.2 Generate thumbnails on the backend
      3. 12.6.3 Implementing both server-side and client-side generation
    7. 12.7 Downloading images and data
      1. 12.7.1 Downloading pages of thumbnails
    8. 12.8 Monitoring and alerting
    9. 12.9 Some other services
      1. 12.9.1 Premium features
      2. 12.9.2 Payments and taxes service
      3. 12.9.3 Censorship/content moderation
      4. 12.9.4 Advertising
      5. 12.9.5 Personalization
    10. 12.10 Other possible discussion topics
    11. Summary
  21. 13 Design a Content Distribution Network
    1. 13.1 Advantages and disadvantages of a CDN
      1. 13.1.1 Advantages of using a CDN
      2. 13.1.2 Disadvantages of using a CDN
      3. 13.1.3 Example of an unexpected problem from using a CDN to serve images
    2. 13.2 Requirements
    3. 13.3 CDN authentication and authorization
      1. 13.3.1 Steps in CDN authentication and authorization
      2. 13.3.2 Key rotation
    4. 13.4 High-level architecture
    5. 13.5 Storage service
      1. 13.5.1 In-cluster
      2. 13.5.2 Out-cluster
      3. 13.5.3 Evaluation
    6. 13.6 Common operations
      1. 13.6.1 Reads: Downloads
      2. 13.6.2 Writes: Directory creation, file upload, and file deletion
    7. 13.7 Cache invalidation
    8. 13.8 Logging, monitoring, and alerting
    9. 13.9 Other possible discussions on downloading media files
    10. Summary
  22. 14 Design a text messaging app
    1. 14.1 Requirements
    2. 14.2 Initial thoughts
    3. 14.3 Initial high-level design
    4. 14.4 Connection service
      1. 14.4.1 Making connections
      2. 14.4.2 Sender blocking
    5. 14.5 Sender service
      1. 14.5.1 Sending a message
      2. 14.5.2 Other discussions
    6. 14.6 Message service
    7. 14.7 Message-sending service
      1. 14.7.1 Introduction
      2. 14.7.2 High-level architecture
      3. 14.7.3 Steps in sending a message
      4. 14.7.4 Some questions
      5. 14.7.5 Improving availability
    8. 14.8 Search
    9. 14.9 Logging, monitoring, and alerting
    10. 14.10 Other possible discussion topics
    11. Summary
  23. 15 Design Airbnb
    1. 15.1 Requirements
    2. 15.2 Design decisions
      1. 15.2.1 Replication
      2. 15.2.2 Data models for room availability
      3. 15.2.3 Handling overlapping bookings
      4. 15.2.4 Randomize search results
      5. 15.2.5 Lock rooms during booking flow
    3. 15.3 High-level architecture
    4. 15.4 Functional partitioning
    5. 15.5 Create or update a listing
    6. 15.6 Approval service
    7. 15.7 Booking service
    8. 15.8 Availability service
    9. 15.9 Logging, monitoring, and alerting
    10. 15.10 Other possible discussion topics
      1. 15.10.1 Handling regulations
    11. Summary
  24. 16 Design a news feed
    1. 16.1 Requirements
    2. 16.2 High-level architecture
    3. 16.3 Prepare feed in advance
    4. 16.4 Validation and content moderation
      1. 16.4.1 Changing posts on users’ devices
      2. 16.4.2 Tagging posts
      3. 16.4.3 Moderation service
    5. 16.5 Logging, monitoring, and alerting
      1. 16.5.1 Serving images as well as text
      2. 16.5.2 High-level architecture
    6. 16.6 Other possible discussion topics
    7. Summary
  25. 17 Design a dashboard of top 10 products on Amazon by sales volume
    1. 17.1 Requirements
    2. 17.2 Initial thoughts
    3. 17.3 Initial high-level architecture
    4. 17.4 Aggregation service
      1. 17.4.1 by product ID
      2. 17.4.2 Matching host IDs and product IDs
      3. 17.4.3 Storing timestamps
      4. 17.4.4 Aggregation process on a host
    5. 17.5 Batch pipeline
    6. 17.6 Streaming pipeline
      1. 17.6.1 Hash table and max-heap with a single host
      2. 17.6.2 Horizontal scaling to multiple hosts and multi-tier aggregation
    7. 17.7 Approximation
      1. 17.7.1 Count-min sketch
    8. 17.8 Dashboard with Lambda architecture
    9. 17.9 Kappa architecture approach
      1. 17.9.1 Lambda vs. Kappa architecture
      2. 17.9.2 Kappa architecture for our dashboard
    10. 17.10 Logging, monitoring, and alerting
    11. 17.11 Other possible discussion topics
    12. 17.12 References
    13. Summary
  26. Appendix A. Monoliths vs. microservices
    1. A.1 Advantages of monoliths
    2. A.2 Disadvantages of monoliths
    3. A.3 Advantages of services
      1. A.3.1 Agile and rapid development and scaling of product requirements and business functionalities
      2. A.3.2 Modularity and replaceability
      3. A.3.3 Failure isolation and fault-tolerance
      4. A.3.4 Ownership and organizational structure
    4. A.4 Disadvantages of services
      1. A.4.1 Duplicate components
      2. A.4.2 Development and maintenance costs of additional components
      3. A.4.3 Distributed transactions
      4. A.4.4 Referential integrity
      5. A.4.5 Coordinating feature development and deployments that span multiple services
      6. A.4.6 Interfaces
    5. A.5 References
  27. Appendix B. OAuth 2.0 authorization and OpenID Connect authentication1
    1. B.1 Authorization vs. authentication
    2. B.2 Prelude: Simple login, cookie-based authentication
    3. B.3 Single sign-on
    4. B.4 Disadvantages of simple login
      1. B.4.1 Complexity and lack of maintainability
      2. B.4.2 No partial authorization
    5. B.5 OAuth 2.0 flow
      1. B.5.1 OAuth 2.0 terminology
      2. B.5.2 Initial client setup
      3. B.5.3 Back channel and front channel
    6. B.6 Other OAuth 2.0 flows
    7. B.7 OpenID Connect authentication
  28. Appendix C. C4 Model
  29. Appendix D. Two-phase commit (2PC)
  30. index

Product information

  • Title: Acing the System Design Interview
  • Author(s): zhiyong tan
  • Release date: January 2024
  • Publisher(s): Manning Publications
  • ISBN: 9781633439108