Get a basic understanding of site reliability engineering (SRE) and then go deeper with recommended resources.
Achieve high-impact systems monitoring by focusing on latency, errors, throughput, utilization, and blackbox monitoring.
Get advice and insight from speakers who have tackled the challenges you face.
O’Reilly Media Podcast: George Miranda discusses the benefits and challenges of a service mesh, and the best ways to get started using one.
Learn why this new tool is a critical component in microservice-based architectures.
Dave Andrews explains how to wield the power of a global 50 Tbps application delivery network to ensure maximum availability during and after a change.
David Hayes explains why adding a manageable dose of actionable intelligence to your operations management workflow can save you time and aggravation.
Julia Grace shares how she learned to rapidly scale herself and her leadership team during a period of hypergrowth at Slack.
Bryan Liles explains how to evaluate and integrate new declarative application management practices into continuous integration pipelines.
Oracle's Kyle York and Netra's Richard Lee discuss Netra’s high-performance computing environment.
Kyle Kingsbury explores anomalies in three distributed systems and shares strategies for correctness testing using Jepsen.
Nicole Forsgren shares results and stories behind high-performing technology-driven teams and organizations.
Javier Garza details the ingredients you need to build and deliver an app your users will love.
Watch highlights covering infrastructure, DevOps, security, and more. From the O'Reilly Velocity Conference in San Jose 2018.
Martin Woodward shares key data points from Microsoft's journey to DevOps.
Renee Orser explains how to monitor the human networks within your engineering teams using models similar to your distributed technology systems.
Kris Nova looks at the four metrics that help you decide if running stateful applications in Kubernetes is worth the risk.
Astrid Atkinson discusses techniques for building systems that are resilient by design.
Kyle York explores the scale, complexity, and volatility of the internet and the risk it poses to your applications and infrastructure.
Tamar Bercovici details how the team at Box has constructed its database stack to handle an ever-growing query load and data set.
Natalie Silvanovich discusses the link between feature complexity, developer error, and security vulnerabilities.
Recipes that deal with various aspects of troubleshooting, from debugging pods and containers, to testing service connectivity, interpreting a resource’s status, and node maintenance.
The O'Reilly Velocity Conference in San Jose will cover what you need to know to build high-performance, resilient, and secure systems.
The O’Reilly Fluent and Velocity conferences are teaming up to create a unique learning opportunity that addresses the full web experience.
An outside-the-box exploration of how containers can be used to provide novel solutions.
Systems and site reliability engineers, architects, and application developers must create new strategies to meet industry shifts and their constraints.
This collection of DevOps resources will get you up to speed on the basics, best practices, and latest techniques.
The O’Reilly Podcast: Modern day DNS for hybrid cloud, intelligent traffic steering, and DevOps.
How edge networks, Kubernetes, serverless and other trends will shape systems engineering and operations.
Lessons learned from building engineering teams under pressure.
Catherine Mulligan discusses the implications of blockchain on distributed systems and what needs to be addressed to build and maintain these systems effectively.
Mike Strickland says a new approach to data analytics acceleration is delivering benchmarked performance increases of 3X to 10X+ at the system level for traditional relational and NoSQL databases.
Kolton Andrus explores the evolution of chaos engineering and explains why it’s becoming the go-to approach for building resilient systems.
Edge computing is a hot topic, but Tyler McMullen says major hurdles need to be overcome before it reaches its full potential.
Kavya Joshi explores strategies for preparing systems for flux and scale.
Laura Hackney looks at the pitfalls and successes of the movement to bring social justice work into the technology landscape.
Sara-Jane Dunn discusses an entirely different paradigm of computing: the information processing carried out by cells.
Miriah Meyer explores how interactive visualizations can help us find meaning in mounds of data.
Liz Rice considers the questions organizations must answer before going cloud native.
Christopher Meiklejohn is building his startup with Martinelli, a new programming language that provides fault-tolerant, high-scalability operation.
Watch highlights covering DevOps and systems engineering. From the O'Reilly Velocity Conference in London 2017.
Guy Podjarny on why open source security is a community responsibility.
Learn how Netflix scales microservices with application data caching.
Learn the core principles of Google site reliability engineering.
For most of us, the best approach to scaling complex distributed systems is to not do that. So, Nick Rockwell asks, why isn’t serverless a bigger deal?
Robert Castley explores the relevance of real-user data if real users are blocking RUM tags, and he shares some solutions.
Craig Adams explores the traditional DevOps pipeline, addresses how to think about CDN automation, and explains how Akamai is baking automation into its CDN.
David Woods and Richard Cook offer a glimpse at the SNAFUcatchers Stella Report.
Lara Hogan walks through tactics you can employ to be a sponsor for those around you.
Jessica Frazelle and Dino Dai Zovi discuss how to be effective at open source in your company.
Bitcoin showed us a new way of moving value around the internet. Neha Narula considers how this paradigm might apply to databases that cross organizational boundaries.
Joe Goldberg explores jobs as code, which looks at batch application automation from a systems development life cycle perspective.
"Do no harm" is a core principle in medicine. Cynthia Savard Saucier challenges the tech industry to come up with its own fundamental principle.
Matt Cutts discusses how better technology can improve not just software systems but also trust in government itself.
Developers spend huge amounts of time fixing bugs in their programs, but what about automatically fixing them? Claire Le Goues shares recent advances that aim to make that dream a reality.
Kristopher Beevers explains how to augment Incident Command with simple tools and processes, such as basic checklists or regular fire drills.
Watch highlights covering complex distributed systems, systems engineering, DevOps, and more. From the O'Reilly Velocity Conference in New York 2017.
Carin Meier explores new ways to approach systems and tame complexity.
Rob Claire introduces the monitoring tools Pinterest uses and offers real-world examples of problem solving with data monitoring.
Building confidence in system behavior through experiments