Chapter 4. Building Great Platform Teams

It’s not DNS.

There’s no way it’s DNS.

It was DNS.

SSBroski

We start with a quote about DNS because it’s a fundamental system most platforms depend on, and at close to 40 years old it seems it should be well understood by now. Yet, the point of the quote is that DNS still causes complex failures regularly, and it takes expertise not just to debug them, but also to avoid more DNS issues in the future.1 This is also the challenge with staffing platform teams: by developing abstractions over complex systems, these teams enable productivity for your users, but if they aren’t staffed with experts who understand those systems, they will create operational problems down the road.

We laid out the systems engineer (administrator) versus software engineer (developer) dichotomy in Chapter 1, as we think it is essential to understanding the challenges of not just building a platform engineering team, but creating the right team culture. It’s tempting to insist that building a good team is just a matter of finding people who are skilled at both. And yes, as much as possible, you should seek out engineers who are strong software developers as well as capable of understanding complex systems they didn’t develop. But no one is good at everything, so the way you build great platform teams is by hiring people with diverse strengths and creating a culture where each is enabled to succeed.

In this chapter, we give you the tools to do so. First, we’ll explore the behaviors and challenges of teams that are focused on only one side of the systems versus software divide. Then we’ll introduce the four major roles for engineers on a platform team and cover how you build processes for hiring and recognition to accommodate all four roles. We’ll look at the characteristics of great platform-engineering managers and at some special-purpose roles that collaborate closely with platform teams. Finally, we’ll finish with some guidance on how to use all this to create a great team culture.

Getting your team set up right is as important as getting DNS configured correctly; while one bad hire probably won’t take down your entire platform, you need to spend time implementing this foundation in order to achieve long-term success.

The Risks of Single-Focus Platform Teams

As we covered in Chapter 1, when the responsibility for platforms is assigned to teams with a narrow focus or skill set, the result for a company tends to be that not enough platforms get created, leading to the “over-general swamp.” Before we present a solution, we want to paint a better picture of the problem. The types of teams we describe here embody outcomes at the extreme; we’re not talking about all engineers or teams, but showing the consequences of how staffing around a single focus leads to a culture that doesn’t deliver great platforms, and will also struggle to change.

Too Much Systems Focus

In this scenario, you have a team that is heavily populated by people who came up through infrastructure, DevOps, SRE, and systems engineering roles. Your team members usually have computer science or software engineering degrees, but few have written significant amounts of code within large software systems.

What they do well

These teams are great operationally. They know that the platform powers the business and take pride in that. They know the operation of their systems inside and out, including the underlying systems. When the Asia region experiences downtime at 2 a.m., not only is the US-based team on top of it, but their leadership team is awake and ready to help too.

The next day, not only will they be done with the incident review, but they will have a quick mitigation in production already and will be planning more hands-on work to make the longer-term fixes. They’ll grumble and blame leadership for letting the “wrong” thing get built, but they’ll still get it done. You can set your clock by them: they are reliable and hardworking, and they pay great attention to operational detail.

What they do badly

The code this team actively develops is mostly automation, templating, and one-off tools. They aren’t doing much to build better platform abstractions to manage complexity, or working on a better architecture to solve operational problems for good.

Faced with the flaws of a system they can’t change, they reach for rules and processes, often cataloged in meticulous wikis. Of course, users constantly run afoul of these rules, to the eternal frustration of both sides. To make any progress, the team management heavily leverages project managers, harassing customers’ engineers to do one-off work that the platform team is incapable of streamlining.

Why they are stuck

Among both the leadership and the engineers, these teams tend to have a strong bias toward hiring experienced systems people who are already strong operators, steeped in directly relevant system knowledge. Their interviews, especially for senior-level candidates, emphasize the kinds of details found deep in books and manual pages. They might as well hang a sign over the interview room door saying “no software engineers need apply.”

They justify this by arguing “that’s what we need operationally,” “we can’t afford to train anyone with this workload,” and sometimes, “what type of software engineer would be happy on this team?” The problem is that their technical filter becomes a cultural filter, and generalist senior software engineers—the type who could build better abstractions to shift the burden of operational load—stay away. Often the only software engineers such a team can hire are recent graduates, who tend to move on after a year or two due to a lack of mentorship. This personnel churn reinforces the team’s belief that their culture isn’t the problem; the problem is that their systems are just not appealing to software engineers.

Too Much Development Focus

In this scenario, the team is full of people who have spent their careers as software engineers and software engineering managers. They like to write code—lots of code. They probably have degrees in computer science and have been writing software for a long time, but usually very little of that experience involved developing platforms or infrastructure.

What they do well

These teams are builders. They nerd out on their platform architectures and technologies and think big about what comes next. They love to talk about “golden paths,” “vNext,” and “next-gen” platforms that will surely fix all the flaws of the current platform. And they’re not wrong—at least, not in a reality where they had infinite time to deliver it.

What they do badly

This type of team reflects an adage among software engineers: that technical debt is any code some other engineers wrote. They get frustrated by any work that isn’t building a newer, better system, and they view any effort expended on improving the old system as “throwaway” work, as it distracts from building the new system faster. What they overlook is the frustration and hampered productivity of the current system’s users. This is made worse when, like most software engineers, they are too optimistic in their project estimates.

In the meantime, they treat the current in-product platform as what systems consultant Carla Geisser calls “haunted graveyards”: curiosities to be carefully poked, not systems to be understood. This leads to operational problems, eventually causing a negative business impact. That 2 a.m. downtime issue? You’ll probably have to page the on-call engineer a couple of times before they respond, and if you try the manager, they’ll likely be upset that you thought waking them would solve any problem.

Why they are stuck

In the last 20 years, the industry has come to view delivering new code as the most valuable thing software engineers can spend time on, and has heavily tied it to compensation and promotions. This bias has spread to software engineering managers, since they got where they are by writing lots of code. So, for instance, they cannot imagine hiring anyone with the title “software engineer” who can’t solve toy algorithm problems on a whiteboard in 30 minutes, regardless of their other, more practical abilities.

Some managers may grant that such an engineer does actually have value, but even in that case, they’ll probably find it hard to believe they should be on the same team as their “real” software engineers. They will insist that there must be some other title for the work, some other team they should be on where they aren’t distracting those undertaking the important work of developing software. In fact, ideally they would just be doing whatever is possible to make that software development go faster—per the split SRE/DevOps model, “taking on all the operations load, support, toil, and automation grunt work that distracts my software engineers from creating code.” The problem is that unless the platform is massive, nobody wants that job.

The Different Roles of Platform Engineers

To unstick teams that are too focused on software or systems, you need to equally value both types of work, which usually means adding new roles to the team. This requires understanding what value each role brings. That’s our goal in this section: even if you’re not a manager, we want to help you better appreciate how your role relates to those of your coworkers.

The first step is realizing that the old software versus systems split, which does a good job of explaining how individuals’ focuses may differ, does a poor job of illustrating how their roles relate. On the software side, the problem is that the term “software engineer” is used to describe a broad variety of roles outside of platform teams, and so misses the fact that various aspects of this role are different on a platform team—and these are differences you need to be aware of, hire for, and recognize.

Things are more complex on the systems side, where there are a plethora of roles and titles. While a lot of the work between them is duplicative, there are also specializations not just in skill set but also in culture. To simplify, we believe that three major systems-focused roles are needed in a platform team:

Systems engineer

A true systems generalist, which many would call a DevOps engineer, although many other names are used across the industry

Reliability engineer

Someone who has deeply focused their role on reliability, ignoring other facets of systems engineering

Systems specialist

This could encompass many specific roles based on specific deep expertise—for example, Linux engineer, performance engineer, network development engineer

We show how these all relate in terms of team composition in Figure 4-1. In the following subsections, we’ll take a closer look at each of these four roles in turn.

Figure 4-1. Breaking down the major engineering roles in a platform engineering team

Software Engineers

Your software engineers are going to be people who can and will write a lot of code. On most platform teams, these will be “backend” engineers—people who are used to writing server-side code—although you may have some “frontend” engineers as well. In successful platform teams, most of the software engineers, particularly the more senior ones, are a bit different from the generalist “backend” software engineers found in application teams:

They are drawn to understanding systems.

They have a strong desire to understand the interaction of their code and the systems the code runs on top of. They’re not just interested in completing a feature for an end user, but think carefully about how their code fits into the ecosystem of software, hardware, and networking that it runs within, seeking a deeper understanding of the browser, the operating system, distributed systems, databases/storage, or whatever else is relevant.

You can spot these engineers because they’re the ones who want to read the code for the libraries they depend on, and who are curious about the failure patterns that happen at the edges of applications. They want to think about the system more broadly than the feature they are implementing, and they are willing to figure out not only how to code the system but how to operate it and support it, because how can they know if the code makes sense if they don’t understand how it actually runs in production?

They are comfortable being on call for business-critical systems.

This matters because, as we will cover in Chapter 6, most platforms are staffed at a level where they need every expert to be part of an on-call rotation, handling events of business-critical impact with ambiguous causes. None of us loves being on call, but the important thing here is not just a willingness to do it, but also the ability to be on call and respond effectively. We have found that many software engineers love to pore over the details of systems, but as an intellectual matter at design and coding time, not as something they need to draw on practically at 2 a.m. In our experience, there are a variety of causes for this—not strong on Unix skills, not strong on communication, uncomfortable working under time pressure—but whatever the reason, those people are going to struggle on most platform teams.

Within your company, you will find the people you’re looking for by watching what happens during incidents: even when they’re not paged, they will engage and make the incident remediation much faster. In the behavioral interview, you should ask about the largest incident the candidate has been involved in remediating and probe to be sure they were a significant part of it being resolved.

They are comfortable shipping at a deliberate pace.

You can find plenty of brilliant people who love system details and want free rein to write code to fix user problems ASAP. But while your platform software engineers will be writing code, they will also spend time on operations, integrations, and experimentation. Furthermore, there are large costs to mistakes in platforms, both in terms of operational risk and the risk of getting stuck supporting features that are much more expensive to maintain than they were to create.

An engineer who is strongly motivated by novelty and fast-turnaround feature building likely will not be a great fit for a mature platform team. (In Chapter 8, we will talk about how such “pioneers” are a much better fit in early-stage platforms, when your platform team needs to partner with product teams and iterate quickly.)

Systems Engineers

Roles around “systems” can be specialized in terms of breadth or depth. We’ll cover the depth form in the following sections, on reliability engineers and systems specialists. In our experience, however, a broad systems engineer is the more common role on successful platform teams, so we’ll start with that.

Almost all platforms benefit from the presence on the team of someone who, while more focused on understanding systems than writing software, uses that focus broadly, to understand more than one speciality. True, they won’t be world experts in performance, Linux, or networking. But because of their motivation to understand the intricacies of how different types of systems work and come together, they will know a lot more than most of your software engineers, and they’ll be a lot more motivated to do work that involves manipulating systems despite those intricacies.

But what do they do? Well, mainly stuff covered in the SRE and DevOps literature: lots of automation, particularly for infrastructure integration, scaling, reliability, and observability configuration. But it doesn’t stop at automation and configuration. Their broad systems knowledge can be put to use for building platform features as well—specifically, those arcane aspects that take a lot of knowledge to get right. This is why, even though there is a lot of crossover, we prefer not to use the name “DevOps engineer” or “SRE engineer” for this role. When you have adopted a platform engineering culture, everyone on the team should be thinking of their work in terms of features of the platform product, not just its automation or reliability.

Where systems engineers tend to shine the most is in using their knowledge to resolve deep systems issues that involve both the platform codebase and the underlying dependencies. A lot of these issues are operational—we have seen cases where software engineers left issues languishing for months, lacking the knowledge to make progress, until a systems engineer came along to help. However, focusing on operational debugging is seeing the value of the role too narrowly. We’ve also seen a systems engineer rescue a launch deadline by spotting an easy optimization in OSS configuration, whereas our software engineers were telling us that fixing the issue would take months of rewriting code.

Why recognize and hire for a broad systems engineer role rather than pushing for specialists in particular elements of the system (say, the OSS or cloud vendor you’re using)? There are three reasons:

  • Specialization tends to take a long time to achieve, making it hard to hire systems engineers who aren’t already at “senior” levels of experience.

  • Strong systems engineers will feel the need to specialize in order to get promoted, and the team will lose the breadth that is so important to their contributions.

  • You don’t want too many specialists, but you need broad systems engineers.

While systems engineers’ expertise should increase in depth over time in certain areas, based on what they have worked on, it is a mistake to push them into that specialization as a career unless your company really needs that. So, with that in mind, let’s turn to the specialists.

Reliability Engineers

Given its popularity in some parts of the industry today, we expect some readers to believe the “SRE” role captures all the systems work that is not feature-based “software development.” There are a few problems with that thinking. First, Google itself split the SRE job into two, differentiating between software engineers and systems engineers, with both sharing the same culture but filling different roles. Second, as we highlighted in Chapter 1, the role of “DevOps engineer” covers a lot of the same skills, although with a very different culture. This comes back to the fact that naming matters. A lot of people, including some SREs, think that “reliability” is all the job should be about, whereas the general systems engineer role also encompasses support, efficiency, security, performance, and even adding features.

So, when we use the term “reliability engineer,” we mean those who want their role to be focused on that, versus more general responsibilities. Now, that’s not to say the role is unimportant. Many great SRE practices work much better when led by passionate practicing engineers. These engineers excel at high-impact incident management, consulting on service level objectives (SLOs) and chaos engineering, and running game days, production readiness reviews, rigorous incident postmortems, and even weekly operational meetings. They have the technical depth to know what matters and the skill set to implement a lot of the technical stuff behind it, and so they drive reliability across the organization’s systems.

In theory, other engineers could do all this on a part-time basis. But in our experience, few are willing. The type of people who are motivated to do this are systems thinkers, those for whom the social dimension of a solution makes it even more motivating. They tend to ask, “How do we make everyone a little bit better?”

Take, for example, incident management. In an organization with teams doing their own on-call rotations, SREs as incident managers can ensure that thematic issues don’t fall through the cracks. An incident can span multiple systems, and even if your team owns all of those systems, it’s very easy for each person or subteam to think only about their portion of the issue. You need someone to track incidents, alert senior management about unresolved challenges, and plan and implement ways to remediate those challenges. This is best done by someone who loves to focus on big broken things and has the patience to see a project through to completion.

People like this can often be found on platform engineering teams, working with highly complex technical systems. Usually they start doing this work part-time, for their own teams. If you want to scale their work up and give them a broader mandate, we have found they typically need to be in a focused team that stays close to the platform engineering team—something like a core reliability team. Coworkers sometimes see people in this role as “talkers who have never done it,” which takes away from their impact. We recommend rotating reliability specialists in and out of platform teams to keep their skills current.

Systems Specialists

Cloud networking engineer. Kernel engineer. Performance engineer. Storage engineer. At a certain scale, an engineer with depth in any of these roles can be game-changing. The best specialist can level up an entire organization’s practices around their speciality, educating even as they do hands-on work. But it’s a mistake to think that your platform team can just be a combination of software engineers and systems specialists, and we encourage you to wait to hire them until the need is clear. When you do have a need, keep the bar high, and avoid hiring more than a few specialists until you can clearly see the positive impact of your first round of new hires.

It takes a fairly large organization to not just want to employ such specialists full time, but also to give them problems that fully interest and utilize them. If a big part of your platform’s offering revolves around networking management, for example, all of your engineering team should understand the network. But it’s easy to take this too far and end up with a bunch of people who are too focused on implementing the state-of-the-art ideas of their specialty and not focused enough on building the thing you need. We once witnessed a developer tools team made up of version control experts: instead of focusing on the lack of user-friendly tooling, they spent all their time refining interfaces to the version control system. Such work can be important, but when it isn’t the current problem, an overspecialized team can end up ignoring the bigger picture of what’s needed.

Another way we have seen this play out is specialists refusing more general work. Instead, they want a role we call “specialist as internal evangelist.” They imagine spending their time contributing to open source projects of no immediate value to the company, speaking on the conference circuit, researching obscure new offerings, and perhaps running nice-to-have internal programs aligned with their speciality. We encourage these activities for all engineers in moderation, but evangelism is a full-time role for SaaS vendors—when an engineer tries to make a full-time role of it internally without having much to show for their expertise, they tend to struggle for credibility, and so usually undermine the ideas they are trying to spread.

Hiring and Recognizing Engineers in All Roles

We understand if you’re feeling a little confused after the last section. Are you looking for one role, two roles, four roles, or even more? There is no single answer, because it depends not just on your needs as a platform engineering team, but also your company’s job families and hiring processes—what they are today, and how much flexibility there is to change them.

These days, most of the tech industry’s hiring and promotion processes greatly favor software engineers who ship a lot of new code to production—their impact on the organization is seen as easy to evaluate. Since their systems are slower to change, software engineers working on platforms can struggle to get recognized in company-wide processes, and the three other roles fare even worse. The organization generates an organ rejection of talented individuals, usually by hiring them into too-junior positions or refusing to promote them, all because their impact on the organization is not as easy to evaluate compared to shipping software.

We’ve had mixed success in dealing with this challenge, for reasons that involve company culture, our positions within the hierarchy, and the CTO’s appetite for making and communicating process changes. What we’ve learned is that success is about positive marginal change—making a case for incremental changes based on individual cases, while building evidence for bigger changes in the long run.

With this experience in mind, the next sections lay out our best practices, which we summarize in Table 4-1.

Table 4-1. Engineering roles and best practices in platform engineering teams
Role Title Interview process Job family/level matrix

Software engineer

Prefer “software engineer.” Allow “platform software engineer” only if unavoidable.

Custom behavioral interview to cover fit with the platform engineering team.

Common to the company-wide role.

Systems engineer

Allow specialized, such as DevOps engineer.

Same as for software engineer, but more flexibility in the coding interview. Design questions should cover the candidate’s systems breadth.

Common across the three roles.

 

Net impact is the same as software engineer.

 

Differences emphasize impact created less by writing code, and more by exercising distinct knowledge, skills and practices.

Reliability engineer

Allow specialized, such as SRE.

Same as for systems engineer, but design questions should cover the candidate’s depth in SRE.

Systems specialist
(splits into many roles)

Allow specialized and per-role, such as kernel engineer, performance engineer, and storage engineer.

Same as for software engineer, but design questions should cover the candidate’s depth in their system speciality.

Allow Role-Specific Titles

The preceding table breaks out three different facets of a role: its title, its level matrix (usually called a “job family”), and its interview process. We have seen people with a systemizing bent want to systemically link these together, insisting that everyone in positions that use a certain level matrix must have the same interview process and must have the same job title. What could be more simple? And it’s so rational!

The problem is that a job title indicates someone’s specific role, both to fellow employees and to external stakeholders. There is a personal aspect to it, especially when someone has built their career around a specialization. Forcing, say, your first kernel engineer to be called an SRE because that’s the level matrix you will use not only won’t make sense to their peers but also demeans their depth of experience, all while introducing a feeling of bureaucratic rigidity.

While we believe it’s fine to allow role-specific titles, we definitely don’t embrace the other extreme of everyone getting to choose their own title, as that will confuse people too. Creating a new title should be done only for good reasons, in recognition of substantial differences in the new role. Crucially, this does not need to be coupled with immediately creating a new level matrix or interview process (we’ll look at when this might be required in the following sections).

Avoid Creating a New Software Engineer Level Matrix

Standard software engineering job descriptions heavily emphasize creating new code, systems, and architectures. Platform software engineers are often ill-served by these definitions, both in the interview process and in the criteria used to evaluate their performance and readiness for promotion. They do all these things, but the business criticality of the mature systems they work on means they do them more slowly. Many organizations respond to this issue by seeking to create a “platform software engineer” level matrix, recognizing a platform as a substantially different type of system whose successful development requires different skills, and so whose practitioners should be evaluated differently.

However, this problem exists for other specialized software development roles as well. For example, data engineers, mobile engineers, and frontend engineers all write software and create new systems (just different types of systems). So, should they get their own set of job levels, too? It’s tempting to say yes, but once you’ve written and launched a few job ladders, you realize that they are expensive to create and even more expensive to maintain. There’s a technical analogy in the trade-offs of forking code to support a new use case versus generalizing it to support both. What looks cheap initially (forking the code/ladder) becomes a long-term maintenance burden, particularly in the presence of many similar forks.

Because all of these roles are primarily about software development, we have found the sanest path is keeping all of them together on a single ladder. To make this work, you will need to specify level criteria in terms of outcomes achieved, as opposed to overly relying on methods used. This can take time and iteration to get right.

In the meantime, what should you do when you have a great platform software engineer but can’t get them promoted? We’ve found that you’ll be much better off stretching within the system, by finding people outside of platform engineering, at the next level up, who can attest that “this person’s impact is just as high as mine.” In fact, bringing such cases forward usually spurs the organization to adjust its level criteria. To support a case for promotion, Diego Quiroga, principal software engineering manager at Microsoft, suggests providing evidence of some of the following:

  • Tools, dashboards, or wikis the engineer has created (particularly those that are widely adopted within the team or organization)

  • The quality of their customer interactions, including clarity, technical depth, and responsiveness

  • Their contribution to handling and resolving tickets efficiently, considering volume and complexity

  • Their involvement in postmortems, ability to coach other teams in analyzing incidents, and ability to propose solutions

Back up these and any other artifacts with feedback from those on the receiving end of the engineer’s work products, and prompt for feedback that speaks to both the impact and the technical expertise needed to do the work well.

Have, at Most, One Level Matrix for the Systems Roles

In smaller companies, the practice we’ve outlined in the last few sections works well to hire and reward great software and systems platform engineers, without needing a second level matrix. However, at scale, because they don’t write as much code, we have seen challenges in getting organizations to recognize the commensurate value of people in all the “systems” roles—reliability engineers, systems engineers, and all the variants of systems specialists.

A common example: we’ve seen a team interview a great systems engineer with 10 years of experience on planet-scale systems, then propose to hire them at a non-senior level “since they can’t solve coding problems like a senior engineer does.” Similarly, we’ve seen company-wide promotion committees struggle with how to evaluate senior-level engineers who had written only a thousand lines of code in the last year, or staff-level engineers who hadn’t led the building of a new system. The panel was biased to think of leverage only in terms of new code and systems being created.

With that in mind, we believe most organizations should eventually create a second level matrix for the “systems” roles that don’t churn out lots of code. The key is to create only one, rather than three, or else you’ll again run into the problem of confusing everyone by codifying subtle differences in how to evaluate impact in similar roles. Since these job titles have seen so much renaming in recent years, we won’t suggest a new name, but a couple of examples we’ve seen are Meta’s production engineer level matrix and Amazon’s systems development engineer level matrix.

Finally, if your company already has a level matrix for DevOps engineers or SREs, it’s fine to use that. The name won’t be a perfect match, since these are more specific roles, but as there’s no such thing as a perfect name anyway it’s OK to avoid the work around renaming—just make sure the level criteria accommodate all three roles, since otherwise you’ll be limiting who can be successful on your platform teams.

If Needed, Create a New Software Engineer Interview Process

Assessing a candidate during an interview and evaluating an employee’s job performance are totally different beasts. In the case of performance and promotion, the employee has been doing the job for thousands of hours, with clear business impact in deliverables and other specific role information that can put the evaluation in context. Interviews, on the other hand, provide only a few hours of information, none of which involves actually doing the job. Thus, “forking” the software engineering interview process may be the right thing to do.

Platform teams at companies that use “company-wide” software engineering pipelines can get stuck on evaluating an “application software engineer” profile and miss the differences we covered earlier in the chapter. For instance, we have seen coding questions that are less about practical coding skills (creating a solution with high attention to detail, particularly around assumptions and edge cases) and more about computer science knowledge that rarely comes up in day-to-day platform programming, such as data structure manipulation or first-order algorithms. We’ve also seen bias in design questions, which may be focused on choosing the right platforms to combine as part of an application, as opposed to designing platforms themselves. None of these map well to platform software engineering.

For platform engineering teams, we prefer an interview process that looks something like this:

  • One traditional coding interview, typically with an algorithm that has a working naive/brute-force approach that can then be optimized through more advanced algorithms or data structures. The candidate is evaluated not only on whether they can find the optimized solution, but also on whether they can implement the “bookkeeping” details of the answer, including error handling and testing.

  • One coding interview that shows the breadth of the candidate’s understanding of systems detail. It may take them 20 minutes to get the code correct, but such questions should generate 30 minutes or more of discussion about the underlying assumptions. During this discussion, you can test their methodology and their assumptions around real-world factors like testing, observability, and scale-up; for example, you might ask how the candidate’s answer would change if the inputs were larger than what a single computer could handle.

  • One traditional design interview, but focused on designing a platform, as opposed to an application.

  • One inverted design interview, in which you ask the candidate to dive deep into the technical trade-offs of something real they have designed and built.

  • One behavioral/values interview, with a particular focus on operational experience, ability to lead in the face of conflict, and empathy with customers.

If any of these types of interviews are new to your company’s process, you will need to be “hands on” in managing the rollout process to ensure your early interviewers are calibrated. You’ll want to set up a small working group to create a set of standard questions to be covered, with a common rubric or a set of green and red flags. As the process is initially rolled out, you will need to collect interviewers’ feedback on how well they think the question evaluated the candidate, and present any trends to the working group to make further corrections. In our experience, such a hands-on rollout can take six months before we are confident that the early interviewers are well calibrated.

Vary the Interview Only Slightly for Systems Roles

As indicated in Table 4-1, for systems roles, we like to use the same interview outline as for platform software engineers. However, we suggest three main changes. The first is more flexibility on the design question—the more you focus on the candidate’s specialization, the more information you have to evaluate whether they pass the interview and what level they should be at. The second is, of course, in the inverted design interview: you want to dive into their depth of systems knowledge in their specific role, ideally having someone in that role asking the questions.

The third change is the largest point of contention: keeping the coding interviews. People with systems backgrounds often argue that “Whiteboard coding isn’t real coding, so I shouldn’t need to do it.” Sometimes they follow that with “You should trust from my resume that I know how to code.” Unfortunately, there are many people in the world who can’t actually create code in existing production systems, including a lot of people whose resumes would make you think otherwise. This has to be covered in the interview.

The question, though, is how. People with systems backgrounds often note that the whiteboard interview process is unnatural and requires practice that is divorced from normal working conditions. It’s hard to produce production-quality code for a new problem with the time pressure of a live interview, not to mention the distractions from an interviewer looking over your shoulder and interrupting you with well-meaning suggestions. The result is that the candidate’s performance gives no legitimate indication of how well they will write code once in the job.

We suggest that, instead, you take the question offline, offering the candidate a time-boxed take-home coding problem. Then, use the interview to discuss their submission. This allows you to validate that they didn’t cheat and to go deep on systems questions. This style of interview takes more time and gets pushback from some candidates, but again, we think it’s important not to backslide on this point if you really want to build a team with a platform culture that will create substantial new software, as opposed to a traditional infrastructure or operational culture.

Interview for Customer Empathy

In our time managing and working with platform engineering organizations, we’ve seen some organizations develop abrasive relationships with some of their largest user groups. In certain cases, the engineers treated users’ thoughts and opinions with contempt, even as those users struggled with problems caused by the platform itself. It’s tempting to dismiss this with “Don’t hire jerks.” There are two reasons we think this is simplistic.

First, the word jerk implies behavior like belittling, agitating, and ignoring. That is certainly some of what you see, but it doesn’t cover more defensive behavior, like touchiness around criticism. Here, rather than focusing on helping the user the best they can, the platform engineers sigh, shrug, and point fingers at the past. When the team is under a lot of pressure, this can cause outbursts targeted at users with a clear message of “you users are lucky to have us.”

Second, the user might be the jerk. There is a reason for the old help-desk meme of “the problem exists between the keyboard and the chair.” Some users just cannot accept the facts. Most application engineers are protected by their support or product organizations, which deal with the worst of such people directly. However, as we’ll discuss in Chapter 6, when you’re working on a platform the line between support cases and new features can be blurred, so user support is part of the job.

Handling difficult users while supporting a system you did not create takes maturity and empathy to hold your temper, build bridges, educate the user, and solve the problem. Not everyone can manage it. Unfortunately, this means there are a lot of passionate platform engineers who aren’t cut out to work on platforms. Not only does their behavior affect the reputation of the team, but because they are somewhat “right” in their grievances, they can easily affect the culture of the entire team.

While there are many skills you might try to interview for to avoid this problem (negotiation, communication, influence without authority), we recommend that, at a minimum, you screen for a basic level of empathy and ability to put oneself in the user’s shoes. We have used questions like:

  • Tell me about a time when you helped one of your users understand the system.

  • Tell me about a time when you used customer feedback to change the direction of what you were building.

  • How do you understand your users in order to figure out whether a new feature or system is interesting or applicable to them?

These questions are not meant to see if the engineers would make good product managers; they make sure engineers appreciate that they are building things for other humans to consume. Camille prefers to frame this as customer empathy instead of user empathy because, to paraphrase a friend, “Customer implies obligations; users are just some schmucks.”

This doesn’t mean that engineers need to spend all their time thinking about their customers. But when engineers have some empathy for other people who might need to read their code and a general commitment not just to the most interesting technical problem but also to the larger health of the system, the engineers themselves and the team as a whole tend to be stronger.

What Makes a Great Platform Engineering Manager?

It’s great to have a balanced set of platform software engineers and systems engineers, but at some point you need to add engineering managers into the mix. (After all, who’s going to make sure all these interview processes happen?) Furthermore, the manager is often the leader who has the most influence on the culture of a team, impacting who feels heard, who feels enabled, and who feels their ideas are being treated equally to those of others.

While there are skills all good managers share, we’ve found that some skills, tendencies, and experiences make for the most successful platform engineering managers. In this section, we’ll cover the main ones.

Experience Operating Platforms

Platform engineering involves operational complexity that many managers with application software engineering backgrounds do not appreciate. Most at least understand the breadth of the underlying systems, but they may miss that these systems tend to have ill-defined boundaries. A problem in one area can cause the whole thing to fall apart in surprising ways. Thus, it requires a lot of humility and patience to manage a team as they slowly yet diligently address systemic issues. When you bring a software engineering leader without operational skills into a platform management role, they can compound problems by encouraging a mindset that the solution is just one “brilliant” engineering fix away. Yes, there are sometimes quick system fixes that buy time, but these are much more likely to be “simple” than “brilliant.” For every “brilliant” engineering fix we’ve seen, we’ve seen 10 failed ones that compounded operational problems and slowed the team down.

A different failure mode we have observed involves hiring a good manager from a customer organization. This approach has clear upsides—you get an established manager, a fast path to customer empathy, and organizational relationship building. However, be cautious, because this hire is likely coming from a place of less operational complexity. As you’ll see in Chapter 6, they can struggle to see the value in routine operational practices, letting things fall through the gaps. Further, they might think that the underlying system problems are easy to solve and the engineers working on them were mismanaged and doing things wrong. That’s a great way to end up not just misdiagnosing problems but also alienating the strongest members of an existing team.

Experience on Big, Long-Running Projects

Managers who are used to the “move fast and break things” pace of application engineering may get frustrated by the slower delivery pace of platform engineering teams. When a lot of people depend on your platform, it is necessarily more critical, and that means you need to make changes slowly and operate with careful thought.

This doesn’t mean that platform teams shouldn’t aim for frequent delivery. A good platform engineering leader will help their team figure out how to deliver their work quickly and safely in the same way that a good application engineering leader does. But there is a difference between leading a team that ships new code to production every day, and a team that has to think through the multi-month process of safely migrating several customers off of one critical platform and onto another one without downtime, data loss, or disruption.

Great platform leaders are able both to take criticism related to the team’s inability to ship improvements faster and to justify why the team’s delivery pace is right, as they are managing high levels of business criticality, complexity, and risk. In the face of constant pressure and criticism from stakeholders, even the most confident leaders can start questioning their strategy and churn their team’s focus by looking for quick fixes to regain face. Instead, they must set aside any emotional reaction to the one-way nature of this feedback, and be willing to spend time and effort on handling tough discussions with technical stakeholders. We’ll discuss techniques for building these relationships in depth in Chapter 10.

Attention to Detail

The most successful managers that we’ve seen transition from “application engineering” leadership to “platform engineering” leadership were detail-oriented sticklers who found motivation in doing project and process management personally.

Managers who spent their early careers as engineers working in infrastructure and platform teams tend to know which details matter and who can be trusted to make the right trade-offs, so they can lead their teams without using too much management process. But for managers from other backgrounds, until you have built these instincts, you’ll need to track a lot of details. This can make teams feel “micromanaged,” which can be annoying, particularly if former management used less process. However, if the options are “my leader asks micromanaging questions to understand trade-offs” and “my leader misses crucial details in making decisions that impact me,” most engineers will grudgingly admit that they prefer the former.

Good managers should eventually build the instincts for when to trust the team and when to probe more deeply, and so become able to put much of the process aside. This is something actual micromanagers struggle with.

Other Roles on a Platform Team

Of course, a good platform team includes more roles than just various kinds of engineers. This section briefly covers some other roles you may encounter.

Product Managers

Good platform teams focus on the product and the customer. As we lay out in Chapter 5, building products that reflect this mindset takes ongoing detailed and focused work, particularly around communicating with customers and building that into a strategy. At scale, adding dedicated product managers to your platform group is the only way to ensure this work will be done.

It’s a challenge to hire good product managers anywhere, but it’s especially hard for platform teams. Most product management organizations see value in the role as closely tied to revenue and to delivering to external customers, so few PMs are experienced in the challenges of platform teams. This means they can sometimes take the short-term mindset of “business obsession” too far. Jordan West, a staff engineer on a data platform team at one of the FAANG companies, captures his negative experience with PMs who think this way: “Why is it so bad if engineers get interrupted every 10 minutes and can’t deliver on anything else, as long as the customer is happy?” That can be a useful mindset for a startup, but it’s dysfunctional on a platform team.

To address this challenge, platform teams can turn to product-minded people from other technical backgrounds to fill some of these roles. Our experience is that, for every PM we’ve worked with who came in with formal product management experience, we’ve worked with two others who moved into product management from engineering or, occasionally, technical program roles. While hiring people without formal product management experience is a gamble, it’s one that often pays off.

However, while it can be tempting to fill your product teams with only these folks, you also need some experienced product managers who already understand the role. They can help to train the newbies and calibrate whether they are doing product management or just glorified program management, scrum mastering, or tech leading. You can bootstrap this through external coaching and training, but don’t skip this step entirely!

Product Owners

Thanks to the Scaled Agile Framework, a lot of companies today are hiring a role called product owner. There’s some ambiguity about how this differs from a product manager role. It’s sometimes defined as a complement to a marketing-focused product management role, emphasizing the work of backlog grooming and defining user stories. In platform engineering organizations, where all the “customers” are internal and so marketing needs are small, there is no reason to split the roles. Look for people who can make strategic decisions and handle the mechanics of action.

Project Managers/Technical Program Managers

The project or technical program manager (TPM)2 role is often the most vilified role in engineering organizations. Critics argue that the “technical” part of the title is a misnomer, since every decision involves big meetings of stakeholders, and otherwise the job seems to mainly involve harassing overworked managers and tech leads for updates. We think this is less about the role and more about the conditions they are often asked to deliver in—when executives don’t prioritize the right projects ahead of time, it creates a situation where the only way a project will succeed is by a TPM driving brute-force cross-organization execution. To avoid these situations, we recommend hiring product managers first and using engineering managers and technical leads to manage all small- to medium-sized projects.

But no matter how hard your executives and PMs try, no one can predict the future, so broad, execution-focused projects are going to happen. Managing those projects well will require 100% of someone’s time, and it’s a good idea to make it someone who has built their career on doing that.

Finding good platform TPMs is almost as difficult as finding good PMs. We have seen a lot of candidates who had succeeded at their last company totally fail at their new one. The lesson we’ve learned is to find TPMs who are comfortable delivering on projects using the organization-wide processes you have today, rather than blaming those processes for why they can’t. Thus, at a relatively small company, you will want people who are good at making things happen by building bottom-up relationships with engineers and delivering without needing authority. On the other hand, at a big company with tens of thousands of engineers, you likely will want someone who makes things happen by bringing hard decisions to misaligned leadership, collecting the details and communicating them upwards in the style your company’s executives prefer.

Developer Advocates, Technical Writers, and Support Engineers

Developer advocates, technical writers, and support engineers are highly specialized roles seen in really big platform engineering organizations (generally, those with more than a thousand engineers). Both product managers and engineers can perform these functions to some degree (though not as well as a specialist), so avoid hiring until your team really needs a full-time specialist. This usually means pushing back against “not my job” attitudes from your team. It also means ensuring such work is recognized and rewarded, even when it is not directly mentioned in their job’s level matrix.

Creating a Platform Engineering Team Culture

We started this chapter by talking about how some teams get stuck with one type of engineer (systems or software) and struggle to create the balanced culture that platform engineering needs. Now, we want to talk about a case where we brought together two teams with divergent cultures and instilled the culture needed to do platform engineering.

A Platform Split Between a Development and an SRE Team

This was a compute platform whose core was a complex OSS system, with the platform cobbled together by a development team of software engineers who were mostly recent PhDs, although in systems-related fields. All of these had been hired according to a company-wide “software engineer” interviewing standard—which means none were screened for systems operational skills or customer empathy. The platform team had recently added a few systems engineers, but they’d been hired into a separate “SRE” organization, which was the by-the-book approach of the time. Their most common manager, three levels up, knew he had a problem, but he was hearing very different proposals about what needed to be addressed.

Strengths and Weaknesses of the Development Team

The development team’s culture held that every problem could be solved by growth—building new functionality in collaboration with customers to allow them to move to the new platform, which would later allow the team to hire more engineers. They didn’t worry about broad customer understanding, migration plans (other than “build it and they will come”), improving the operational stability of some of the systems, or solving problems through evolving the existing offerings. Instead, at every opportunity, the team immediately went into “new build” mode, thinking this was going to be the “brilliant” solution that all future customers would use.

This had certain advantages. The team created some very innovative solutions. They weren’t afraid to tackle problems with the OSS that no one in the community had solved yet, instead of deciding that the OSS wasn’t fit for purpose due to the lack of support for their needs. Their fearlessness in the face of unsolved problems helped them knock down barriers and deliver some really big advances, and they didn’t get bogged down in the formalities of process while delivering.

As you can probably guess, however, it had some big drawbacks too. There were many stability problems with these new systems, which the engineers often preferred to solve via a new build instead of taking the time to understand and fix the existing infrastructure. The innovative solutions were great for getting the ball rolling, but the team seemed stuck in “pioneer” mode. Perhaps due to the influence of the researchers in the group, they were less interested in the day-to-day grind of stability, reliability, and iterative improvements. Eventually, project delivery and broad customer communication suffered. The customers weren’t clear when they could expect things to get done, so they weren’t happy.

The creation of the separate SRE team, which was undersized and mostly staffed with newcomers to the company, just made things worse. The development team started to view reliability as “SRE’s problem” and a license to stay on the same path, despite significant technical debt. The finger-pointing was more covert than overt, but the thinking on both sides was “Why is that other team not hiring better people who meet the needs of the business?”

Merging the Teams and Adding Product Management

Our first move was to merge the teams under someone from the SRE side with great managerial traits and strong experience operating platforms, executing on long-running projects, and managing stakeholders. The new team balanced people who wanted to build new things with those who were happy to scale and operate existing things and those who would work more closely with customers. As a result, over a period of about six months, they stabilized operationally and consolidated the prioritization around new features, partly by moving well away from the “one engineer, one feature” model toward a roadmap model.

These changes didn’t come for free. We lost some of the more innovative research-focused developers: they had preferred the SRE model, as it let them focus on playing with new technologies and building new things. Mostly, we saw their unhappiness growing and were able to move them to more appropriate roles in other parts of the company. But for a couple, we didn’t move fast enough, and we lost some good engineers who just ended up stuck on the wrong team.

The next step was finding a PM to take over product management for the team. This too created some tension; we had to ensure that the technical leads still felt heard and that the engineering managers wouldn’t be (or feel) undermined by the product manager making all the big decisions. Bringing such a disparate group together while ensuring everyone feels respected and able to do their best work is not easy.

Instilling a Platform Engineering Culture

As these moves were made, an important aspect of our role as organization leaders was reinforcing the new culture. That was harder than we expected, as cultural challenges also came from outside the team. Part of the reason the development team had been empowered was their strong alignment with the broader company culture, which was largely about collaboratively building innovative things on the cutting edge. This culture was great for data scientists whose work had no human users. But the business had other other application teams with much higher reliability needs, and even the data scientists were unhappy with an unreliable platform.

Thus, we needed a platform team that balanced building new things with thinking about stability, reliability, and usability. We needed to create a new platform engineering culture that respected the overall company values of innovation and collaboration while adding our balanced focus on stability and scale.

In truth, most subteams at larger companies have their own distinct cultures, and the larger the company, the more these cultures tend to diverge. Teams develop cultures that reflect where they focus their attention and how they are punished or rewarded, and platform teams’ cultures tend to be a bit more conservative than those of their product engineering counterparts. Platform engineering leaders should pay close attention to any culture drift and ensure that it doesn’t lead to an “us versus them” mindset. It’s OK for your team to have a slightly distinct culture, but when your pride in running highly reliable systems turns into scorn at the product engineering teams who keep shipping broken code, you risk destroying the customer empathy that is so important in building great platforms in the first place.

To create a healthy organization, spend time recognizing and rewarding different roles and skill sets. Talk about how they contribute to the stronger, better whole. Take the time to appreciate your partner teams and their work. This cultural investment will go a long way toward supporting your team and ensuring their (and your) continued success. The next step forward is creating a product culture, which is the topic of the next chapter.

Wrapping Up

In the first part of this book, we emphasized that platform engineering requires a cultural change in how you staff teams, bringing together engineers with a mix of focus areas to collaborate on customer-focused platforms. Mixed-mode management is not easy, though, so it’s tempting for platform leaders early in their leadership careers, or early in their migration to platform engineering, to fall back on what they know—which results in teams with either a software or a systems focus. Unfortunately, that usually means the platforms they produce lack either the complex analysis that systems engineers bring or the code development productivity that software engineers bring. The platform product ends up being defined by what the team can produce, rather than what the customers need.

In this chapter, we not only introduced you to the breadth of roles that a platform engineering team needs to have at scale, but also to how to think about your processes of hiring and recognition so that the people in each role can flourish. We also covered the characteristics leaders need to successfully manage platform teams and gave an example of how to bridge the cultural gaps, not just within the team itself, but also in the expectations customer engineers have in interacting with the team. Because culture is such a major contributing factor to your success, it runs through nearly every topic in this book, and we’ll talk about it quite a bit more in Chapter 5.

This change is not easy to implement. But if you want to move past operating over-general OSS and vendor primitives held together with in-house tools and glue, you need to start your platform engineering journey by ensuring your teams are staffed to actually build platforms.

1 See for instance Laurent Bernaille and Elijah Andrews’s 2022 talk on this.

2 Like “product owner” versus “product manager,” the difference in these titles in terms of role is subtle, especially as usage varies across the industry. We will use the terms interchangeably.

Get Platform Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.