How the TensorFlow team handles open source support
Improving software with the help of a community takes patience and organization.
Open-sourcing is more than throwing code over the wall and hoping somebody uses it. I knew this in theory, but being part of the TensorFlow team at Google has opened my eyes to how many different elements you need to build a community around a piece of software.
When a new project releases into the world, the sole experts on that project are the people who wrote it. They’re the only ones who can write documentation and answer questions, and they are most effective at improving the software. As a result, those of us on the core team of TensorFlow became the bottlenecks to growing our project: we couldn’t do everything at once. We did know how to make time for writing code and documentation, since those tasks were part of our daily jobs at Google. Answering questions from a large community of developers, on the other hand, wasn’t something we were used to, although we knew it was important for the project’s success.
To make sure users got the answers they needed, everyone on the core engineering team joined a rotation. Team members could choose to address Stack Overflow questions with the #tensorflow tag, review pull requests on GitHub, triage GitHub issues, handle syncing the external and internal code, or chase down the causes of failing tests.
Individuals in these areas chose how to divide up the work; typically, each engineer took responsibility for a particular area for a week at a time, rotating through the people available in a round-robin fashion. As a result, the organizing engineer was a lot less productive on their normal tasks that week, but at least the disruption was limited to once every couple of months per person.
We open sourced TensorFlow in part to allow the community to improve it with contributions. So far, we’ve had more than 400 external contributors add to the code, with everything from small documentation fixes to large additions like OS X GPU support, the OpenCL implementation, or InfiniBand RDMA. First, every contribution has to be reviewed by the core engineer on rotation to figure out if it makes sense. If the contribution passes the initial review, a set of Jenkins tests are triggered to ensure it doesn’t cause any failures. Once those have passed, the duty engineer may want another core engineer who knows the area better to take a look, so it will be passed to that specialist for review.
GitHub’s new detailed code review tools have been a great help in this process; before they came along, it was painful to deal with all the individual comments. Often, larger PRs are kept as work in progress for some time, while a core engineer and one or more external contributors work on it collaboratively. Once everyone’s happy, the PR will be merged with the top of tree on GitHub, and then merged into our internal code base the next time a sync is run.
Code license agreement
As part of our automatic pull request process, we make sure that any external contributions are covered by a code license agreement (CLA) by matching the contributor’s GitHub account name with our records at cla.developers.google.com. Our goal is to confirm there’s no doubt the whole code base can be distributed under the Apache 2.0 license. It can be tricky if different email addresses are associated with check-ins inside a pull request, or if the contributor needs to sign in as a corporation, but the engineer on pull request duty is there to sort out any problems that arise.
There have been more than 5,000 issues filed against TensorFlow, which might seem depressing to some, but it’s my favorite metric because it illustrates that people are really using the software! In order to make sure we have a response for every issue filed, the engineer on duty looks at the messages as they come in and tries to categorize them using labels. If it’s a feature we’re unlikely to get to in the short term internally, we mark it as “Contributions Welcome,” or, for bugs, we try to prioritize it. These days, we’re increasingly seeing issues resolved without our help, as external users become experts themselves, especially on platforms like Windows that we’re not all using day to day.
If there isn’t an answer or fix from the community, and it’s a high enough priority, the person on duty assigns it to one of our engineers who knows the area. The entire TensorFlow team have GitHub accounts so we can assign problems using the normal GitHub issue tracker. We did consider shadowing bugs in our internal systems, but the cost of synchronizing two copies of the same information was too high. As a result, we ask our engineers to turn on email notifications for bugs on GitHub so they see when they’ve been assigned, in addition to keeping an eye on our internal tracker.
Derek Murray is the head of the Stack Overflow rotation, and I’m in awe of his ability to answer questions. According to his profile page, he’s reached more than 1.3 million people with his posts. He’s also managed to set up an automated spreadsheet driven by an RSS feed so we can track all of the questions on the site with the #tensorflow tag. We started off with a weekly rotation, but found that the volume became too large for a single person to deal with. Instead, we’re now splitting up questions automatically as they come in, on a round-robin basis.
I’m on this rotation, so after going through my email every morning, I now look at the spreadsheet and see what questions I’ve been assigned. Unfortunately, we aren’t able to answer all the questions ourselves, but we review every one that comes in. If a question is relatively simple, we’ll try to answer it ourselves.
The on-duty engineer is the front line for incoming queries, but sometimes the answer needs more time or expertise. If the questions seem like they might be answerable but nobody in the community has jumped in, we’ll do some detective work in the code (often using `git blame`) to figure out who in the team might have some ideas. Then the duty engineer sends an email asking the internal expert we’ve identified if they can help.
We have a mailing list set up, but at first we weren’t too clear on what it should be used for. It became obvious pretty quickly that it was a terrible way to track issues or answer general questions.
Instead, we keep it available for discussions that don’t fit anywhere else. In practice, however, we’ve discovered that even for things like architectural questions, GitHub issues are a better fit. Now we use the mailing list to send information and share announcements, and it’s worth subscribing.
A lot of people I talk to find it surprising that we use almost exactly the same code base inside Google that we make available on GitHub. There are a few differences, however: for example, support for Google-only infrastructure is separate and include paths are different, but the syncing process is entirely mechanical. We push our internal changes at least once a week, and pull from GitHub even more often.
The tricky part is that we’re syncing in both directions. There are a lot of simultaneous changes on both the GitHub public project and our internal version, and we need to merge all of them as we go back and forth. There was no existing infrastructure we could use, so we handle this with a set of Python scripts we’ve created. The scripts pull the GitHub changes into our internal source repository, convert all the header paths and other minor changes, merge them with the latest internal code, and create an internal copy. We can then go in the other direction, converting all the internal code to the external format, and merging the result with the latest on GitHub using the same scripts.
For internal changes, we also do our best to make sure each check-in appears as a single git commit, and includes the author’s GitHub account and a comment explaining the change. We have a special “tensorflow-gardener” account on GitHub that is scripted to manage this process, and you can see what an internal commit looks like once it’s been migrated to GitHub here.
It’s challenging to make sure the conversion process continues to work as the code changes. To verify this is happening, we make sure every internal change can be run through the scripts to an external version, and back in the other direction to the internal form, and that there are no differences from the original internal version. This test runs on every internal change that touches the TensorFlow code base and blocks submission on anything that doesn’t pass. For those sending pull requests, we sometimes ask for strange changes. Often, the reason is that we have to make sure their code works with the synchronization infrastructure.
We wanted to have a testing infrastructure with a broad reach because we have to support a lot of platforms. TensorFlow runs on Linux, Windows, and OS X on the desktop, and iOS, Android, Android Things, and Raspberry Pi for mobile and embedded systems. We also have different code paths for GPUs, including CUDA and OpenCL support, along with Bazel, cmake, and plain makefile build processes.
It’s impossible for every developer to test all these combinations manually when they make a change, so we have a suite of automated tests running on most of the supported platforms, all controlled by the Jenkins automation system. Keeping this working takes a lot of time and effort because there are always operating system updates, hardware problems, and other issues unrelated to TensorFlow that can cause the tests to fail. There’s a team of engineers devoted to making the whole testing process work. That team has saved us from a lot of breakages we’d have suffered otherwise, so the investment has been worth it.
We’re not alone in working on open source at Google, and we’ve learned a lot from other projects like Kubernetes and the Open Source Program Office (which has a great set of documentation, too). We have an incredibly hard-working team of developer relations experts assisting us as well, and they handle a lot of the heavy lifting around documentation, code samples, and other essential parts of the developer experience. Our long-term goal is to migrate critical expertise beyond the core developers so more people inside and outside Google can help the community.
One good thing about having core engineers assigned to customer service part-time is that it gives us first-hand insights into the problems our users are having. Participating in customer service also motivates us to improve common bugs and add documentation, since we can see a direct payoff in the reduction of the support workload.
Looking to the future, we’re hoping to spread the workload more widely as more people become familiar with the internal details of the framework, the documentation improves, and we create more “playbooks” for dealing with common tasks, such as bug triage. Until then, I feel lucky to have the chance to interact with so many external developers, and hopefully have a positive impact by helping some of them create amazing new applications with machine learning.