Chapter 9. LinkedIn: The Road to Data Craftsmanship

I’ve been with LinkedIn for only 18 months. Yet, what I’ve seen in data operations has amazed me. Like all consumer web companies, there’s always been an enormous amount of data that’s flowed through LinkedIn. But LinkedIn was relatively early to realize the importance of this data.

At LinkedIn, it wasn’t just about getting the analytics right. The company realized early on that infrastructure had to go hand in hand with analytics to support the data ecosystem. Many open source projects, most famously Apache Kafka, were born at LinkedIn to support this ecosystem. Today at LinkedIn, we rely heavily on the scalability and reliability of Kafka, Hadoop, and a surrounding ecosystem of open source and internally developed tools to serve our analytic needs.

Early on, the company found that different teams—such as the Email Team, and the Homepage Team—were using disparate tools when building data pipelines, as illustrated in Figure 9-1.

Different teams built and operated different pipelines (source: LinkedIn)
Figure 9-1. Different teams built and operated different pipelines (source: LinkedIn)1

LinkedIn knew, of course, that it shouldn’t have multiple pipelines for moving and ingesting data, or for computing metrics. It’s inefficient and difficult to manage, and, most important, it leads to inconsistent and unpredictable ...

Get Creating a Data-Driven Enterprise with DataOps now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.