In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.
On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.
Here are some highlights from our conversation:
The first engineering project that made use of Apache Kafka
If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place?
So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That's the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka's very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka.
So from there, Kafka adoption became pretty vital. Now, I think years down the line, all of LinkedIn runs on Kafka. It's essentially the central nervous system for the whole company.
Microservices and Kafka
My own opinion of microservices is that it lets you add more money and turn it into software at a more constant rate by allowing engineers to focus on various parts of the application, by essentially decoupling a big monolith so that a lot of things can happen in parallel development of real applications.
... The upside is that it lets you move fast. It adds a certain amount of agility to an engineering organization. But it comes with its own set of challenges. And these were not very obvious back then. How are all these microservices deployed? How are they monitored? And, most importantly, how do they communicate with each other? The communication bit is where Kafka comes in. When you break a monolith, you break state. And you distribute that state across different machines that run all those different applications.
So now the problem is, ‘well, how do these microservices share that state? How do they talk to each other?’ Frequently, the expectation is that things happens in real time. The context of microservices where streams or Kafka comes in is in the communication model for those microservices. I should just say that there isn't a one size fits all when it comes to communication patterns for microservices.
"Architecting and building end-to-end streaming applications": Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications.
"Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series": Ira Cohen on developing machine learning tools for a broad range of real-time applications.
"Building Apache Kafka from scratch": Jay Kreps on data integration, event data, and the Internet of Things.
"How companies can navigate the age of machine learning": to become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.