Radar / AI & ML

The current state of Apache Kafka

The O’Reilly Data Show Podcast: Neha Narkhede on data integration, microservices, and Kafka’s roadmap.

By Ben Lorica

November 22, 2017

Laser light show (source: Pixabay)

The current state of Apache Kafka
Data Show Podcast

00:00 / 00:37:16

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.

On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Here are some highlights from our conversation:

The first engineering project that made use of Apache Kafka

If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place?

So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That’s the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka’s very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka.

So from there, Kafka adoption became pretty vital. Now, I think years down the line, all of LinkedIn runs on Kafka. It’s essentially the central nervous system for the whole company.

Microservices and Kafka

My own opinion of microservices is that it lets you add more money and turn it into software at a more constant rate by allowing engineers to focus on various parts of the application, by essentially decoupling a big monolith so that a lot of things can happen in parallel development of real applications.

… The upside is that it lets you move fast. It adds a certain amount of agility to an engineering organization. But it comes with its own set of challenges. And these were not very obvious back then. How are all these microservices deployed? How are they monitored? And, most importantly, how do they communicate with each other? The communication bit is where Kafka comes in. When you break a monolith, you break state. And you distribute that state across different machines that run all those different applications.

So now the problem is, ‘well, how do these microservices share that state? How do they talk to each other?’ Frequently, the expectation is that things happens in real time. The context of microservices where streams or Kafka comes in is in the communication model for those microservices. I should just say that there isn’t a one size fits all when it comes to communication patterns for microservices.

Related resources:

Kafka: The Definitive Guide
“Architecting and building end-to-end streaming applications“: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications.
“Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series“: Ira Cohen on developing machine learning tools for a broad range of real-time applications.
“Building Apache Kafka from scratch“: Jay Kreps on data integration, event data, and the Internet of Things.
I ❤️ Logs
“How companies can navigate the age of machine learning“: to become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.

Post topics: AI & ML, Data, O'Reilly Data Show Podcast

Post tags: Podcast

The current state of Apache Kafka

The current state of Apache KafkaData Show Podcast

Learn faster. Dig deeper. See farther.

The first engineering project that made use of Apache Kafka

Microservices and Kafka

Get the O’Reilly Radar Trends to Watch newsletter

Thank you for subscribing.

The current state of Apache Kafka
Data Show Podcast