Validating data models with Kafka-based pipelines

A case for back-end A/B testing.

By Gwen Shapira

May 27, 2015

The McNary Dam on the Lower Columbia River, along border of Oregon and Washington states (source: Flickr/Department of Energy)

Start the O’Reilly “Introduction to Apache Kafka” training video for free. In this video, Gwen Shapira shows developers and administrators how to integrate Kafka into a data processing pipeline.

A/B testing is a popular method of using business intelligence data to assess possible changes to websites. In the past, when a business wanted to update its website in an attempt to drive more sales, decisions on the specific changes to make were driven by guesses; intuition; focus groups; and ultimately, which executive yelled louder. These days, the data-driven solution is to set up multiple copies of the website, direct users randomly to the different variations and measure which design improves sales the most. There are a lot of details to get right, but this is the gist of things.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

When it comes to back-end systems, however, we are still living in the stone age. Suppose your business grew significantly and you notice that your existing MySQL database is becoming less responsive as the load increases. Suppose you consider moving to a NoSQL system, you need to decide which NoSQL solution to pick — there are a lot of options: Cassandra, MongoDB, Couchbase, or even Hadoop. There are also many possible data models: normalized, wide tables, narrow tables, nested data structures, etc.

A/B testing multiple data stores and data models in parallel

It is surprising how often a company will pick a solution based on intuition or even which architect yelled louder. Rather than making a decision based on facts and numbers regarding capacity, scale, throughput, and data-processing patterns, the back-end architecture decisions are made with fuzzy reasoning. In that scenario, what usually happens is that a data store and a data model are somehow chosen, and the entire development team will dive into a six-month project to move their entire back-end system to the new thing. This project will inevitably take 12 months, and about 9 months in, everyone will suspect that this was a bad idea, but it’s way too late to do anything about it.

Note how this approach is anti-agile. Even though the effort is often done by scrum teams with stories, points, sprints, and all the other agile trapping, a methodology in which you are taking on a project without knowing if early design decisions were correct or not for six months is inherently a waterfall methodology. You can’t correct course based on data because by the time you have data, you’ve invested too much in the project already. This model is far too inflexible and too risky compared to a model of choosing few reasonable options, testing them out for few weeks, collecting data, and proceeding based on the results.

The reason smart companies that should know better still develop data back ends using this waterfall model is that the back-end system is useless without data in it. Migrating data and the associated data pipelines is by far the most challenging component in testing out a new back-end system. Companies do six-month back-end migration projects because they have no way of testing a back-end system in a two-week spike.

But what if you could? What if you could easily “split” all of your data pipelines to go through two back ends instead of one? This will allow you to kick the new system a bit with your real data and check out how to generate reports, how to integrate existing software, and to find out how stable the new database is under various failure conditions. For a growing number of organizations, this is not just an interesting possibility; this is a reality.

Kafka’s place in the “which datastore do we chose” debate

Data bus is a central component of all modern data architectures: in both Lambda or Kappa architectures, the first step in the data processing pipeline is collecting all events in Apache Kafka. From Kafka, data can be processed in streams or batches, and because Kafka scales so well when you store more data or add more brokers, most organizations store months of data within Kafka.

If you follow the principles of agile data architectures, you will have most of your critical data organized in different topics in Kafka — not just the raw input data, but data in different stages of processing — cleaned, validated, aggregated and enriched. Populating a new back end with the data for a proof-of-concept is no longer a question of connecting many existing systems to the new back end and re-creating huge number of pipelines. With Kafka, populating a back-end system is a relatively simple matter of choosing which topics should be used and writing a consumer that reads data from these topics and inserts them into the new back end.

This means the new back end will receive data without changing a single thing about the existing systems. The data sources and the old back ends simply continue on from the existing pipeline, unaware that there is a new consumer somewhere populating a new back-end data store. The data architecture is truly decoupled in a way that allows you to experiment without risking existing systems.

Once the new data store and its data model is populated, you can validate things like throughput, performance, ease of querying, and anything else that will come up in the “which datastore do we chose” debate. You can even A/B test multiple data stores and multiple data models in parallel, so the whole decision process can take less time.

Next time your team debates between three different data models, there will still be opinions, intuition, and possibly even yelling, but there will also be data to help guide the decision.

Post topics: Data