Using microservices to evolve beyond the data lake

The better prepared you are to utilize all the data in your data lake, the more likely you are to be successful​.

By Jim Scott
February 21, 2017
Mountains. Mountains. (source: Pixabay)

Big data tools and technologies started out by meeting the needs of the analytics community, but they have been evolving ever since. These tools and technologies were born out of the necessity to support large-scale analytics that wouldn’t break the bank. They have since morphed into a set of technologies to support the live operational aspects of a business. Starting with Hadoop, Pig and Hive, HBase and other NoSQL point solutions onto Spark, Flink, Drill, and Kafka—a plethora of technologies that were built to each handle individual aspects of the three V’s of big data (volume, variety, and velocity).

Let’s take a moment to level set. These technologies started out by displacing workloads previously reserved for the traditional data warehouse. Those data warehouses are now actively being augmented or displaced by data lakes. Fundamentally, this is predominantly driven by the cost savings when handling large volumes of data. Does this mean that traditional RDBMS or data warehouse are obsolete? Of course not! At this point in time, this is still primarily a story of co-existence.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Putting big data aside for a just a moment, let’s peer into the applications that create data. These applications or services that have been built are still mostly monolithic in nature. This has been due in part to the cost of scaling the messaging layer to build properly decoupled components. The data these applications are generating has been growing by magnitudes in recent years, and making services smaller and more micro in nature means more data moving and more data being stored. Businesses have begun the move to microservices, so the better prepared you are to leverage all of this data with your data lake, the more likely you are to be successful.

In the meantime, companies are putting more and more data in their data lakes. System “A” generating data, process “B” transforming that data, process “C” moving the data to store it in data lake “D” to be analyzed, by analytics application “E.” Just because the only thing that was really changed between the data warehouse and the data lake is the final storage location doesn’t mean we shouldn’t think further into the future.

Time to evolve beyond the data lake

Generally speaking, moving from the data warehouse to the data lake enables new ways to look at the business opportunities available. The processes don’t change a lot by just swapping in the data lake for the data warehouse. It still takes a substantial amount of time to get the data moved from the source to the destination, and the transformations are likely to still be complex. While this technology change may help move the business in the right direction, it likely doesn’t take it far enough.

Companies use software to drive their business. The faster software can be written, tested, and deployed to production, the more nimble and agile a company can be to respond to its business needs. The same goes for the speed at which data can be stored at the final destination and made available to all downstream processes. Instead of just swapping technologies, we should think bigger.

Let’s consider for a moment that we built our business application and deployed it directly on top of our data lake. I’m not talking about just some analytics application. I’m talking about applications that require data persistence like a database—maybe key-value, wide-column oriented, or even a document database. I’m also talking about message-driven applications that are not based on legacy message queueing models, applications that can scale with the underlying platform. If the storage platform can scale linearly, then shouldn’t we consider putting the applications right on top of the storage platform? If the application runs where the data is stored, then we don’t have to worry about moving the data later to perform analytics.

While we could start putting our existing applications on this technology stack and leveraging the shared storage platform, I think we can still go a step further.

Leverage a microservices model

A service-oriented architecture (SOA) is not a new concept, but in practice, it can be an expensive proposition for delivering the scaling sought from this architecture. The primary problem in the past with SOA was that the smaller and more plentiful the number of components, the more messages would be created—meaning that pushing the messaging platforms to their limits was very easy. In turn, the number of components in an architecture was dialed back so as to not worry about the scaling issues of the messaging platform and limiting the real value of SOA.

With the newer publish-subscribe model of messaging delivered via technologies like Kafka and MapR Streams, we can achieve rates surpassing one million events per second with a minor investment. Because we no longer have to worry about scaling the messaging platform, we are free to take the SOA concepts forward to the level originally intended by the architecture. The core intent of the microservice is to limit the service to a single purpose, enabling it to be easily scaled and swapped out.


Microservices should be fully decoupled by leveraging a solid publish-subscribe technology because fully decoupling services makes them more scalable. Scaling on either side of the event stream means that as a data source grows, there is full support for backpressure. It also means that the service doesn’t need to know how or where the event is being generated—only that events can be read from some location. That service can then be scaled to handle whatever service level must be met, based on the volume of the events. Deploying a service in this way, services can easily be composed into different workflows.

Backpressure is a very important topic that should not be pushed into the background of a decoupled message-driven architecture. Backpressure refers to the ability to prevent a service from being overwhelmed by requests. If a service takes longer to respond to a request than the rate at which requests come in, there will be backpressure and services will begin failing. By the way, this is not usually a graceful series of events.

Using an event stream in front of a service is paramount to maintaining service levels and enabling scalability of a message-driven architecture. Consider this a warning that any platform that requires you set to manual high water marks for components is not backpressure support. Systems that do this can easily bottleneck and cause a backup to the ingress point and fall over causing catastrophic application failure, which is what backpressure is meant to prevent.

Web fraud as a use case

It may be difficult to imagine how or why a microservices approach is a great fit with big data technologies, so let’s walk through a use case. Looking at a use case like web fraud, we can begin to understand where the pieces fit.

Imagine a banking website where there are a lot of users. Most are valid users, but some are fraudsters. The bank’s job is to identify as many cases of fraudsters as possible as early as possible in order to prevent loss.

Every event on a website generates logs, and those logs must be processed in as close to real-time as possible in order to make this use case functional. We can’t wait for data to get moved in batch for processing later. That data must stream in and be acted upon immediately, otherwise action can not be taken immediately. Waiting five minutes means the fraudster may make off with your customers’ money.

The click stream can be processed by a series of different microservices as the events are created. Those services could be leveraging databases containing past history for the logged in user, or a blacklist of activities from known fraudulent activities or even a white list of activities where it is known that a fraudster never reads the end user license agreement page (as an example). Each of those microservices could then generate their own output, which can further be evaluated. The actions from there may allow the user to continue their session as-is, or to perhaps change the available functionality (e.g., cannot transfer funds to other accounts). Those events may also be used to trigger intervention by a security team to evaluate the heuristics of traffic in real time, perhaps even involving the location where fraudsters are based to help trigger law enforcement activities, for example.

Those services can be easily scaled, altered, new services deployed or validated against live data. Things that are rather difficult in a non-message driven architecture. Since the services are micro in nature, they can be rapidly evolved.

Microservices take on even more importance when you start to consider alternative use cases. In order to really complete the web fraud use case, for example, a user profile needs to be created. That user profile could then be used to provide customer 360 services, allowing for recommendations for other cross-selling opportunities, or to continue building brand loyalty.

Use cases tend to look similar

After building microservices, consider managing them across your big data platform, directly on top of your data lake, leveraging a resource manager like Apache Mesos. With it you can leverage Apache Myriad to dynamically launch your YARN clusters on Mesos. This allows you the ability to leverage business applications as well as analytics applications on the same hardware.

While isolated clusters, like those of Kafka or HBase, can deliver some of the needs to support these microservices, data movement and duplication becomes a major burden. Ideally, you want to gain the ability to meet the needs of all your business use cases—not just analytics or business applications, but both together for all your required data access needs via standard APIs (files, database, and streaming). MapR has what’s called a Converged Data Platform, which converges files, NoSQL databases, streaming, and other enterprise databases into a single global namespace.

Following this approach enables a linearly scalable and flexible infrastructure, and developers and administrators can stop worrying about how to get dedicated hardware resources for single purpose clusters. The idea is to also prevent re-architecting your applications when you have customer expansion beyond what you originally planned, which typically comes with customer success. These technologies take a little time to understand and get comfortable with, but may be worth the investment.

To learn more about working with Kafka, check out Kafka: The Definitive Guide.

Post topics: Data science