In this episode of the O’Reilly Podcast, O’Reilly’s Ben Lorica sat down with Ben Sharma, CEO and co-founder of Zaloni, an organization that provides enterprise data lake management solutions. They discussed real-time data processing, the changing nature of data, sentiment analysis, and the trend toward combining cloud with on-prem infrastructures.
Real-time data for near-real-time decision making
Being able to consume data in a near real-time manner, and also be able to process the data to drive insights in near real time is a real use case. At Zaloni, we have seen multiple use cases for this. … At the same time, we’re thinking a lot about self-service and bringing in business users, and it’s important to realize that they have certain expectations about latency, based on their experience with the data warehouse, or the traditional database world.
As you bring in those consumers, with their favorite BI visualization tool, there is a need for a low latency serving layer that can run on top of the data lake environment in a converged manner. This is necessary so that you're not shuffling data back and forth between the serving layer and the processing layer, so that that becomes the reference architecture for the data lake in the future.
Evolving nature of data stored in the data lake
One of the trends we have seen is that the driver for the data lake will be one or two use cases for which the data may be very enterprise oriented. With data coming into the data lake from existing sources, the enterprise can now break down those silos, and create a rich data model that they couldn't do very easily in their previous data environments. Once one or two initial use cases are successful, it opens up a variety of other use cases where then you may be bringing in completely unstructured data, like a bunch of raw text for example. One very good use case that we're seeing is in a health care setting.
In this example, you can do feature extraction, map it to a taxonomy model, and associate it with electronic medical health records, so that you can know what is in the physician's notes—unstructured data in the form of blocks of text captured from the clinical information system. Being able to create a rich data model based on some of this unstructured data, along with structured data like EMR records coming from an Epic or another EHR system, helps you create something new that you couldn't do before.
Deriving business value from sentiment analysis
I think working with unstructured text, without having a sense of very specific business insights that you can use in a constructive manner for specific use cases, can be very challenging. But having very specific outcomes that you need to get out of the unstructured data is becoming more and more common. One of the things we have seen in the past is that some customers would try to mine a lot of unstructured data to do sentiment analysis, but in the end, found that it did not have any business value, to actually drive business outcomes. In these cases, it becomes challenging in terms of adoption, and making sentiment analysis part of the core set of use cases that you're building on the platform.
You always need to think from a business perspective—what are the things you would do if you had sentiment analysis? First of all, how good are the results and how reliable are they? And then, ask: how would you take that outcome and drive change in the organization, or drive change in how your customers are experiencing your service so that they're able to get better service or a better customer experience end to end? I think those things need to be thought out in terms of the end-to-end business impact that you'll have out of a use case, for the use case to be successful in a Hadoop environment.
From cloud-to-ground—mixing data models
At Zaloni, we are seeing a couple of different trends with regard to mixing cloud and on-premise data models. One trend is that there is a lot of data that has been generated from outside the organization. Think about this, for example: car manufacturers have their own enterprise data sources to identify their customers, their manufacturing data, and so forth, but as cars get smarter, with the development of sensors, they are starting to send data back to the car manufacturer. That is not a problem that these car manufacturers had to deal with before, but now, each car they sell is generating so much data that they need to be able to consume and analyze this data in order to be proactive in terms of service maintenance and various other things that they are providing to consumers as a service.
In order to consume all of that data, it makes sense to have cloud-based infrastructures because this is all data that has been generated outside of the core enterprise environment. At the same time, you still want to have a view of what's going on across the fleet of vehicles that you're selling or what’s going on across the different product lines. For example, you might ask: is one year of a specific model having more problems than other years of that same model? Related to that, should we become more proactive in terms of recalls and the like? Can we do recalls more selectively so that we don't need to recall all 60,000 cars?
In these types of scenarios, we see a very good fit for what could be a cloud-to-ground model, where you have the on-prem big data platforms with all of their core enterprise data sets, but then you also have the cloud-based platforms—and a lot of these cloud-based platforms can have a distinction between the compute and the storage layer, where the data that is being fed from these vehicles, or these devices or sensors, can be written into the storage layer. And as you do the computation on that data, you can spin up the clusters on demand, and you can rely on the persistent metadata model to be able to compute or process the persistent data models.