Damon Feldman on combining a data hub and data lake
The O’Reilly Podcast: Achieving greater reliability and security when integrating data.
In this podcast episode, O’Reilly’s Shannon Cutt talks with Damon Feldman, solutions director at MarkLogic, a company that has developed an operational and transactional NoSQL database that integrates data silos to give customers a single view of their data. They discuss data lakes, data hubs, integrating data in a lake format, data governance related to security, and more.
Here are some highlights from the conversation:
Why data lakes can easily turn into data swamps
The core or the underlying foundation of these lakes is the Hadoop distributed file system; it really is a big file system just like a file system on our laptops or home computers. You can put anything there, and make as many copies, and name it wrong, and lose track of it, and not know who put it there. There’s not inherently a lot of governance to it. That flexibility is a little bit like the flexibility of the web and HTML. Anybody can throw up any kind of crazy web page and you can’t necessarily tell what information in it is accurate. So, that flexibility does lead, in some cases, to chaos.
How a data hub can help clear up a data lake
A data hub is an extension of the data lake pattern. The primary thing that characterizes the data lake is that you’re moving data into one area. When you move data to one place, you gain a huge amount of power in your ability to access and control the data. The data analytics reporting and downstream uses of that can become much easier because you can access the data any time you want. A data hub goes beyond moving the data. It starts to organize, harmonize, add a little common structure, and secure the data.
Tracking and governing data
As you copy data around, whether you’re doing it through a well-defined pipeline like an ETL or data processing pipeline or if you’re doing it in a more ad hoc way, it’s important to keep track of the original version of the data—which all data lakes are pretty good at. Data lakes are very flexible, and they have the ability to take in an as-is copy of data that originated somewhere else and keep it in its native form.
How the cloud works with data lakes and data hubs
Historically, data lakes, coming from the Hadoop world, have an ability to run on physical hardware in an inexpensive way. However, cloud has become so powerful and so convenient to stand up large groups of servers, that all of the data lake and Hadoop vendors are now able to run on the cloud, even though you lose some of that cost benefit of cheap discs. It’s not a huge difference in terms of setting up and running the systems. The main advantage and the difference is when you want to do fast elasticity. Other than that, it’s not a huge difference how you run on the cloud or a physical hard drive.
This post and podcast is a collaboration between O’Reilly and MarkLogic. See our statement of editorial independence.