9The Gravity Principle in Data Lakes

We have seen in the previous chapter how the data lake concept can be complex from an architecture point of view and is not only a simple storage management system. The Apache Hadoop technology, which is the most used technology to store data for the data lake, is now not the only solution proposed. Several hybrid solutions such as NoSQL and RDBMS are now implemented. The data lake solutions are now more complex to design, from an architecture point of view, and really need to explore several technologies and approaches. In this chapter, we want to explore some factors which can force, from an architecture angle, alternative solutions to the “physical” data movement from data sources to data lakes. Based on some works in [ALR 15, MCC 14, MCC 10], an interesting perspective to explore for the data lake is the data gravity concept. In this chapter, we want to investigate what the data gravity influence could be on the data lake design architecture and which are the parameters into the data gravity concept could influence.

9.1. Applying the notion of gravitation to information systems

9.1.1. Universal gravitation

In physics, universal gravitation refers to the mutual attraction between any two bodies whose mass is not null. According to Newton, the force F between two point bodies of respective masses m1 and m2 and located at distance d is as follows:

where G is the universal gravitational constant. Gravitation is the cause of orbital motions ...

Get Data Lakes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.