Chapter 4. Scalable Data Lakes
If you change the way you look at things, the things you look at change.
Wayne Dyer
After reading the first three chapters, you should have all you need to get your data lake architecture up and running on the cloud, at a reasonable cost profile for your organization. Theoretically, you also have the first set of use cases and scenarios successfully running in production. Your data lake is so successful that the demand for more scenarios is now higher, and you are busy serving the needs of your new customers. Your business is booming, and your data estate is growing rapidly. As they say in business, going from zero to one is a different challenge than going from one to one hundred or from one hundred to one thousand. To ensure your design is also scalable and continues to perform as your data and the use cases grow, it’s important to realize the various factors that affect the scale and performance of your data lake. Contrary to popular opinion, scale and performance are not always a trade-off with costs, but they very much go hand in hand. In this chapter, we will take a closer look at these considerations as well as strategies to optimize your data lake for scale while continuing to optimize for costs. Once again, we will be using Klodars Corporation, a fictitious organization, to illustrate our strategies. We will build on these fundamentals to focus on performance in Chapter 5.
A Sneak Peek into Scalability
Scale and performance are terms ...