Chapter 4. Scalable Data Lakes

If you change the way you look at things, the things you look at change.

Wayne Dyer

After reading the first three chapters, you should have all you need to get your data lake architecture up and running on the cloud, at a reasonable cost profile for your organization. Theoretically, you also have the first set of use cases and scenarios successfully running in production. Your data lake is so successful that the demand for more scenarios is now higher, and you are busy serving the needs of your new customers. Your business is booming, and your data estate is growing rapidly. As they say in business, going from zero to one is a different challenge than going from one to one hundred or from one hundred to one thousand. To ensure your design is also scalable and continues to perform as your data and the use cases grow, itâs important to realize the various factors that affect the scale and performance of your data lake. Contrary to popular opinion, scale and performance are not always a trade-off with costs, but they very much go hand in hand. In this chapter, we will take a closer look at these considerations as well as strategies to optimize your data lake for scale while continuing to optimize for costs. Once again, we will be using Klodars Corporation, a fictitious organization, to illustrate our strategies. We will build on these fundamentals to focus on performance in ChapterÂ 5.

A Sneak Peek into Scalability

Scale and performance are terms you have likely seen sprinkled generously into product pitches and marketing materials. What do they actually mean, and why are they important? To understand this better, letâs first look at the definition of scalability. In ChapterÂ 5, we will dive deep into the performance aspects.

What Is Scalability?

The best definition of scalability that I have ever come across is from Werner Vogelsâs blog. Vogels was the CTO of Amazon, which hosts one of the largest hyperscale systems on the planet. According to his blog, the definition of scalability goes like this:

A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. An always-on service is said to be scalable if adding resources to facilitate redundancy does not result in a loss of performance.

This concept of scale is very important because as your needs and usage grow, it is important to have an architecture that can guarantee the same experience to your customers without degradation in performance. To illustrate this better, we will apply the principles of scale to something all of us can relate to: making sandwiches.

Scale in Our Day-to-Day Life

Letâs take an example of scalability in action. Say it takes you a total of five minutes to pack one peanut butter and jelly sandwich for lunch, which consists of the following steps, as shown in FigureÂ 4-1:

Toast two pieces of bread.
Spread peanut butter on one side.
Spread jelly on the other side.
Assemble the sandwich.
Bag the sandwich.

Steps to make a peanut butter jelly sandwich

Simple enough and no sweat, right? Now, letâs say you want to make 100 peanut butter and jelly sandwiches. The obvious next step is to invite more people to help. Now, if one sandwich takes five minutes to make, and you have a five-member team to make the 100 sandwiches, itâs natural to think that it would take a total of 100 minutes to make these 100 sandwiches, assuming an equal distribution of labor and that each person makes 20 sandwiches.

Note

In this particular example, performance is measured by the output in terms of unit of work done (one sandwich) and the time taken for that output (average time to make one sandwich). Scalability is understanding how much this average time is preserved as the unit of work done increases.

However, reality could be very different in terms of how you choose to implement this with five people. Letâs take a look at two approaches:

End-to-end execution approach: In this approach, each person follows the steps to make a single sandwich, then proceeds to make the next sandwich, until they complete five sandwiches. This is illustrated in FigureÂ 4-2.

Figure 4-2. End-to-end execution
Assembly-line approach: In this approach, you have a division of labor where you distribute the steps to different people: the first person toasts the bread, the second person applies peanut butter, the third person applies the jam, and so on, as shown in FigureÂ 4-3.

Figure 4-3. Assembly line

As you have probably guessed by now, the assembly-line approach is more efficient than the end-to-end execution approach. But what exactly makes it more efficient? To understand this, we first need to understand the fundamental concepts of what affects scaling:

Resources: The materials that are available to make the sandwichâin this case, the bread, peanut butter, and jelly are the more obvious resources. The toaster and knives are other resources required to make the sandwich.
Task: The set of steps that are followed to produce the outputâin this case, the five steps required to make the sandwich.
Workers: The components that perform the actual workâin this case, the people who execute the task of making the sandwich.
Output: The outcome of the job that signals that the work is completeâin this case, the sandwich.

The workers utilize the resources to execute the task that produces the output. How effectively these come together affects the performance and scalability of the process. TableÂ 4-1 shows how the assembly-line approach becomes more efficient than the end-to-end execution approach.

Table 4-1. Comparison of the approaches
Area	End-to-end execution approach	Assembly-line approach
Contention for resources	All the workers end up contending for the same set of resources (toaster, jar of peanut butter, etc.).	The contention is minimal since the different workers need different resources.
Flexibility in workers to thread mapping	Lowâsince the workers perform all the tasks, the allocation is uniform to all tasks.	Highâif some tasks need more workers than others, a quick shift is possible.
Impact of adding/removing resources	The impact of adding resources might not make a big difference depending on which bottleneck is in the system. However, adding the right resources would speed up the execution. For example, if you have five toasters instead of one toaster, the workers can toast the bread more quickly.	The flexibility of resource allocation allows for an increase in performance when more resources are available.

Also, note that in the end-to-end execution approach, some sandwiches will take less time to complete than others. For example, when five people are reaching for the toaster and one person gets it, that particular sandwich will be done sooner than the others who have to wait for the toaster to be ready. So if you were to measure performance, you would see that the time to make a sandwich at the 50th percentile might be acceptable, but the time taken at the 75th and 99th percentiles might be a lot higher.

It would be fair to conclude that the assembly-line approach is more scalable than the end-to-end execution approach in making the sandwich. While the benefits of this scalability are not conspicuous in your normal day, such as when you may pack three or four sandwiches, the difference is really visible when the job to be done drastically increases, to making 3,000 or 4,000 sandwiches, for instance.

Scalability in Data Lake Architectures

In data lake architectures, as we saw in the previous chapters, resources are available to us from the cloud: the compute, storage, and networking resources. Additionally, there are two key factors that we own and control to best leverage the resources available to us: the processing job, which is the code we write, and the data itself, in terms of how we get to store and organize it for processing. This is illustrated in FigureÂ 4-4.

They key resources available to us in a cloud data lake architecture are as follows:

Compute resource: The compute resources available to you from the cloud are primarily CPU cores and memory for your data processing needs. Additionally, there is software running on these cores, primarily the software you install in the case of IaaS and the software available to you in terms of PaaS and SaaS that is designed to manipulate and optimize the utilization of the CPU and memory resources. A key understanding of how this works is critical to building your scalable solution. Cloud service providers offer capabilities like autoscaling, where as the compute needs of your solution increases, your cloud services can automatically add more compute without you having to manage the resources. Additionally, there are serverless components like Googleâs BigQuery that completely abstract the resourcing aspects of compute and storage from the user, allowing the user to focus solely on their core business scenarios. Serverless components tend to cost more compared to tunable IaaS but offer built-in optimizations and tuning that let you focus on your core scenarios.

Networking resources: Think of your networking resource as a metaphorical cable that can send data back and forth. This is implemented on networking hardware or, in other cases, is software-defined networking.
Storage resources: Cloud data lake storage systems provide an elastic distributed storage system that gives you space to store data (in disks, flash storage, and tape drives depending on the tier of data you operate on), along with computational power to perform storage transactions and networking power to transfer data over the wire to other resources inside and outside your cloud provider.

These are the key pieces that you control:

The code you write
The way your data is stored and organized, which largely influences how effectively the resources are utilized
How scalable and performant your solution is

In the next section, weâll take a deeper look at how big data processing works on a data lake architecture and the factors that affect the scalability and performance of your data lake solution.

Understanding and learning the factors that help scale the system is important for two key reasons:

The traffic patterns on the data lake tend to be highly variable and bursty in most cases, so scaling up and down are key capabilities that need to be factored into your data lake architecture.
Understanding the bottlenecks gives you insights into what resources need to be scaled up or down; otherwise, you run the risk of adding the wrong resources without moving the needle on the scalability of the solution.

Why Do I Need to Worry About Scalability? My Solution Runs Just Fine Today.

Speaking of 10Ã growth, I often see customers underestimate the value and opportunities that a data lake architecture helps unlock, and optimize their time and efforts for the short-term maxima. If you are thinking along the lines of these statementsââI just need to move my data warehouse now; I donât see any other data being important as yet,â or âMy first priority is to get whatever we are doing on the cloud; let me worry about the rest later,â or âI have a one-year timeline to move my data to the cloud; I canât afford to think about anything else other than my current workloadsââlet me assure you of a few things:

As with any software development efforts, thinking of future scenarios helps you avoid the technical debt of having to rearchitect your solutions completely when new scenarios are enabled.
Future-proofing your design is not as hard as you think. Itâs more about diligence than effort, which will set you up for success.
According to a study published by the World Economic Forum, digital transformation is expected to add $100 trillion to the world economy by 2025, and platform-driven interactions will drive at least two-thirds of this $100 trillion growth. It is only a matter of time before these scenarios are unlocked.

Internals of Data Lake Processing Systems

As seen in FigureÂ 4-4, the key operations involved in big data processing are the following:

Ingest: Getting raw data from various sources into a data lake storage
Prep: Preparing the raw data to generate enriched data, by applying a schema on demand, removing or fixing erroneous data, and optimizing data formats
Curate: Generating curated data with high-value density by aggregating, filtering, and other processing of enriched data
Consume: Consuming your data through dashboards, queries, or data-science-related explorations, to name a few, and using these insights to modify your application behavior, which falls under consumption as well

In this chapter, weâll focus on the most common use case of the big data lake, which is batch processing. In batch processing, data is ingested into the data lake in a scheduled, periodic fashion via data copies. This raw data is prepared and enriched and then further curated with ELT/ETL processing engines. Apache Spark and Apache Hadoop are the most common processing engines leveraged for the prep and curate phases. These Spark jobs run in a scheduled fashion after the ingestion or data copy is completed.

There are other use cases, such as real-time processing engines, where data is continuously ingested into the data lake and further prepared and curated. While the principles of scale that we will discuss in batch processing largely apply to real-time processing pipelines as well, there are additional constraints on the design given the continuous nature of the processing. We will not get into depth on the non-batch-processing systems in this book since the most common use cases are pertinent to batch processing. In this chapter, we also wonât dive deep into the consumption use cases for BI queries and data science. There are plenty of resources available on querying patterns and data science.

Weâll look at the two specific aspects of big data processing that are unique to the cloud data lake architecture:

Data copy: This involves moving data as is from one system to another system, by, for example, ingesting data into the data lake from external sources and copying data from one storage system to another within the cloud service, such as loading data from a data lake into a data warehouse. This is the most common form of ingestion used in big data processing systems.
ETL/ELT processing: This involves an input and output dataset, where input data is read and transformed via filtering, aggregation, and other operations to generate the output datasets. Generating data in the enriched and curated datasets falls into this category. Hadoop and Spark are the most popular processing engines, with Spark leading the area here in supporting batch, real-time, and streaming operations with the same architecture. These are the most common kinds of data prep and data curation stages of big data processing.

Data Copy Internals

There are many ways to perform data copy operations. Cloud providers and ISVs offer PaaS that copy data from one system to another in an optimized fashion. You can also leverage tools and software development kits (SDKs) to copy data, such as using the cloud providerâs portal to upload or download data. In this section, weâll examine the internals of a copy job where data is read from a source and copied as such into the destination. Please note that in real life, you can also extend this simple data copy scenario to more advanced topics, such as copying a constantly changing dataset as well as cleansing datasets periodically to comply with regulatory requirements like GDPR. For the purposes of understanding scalability, we will focus on the simple copy process.

Components of a data copy solution

The very simplified components involved in data copy are presented in FigureÂ 4-5.

The data copy tool has two main components in a simplified form:

Data copy orchestrator: Understands the work to be done (e.g., how many files need to be copied from the source) and leverages the compute resources available to distribute the copy job across different workers. The orchestrator also preserves the state of the copy job itself, as in, it knows what data has been copied and what hasnât yet been copied.
Data copy workers: Units of compute that accept work from the orchestrator and perform the copy operations from source to destination.

There is a configuration that specifies the number of data copy workers that you can provide to the copy job. This is a knob you can dial up or down to configure the number of workers you need, either directly by setting the maximum number of workers or by specifying a proxy value that the data copy orchestrator has defined as a configurable value.

Understanding resource utilization of a data copy job

The bottlenecks that affect the scalability and performance of your data copy are as follows:

The number and size of the files/objects that need to be copied: The granularity of the data copy job is at the file/object level in your storage system. A large file could be chunked up to do a parallel copy, but you cannot combine multiple files into a single-copy job. If you have a lot of small files to be copied, you can expect this to take a long time because the listing operation for the operator could take longer, and the small files would make it so that in a single-copy work unit, the amount of data transferred is low, not utilizing your available bandwidth resource to the maximum possible extent.
Compute capacity of your data copy tool: If you configure your data copy tool to have enough compute resources, then you can launch more workers, effectively making many simultaneous copy operations. On the contrary, not enough compute resources makes the number of available workers a bottleneck on your system.
Network capacity available for the copy: The amount of networking capacity you have controls the pipe that is used for the data transfer, especially for copying data across the cloud boundary. Ensure that you have a provisioned high-bandwidth network. Please note that when you are copying or transacting between cloud services of the same provider, you donât need to provision or leverage your network; the cloud services have their own network to accomplish this.
Cross-region data copy: When you make data copies across regions, they have to travel a longer distance over the network, which makes the data copy much slower and may even time out in some cases, causing jobs to fail.

ELT/ETL Processing Internals

If you need a refresher on how big data analytics engines work, I recommend revisiting âBig Data Analytics Enginesâ, specifically the section on Spark in âApache Sparkâ.

ELT/ETL processes primarily work on unstructured, structured, or semistructured data, apply a schema on demand, and transform this data via filtering, aggregations, and other processing to generate structured data in a tabular format. Apache Hadoop and Apache Spark are the most common processing engines that fall in this category. We will take a deeper look at the internals of Apache Spark in this section, but the concepts largely apply to Apache Hadoop with subtle nuances. Apache Hadoop, while it can run on the cloud, was designed to run on premises on HDFS. Apache Spark is much closer to the cloud architecture. Apache Spark is also largely the de facto processing engine due to its consistency across batch, streaming, and interactive data pipelines, so we will focus on Spark in this section.

You would run an Apache Spark job in a cluster that you can create in the following ways:

Provision IaaS compute resources and install the Apache Spark distribution either from open source Apache Spark or from an ISV like Cloudera.
Provision PaaS where you can get a cluster that comes with Spark already installed and ready for you to use from vendors like Databricks or from cloud service providers like Amazon EMR or Azure Synapse Analytics.

Apache Spark leverages a distributed computing architecture, where there is a central controller/coordinator, also called the driver, that orchestrates the execution, and multiple executors, or worker units, that perform a specific task that contributes to the application. Drawing an analogy to home construction, you can think of the Apache Spark driver as the general contractor and the executors as skilled workers like plumbers and electricians. We will now go over a few key concepts that are fundamental to Apache Spark.

Components of an Apache Spark application

Data developers write Spark code and submit that code to a Spark cluster, then get results back when the execution is done. Behind the screen, the user code is executed as a Spark application and is divided into the following components:

Driver: The driver is the central coordinator of the Spark process and is the only component that understands the user code. The driver has two main components: define the breakdown of the jobs, tasks, and executors needed to execute the program and coordinate the assignment of these into various available parts of the Spark cluster. The cluster manager helps the driver find the resources.
Executors: The executors are the components that actually perform the computation. The driver communicates the code as well as the dataset that the executor needs to work on.
Jobs, stages, and tasks: An Apache Spark application is internally translated into an execution plan inside your Spark cluster. This execution plan is represented as a directed acyclic graph (DAG) with a set of nodes representing jobs, which in turn consist of stages that could have dependencies on one another. These stages are then broken down into tasks, which are the actual units of execution, where work is done. This is shown in FigureÂ 4-6. The amount of executors assigned for a task depends on the amount of data that needs to be crunched.

Figure 4-6. Spark job internals

Understanding resource utilization of a Spark job

As you can see, two factors contribute to the resource utilization of a Spark code:

The code: The complexity of the operations that need to be performed
The data: The volume and organization of the data that needs to be crunched

These bottlenecks affect the scalability of your data copy:

Cluster form factor and memory: The amount of compute and memory that you provision for your Spark job heavily affects the performance of the job. When you have a compute-intensive application with a lot of data transformations, increasing the number of compute cores would provide for more executors to accomplish the tasks, resulting in overall improvement of the job. Similarly, when you have a complex transformation, if you have more memory available, the temporary datasets (RDDs) could be persisted in memory, minimizing retrieval times from slower persistent storage solutions, such as the object store. Apache Spark vendors also enable caches that help store frequently used datasets in memory.
The number and size of the files/objects that need to be operated on: The granularity of the job execution is at the file/object level in your storage system. Similar to the data copy job, if you have a lot of small files to be read to perform your Apache Spark job, you can expect this to take a long time because the listing operation at the driver would take longer, and the overhead of reading a file (i.e., doing access checks and other metadata) has only a small return in terms of actual data read or written. On the other hand, if you crank up the number of Spark executors, you can parallelize the writes a lot faster and complete the job sooner. This is a healthy conflict because, while writes are more expensive and can be optimized by writing many small files, the subsequent reads get expensive. To minimize this downstream impact, Apache Spark provides compaction utilities that can be used to combine these small files into one large file after the job completes.
Data organization: Processing in Apache Spark essentially involves a lot of filtering or selective data retrieval operations for reads. Organizing your data in a way that enables faster retrieval of the data in question could make a huge difference in your job performance. Columnar data formats like Parquet offer a huge benefit here; we will take a look at this in detail in the next section. In addition, effectively partitioning your data so that files/objects with similar content are organized together immensely helps optimize for quick access, requiring fewer compute resources and hence improving scalability of your solution.
Network capacity and regional boundaries: As in the data copy scenario, the network capacity and cross-region boundaries heavily affect your performance and scalability.

A Note on Other Interactive Queries

Apache Spark is an open source technology that largely addresses batch, interactive, and real-time interactions with the data lake. Cloud data warehouses offer other interactive query technologies that are optimized for certain data formats. These formats are proprietary, and both the compute and storage systems are optimized for the formats. We are not doing a deep dive into them in this book. However, they conceptually follow the model of Spark internals at a high level.

Considerations for Scalable Data Lake Solutions

Let me start with a big disclaimer: there is no 12-step process that you can follow to the T to make your data lake performant, reliable, or scalable. However, there are several factors that contribute to the scalability and performance of your solution that you can rely on to have a robust implementation of your data lake. Think of these factors as knobs that you can tweak to understand what exactly drives your optimal performance.

If you have historical data from previous years or from your analogous on-premises implementation, you could use those as a proxy for your peak scale characteristics. However, no worries if that is not availableâyou could run a scaled PoC, which essentially is a copy of your workload in a simulated environment to help you understand the various factors that affect your performance, so you can see how they increase as the load on your system increases. Take your most complex job or your largest dataset to run your PoC on the data lake, and double that complexity or data size or both to analyze the impact on your resources.

In this section, we will go over some of the key factors that affect the scale and performance of your system.

Pick the Right Cloud Offerings

As we saw in the earlier sections in this chapter, you have plenty of choices of cloud offerings when it comes to your big data solutions. You can decide to compose your big data solution with IaaS, PaaS, or SaaS, either all on one cloud provider, across cloud providers (multicloud solution), or with a mix of on-premises and cloud environments (hybrid cloud solution), and in one region or multiple regions. Letâs take a look at the impact of some of these choices on the overall performance and scalability of your solution.

Hybrid and multicloud solutions

Most organizations today leverage a multicloud approach, where they have invested in architectures that span two or more cloud providers. Most organizations also have hybrid cloud architectures, where they have investments across private clouds and in on-premises systems as well as with public cloud providers.

Many motivations drive a multicloud or hybrid cloud architecture:

Migrating an on-premises platform to the cloud in phases
Leveraging the cloud for newer scenarios and bringing back the insights to the legacy platform on premises for backward-compatibility reasons
Minimizing vendor lock-in on a single cloud provider, the equivalent of not putting all of your eggs in one basket
Mergers and acquisitions where different parts of the organization have their infrastructure on different clouds
Specific requirements such as data privacy or compliance that require part of the data assets to stay on premises and not on the cloud
Data mesh architecture, where teams or business units within the organization have the flexibility of picking the cloud providers

There are also advantages to a multicloud architecture:

Flexibility of choice to the business units
Lower costâsome services could be cheaper on one cloud service compared to others

However, when it comes to performance and scale, as well as the cost of the solution, there are some traps you could fall into when you have a multicloud or hybrid cloud architecture:

The operational cost of managing multiple clouds could be an overhead or hidden cost that you might have overlooked. You could leverage multiple cloud management software applications, which could add to your costs.
Moving data out of the cloud is not optimal in terms of performance and is also expensive. If you have a scenario where you are moving data back and forth across different cloud solutions, you would see that affecting the overall performance, and hence the scalability of the solution would be affected.
While the fundamental concepts are similar across the services offered by different cloud providers, deep skill sets in terms of the nuances and best practices for implementation are also required. Lack of these skill sets could lead to not having an optimal solution across all your environments.
When you have scenarios where you need low-latency, secure, direct connections to the cloud, you need to provision specialized features like ExpressRoute from Azure or AWS Direct Connect. You would have to provision multiple solutions to move data from your on-premises systems, thereby increasing your costs as well as your data transfers.

If you need to have a hybrid or multicloud solution, pay attention to the data transfers, and ideally ensure that the data transfers across these multiple environments are minimal and carefully thought through.

IaaS versus PaaS versus SaaS solutions

Your big data solution can be composed of a mix of IaaS, PaaS, and SaaS solutions. Here are the differences between the choices for one scenario of running a Spark notebook for your data scientists:

IaaS solution: In this solution, you would first provision virtual machine (VM) resources from the cloud provider, then install the distribution of the softwareâeither from open source Apache Foundation or from an ISV like Clouderaâand enable the notebook access for your data scientists. You have end-to-end control of the solution; however, you also need the relevant skill sets to optimize for performance and scale. Enterprises typically follow this approach when they have engineers who can tune the open source software for their needs by building on them and when they have their custom version of the open source tools, such as Apache Hadoop or Apache Spark.
PaaS solution: In this solution, you would provision clusters from the cloud provider that offers you the right software environment (along with managing updates). You are able to specify the resources you need in terms of CPU, memory, and storage without having to understand the deep mechanics of how big data processing engines work. Most organizations follow this approach.
SaaS solution: In this solution, you would subscribe to a SaaS, such as a data warehouse and notebook service, connect them, and start using them right away. While this works great as a getting-started solution, the scalability and performance have the ceiling of the scalability of the SaaS solution itself. SaaS solutions are getting better and better, so you need to understand how much you need to scale and verify that with the SaaS provider, and confirm with a PoC as well.

Use the summary in TableÂ 4-2 to help make the best choice.

Table 4-2. Comparison of IaaS, PaaS, and SaaS solutions
Type of service	Getting started	Flexibility to customize solution	Control over resources
IaaS	High effort: need to manage software, updates, etc.	High flexibility: you own the software stack	High control: you have control over the infrastructure-level details of the service
PaaS	High effort: lower than IaaS	Medium flexibility: as much as the PaaS service provider exposes the controls	Medium control: higher than IaaS solutions; lower than SaaS solutions
SaaS	Low effort: you can get started with your business problem almost right away	Low flexibility: ease of use comes from limited flexibility; leverage extensibility models to build on top of the SaaS	Low control: the SaaS solutions are usually multitenant (resources shared by different customers), and the resource-level details are not exposed to the customer

Cloud offerings for Klodars Corporation

Klodars Corporation planned to implement its solution as a hybrid solution, with its legacy component running on premises and its data analytics components running on the cloud. It picked PaaS for its big data cluster and data lake storage, running Apache Spark, and leveraged a SaaS solution for its data warehouse and dashboarding component. Klodars understood the impact of networking resources between its on-premises solution and its cloud provider on the performance of its solution and planned for the right capacity with the cloud provider. It also ran a PoC of the data processing workload of one of its data- and compute-intensive scenariosâproduct recommendation and sales projectionsâand ensured that it had picked a big data cluster with the right set of resources.

Klodars also segmented their clustersâone for sales scenarios, one for marketing scenarios, and one for product scenariosâto ensure that a peak workload on one did not affect the performance of the others. To promote sharing of data and insights, the data scientists had access to all of the data from product, sales, and marketing for their exploratory analysis. However, Klodars also provisioned separate clusters for the data scientists and set a limit for the resources so that there were guardrails against spurious jobs that could hog resources. You can find an overview of this implementation in FigureÂ 4-7.

Data Lake Implementation at Klodars Corporation

Plan for Peak Capacity

Regardless of the type of solution you choose, planning for capacity and understanding the path to acquiring more capacity when you have additional demand are key to the cloud data lake solution. Capacity planning refers to the ability to predict your demand over time, ensure that you are equipped with the right resources to meet that demand, and make the right business decisions if that is not the case.

The very first step is to forecast the demand. This can be accomplished in the following ways:

Understand your business need and the SLAs you need to offer to your customers. For example, if the last batch of your data arrives at 10 P.M., and your promise to your customers is that they will see a refreshed dashboard by 8 A.M., then you have a window of 10 hours to do your processing, maybe 8 hours if you would like to leave a buffer.
Understand your resource utilization on the cloud. Most cloud providers offer monitoring and observability solutions that you can leverage to understand how many resources you are utilizing. For example, if you have a cluster with 4 virtual CPUs (vCPUs) and 1 GB of memory, and you observe that your workload utilizes 80% of CPU and 20% of memory overall, then you know you could go to a different SKU or cluster type that has higher CPU and lower memory, or you could take advantage of the memory to cache some results with optimizations so you can reduce the load on the CPU.
Plan for peak demand and peak utilization. The big advantage of moving to the cloud is the elasticity that cloud offers. At the same time, it is always better to have a plan for exactly how you will scale your resources at peak demand. For example, your workloads today are supported by a cluster that has four vCPUs and 1 GB of memory. When you anticipate a sudden increase in load, either because itâs the budget-closing season if you are running financial services or you are preparing for holiday demand if you are a retail industry, what would your plan be? Would you increase the resources on your existing clusters, or are your jobs segmented enough that you would add additional clusters on demand? If you need to bring in more data during this time from your on-premises systems, have you planned for the appropriate burst in networking capacity as well?

You can utilize one of two scaling strategies : horizontal scaling or vertical scaling. Horizontal scaling, also referred to as scaling out, involves scoping a unit of scale (e.g., VM or cluster) and adding more of those units of scale. Your application would need to be aware of this scale unit as well. Vertical scaling, also referred to as scaling up, involves keeping the unit of scale as is and adding more resources (e.g., CPUs or memory) on demand. Either strategy works well as long as you are aware of the impact on your business need, your SLAs, and your technical implementation.

In TableÂ 4-3, youâll see the set of factors to consider regarding monitoring and evaluating for capacity planning. Also look at the reservation models that are available, which enable critical production workloads to run without interfering with other workloads.

Table 4-3. Factors to consider for capacity planning
Component	Factors to consider
IaaS compute	vCPU (cores), memory, SKUs (type of VM), caching available, disk size (GB/TB), and transactions (TPS)
PaaS compute	Cluster size, vCPU (core) when available, and billable units published by the PaaS provider
Storage	Data size (TB/PB), transactions (TPS), and tier of storage (flash, disks, tape drives in order of high to low performance), depending on performance
Data warehouse	SKUsâwatch for compute, storage, and transactions
Networking	Ingress and egress, tier (standard/premium), and gateways with private networks

To make a best guess of the capacity needs, I recommend that you run a scaled PoC that represents the workload you are running. For the PoC, you need to factor in the dataset distribution that best represents your production workload. If you are running an on-premises implementation, a great method is to run an existing workload on the cloud. If you leverage autoscaling capabilities on your cluster or serverless offerings from your cloud provider, a lot of these are automatically handled for you.

Data Formats and Job Profile

The data format you choose plays a critical role in the performance and scalability of your data lake. This is often ignored because in structured data storage systems (databases or data warehouses), this assumption was taken for granted. The data was stored in a format that was optimal for the transaction patterns of the database/data warehouse service. However, given the myriad data processing applications that run on top of the data lake storage, and the premise that the same data could be used across the multiple engines, the onus falls on the big data architects and the developers to pick the right format for their scenarios. However, the best part of this is, once you find the optimal format, you will find that your solution offers high performance and scale, along with lowered cost of your total solution. Given the rise of the data lakehouse and the ubiquity of technologies like Apache Spark for batch, streaming, and interactive computing, Apache Parquet, and formats based on Parquet such as Delta Lake and Apache Iceberg, are being adopted widely as optimal formats for the data lake solution. We will dive deep into this in ChapterÂ 6.

Another aspect that influences your scalability needs is the construction of your big data processing job. Typically, the job can be broken down into various stages of ingestion and processing. However, not all stages are the same. As an example, letâs say that your job comprises the following steps:

Read a dataset that has 150 million records.
Filter that down to the records of interest, which are 50 million records.
Apply transformations on the 50 million records to generate an output dataset.

In this scenario, if your input dataset comprises multiple small files, that would serve as the long pole for your total job. By introducing a new stage in your job of compacting the multiple small files into a small number of larger files, you can make your solution more scalable.

Summary

In this section, we dove deep into the scalability characteristics of the cloud data lake architecture and how closely scalability is tied to performance. We compared a big data architecture with a disaggregated compute and storage model to a colocated tightly coupled architecture and looked at the ramifications of this disaggregation on scale. We also examined the various considerations for the cloud data lake architecture that affect the scale: picking the right cloud offerings, planning for capacity, and tuning the data formats and data organization to match the query patterns. These are considerations that you need to understand to effectively tune your data lake implementation to scale 10Ã. In ChapterÂ 5, weâll build on these fundamental concepts of scale to optimize for performance.

Get The Cloud Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

The Cloud Data Lake by Rukmani Gopalan

Chapter 4. Scalable Data Lakes

A Sneak Peek into Scalability

What Is Scalability?

Scale in Our Day-to-Day Life

Figure 4-1. Steps to make a peanut butter and jelly sandwich

Note

Figure 4-2. End-to-end execution

Figure 4-3. Assembly line