Chapter 4. Cloud Infrastructure

In this chapter, Rich Morrow outlines the differentiators between the major cloud service providers—Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)—as a guide for choosing where to launch a managed or unmanaged Hadoop cluster. Then Michael Li and Ariel M’ndange-Pfupfu compare AWS and GCP in terms of cost, performance, and runtime of a typical Spark workload. Finally, Arti Garg and Parviz Deyhim explore how tools like AWS Auto Scaling enable customers to automatically provision resources to meet real-time demand (i.e., scale up or scale down), leading to significant cost savings.

Where Should You Manage a Cloud-Based Hadoop Cluster?

By Rich Morrow

You can read this post on oreilly.com here.

It’s no secret that Hadoop and public cloud play very nicely with each other. Rather than having to provision and maintain a set number of servers and expensive networking equipment in house, Hadoop clusters can be spun up in the cloud as a managed service, letting users pay only for what they use, only when they use it.

The scalability and per-workload customizability of public cloud is also unmatched. Rather than having one predefined set of servers (with a set amount of RAM, CPU, and network capability) in-house, public cloud offers the ability to stand up workload-specific clusters with varying amounts of those resources tailored for each workload. The access to “infinite” amounts of hardware that public cloud offers is also a natural fit for Hadoop, as running 100 nodes for 10 hours is the same cost and complexity level as running 1,000 nodes for one hour.

But among cloud providers, the similarities largely end there. Although Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) each has its own offerings for both managed and VM-based clusters, there are many differentiators that may drive you to one provider over another.

High-Level Differentiators

When comparing the “Big 3” providers in the context of Hadoop operations, several important factors come into play. The high-level ones being:

Network isolation: this refers to the ability to create “private” networks and control routing, IP address spaces, subnetting, and additional security. In this area, AWS, Azure, and GCP each provides roughly equal offerings in the way of VPC, Azure Virtual Networks, and Google Subnetworks, respectively.
Type and number of underlying VMs: for workload customizability, the more VM types, the better. Although all providers have “general,” “high CPU,” and “high RAM” instance types, AWS takes the ball further with “high storage,” GPU, and “high IO” instance types. AWS also has the largest raw number of instance types (currently 55), while both GCP and Azure offer only 18 each.
Cost granularity: for short-term workloads (those completing in just a few hours), costs can vary greatly, with Azure offering the most granular model (minute-by-minute granularity), GCP offering the next best model (pre-billing the first 10 minutes of usage, then billing for each minute), and AWS offering the least flexibility (each full hour billed ahead).
Cost flexibility: how you pay for your compute nodes makes an even bigger difference with regard to cost. AWS wins here with multiple models like Spot Instances & Reservations, which can save up to 90% of the cost of the “on-demand” models, which all three support. Azure and GCP both offer cost-saving mechanisms, with Azure using reservations (but only up to 12 months) and GCP using “sustained-use discounts,” which are automatically applied for heavily utilized instances. AWS’s reservations can go up to three years, and therefore offer deeper discounts.
Hadoop support: each provider offers a managed, hosted version of Hadoop. AWS’s is called Elastic MapReduce or EMR, Azure’s is called HDInsight, and GCP’s is called DataProc. EMR and DataProc both use core Apache Hadoop (EMR also supports MapR distributions), while Azure uses Hortonworks. Outside of the managed product, each provider also offers the ability to use raw instance capacity to build Hadoop clusters, removing the convenience of the managed service but allowing for much more customizability, including the ability to choose alternate distributions like Cloudera.

Cloud Ecosystem Integration

In addition to the high-level differentiators, one of the public cloud’s biggest impacts for Hadoop operations is the integration to other cloud-based services like object stores, archival systems, and the like. Each provider is roughly equivalent with regard to integration and support of:

Object storage and data archival: each provider has near parity here for both cost and functionality, with their respective object stores (S3 for AWS, Blob Storage for Azure, and Google Cloud Storage for GCP) being capable of acting as a data sink or source.
NoSQL integrations: each provider has different, but comparable, managed NoSQL offerings (DynamoDB for AWS, DocumentDB and Managed MongoDB for Azure, and BigTable and BigQuery for GCP), which again can act as data sinks or sources for Hadoop.
Dedicated point-to-point fiber interconnects: each provider offers comparable capability to stretch dedicated, secured fiber connections between on-premise data centers and their respective clouds. AWS’s is DirectConnect, Azure’s is ExpressRoute, and GCP’s is Google Cloud Interconnect.
High-speed networking: AWS and Azure each offer the ability to launch clusters in physically grouped hardware (ideally all machines in the same rack if possible), allowing the often bandwidth-hungry Hadoop clusters to take advantage of 10 Gbps network interconnects. AWS offers Placement Groups, and Azure offers Affinity Groups. DataProc offers no such capability, but GCP’s cloud network is already well known as the most performant of the three.

Big Data Is More Than Just Hadoop

Although the immediate Hadoop-related ecosystems discussed above have few differentiators, the access to the provider’s other services and features can give Hadoop administrators many other tools to either perform analytics elsewhere (off the physical cluster) or make Hadoop operations easier to perform.

AWS really shines here with a richer service offering than any of the three. Some big services that come into play for larger systems are Kinesis (which provides near-real-time analytics and stream ingestion), Lambda (for event-driven analytics architectures), Import/Export Snowball (for secure, large-scale data import/export), and AWS IoT (for ingestion and processing of IoT device data)—all services either completely absent at Azure or GCP, or much less mature and not as rich in features.

Key Takeaways

While one could argue any number of additions or edits to the comparisons just described, it represents a good checklist to use when comparing where to launch a managed or unmanaged cloud-based Hadoop cluster. One of the great things about using Hadoop in the cloud is that it’s nearly the exact same regardless of distribution or cloud provider. Each of the big three has a mature offering with regards to Hadoop, so whichever partner you choose, you can bet that your cluster will work well, provide cost-saving options, strong security features, and all the flexibility that public cloud provides.

This post was a collaboration between O’Reilly and Pepperdata. See our statement of editorial independence.

Spark Comparison: AWS Versus GCP

By Michael Li and Ariel M’ndange-Pfupfu

You can read this post on oreilly.com here.

There’s little doubt that cloud computing will play an important role in data science for the foreseeable future. The flexible, scalable, on-demand computing power available is an important resource, and as a result, there’s a lot of competition between the providers of this service. Two of the biggest players in the space are Amazon Web Services (AWS) and Google Cloud Platform (GCP).

This article includes a short comparison of distributed Spark workloads in AWS and GCP—both in terms of setup time and operating cost. We ran this experiment with our students at The Data Incubator, a big data training organization that helps companies hire top-notch data scientists and train their employees on the latest data science skills. Even with the efficiencies built into Spark, the cost and time of distributed workloads can be substantial, and we are always looking for the most efficient technologies so our students are learning the best and fastest tools.

Submitting Spark Jobs to the Cloud

Spark is a popular distributed computation engine that incorporates MapReduce-like aggregations into a more flexible, abstract framework. There are APIs for Python and Java, but writing applications in Spark’s native Scala is preferable. That makes job submission simple, as you can package your application and all its dependencies into one JAR file.

It’s common to use Spark in conjunction with HDFS for distributed data storage, and YARN for cluster management; this makes Spark a perfect fit for AWS’s Elastic MapReduce (EMR) clusters and GCP’s Dataproc clusters. Both EMR and Dataproc clusters have HDFS and YARN preconfigured, with no extra work required.

Configuring Cloud Services

Managing data, clusters, and jobs from the command line is more scalable than using the web interface. For AWS, this means installing and using the command-line interface (CLI). You’ll have to set up your credentials beforehand as well as make a separate keypair for the EC2 instances that are used under the hood. You’ll also need to set up roles—basically permissions—for both users (making sure they have sufficient rights) and EMR itself (usually, running aws emr create-default-roles in the CLI is good enough to get started).

For GCP, the process is more straightforward. If you install the Google Cloud SDK and sign in with your Google account, you should be able to do most things right off the bat. The thing to remember here is to enable the relevant APIs in the API Manager: Compute Engine, Dataproc, and Cloud Storage JSON.

Once you have things set up to your liking, the fun part begins! Using commands like aws s3 cp or gsutil cp, you can copy your data into the cloud. Once you have buckets set up for your inputs, outputs, and anything else you might need, running your app is as easy as starting up a cluster and submitting the JAR file. Make sure you know where the logs are kept—it can be tricky to track down problems or bugs in a cloud environment.

You Get What You Pay For

When it comes to cost, Google’s service is more affordable in several ways. First, the raw cost of purchasing computing power is cheaper. Running a Google Compute Engine machine with four vCPUs and 15 GB of RAM will run you $0.20 every hour, or $0.24 with Dataproc. An identically-specced AWS instance will cost you $0.336 per hour running EMR.

The second factor to consider is the granularity of the billing. AWS charges by the hour, so you pay the full rate even if your job takes 15 minutes. GCP charges by the minute, with a 10-minute minimum charge. This ends up being a huge difference in cost in a lot of use cases.

Both services have various other discounts. You can effectively bid on spare cloud capacity with AWS’s spot instances or GCP’s preemptible instances. These will be cheaper than dedicated, on-demand instances, but they’re not guaranteed to be available. Discounted rates are available on GCP if your instances live for long periods of time (25% to 100% of the month). On AWS, paying some of the costs upfront or buying in bulk can save you some money. The bottom line is, if you’re a power user and you use cloud computing on a regular or even constant basis, you’ll need to delve deeper and perform your own calculations.

Lastly, the costs for new users wanting to try out these services are lower for GCP. They offer a 60-day free trial with $300 in credit to use however you want. AWS only offers a free tier where certain services are free to a certain point or discounted, so you will end up paying to run Spark jobs. This means that if you want to test out Spark for the first time, you’ll have more freedom to do what you want on GCP without worrying about price.

Performance Comparison

We set up a trial to compare the performance and cost of a typical Spark workload. The trial used clusters with one master and five core instances of AWS’s m3.xlarge and GCP’s n1-standard-4. They differ slightly in specification, but the number of virtual cores and amount of memory is the same. In fact, they behaved almost identically when it came to job execution time.

The job itself involved parsing, filtering, joining, and aggregating data from the publicly available Stack Exchange Data Dump. We ran the same JAR on a ~50M subset of the data (Cross Validated) and then on the full ~9.5G data set (Figures 4-1 and 4-2).

The short job clearly benefited from GCP’s by-the-minute billing, being charged only for 10 minutes of cluster time, whereas AWS charged for a full hour. But even the longer job was cheaper on GPS both because of fractional-hour billing and a lower per-unit time cost for comparable performance. It’s also worth noting that storage costs weren’t included in this comparison.

Conclusion

AWS was the first mover in the space, and this shows in the API. Its ecosystem is vast, but its permissions model is a little dated, and its configuration is a little arcane. By contrast, Google is the shiny new entrant in this space and has polished off some of the rough edges. It is missing some features on our wishlist, like an easy way to auto-terminate clusters and detailed billing information broken down by job. Also, for managing tasks programmatically in Python, the API client library isn’t as full-featured as AWS’s Boto.

If you’re new to cloud computing, GCP is easier to get up and running, and the credits make it a tempting platform. Even if you are already used to AWS, you may still find the cost savings make switching worth it, although the switching costs may not make moving to GCP worth it.

Ultimately, it’s difficult to make sweeping statements about these services because they’re not just one entity; they’re entire ecosystems of integrated parts, and both have pros and cons. The real winners are the users. As an example, at The Data Incubator, our PhD data science fellows really appreciate the cost reduction as they learn about distributed workloads. And while our big data corporate training clients may be less price-sensitive, they appreciate being able to crunch enterprise data faster while holding price constant. Data scientists can now enjoy the multitude of options available and the benefits of having a competitive cloud computing market.

Time-Series Analysis on Cloud Infrastructure Metrics

By Arti Garg and Parviz Deyhim

You can read this post on oreilly.com here.

Many businesses are choosing to migrate to, or natively build their infrastructure in the cloud; doing so helps them realize a myriad of benefits. Among these benefits is the ability to lower costs by “right-sizing” infrastructure to adequately meet demand without under- or over-provisioning. For businesses with time-varying resource needs, the ability to “spin-up” and “spin-down” resources based on real-time demand can lead to significant cost savings.

Major cloud-hosting providers like Amazon Web Services (AWS) offer management tools to enable customers to scale their infrastructure to current demand. However, fully embracing such capabilities, such as AWS Auto Scaling, typically requires:

Optimized Auto Scaling configuration that can match customers’ application resource demands
Potential cost saving and business ROI

Attempting to understand potential savings from the use of dynamic infrastructure sizing is not a trivial task. AWS’s Auto Scaling capability offers a myriad of options, including resource scheduling and usage-based changes in infrastructure. Businesses must undertake detailed analyses of their applications to understand how best to utilize Auto Scaling, and further analysis to estimate cost savings.

In this article, we will discuss the approach we use at Datapipe to help customers customize Auto Scaling, including the analyses we’ve done, and to estimate potential savings. In addition to that, we aim to demonstrate the benefits of applying data science skills to the infrastructure operational metrics. We believe what we’re demonstrating here can also be applied to other operational metrics, and hope our readers can apply the same approach to their own infrastructure data.

Infrastructure Usage Data

We approach Auto Scaling configuration optimization by considering a recent client project, where we helped our client realize potential cost savings by finding the most optimized configuration. When we initially engaged with the client, their existing web-application infrastructure consisted of a static and fixed number of AWS instances running at all times. However, after analyzing their historical resource usage patterns, we observed that the application had time-varying CPU usage where, at times, the AWS instances were barely utilized. In this article, we will analyze simulated data that closely matches the customers’ usage patterns, but preserves their privacy.

In Figure 4-3, we show two weeks’ worth of usage data, similar to that available from Amazon’s CloudWatch reporting/monitoring service, which allows you to collect infrastructure-related metrics.

usage collected from the Amazon CloudWatch reporting/monitoring service

A quick visual inspection reveals two key findings:

Demand for the application is significantly higher during late evenings and nights. During other parts of the day, it remains constant.
There is a substantial increase in demand over the weekend.

A bit more analysis will allow us to better understand these findings. Let’s look at the weekend usage (Saturday–Sunday) and the weekday usage (Monday–Friday), independently. To get a better sense of the uniformity of the daily cycle within each of these two groups, we can aggregate the data to compare the pattern on each day. To do so, we binned the data into regular five-minute intervals throughout the 24-hour day (e.g., 0:00, 0:05, etc.) and determined the minimum, maximum, and average for each of these intervals.

Note that for this example, since the peak period extends slightly past midnight, we defined a “day” as spanning from noon to noon across calendar dates. The difference between the weekday group (red) and the weekend group (blue) is seen quite starkly in the plot below. The dotted lines show the minimum and maximum usage envelopes around the average, which is shown with the solid line in Figure 4-4.

difference between the weekday group (red) and the weekend group (blue)

In this example, it is also visually apparent that the minimum and maximum envelopes hew very closely to the average usage cycles for both the weekend and weekday groups—indicating that, over this two-week period, the daily cycles are very consistent. If the envelope were wider, we could examine additional metrics, such as standard deviation, 1st and 3rd quartiles, or other percentiles (e.g., 10th and 90th or 1st and 99th), to get a sense for how consistent the usage cycle is from day to day.

Although not evident in this example, another frequent consideration when examining infrastructure usage is assessing whether there is an overall increase or decrease in usage over time. For a web-based software application, such changes could indicate growth or contraction of its user base, or reveal issues with the software implementation, such as memory leaks. The lack of such trends in this data is apparent upon visual inspection, but there are some simple quantitative techniques we can use to verify this theory.

One approach is to find the average usage for each day in the data set and determine whether there is a trend for these values within either the weekday or the weekend groupings. These daily averages are plotted in green in Figure 4-3. In this example, it is obvious to the eye that there is no trend; however, this can also be verified by fitting a line to the values in each set. We find that for both groupings, the slope is consistent with zero, indicating no change in the average daily usage over this two-week period. However, because of the cyclical nature of the usage pattern, we may be concerned that the long periods of low, constant usage might overwhelm any trends during the peak periods. To test this, we can calculate the average daily usage only during the peak periods, shown in Figure 4-3, in red. Once again, we find the slopes for each of the groupings to be consistent with zero, suggesting no obvious trend over this two-week period.

In a real-world scenario, we would urge some caution in interpreting these results. Two weeks represents a relatively short period over which to observe trends, particularly those associated with growth or contraction of a user base. Growth on the order of months or annual usage cycles, such as those that may be associated with e-commerce applications, may not be detectable in this short of a time span. To fully assess whether a business’s CPU usage demonstrates long-term trends, we recommend continuing to collect a longer usage history. For this data, however, our analyses indicate that the relevant patterns are (1) a distinct peak in usage during the late-night period and (2) differing usage patterns on weekends versus weekdays.

Scheduled Auto Scaling

Based on the two findings about this business’ usage, we can immediately determine that there may be cost-savings achieved by scheduling resources to coincide with demand. Let’s assume that the business wants to have sufficient resources available so that at any given time, its usage does not exceed 60% of available capacity (i.e., CPU). Let’s further assume that this customer does not want fewer than two instances available at any time to provide high availability when there are unforeseen instance failures.

Over this two-week period, this business’s maximum CPU usage tops out at 24 cores. If the business does not use any of Auto Scaling’s scheduling capabilities, it would have to run 20 t2.medium instances, each having two CPU/instance, on AWS at all times to ensure it will not exceed its 60% threshold. Priced as an hourly on-demand resource in Northern California, this would lead to a weekly cost of about $230. With the use of Auto Scaling, however, we can potentially reduce the cost signficantly.

First, let’s consider our finding that usage experienced a high peak at nighttime. Because of the reliably cyclical nature of the usage pattern, we can create a schedule wherein the business can toggle between a “high” and “low” usage setting of 20 and 6 instances, respectively, where the “low” setting is determined by the number of CPUs necessary to not exceed the 60% threshold during the constant daytime periods. By determining the typical start and end times for the peak, we created a single daily schedule that indicates whether to use either the “high” or the “low” setting for each hour of the day. We found that by implementing such a schedule, the business could achieve a weekly cost of around $150—a savings of more than a third. A schedule with even more settings could potentially achieve even further savings.

In the previous example, we use the same schedule for each day of the week. As we noted however, this business has significantly different usage patterns on weekdays than on weekends. By creating two different schedules (weekend versus weekday) the business can realize even further savings by utilizing fewer resources during the slower weekday periods. For this particular usage pattern, the “low” setting would be the same for both groupings, while the “high” setting for the weekday grouping is 10 instances—half that of the weekend grouping.

Figure 4-5 illustrates how this setting would be implemented. The red line shows the binary schedule, including the difference between the weekday and weekend schedules. The blue line shows a more granular, multilevel schedule. The black line shows the actual usage.

two different schedules (weekend versus weekday)

It may be tempting to create even more detailed schedules, perhaps one for each day of the week, but we emphasize caution before proceeding. As discussed above, these analyses are based on only two weeks of usage data, and we lack sufficient information to assess whether these patterns are unique to this particular time of year. However, if we can determine that the observed pattern is consistent with what might be expected from the company’s business model, we can feel more confident in basing resource decisions upon it. The table below summarizes weekly cost estimates for a variety of schedules and instance types. It also includes pricing using dynamic Auto Scaling, which we’ll explore next.

Dynamic Auto Scaling

As we can see, using AWS’s Auto Scaling feature to schedule resources can lead to significant savings. At the same time, by using a multilevel schedule that hews closely to the observed usage pattern, a business also runs the risk that out-of-normal traffic can exceed the scheduled capacity. To avoid this, AWS offers a dynamic Auto Scaling capability that automatically adds or subtracts resources based upon predefined rules. For this usage pattern, we will consider a single scaling rule, though we note that AWS allows for multiple rules.

Let’s consider a rule where at any given time, if the usage exceeds 70% of available capacity, AWS should add 10% of the existing capacity. As usage falls off, AWS should subtract 20% of existing capacity when current usage falls below 55%. When setting this scaling rule, we must also account for the finite amount of time needed for a new instance to “warm up” before becoming operational.

For this scenario, we use AWS’s default setting of five minutes. Using this rule, we can step through our historical usage data to determine how many instances would be in use, launched, or deleted at any given time. Based on that output, we find that the average weekly cost for the period would be about $82, similar to the multilevel weekend + weekday schedule. This is not surprising when looking at historical data; our multilevel approach, which is optimized to the actual usage pattern, should produce similar results as dynamic, rules-based scaling.

This can be seen in Figure 4-6, which shows the number of CPUs that would be made available from a multilevel schedule (blue line) and from dynamic Auto Scaling (green line). Notably, the highest resource level launched by dynamic Auto Scaling is lower than what is made available by the multilevel schedule, but the cost impact is not significant since the peak lasts for only a short duration. The main advantage of dynamic Auto Scaling compared to the multilevel is that resources will still be added as needed even if the usage patterns deviate from historical behavior. For this usage pattern, this single rule is sufficient to provide substantial savings, though the optimal dynamic Auto Scaling setting will vary by each application’s web traffic. For more complex usage patterns, we could consider and analyze a more complex sets of rules.

Assess Cost Savings First

Third-party-hosted, cloud-based infrastructure can offer businesses unprecedented advantages over private, on-site infrastructure. Cloud-based infrastructure can be deployed very quickly—saving months of procurement and set-up time. The use of dynamic resource scheduling, such as the capability enabled by AWS’s Auto Scaling tool, can also help significantly reduce costs by right-sizing infrastructure. As we have seen, however, determining the optimal settings for realizing the most savings requires detailed analyses of historical infrastructure usage patterns. Since re-engineering is often required to make applications work with changing numbers of resources, it is important that businesses assess potential cost savings prior to implementing Auto Scaling.

Note: this example was put together by Datapipe’s Data and Analytics Team.

Get Big Data Now: 2016 Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Big Data Now: 2016 Edition by O'Reilly Media, Inc.

Chapter 4. Cloud Infrastructure

Where Should You Manage a Cloud-Based Hadoop Cluster?

High-Level Differentiators

Cloud Ecosystem Integration

Big Data Is More Than Just Hadoop

Key Takeaways

Spark Comparison: AWS Versus GCP

Submitting Spark Jobs to the Cloud

Configuring Cloud Services

You Get What You Pay For

Performance Comparison

Figure 4-1. Job performance comparison. Credit: Michael Li and Ariel M’ndange-Pfupfu.

Figure 4-2. Job cost comparison. Credit: Michael Li and Ariel M’ndange-Pfupfu.

Conclusion

Time-Series Analysis on Cloud Infrastructure Metrics

Infrastructure Usage Data

Figure 4-3. Usage collected from the Amazon CloudWatch reporting/monitoring service. Credit: Arti Garg.

Figure 4-4. Difference between the weekday group (red) and the weekend group (blue). Credit: Arti Garg.

Scheduled Auto Scaling

Figure 4-5. Comparison of binary and multilevel schedules, and actual usage. Credit: Arti Garg.

Dynamic Auto Scaling

Figure 4-6. Comparison of dynamic and multilevel autoscaling. Credit: Arti Garg.

Assess Cost Savings First

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly