I’M ASSUMING YOU’VE MADE A FEW PASSES THROUGH Chapter 3 AND HAVE JUST DEPLOYED A SUPER-AWESOME, totally amazing, monitoring, trending, graphing, and measurement system. You’re graphing everything you can get your hands on, as often as you can. You probably didn’t gain anything from graphing the peak barking periods of your neighbor’s dog—but hey, you did it, and I’m proud of you.
Now you’ll be able to use this data (excluding the barking statistics) like a crystal ball, and predict the future like Nostradamus. But let’s stop here for a moment to remember an irritating little detail: it’s impossible to accurately predict the future.
Forecasting capacity needs is part intuition, and part math. It’s also the art of slicing and dicing up your historical data, and making educated guesses about the future. Outside of those rare bursts and spikes of load on your system, the long-term view is hopefully one of steadily increasing usage. By putting all of this historical data into perspective, you can generate estimates for what you’ll need to sustain the growth of your website. As we’ll see later, the key to making accurate predictions is having an adjustable forecasting process.
A good capacity plan depends on knowing your needs for your most important resources, and how those needs change over time. Once you have gathered historical data on capacity, you can begin analyzing it with an eye toward recognizing any trends and recurring patterns.
For example, in the last chapter I recounted how at Flickr, we discovered Sunday has been historically the highest photo upload day of the week. This is interesting for many reasons. It may also lead us to other questions: has that Sunday peak changed over time, and if so, how has it changed with respect to the other days of the week? Has the highest upload day always been Sunday? Does that change as we add new members residing on the other side of the International Date Line? Is Sunday still the highest upload day on holiday weekends? These questions can all be answered once you have the data, and the answers in turn could provide a wealth of insight with respect to planning new feature launches, operational outages, or maintenance windows.
Recognizing trends is valuable for many reasons, not just for capacity planning. When we looked at disk space consumption in Chapter 3, we stumbled upon some weekly upload patterns. Being aware of any recurring patterns can be invaluable when making decisions later on. Trends can also inform community management, customer care and support, product management, and finance. Some examples of how metrics measurement can be useful include:
Your operations group can avoid scheduling maintenance that could affect image processing machines on a Sunday, opting for a Friday instead, to minimize any adverse effects on users.
If you deploy any new code that touches the upload processing infrastructure, you might want to pay particular attention the following Sunday to see whether everything is holding up well when the system experiences its highest load.
Making customer support aware of these peak patterns allows them to gauge the effect of any user feedback regarding uploads.
Product management might want to launch new features based on the low or high traffic periods of the day. A good practice is to make sure everyone on your team knows where these metrics are located and what they mean.
Let’s take a look back at the daily storage consumption data we collected in the last chapter and apply it to make a forecast of future storage needs. We already know the defining metric: total available disk space. Graphing the cumulative total of this data provides the right perspective from which to predict future needs. Taking a look at Figure 4-1, we can see where we’re headed with consumption, how it’s changing over time, and when we’re likely to run out of space.
Now, let’s add our constraint: the total currently available disk space. Let’s assume for this example we have a total of 20 TB (or 20,480 GB) installed capacity. From the graph, we see we’ve consumed about 16 TB. Adding a solid line extending into the future to represent the total space we have installed, we obtain a graph that looks like Figure 4-2. This illustration demonstrates a fundamental principal of capacity planning: predictions require two essential bits of information, your ceilings and your historical data.
Determining when we’re going to reach our space limitation is our next step. As I just suggested, we could simply draw a straight line that extends from our measured data to the point at which it intersects our current limit line. But is our growth actually linear? It may not be.
Excel calls this next step “adding a trend line,” but some readers might know this process as curve fitting. This is the process by which you attempt to find a mathematical equation that mimics the data you’re looking at. You can then use that equation to make educated guesses about missing values within the data. In this case, since our data is on a time line, the missing values in which we’re interested are in the future. Finding a good equation to fit the data can be just as much art as science. Fortunately, Excel is one of many programs that feature curve fitting.
XY (Scatter) changes the date values to just single data points. We can then use the trending feature of Excel to show us how this trend looks at some point in the future. Right-click the data on the graph to display a drop-down menu. From that menu, select Add Trendline. A dialog box will open, as shown in Figure 4-3.
Next, select a trend line type. For the time being, let’s choose Polynomial, and set Order to 2. There may be good reasons to choose another trend type, depending on how variable your data is, how much data you have, and how far into the future you want to extrapolate. For more information, see the upcoming sidebar, “Fitting Curves.”
In this example, the data appears about as linear as can be, but since I already know this data isn’t linear over a longer period of time (it’s accelerating), I’ll pick a trend type that can capture some of the acceleration we know will occur.
After selecting a trend type, click the Options tab to bring up the Add Trendline options dialog box, as shown in Figure 4-4.
To show the equation that will be used to mimic our disk space data, click the checkbox for “Display equation on chart.” We can also look at the R2 value for this equation by clicking the “Display R-squared value on chart” checkbox.
The R2 value is known in the world of statistics as the coefficient of determination. Without going into the details of how this is calculated, it’s basically an indicator of how well an equation matches a certain set of data. An R2 value of 1 indicates a mathematically perfect fit. With the data we’re using for this example, any value above 0.85 should be sufficient. The important thing to know is, as your R2 value decreases, so too should your confidence in the forecasts. Changing the trend type in the previous step affects the R2 values—sometimes for better, sometimes for worse—so some experimentation is needed here when looking at different sets of data.
We’ll want to extend our trend line into the future, of course. We want to extend it far enough into the future such that it intersects the line corresponding to our total available space. This is the point at which we can predict we’ll run out of space. Under the Forecast portion of the dialog box, enter 25 units for a value. Our units in this case are days. After you hit OK, you’ll see our forecast looks similar to Figure 4-5.
The graph indicates that somewhere around day 37, we run out of disk space. Luckily, we don’t need to squint at the graph to see the actual values; we have the equation used to plot that trend line. As detailed in Table 4-1, plugging the equation into Excel, and using the day units for the values of X, we find the last day we’re below our disk space limit is 8/30/05.
Disk available (GB)
y=0.7675 x 2 + 146.96x + 14147
Now we know when we’ll need more disk space, and we can get on with ordering and deploying it.
This example of increasing disk space is about as simple as they come. But as the metric is consumption-driven, every day has a new value that contributes to the definition of our curve. We also need to factor in the peak-driven metrics that drive our capacity needs in other parts of our site. Peak-driven metrics involve resources that are continually regenerated, such as CPU time and network bandwidth. They fluctuate more dramatically and thus are more difficult to predict, so curve fitting requires more care.
In Chapter 3, we went through the exercise of establishing our database ceiling values. We discovered (through observing our system metrics) that 40 percent disk I/O wait was a critical value to avoid, because it’s the threshold at which database replication begins experiencing disruptive lags.
How do we know when we’ll reach this threshold? We need some indication when we are approaching our ceiling. It appears the graphs don’t show a clear and smooth line just bumping over the 40 percent threshold. Instead, our disk I/O wait graph shows our database doing fine until a 40 percent spike occurs. We might deem occasional (and recoverable) spikes to be acceptable, but we need to track how our average values change over time so the spikes aren’t so close to our ceiling. We also need to somehow tie I/O wait times to our database usage, and ultimately, what that means in terms of actual application usage.
To establish some control over this unruly data, let’s take a step back from the system statistics and look at the purpose this database is actually serving. In this example, we’re looking at a user database. This is a server in our main database cluster, wherein a segment of Flickr users store the metadata associated with their user account: their photos, their tags, the groups they belong to, and more. The two main drivers of load on the databases are, of course, the number of photos and the number of users.
This particular database has roughly 256,000 users and 23 million photos. Over time, we realized that neither the number of users nor the number of photos is singularly responsible for how much work the database does. Taking only one of those variables into account meant ignoring the effect of the other. Indeed, there may be many users who have few, or no photos; queries for their data is quite fast and not at all taxing. On the flip side, there are a handful of users who maintain enormous collections of photos.
We can look at our metrics for clues on our critical values. We have all our system metrics, our application metrics, and the historical growth of each.
We then set out to find the single most important metric that can define the ceiling for each database server. After looking at the disk I/O wait metric for each one, we were unable to distinguish a good correlation between I/O wait and the number of users on the database. We had some servers with over 450,000 users that were seeing healthy, but not dangerous, levels of I/O wait. Meanwhile, other servers with only 300,000 users were experiencing much higher levels of I/O wait. Looking at the number of photos wasn’t helpful either—disk I/O wait didn’t appear to be tied to photo population.
As it turns out, the metric that directly indicates disk I/O wait is the ratio of photos-to-users on each of the databases.
As part of our application-level dashboard, we measure on a daily basis (collected each night) how many users are stored on each database along with the number of photos associated with each user. The photos-to-user ratio is simply the total number of photos divided by the number of users. While this could be thought of as an average photos per user, the range can be quite large, with some “power” Flickr users having many thousands of photos while a majority have only tens or hundreds. By looking at how the peak disk I/O wait changes with respect to this photos per user ratio, we can get an idea of what sort of application-level metrics can be used to predict and control the use of our capacity (see Figure 4-6).
This graph was compiled from a number of our databases, and displays the peak disk I/O wait values against their current photos-to-user ratios. With this graph, we can ascertain where disk I/O wait begins to jump up. There’s an elbow in our data around the 85–90 ratio when the amount of disk I/O wait jumps above the 30 percent range. Since our ceiling value is 40 percent, we’ll want to ensure we keep our photos-to-user ratio in the 80–100 range. We can control this ratio within our application by distributing photos for high-volume users across many databases.
I want to stop here for a moment to talk a bit about Flickr’s database architecture. After reaching the limits of the more traditional Master/Slaves MySQL replication architecture (in which all writes go to the master and all reads go to the slaves), we redesigned our database layout to be federated, or sharded. This evolution in architecture is becoming increasingly common as site growth reaches higher levels of changing data. I won’t go into how that architectural migration came about, but it’s a good example of how architecture decisions can have a positive effect on capacity planning and deployment. By federating our data across many servers, we limit our growth only by the amount of hardware we can deploy, not by the limits imposed by any single machine.
Because we’re federated, we can control how users (and their photos) are spread across many databases. This essentially means each server (or pair of servers, for redundancy) contains a unique set of data. This is in contrast to the more traditional monolithic database that contains every record on a single server. More information about federated database architectures can be found in Cal Henderson’s book, Building Scalable Web Sites (O’Reilly).
OK, enough diversions—let’s get back to our database capacity example and summarize where we are to this point. Database replication lag is bad and we want to avoid it. We hit replication lag when we see 40 percent disk I/O wait, and we reach that threshold when we’ve installed enough users and photos to produce a photos-to-user ratio of 110. We know how our photo uploads and user registrations grow, because we capture that on a daily basis (Figure 4-7). We are now armed with all the information we need to make informed decisions regarding how much database hardware to buy, and when.
We can extrapolate a trend based on this data to predict how many users and photos we’ll have on Flickr for the foreseeable future, then use that to gauge how our photos/user ratio will look on our databases, and whether we need to adjust the maximum amounts of users and photos to ensure an even balance across those databases.
We’ve found where the elbow in our performance (Figure 4-6) exists for these databases—and therefore our capacity—but what is so special about this photos/users ratio for our databases? Why does this particular value trigger performance degradation? It could be for many reasons, such as specific hardware configurations, or the types of queries that result from having that much data during peak traffic. Investigating the answers to these questions could be a worthwhile exercise, but here again I’ll emphasize that we should simply expect this effect will continue and not count on any potential future optimizations.
When we forecast the capacity of a peak-driven resource, we need to track how the peaks change over time. From there, we can extrapolate from that data to predict future needs. Our web server example is a good opportunity to illustrate this process.
In Chapter 3, we identified our web server ceilings as 85 percent CPU usage for this particular hardware platform. We also confirmed CPU usage is directly correlated to the amount of work Apache is doing to serve web pages. Also as a result of our work in Chapter 3, we should be familiar with what a typical week looks like across Flickr’s entire web server cluster. Figure 4-8 illustrates the peaks and valleys over the course of one week.
This data is extracted from a time in Flickr’s history when we had 15 web servers. Let’s suppose this data is taken today, and we have no idea how our activity will look in the future. We can assume the observations we made in the previous chapter are accurate with respect to how CPU usage and the number of busy apache processes relate—which turns out to be a simple multiplier: 1.1. If for some reason this assumption does change, we’ll know quickly, as we’re tracking these metrics on a per-minute basis. According to the graph in Figure 4-8, we’re seeing about 900 busy concurrent Apache processes during peak periods, load balanced across 15 web servers. That works out to about 60 processes per web server. Thus, each web server is using approximately 66 percent total CPU (we can look at our CPU graphs to confirm this assumption).
The peaks for this sample data are what we’re interested in the most. Figure 4-9 presents this data over a longer time frame, in which we see these patterns repeat.
It’s these weekly peaks that we want to track and use to predict our future needs. As it turns out, for Flickr, those weekly peaks almost always fall on a Monday. If we isolate those peak values and pull a trend line into the future as we did with our disk storage example above, we’ll see something similar to Figure 4-10.
If our traffic continues to increase at the current pace, this graph predicts in another eight weeks, we can expect to experience roughly 1,300 busy Apache processes running at peak. With our 1.1 processes-to-CPU ratio, this translates to around 1,430 percent total CPU usage across our cluster. If we have defined 85 percent on each server as our upper limit, we would need 16.8 servers to handle the load. Of course, manufacturers are reluctant to sell servers in increments of tenths, so we’ll round that up to 17 servers. We currently have 15 servers, so we’ll need to add 2 more.
Fortunately, we already have enough data to calculate when we’ll run out of web server capacity. We have 15 servers, each currently operating at 66 percent CPU usage at peak. Our upper limit on web servers is set at 85 percent, which would mean 1,275% CPU usage across the cluster. Applying our 1.1 multiplier factor, this in turn would mean 1,160 busy Apache processes at peak. If we trust the trend line shown in Figure 4-11, we can expect to run out of capacity sometime between the 9th and 10th week.
Therefore, the summary of our forecast can be presented succinctly:
We’ll run out of web server capacity three to four weeks from now.
We’ll need two more web servers to handle the load we expect to see in eight weeks.
Now we can begin our procurement process with detailed justifications based on hardware usage trends, not simply a wild guess. We’ll want to ensure the new servers are in place before we need them, so we’ll need to find out how long it will take to purchase, deliver, and install them.
This is a simplified example. Adding two web servers in three to four weeks shouldn’t be too difficult or stressful. Ideally, you should have more than six data points upon which to base your forecast, and you likely won’t be so close to your cluster’s ceiling as in our example. But no matter how much capacity you’ll need to add, or how long the timeframe actually is, the process should be the same.
When you’re forecasting with peak values as we’ve done, it’s important to remember the more data you have to fit a curve, the more accurate your forecast will be. In our example, we based our hardware justifications on six weeks worth of data. Is that enough data to constitute a trend? Possibly, but the time period on which you’re basing your forecasts is of great importance as well. Maybe there is a seasonal lull or peak in traffic, and you’re on the cusp of one. Maybe you’re about to launch a new feature that will add extra load to the web servers within the timeframe of this forecast. These are only a few considerations for which you may need to compensate when you’re making justifications to buy new hardware. A lot of variables can come into play when predicting the future, and as a result, we have to remember to treat our forecasts as what they really are: educated guesses that need constant refinement.
Our use of Excel in the previous examples was pretty straightforward. But you can automate that process by using Excel macros. And since you’ll most likely be doing the same process repeatedly as your metric collection system churns out new usage data, you can benefit greatly by introducing some automation into this curve-fitting business. Other benefits can include the ability to integrate these forecasts into a dashboard, plug them into other spreadsheets, or put them into a database.
An open source program called fityk (http://fityk.sourceforge.net) does a great job of curve-fitting equations to arbitrary data, and can handle the same range of equation types as Excel. For our purposes, the full curve-fitting abilities of fityk are a distinct overkill. It was created for analyzing scientific data that can represent wildly dynamic datasets, not just growing and decaying data. While fityk is primarily a GUI-based application (see Figure 4-12), a command-line version is also available, called cfityk. This version accepts commands that mimic what would have been done with the GUI, so it can be used to automate the curve fitting and forecasting.
The command file used by cfityk is nothing more than a script of actions you can write using the GUI version. Once you have the procedure choreographed in the GUI, you’ll be able to replay the sequence with different data via the command-line tool.
If you have a carriage return–delimited file of x-y data, you can feed it into a command script that can be processed by cfityk. The syntax of the command file is relatively straightforward, particularly for our simple case. Let’s go back to our storage consumption data for an example.
In the code example that follows, we have disk consumption data for a 15-day period, presented in increments of one data point per day. This data is in a file called storage-consumption.xy, and appears as displayed here:
1 14321.83119 2 14452.60193 3 14586.54003 4 14700.89417 5 14845.72223 6 15063.99681 7 15250.21164 8 15403.82607 9 15558.81815 10 15702.35007 11 15835.76298 12 15986.55395 13 16189.27423 14 16367.88211 15 16519.57105
# Fityk script. Fityk version: 0.8.2 @0 < '/home/jallspaw/storage-consumption.xy' guess Quadratic fit info formula in @0 quit
This script imports our x-y data file, sets the equation type to a second-order polynomial (quadratic equation), fits the data, and then returns back information about the fit, such as the formula used. Running the script gives us these results:
jallspaw:~]$cfityk ./fit-storage.fit1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
We now have our formula to fit the data:
|0.786854x2 + 146.657x + 14147.4|
Note how the result looks almost exactly as Excel’s for the same type of curve. Treating the values for x as days and those for y as our increasing disk space, we can plug in our 25-day forecast, which yields the same results as the Excel exercise. Table 4-2 lists the results generated by cfityk.
Disk Available (GB)
y=0.786854x2 + 146.657x + 14147.4
Being able to perform curve-fitting with a cfityk script allows you to carry out forecasting on a daily or weekly basis within a cron job, and can be an essential building block for a capacity planning dashboard.
Web capacity planning can borrow a few useful strategies from the older and better-researched work of mechanical, manufacturing, and structural engineering. These disciplines also need to base design and management considerations around resources and immutable limits. The design and construction of buildings, bridges, and automobiles obviously requires some intimate knowledge of the strength and durability of materials, the loads each component is expected to bear, and what their ultimate failure points are. Does this sound familiar? It should, because capacity planning for web operations shares many of those same considerations and concepts.
Under load, materials such as steel and concrete undergo physical stresses. Some have elastic properties that allow them to recover under light amounts of load, but fail under higher strains. The same concerns exist in your servers, network, or storage. When their resources reach certain critical levels—100 percent CPU or disk usage, for example—they fail. To pre-empt this failure, engineers apply what is known as a factor of safety to their design. Defined briefly, a factor of safety indicates some margin of resource allocated beyond the theoretical capacity of that resource, to allow for uncertainty in the usage.
While safety factors in the case of mechanical or structural engineering are usually part of the design phase, in web operations they should be considered as an amount of available resources that you leave aside, with respect to the ceilings you’ve established for each class of resource. This will enable those resources to absorb some amount of unexpected increased usage. Resources with which you should calculate safety factors include all the those discussed in Chapter 3: CPU, disk, memory, network bandwidth, even entire hosts (if you run a very large site).
For example, in Chapter 3 we stipulated 85 percent CPU usage as our upper limit for web servers, in order to reserve “enough headroom to handle occasional spikes.” In this case, we’re allowing a 15 percent margin of “safety.” When making forecasts, we need to take these safety factors into account and adjust the ceiling values appropriately.
Why a 15 percent margin? Why not 10 or 20 percent? Your safety factor is going to be somewhat of a slippery number or educated guess. Some resources, such as caching systems, can also tolerate spikes better than others, so you may want to be less conservative with a margin of safety. You should base your safety margins on “spikes” of usage that you’ve seen in the past. See Figure 4-13.
Figure 4-13 displays the effect of a typically-sized traffic spike Flickr experiences on a regular basis. It’s by no means the largest. Spikes such as this one almost always occur when the front page of http://www.yahoo.com posts a prominent link to a group, a photo, or a tag search page on Flickr. This particular spike was fleeting; it lasted only about two hours while the link was up. It caused an eight percent bump in traffic to our photo servers. Seeing a 5–15 percent increase in traffic like this is quite common, and confirms that our 15 percent margin of safety is adequate.
As we’ve demonstrated, with our resource ceilings pinpointed, we can predict when we’ll need more of a particular resource. When we complete the task of predicting when we’ll need more, we can use that timeline to gauge when to trigger the procurement process.
Your procurement pipeline is the process by which you obtain new capacity. It’s usually the time it takes to justify, order, purchase, install, test, and deploy any new capacity. Figure 4-14 illustrates the procurement pipeline.
The tasks outlined in Figure 4-14 vary from one organization to another. In some large organizations, it can take a long time to gain approvals to buy hardware, but delivery can happen quickly. In a startup, approvals may come quickly, but the installation likely proceeds more slowly. Each situation will be different, but the challenge will remain the same: estimate how long the entire process will take, and add some amount of comfortable buffer to account for unforeseen problems. Once you have an idea of what that buffer timeline is, you can then work backward to plan capacity.
In our disk storage consumption example, we have current data on our disk consumption up to 8/15/05, and we estimate we’ll run out of space on 8/30/05. You now know you have exactly two weeks to justify, order, receive, install, and deploy new storage. If you don’t, you’ll run out of space and be forced to trim that consumption in some way. Ideally, this two-week deadline will be long enough for you to bring new capacity online.
Obviously, the when of ordering equipment is just as important as the what and how much. Procurement timelines outlined above hint at how critical it is to keep your eye on how long it will take to get what you need into production. Sometimes external influences, such as vendor delivery times and physical installation at the data center can ruin what started out to be a perfectly timed integration of new capacity.
Startups routinely order servers purely out of the fear they’ll be needed. Most newly launched companies have developers to work on the product and don’t need to waste money on operations-focused engineers. The developers writing the code are most likely the same people setting up network switches, managing user accounts, installing software, and wearing whatever other hats are necessary to get their company rolling. The last thing they want to worry about is running out of servers when they launch their new, awesome website. Ordering more servers as needed can be rightly justified in these cases, because the hardware costs are more than offset by the costs of preparing a more streamlined and detailed capacity plan.
But as companies mature, optimizations begin to creep in. Code becomes more refined. The product becomes more defined. Marketing starts to realize who their users are. The same holds true for the capacity management process; it becomes more polished and accurate over time.
Toyota Motors developed the first implementations of a just-in-time inventory practice. It knew there were large costs involved to organize, store, and track excess inventory of automobile parts, so it decided to reduce that “holding” inventory and determine exactly when it needed parts. Having inventory meant wasting money. Instead of maintaining a massive warehouse filled with the thousands of parts to make its cars, Toyota would only order and stock those parts as they were needed. This reduced costs tremendously and gave Toyota a competitive advantage in the 1950s. Just-in-time inventory practice is now part of any modern manufacturing effort.
The costs associated with having auto parts lying around in a warehouse can be seen analogous to having servers installed before you really need them. Rack space and power consumption in a data center cost money, as does the time spent installing and deploying code on the servers. More important, you risk suffering economically as a result of the aforementioned Moore’s Law, which if your forecasts allow it, should motivate you to buy equipment later, rather than sooner.
Once you know when your current capacity will top out, and how much capacity you’ll need to get through to the next cycle of procurement, you should take a few lessons from the just-in-time inventory playbook, whose sole purpose it is to eliminate waste of time and money in the process.
Determine your needs
You know how much load your current capacity can handle, because you’ve followed the advice in Chapter 3 to find their ceilings and are measuring their usage constantly. Take these numbers to the curve-fitting table and start making crystal ball predictions. This is fundamental to the capacity planning process.
Add some color and use attention-grabbing fonts on the graphs you just made in the previous step, because you’re going to show them to the people who will approve the hardware purchases you’re about to make. Spend as much time as you need to ensure your money-handling audience understands why you’re asking for capacity, why you’re asking for it now, and why you’ll be coming back later asking for more. Be very clear in your presentations about the downsides of insufficient capacity.
Solicit quotes from vendors
Vendors want to sell you servers and storage; you want to buy servers and storage—all is balanced in the universe. Why would you choose vendor A over vendor B? Because vendor A might help alleviate some of the fear normally associated with ordering servers, through such practices as quick turnarounds on quotes and replacements, discounts on servers ordered later, or discounts tied to delivery times.
Can you track your order online? Do you have the phone number (gasp!) of a reliable human who can tell you where your equipment is at all times? Does the data center know the machines are coming, and have they factored that into their schedule?
How long will it take for the machines to make the journey from a loading dock into a rack, and cabled up to a working switch? Does the data center staff need to get involved, or are you racking machines yourself? Are there enough rack screws? Power drill batteries? Crossover cables? How long is this entire process going to take?
In the next chapter, we’ll talk about deployment scenarios that involve automatic OS installation, software deployment, and configuration management. However, just because it’s automated doesn’t mean it doesn’t take time and that you shouldn’t be aware of any problems that can arise.
Do you have a QA team? Do you have a QA environment? Testing your application means having some process by which you can functionally test all the bits you need to make sure everything is in its right place. Entire books are written on this topic; I’ll just remind you that it’s a necessary step in the journey toward production life as a server.
Deploy your new equipment
It’s not over until the fat server sings. Putting a machine into production should be straightforward. When doing so, you should use the same process to measure the capacity of your new servers as outlined in the Chapter 3. Maybe you’ll want to ramp up the production traffic the machine receives by increasing its weight in the load-balanced pool. If you know this new capacity relieves a bottleneck, you’ll want to watch any effect that has on your traffic.
All of the segments within your infrastructure interact in various ways. Clients make requests to the web servers, which in turn make requests to databases, caching servers, storage, and all sorts of miscellaneous components. Layers of infrastructure work together to respond to users by providing web pages, pieces of web pages, or confirmations that they’ve performed some action, such as uploading a photo.
When one or more of those layers encounters a bottleneck, you bring your attention to bear, figure out how much more capacity you need, and then deploy it. Depending on how bottlenecked that layer or cluster is, you may find you’ll see second-order effects of that new deployment, and end up simply moving the traffic jam to yet another part of your architecture.
For example, let’s assume your website involves a web server and a database. One of the ways organizations can help scale their application is to cache computationally expensive database results. Deploying something like memcached can allow you to do this. In a nutshell, it means for certain database queries you choose, you can consult an in-memory cache before hitting the database. This is done primarily for the dual purpose of speeding up the query and reducing load on the database server for results that are frequently returned.
The most noticeable benefit is queries that used to take seconds to process might take as little as a few milliseconds, which means your web server will be able to send the response to the client more quickly. Ironically, there’s a side effect to this; when users are not waiting for pages as long, they have a tendency to click on links faster, causing more load on the web servers. It’s not uncommon to see memcached deployments turn into web server capacity issues rather quickly.
Now you know how to apply the statistics collected in Chapter 3 to immediate needs. But you may also want to view your site from a more global perspective—both in the literal sense (as your site becomes popular internationally), and in a figurative sense, as you look at the issues surrounding the product and the site’s strategy.
As mentioned earlier, getting to know the peaks and valleys of your various resources and application usage is paramount to predicting the future. As you gain more and more history with your metrics, you may be able to perceive more subtle trends that will inform your long-term decisions.
For example, let’s take a look at Figure 4-15, which illustrates a typical traffic pattern for a web server.
Figure 4-15 shows a pretty typical U.S. daily traffic pattern. The load rises slowly in the morning, East Coast time, as users begin browsing. These users go to lunch as West Coast users come online, keeping up the load, which finally drops off as people leave work. At this point, the load drops to only those users browsing over night.
As your usage grows, you can expect this graph to grow vertically as more users visit your site during the same peaks and valleys. But if your audience grows more internationally, the bump you see every day will widen as the number of active user time zones increases. As seen in Figure 4-16, you may even see distinct bumps after the U.S. drop-off if your site’s popularity grows in a region further away than Europe.
Figure 4-16 displays two daily traffic patterns, taken one year apart, and superimposed one on top of the other. What once was a smooth bump and decline has become a two-peak bump, due to the global effect of popularity.
Of course, your product and marketing people are probably very aware of the demographics and geographic distribution of your audience, but tying this data to your system’s resources can help you predict your capacity needs.
Figure 4-16 also shows that your web servers must sustain their peak traffic for longer periods of time. This will indicate when you should schedule any maintenance windows to minimize the effect of downtime or degraded service to the users. Notice the ratio between your peak and your low period has changed as well. This will affect how many servers you can stand to lose to failure during those periods, which is effectively the ceiling of your cluster.
It’s important to watch the change in your application’s traffic pattern, not only for operational issues, but to drive capacity decisions, such as whether to deploy any capacity into international data centers.
A good capacity plan not only relies on system statistics such as peaks and valleys, but user behavior as well. How your users interact with your site is yet another valuable vein of data you should mine for information to help keep your crystal ball as clear as possible.
If you run an online community, you might have discussion boards in which users create new topics, make comments, and upload media such as video and photos. In addition to the previously discussed system-related metrics, such as storage consumption, video and photo processing CPU usage, and processing time, some other metrics you might want to track are:
Discussion posts per minute
Posts per day, per user
Video uploads per day, per user
Photo uploads per day, per user
Recall back to our database-planning example. In that example, we found our database ceiling by measuring our hardware’s resources (CPU, disk I/O, memory, and so on), relating them to the database’s resources (queries per second, replication lag) and tying those ceilings to something we can measure from the user interaction perspective (how many photos per user are on each database).
This is where capacity planning and product management tie together. Using your system and application statistics histories, you can now predict with some (hopefully increasing) degree of accuracy what you’ll need to meet future demand. But your history is only part of the picture. If your product team is planning new features, you can bet they’ll affect your capacity plan in some way.
Historically, corporate culture has isolated product development from engineering. Product people develop ideas and plans for the product, while engineering develops and maintains the product once it’s on the market. Both groups make forecasts for different ends, but the data used in those forecasts should tie together.
One of the best practices for a capacity planner is to develop an ongoing conversation with product management. Understanding the timeline for new features is critical to guaranteeing capacity needs don’t interfere with product improvements. Having enough capacity is an engineering requirement, in the same way development time and resources are.
Producing forecasts by curve-fitting your system and application data isn’t the end of your capacity planning. In order to make it accurate, you need to revisit your plan, re-fit the data, and adjust accordingly.
Ideally, you should have periodic reviews of your forecasts. You should check how your capacity is doing against your predictions on a weekly, or even daily, basis. If you know you’re nearing capacity on one of your resources and are awaiting delivery of new hardware, you might keep a much closer eye on it. The important thing to remember is your plan is going to be accurate only if you consistently re-examine your trends and question your past predictions.
As an example, we can revisit our simple storage consumption data. We made a forecast based on data we gleaned for a 15-day period, from 7/26/05 to 8/09/05. We also discovered that on 8/30/2005 (roughly two weeks later), we expected to run out of space if we didn’t deploy more storage. More accurately, we were slated to reach 20,446.81 GB of space, which would have exceeded our total available space is 20,480 GB.
How accurate was that prediction? Figure 4-17 shows what actually happened.
As it turned out, we had a little more time than we thought—about four days more. We made a guess based on the trend at the time, which ended up being inaccurate but at least in favor of allowing more time to integrate new capacity. Sometimes, forecasts can either widen the window of time (as in this case) or tighten that window.
This is why the process of revisiting your forecasts is critical; it’s the only way to adjust your capacity plan over time. Every time you update your capacity plan, you should go back and evaluate how your previous forecasts fared.
Since your curve-fitting and trending results tend to improve as you add more data points, you should have a moving window with which you make your forecasts. The width of that forecasting window will vary depending on how long your procurement process takes.
For example, if you know that it’s going to take three months on average to order, install, and deploy capacity, then you’d want your forecast goal to be three months out, each time. As the months pass, you’ll want to add the influence of most recent events to your past data and recalculate your predictions, as is illustrated in Figure 4-18.
This process of plotting, prediction, and iteration can provide a lot of confidence in how you manage your capacity. You’ll have accumulated a lot of data about how your current infrastructure is performing, and how close each piece is to their respective ceilings, taking into account comfortable margins of safety. This confidence is important because the capacity planning process (as we’ve seen) is just as much about educated guessing and luck as it is about hard science and math. Hopefully, the iterations in your planning process will point out any flawed assumptions in the working data, but it should also be said the ceilings you’re using could become flawed or obsolete over time as well.
Just as your ceilings can change depending on the hardware specifications of a server, so too can the actual metric you’re assuming is your ceiling. For example, the defining metric of a database might be disk I/O, but after upgrading to a newer and faster disk subsystem, you might find the limiting factor isn’t disk I/O anymore, but the single gigabit network card you’re using. It bears mentioning that picking the right metric to follow can be difficult, as not all bottlenecks are obvious, and the metric you choose can change as the architecture and hardware limitations change.
During this process you might notice seasonal variations. College starts in the fall, so there might be increased usage as students browse your site for materials related to their studies (or just to avoid going to class). As another example, the holiday season in November and December almost always witness a bump in traffic, especially for sites involving retail sales. At Flickr, we see both of those seasonal effects.
Taking into account these seasonal or holiday variations should be yet another influence on how wide or narrow your forecasting window might be. Obviously, the more often you recalculate your forecast, the better prepared you’ll be, and the sooner you’ll notice variations you didn’t expect.
As I pointed out near the beginning of the chapter, predicting capacity requires two essential bits of information: your ceilings and your historical data. Your historical data is etched in stone. Your ceilings are not, since each ceiling you have is indicative of a particular hardware configuration. Performance tuning can hopefully change those ceilings for the better, but upgrading the hardware to newer and better technology is also an option.
As I mentioned at the beginning of the book, new technology (such as multicore processors) can dramatically change how much horsepower you can squeeze from a single server. The forecasting process shown in this chapter allows you to not only track where you’re headed on a per-node basis, but also to think about which segments of the architecture you might possibly move to new hardware options.
As discussed, due to the random access patterns of Flickr’s application, our database hardware is currently bound by the amount of disk I/O the servers can manage. Each database machine currently comprises six disks in a RAID10 configuration with 16 GB of RAM. A majority of that physical RAM is given to MySQL, and the rest is used for a filesystem cache to help with disk I/O. This is a hardware bottleneck that can be mitigated in a variety of ways, including at least:
Spreading the load horizontally across many six-disk servers (current plan)
Replacing each six-disk server with hardware containing more disk spindles
Adding more physical RAM to assist both the filesystem and MySQL
Using faster I/O options such as Solid-State Disks (SSDs)
Which of these options is most likely to help our capacity footprint? Unless we have to grow the number of nodes very quickly, we might not care right now. If we have accurate forecasts for how many servers we’ll need in our current configuration, we’re in a good place to evaluate the alternatives.
Another long-term option may be to take advantage of the bottlenecks we have on those machines. Since our disk-bound boxes are not using many CPU cycles, we could put those mostly idle CPUs to use for other tasks, making more efficient use of the hardware. We’ll talk more about efficiency and virtualization in Appendix A.
Predicting capacity is an ongoing process that requires as much intuition as it does math to help you make accurate forecasts. Even simple web applications need to be attended, and some of this crystal ball work can be tedious. Automating as much of the process as you can will help you stay ahead of the procurement process. Taking the time to connect your metric collection systems to trending software, such as cfityk will prove to be invaluable as you develop a capacity plan that is easily adaptable. Ideally, you’ll want some sort of a capacity dashboard that can be referred to at any point in time to inform purchasing, development, and operational decisions.
The overall process in making capacity forecasts is pretty simple:
Determine, measure, and graph your defining metric for each of your resources.
Example: disk consumption
Apply the constraints you have for those resources.
Example: total available disk space
Use trending analysis (curve fitting) to illustrate when your usage will exceed your constraint.
Example: find the day you’ll run out of disk space