BUY THIS BOOK
Add to Cart

Print Book $39.99


Add to Cart

Print+PDF $51.99

Add to Cart

PDF $31.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint or License this content?


Analyzing Business Data with Excel
Analyzing Business Data with Excel

By Gerald Knight
Book Price: $39.99 USD
£28.50 GBP
PDF Price: $31.99

Cover | Table of Contents


Table of Contents

Chapter 1: Excel and Statistics
Several chapters of this book solve common business problems by creating complete applications using Excel features like VBA, forms, and array formulas . Before we start building applications, we need to cover some key basics. In this first chapter we look at array formulas and indexed addresses, two key Excel features used throughout the book. Next we will look at Excel's ability to handle common statistical calculations. These features will be the building blocks for the later applications. This is not a statistics textbook and we will not examine all of Excel's 50+ statistical functions; instead, we will look at the most commonly used functions with particular attention to those used in other parts of the book.
Excel array formulas give you the ability to work on ranges of cells all at once. Suppose we have a list of numbers, and we want to know the average amount of change from one number to the next. This situation is illustrated in Figure 1-1.
The normal way to approach a problem like this is to add a new column with an intermediate calculation. In cell B2 we calculate the difference between A2 and A1. Then we fill this formula down to cell B10. In cell B12 we simply take the average.
With an array formula we can get the same answer using only one cell. The formula in B15 is:
=AVERAGE(A2:A10-A1:A9)
This makes sense. We want the average of the differences, and that is what the formula is asking for. However, this returns a value error! The error appears because we need to enter the formula in a special way that tells Excel the formula applies to the ranges and not to individual cells.
This is done by pressing Ctrl-Shift-Enter, all at the same time while entering the formula. Excel then displays the formula in brackets, as shown in cell B17:
{=AVERAGE(A2:A10-A1:A9)}
Figure 1-1: Using an array formula
The values returned in cells B12 and B17 are the same. Excel does all of the intermediate calculations and returns the result.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Array Formulas
Excel array formulas give you the ability to work on ranges of cells all at once. Suppose we have a list of numbers, and we want to know the average amount of change from one number to the next. This situation is illustrated in Figure 1-1.
The normal way to approach a problem like this is to add a new column with an intermediate calculation. In cell B2 we calculate the difference between A2 and A1. Then we fill this formula down to cell B10. In cell B12 we simply take the average.
With an array formula we can get the same answer using only one cell. The formula in B15 is:
=AVERAGE(A2:A10-A1:A9)
This makes sense. We want the average of the differences, and that is what the formula is asking for. However, this returns a value error! The error appears because we need to enter the formula in a special way that tells Excel the formula applies to the ranges and not to individual cells.
This is done by pressing Ctrl-Shift-Enter, all at the same time while entering the formula. Excel then displays the formula in brackets, as shown in cell B17:
{=AVERAGE(A2:A10-A1:A9)}
Figure 1-1: Using an array formula
The values returned in cells B12 and B17 are the same. Excel does all of the intermediate calculations and returns the result.
Array formulas save space on the worksheet; often whole columns of formulas can be eliminated. By reducing visual complexity, they make things easier to understand, and can speed up sheet recalculation.
Array formulas can also contain truth values . Excel considers true to be a binary 1 and false to be a binary 0, and sometimes this can be helpful.
In Figure 1-2 we need to find the average of the odd numbers in the list.
In the non-array approach, we would start by identifying the odd values. If a number divided by two is not equal to the integer value of itself divided by two, it is odd. The formula in cell B1 is:
=A1/2<>INT(A1/2)
This fills down to B10
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Addressing Cells Indirectly
The INDEX, INDIRECT, ADDRESS, and OFFSET functions are used extensively in this book, providing formulas with much greater flexibility than cell references can provide.
All cells on a worksheet have a unique address (e.g, A1) and Excel lets you refer to any cell by its address. But what if you don't know what cell you are going to want? This can happen if the address of the required piece of information changes based on other values on the worksheet.
If there is a list of items you need to be able to select from, the INDEX function will do the job. The list can be in a column or a row. An example is shown in Figure 1-3.
Figure 1-3: Using the index function
We have a list of seven colors. In cell C4, the formula is:
=INDEX(A1:A7,C2)
The first entry in the formula is the range where the names of the colors are. The number in cell C2 tells the formula which color is required. In this case it is color number five. The fifth color, the one in cell A5, is Blue. The formula in cell C4 returns a value of Blue as text.
The INDEX function works in most cases but sometimes you may not know the row or column of the required value. It could be anywhere on the worksheet, or even on another worksheet. Figure 1-4 contains just such an example.
This time the data extends over several columns. The value in cell D3 is 373. D3 is the fourth column and the third row. The row and column numbers are in cells D12 and D11. The ADDRESS function in cell D14 uses them to build the address.
If you need an address to include the sheet name, and you will if you are referencing data on a different sheet, use the version in cell D17. It lets you include the sheet name and builds it into the returned value.
The address alone does not help much, but the INDIRECT function returns the value corresponding to an address. In cell D20 the INDIRECT function uses the address built in cell D14 to retrieve the value in cell D3.
The formula in cell D20
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Statistical Functions
Statistical functions are used to describe data. When working with a list of numbers, we need to be able to find the average, to understand any trends, and to describe the distribution. In this section we look at Excel's most commonly used statistical functions .
The average is the most common measure of the center of a group of numbers, and Excel makes taking an average easy. The basic approach is shown in Figure 1-6.
Figure 1-6: Taking an average
Five is the average of the numbers from one to nine (in column A, rows 1 through 9). Usually numbers will be in a column, as in item 1, but it really doesn't make any difference as long as you know where the numbers are. In item 2 the average formula is looking at a three by three range, and in item 3 the numbers are mixed with non-numeric information.
Excel ignores cells that do not contain valid numbers, and item 4 uses this behavior to simplify taking the average of scattered numbers. The AVERAGE function finds the numbers in the range. I don't have to tell Excel where they are.

Section 1.3.1.1: AVERAGEA

The AVERAGE function is usually convenient, but fails in Figure 1-7.
Figure 1-7: Averaging text as a number
On Wednesday we did not get any orders. The data contains the word "None," but the value is really zero. In item 1, the AVERAGE function gets the wrong answer because it ignores the cell with "None" in it. The AVERAGEA function is made for this situation. AVERAGEA works just like AVERAGE, except it considers non-numeric cells to have a value of zero. In item 1 formula 2, the AVERAGEA function calculates the average correctly.
AVERAGEA ignores cells that are not used. In item 2, the Orders cell for Wednesday is empty and the AVERAGEA function does not consider the cell to have a value of zero.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Pivot Tables and Problem Solving
Business data analysis is a search for relationships. How does advertising impact sales? Does an increase in returns mean we have a quality problem? Whether we are looking for problems or identifying best practices, it all comes down to relationships. Business analysis requires a complex, multidimensional approach. Businesses capture and store large amounts of data. As companies try to become more efficient, the job of sifting through this data looking for valuable insight is becoming more common and important.
Excel is ideal for this kind of work. It can import data from most databases, it can handle almost any statistical or formatting problem, and it has a great pivot table feature. Pivot tables were designed for researching relationships in data. They allow us to try different combinations by dragging and dropping, making it easy to check a large number of relationships quickly. They create interactive tables and charts and can quickly filter the data or change point of view.
In this chapter we use pivot tables to analyze a business process. This is an ad hoc activity and the end product is information, not an application. So, formatting and appearance are not the main concerns. The real goal is to find specific problems or opportunities.
We work with two kinds of data. First, there are categorical items. These elements separate data into well-defined groups. If you look at a customer file, one of the fields might be the customer's ZIP code. It is a number, but it has no real numeric value. Its only purpose is to identify a group or category of customers.
The second type is scalar items. Scalars have a numeric value, like cost or square feet. They can be added up or averaged. In pivot tables we use categoricals and scalars differently.
The sample data for this chapter consists of 10,000 rows of data from an order processing operation. The layout is shown in Figure 2-1.
Figure 2-1: Order data
It takes too long to handle orders, and we want to find specific problem areas. We also want to find examples of good performance to establish best practices. Pivot tables will provide an excellent tool for exploring this kind of data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Pivot Table Basics
Pivot tables require data to be arranged in columns with headings. We start by selecting the range of cells or the columns that we want in the pivot table. The pivot table wizard will attempt to find the range for you, if you select a single cell containing data.
Be careful when letting the wizard find the range. The pivot table wizard will stop at an empty row. So, if you are not sure all the rows in your range are used, it is best to select the range you want manually.
Next we click on the Data → Pivot Table PivotChart Report menu option. This starts the pivot table wizard and brings up the dialog in Figure 2-2.
For this example we simply click the Finish button. This accepts all of the defaults, and this is often all you will need. Several options are available at this point, including linking our pivot table to external data or making a PivotChart. For now, just click Finish, which brings us to the display in Figure 2-3.
A newly created pivot table has a few parts. Item 1 is a list of data items in the table. We can drag and drop these items onto the pivot table.
Item 2 is the Pivot Table toolbar. On your system it may show up in a different location. Item 3 is the area where we drag-and-drop row fields. Our selection in this area establishes the order of the rows in the finished table.
Item 4 is the data area. Items dragged to this area are summarized by both row and column. Item 5 is the area for column fields. Selections here determine the horizontal arrangement of the table. The page field area in item 6 is a filtering option, allowing you to add a fourth dimension to the table.
Figure 2-2: The PivotTable Wizard
Figure 2-3: The parts of a pivot table
We start by looking at the average age of orders by location. First, drag the Office data item into the row field area. The results are shown in Figure 2-4. Office is a categorical item. The row, column, and page areas can only use categorical data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Changing the Data
You are not limited to the data you get. Sometimes it is helpful to change the form of a data item. You cannot create new information, but you can redefine existing data in ways that make it more useful.
One of the data items we started with is Action Date. This is categorical information, but there are too many dates. If I use it as the row or column field I get too many categories. I can make this item more useful by changing it into a day of the week. We might find orders have different characteristics based on the day of the week they are received. I return to the data sheet and add a new column named Weekday, filling it out as shown in Figure 2-24.
Figure 2-21: Using the page field area
Figure 2-22: Drilling down
Figure 2-23: The details
Figure 2-24: Adding a new data item
I then go back to the pivot table, click on the pivot table menu and select PivotTable wizard . Clicking on the Back button brings up the dialog in Figure 2-25.
Figure 2-25: Changing the data range for a pivot table
Change the data range to DATA!$A:$L to add the new column, then click Finish, and the new data item appears in the field list.
There are times when you might need to change a scalar item into a categorical. I can convert Order Amount into a categorical item by adding a column that gives its quartile. The formula is shown in Figure 2-26.
The QUARTILE function has two parameters. First is a range that contains a list of numbers. Second is the quartile to be returned. It returns the maximum value of the quartile and the formula in Figure 2-26 returns the quartile number.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Pivot Table Options
In this chapter I have tried to show how pivot tables and charts can be used to analyze business problems, but there are many more features and options. In Step 3 of the PivotTable Wizard there is an Options button. You can launch the wizard by right-clicking anywhere in the pivot table area. The Options button brings up the dialog shown in Figure 2-31.
This gives you control over the general formatting and layout of the table, and the Data source options (bottom left) can be helpful if pivot tables are the end product for your users and you don't want them to have the original data.
Figure 2-31: The options dialog
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Workload Forecasting
Forecasting takes information available in the present and uses it to predict the future . Everyone forecasts things. We use our knowledge of the past to decide what time to get up in the morning, how much food to buy at the grocery, and where to go on vacation. In daily life we don't usually think about how we make these decisions or how we evaluate the outcome of our choices. Business forecasting is more structured. There are specific techniques to model the relationships between present information and a future value.
Few forecasts are exactly right. Some error is expected—some difference between the forecast and the actual value. In most cases, then, it is not enough to make a forecast. We also need to know how accurate the forecast will be. This means that the prediction is not an exact value but a range with a known probability.
This chapter looks at predicting the workload for a typical business process. The example forecasts the volume of calls coming into a call center, but the techniques can be applied to many other problems. This type of forecasting uses only the past values of the item being predicted. Future call volumes are predicted using past call volumes.
This chapter also demonstrates techniques for creating a complete application in Excel. The application uses a combination of organization, workbook functions, formatting, and a little VBA. At the end of the chapter we will have a complete Excel application for predicting workload.
Since this is the first application in the book, we will take a detailed look at the entire process, beginning with the Excel functions and features used to create the application.
The application uses the Excel functions listed in Table 3-1. Most of these functions are discussed as we encounter them in the application. But the INDEX, INDIRECT, and ADDRESS functions are used in several of the applications in the rest of this book and have a unique job. Understanding how they work is critical to understanding this chapter's application, so make sure you have read the explanations of how they work in Chapter 1.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Procedure
First we look at the calculations. We need to make the best possible prediction, measure the accuracy, and manage anomalies. Excel provides the tools, but before we start entering formulas we need to understand the data.
This chapter uses data from a call center. It is a five day a week operation, and the data is simply the date and number of incoming calls for each day.
The techniques used can be applied to most workload situations. The goal is to predict a periodic workload that could have an overall trend, and is subject to short-term ups and downs. The accuracy of predictions will be measured and used to set a prediction range with a known probability.
A workload model needs to handle both expected and unexpected shifts in volume. Holidays are expected, but December volumes are not a good predictor for January. Real world workloads are subject to all kinds of unpredictable outside forces. The competition can raise their price, a server can go down, or a snow storm can shut down part of the country. As a result, workloads can go up or down with no warning.
The sample data in this chapter comes from an actual call center and is subject to all the uncertainty of the business environment.
Time creates uncertainty. The further into the future we predict, the less accurate we are. This chapter starts with a weekly prediction. The weekly forecast is adjusted as more information is available and becomes an adjusted daily forecast. An hourly forecast is also made, and the daily forecast is adjusted further as actual hourly values are entered.

Section 3.1.2.1: Find the lag

Lag is the number of observations in a cycle. Many business systems run on a weekly cycle. Mondays look like Mondays; Fridays look like Fridays. If the process runs five days a week, the lag is five. Of course, everything doesn't run on a weekly cycle. Sometimes it is monthly, hourly, or some exotic period. In all cases it is critical to know how many observations there are in a cycle, and usually this is known without examining the data. But if you are not sure, it's best to check.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Building an Application
We now understand how this application will work in theory. But, we need to take the theory and turn it into a solution—something we can give to the user. In this section we will build a complete Excel application based on the processes described in the last section.
As we develop an application, things can easily get out of control. Therefore, it is important to start with a plan and some structure. The example has several worksheets, each with its own function.
The Data sheet holds information. It has no formulas. The user is not going to be looking at this sheet, so there is no formatting. This sheet stores the information that drives the application. When new information comes in, it goes on the Data sheet. Keeping the data separate has advantages. If the data source changes, only the interface between the data sheet and the data source needs changing.
All of the logic in the application is on the Workarea sheet. This is the only sheet with workbook functions.
The Display sheet is only concerned with presentation. It is attached to the Workarea sheet by named cells and ranges. There are no formulas, only formatting and information organization.
The Settings sheet contains application options, and there are two specialized sheets: AccuracyChart contains a chart and HourlyForecasts contains a table of hourly predictions. These sheets have no formulas and only hold or display information.
Prediction is the core of the application. But, to make a complete solution there needs to be a structure around it. What will the user see? How will the user interact with the application? These questions are answered by design.
This is important, and should be shared with users. It forces the developer and the user to think through how the application will be used. In this case there are seven basic requirements:
  1. The system will display a full week of predictions.
  2. The adjusted prediction for the next day will be shown.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Modeling
In the last chapter we used the past to predict the future. This works well for simple situations, but things are often more complicated. Most things depend on other things. Forecasting stock prices, credit scoring, predicting the weather, and designing a direct mail campaign all depend on independent data that influences the thing being predicted. If you want to predict tomorrow's weather in Chicago, you have to consider today's weather further west. They are connected.
In this chapter we look at using Excel to model a complex situation. We consider selecting independent data items and preparing them for use. It is not always easy to decide what value to predict, so we examine this process. Finally, we go through the steps and techniques needed to build a working model.
For more complex kinds of problems, a technique called regression is used. Excel has a regression tool from Tools → Data Analysis → Regression. If Data Analysis is not showing up on the Tools menu, select Add-Ins and check Analysis ToolPak.
The first example predicts a stock price. We have 223 days of technical data for a stock including the opening price, the high, the low, the closing price, and the volume for each day. We predict tomorrow's closing price using this information.
We build a model using regression. But we will need a way to know if our model is any good. So, we start by making a simple prediction. Then we can compare the accuracy of our model to the simple prediction. If our model is not more accurate than the simple prediction, it does not add any value and we might as well just use the simple prediction. For the simple prediction, we assume the closing stock price tomorrow will be the same as today's closing price. Figure 4-1 shows the setup.
The array formula in cell I7 gives the average error amount for the prediction. On average we are off by about $0.46 everyday. But we have six pieces of information about the stock, not just the closing price, so next we make the prediction using all six.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Regression
For more complex kinds of problems, a technique called regression is used. Excel has a regression tool from Tools → Data Analysis → Regression. If Data Analysis is not showing up on the Tools menu, select Add-Ins and check Analysis ToolPak.
The first example predicts a stock price. We have 223 days of technical data for a stock including the opening price, the high, the low, the closing price, and the volume for each day. We predict tomorrow's closing price using this information.
We build a model using regression. But we will need a way to know if our model is any good. So, we start by making a simple prediction. Then we can compare the accuracy of our model to the simple prediction. If our model is not more accurate than the simple prediction, it does not add any value and we might as well just use the simple prediction. For the simple prediction, we assume the closing stock price tomorrow will be the same as today's closing price. Figure 4-1 shows the setup.
The array formula in cell I7 gives the average error amount for the prediction. On average we are off by about $0.46 everyday. But we have six pieces of information about the stock, not just the closing price, so next we make the prediction using all six.
Figure 4-1: A simple prediction
We assume all six metrics add some value to the prediction. Each metric is multiplied by a weight, and then they are added up. An additional value, called the intercept, is added to the sum to get the final prediction. Figure 4-2 below shows how the problem is set up in Excel.
Figure 4-2: Setup for a regression model
The formula in F3 multiplies the opening price in column A by the weight in cell F1. We start in row 3 because that is where we started in the calculations in Figure 4-1. This way we can compare the accuracy of the regression to the simpler method for exactly the same days. This formula fills right to column J, and down to the end of the data at row 224.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Defining the Problem
We start with a question. Can we predict the results of a dog race? The first challenge is to figure out what the question means. We could predict which dog is most likely to win, or finish in the top two or three positions. We could predict the first and second dogs in a race. But predicting which dog will win may not be the point. The real issue is probably money. If we are looking at dog races, we want to know which bets are most likely to be profitable, so we need to predict how much a dog will pay along with its chances of winning.
We can build a model to predict this, but how will we know if the model is any good? In this case it's easy. If we can make a profit using the model, then it is good; otherwise, it is useless. If we build a credit scoring model, we have the same problem. It is not enough to identify accounts that are most risky. As a group these accounts may still be profitable, and a model would need to consider the impact to the bottom line, not just the level of risk. The same problem occurs when modeling stock prices. What do we really need to know? If we are trading options, we don't need to know the future price of the stock. All we need is the probability that it will trade above or below a price in a given period of time.
There is another important consideration here. Some things are easier to model than others. For example, if we try to build a model that predicts which dog will win in a race, we are trying to identify one winner out of eight dogs. When we look at the data there will be seven times more losers than winners. This makes modeling difficult. It is easier to get a good result when there is an even mix of outcomes in the data.
Next we consider the data used to build the model. What data is available? In most business situations there will be historical data. If we are modeling collections calls to increase dollars collected per call, we will need data on past collections calls and their outcomes. For stocks there is plenty of historical data available. With dog races, the data is on the racing form.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Refining Metrics
BestSpeed tells us how fast a dog can run, but not how that speed compares to the speeds of the other dogs in a race. There are eight dogs in each race and one is the fastest. We need a way to rank the dogs in each race by speed. We start by sorting the data by Race and BestSpeed, as in Figure 4-8.
We insert a column between BestSpeed and PlacePay and label it SpeedRank. For the first race we enter the numbers 1–8. Then in cell G9 we enter the formula =G2, and fill this formula down to the bottom of the data, as in the Figure 4-9. Next, Copy and Paste Special (Values) on the G column.
Figure 4-8: Sort each race by BestSpeed
We also need a ranking by how often the dogs win. For this we use the Races and Wins columns. We create a new column called WinRatio. The value is Wins divided by Races for each dog. If a dog has no races, we set this value to zero since we cannot divide by zero. We then sort by Races and WinRatio and build a WinRank column just as we built SpeedRank. The setup is shown in Figure 4-10.
Once the WinRank column is filled down and columns H and I are copied and pasted as values, the data is ready to use.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Analysis
We still do not know if the metrics can predict the payout. We hope the data can make the prediction, but we need more information about the relationships in the data. The Pivot Table tool makes it easy to explore these relationships. We select all rows for columns B thru J. Then we select PivotTable and PivotChart Report from the Data menu. The PivotTable dialog opens up, and we select Pivot Chart Report as in Figure 4-11.
Figure 4-9: Adding the SpeedRank column
Figure 4-10: Adding the WinRank column
After Finish is clicked, the Pivot Chart is displayed. We are interested in the relationships between the payout and the other metrics. So, we drag PlacePay to the Data area in the center of the chart, labeled as Item 1 in Figure 4-12. By default the count of the data item, PlacePay, is displayed. We change to average by double-clicking on the Count of PlacePay button and selecting Average (Item 3). Next we drag Dog to the Category area at the bottom (Item 2).
Figure 4-11: Setting up a Pivot Chart
Figure 4-12: Configuring the Pivot Chart
This results in the chart in Figure 4-13, showing how post position relates to payout. On average, the dog in position 1 pays more. So, if you bet dogs running in position 1, you will lose less money. The minimum bet is $2.00, thus there is a profit if the average of PlacePay is more than $2.00.
Figure 4-13: Post position and Payout
We use this chart to check the relationship between our metrics and the payout to find ones with the greatest predictive power. The metric Dog is dragged back to the list of metrics and the other metrics are dragged to the category box one by one. Races, Wins, and BestSpeed look odd because they have a large number of possible values. The metric that gives the best result is WinCnt, the number of wins out of the last six races.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Building the Model
We need to limit the number of metrics. If we use too many, the model will over-train. If this happens, the model will be overly influenced by unusual or isolated data. There might be one or two dogs that win with a very high payout. Two out of seven hundred doesn't mean much. But regression is not a magic formula, and it is not guaranteed to find the relationships in the data. It is just a mathematical technique that draws a bunch of lines based on the best fit to the data. With too much flexibility it will, in effect, memorize the data rather than learn how to solve the problem.
To guard against this we test the results. The data is separated into two groups. One group is used to build the model and the other is used for testing. If we get good results when we build the model but worse results when we test, the model is over-trained and useless.
We have 710 data items. We will use 449 to build the model and the remaining 261 will be reserved for testing. We have already used WinCnt to limit the data. We now build a worksheet with just the columns needed for the model. We know that Dog (running position), SpeedRank, and WinRank are the metrics to use. But there is a problem with Dog. In Figure 4-16 we see that Dog produces two distinct profitable regions. This seems to mean that the middle positions are less profitable than one, two, seven, and eight. Since we know that this situation exists, we should make a change in the Dog metric. Figure 4-19 shows the resulting sheet.
Figure 4-19: Model data
We insert a new column named Dog-4 containing the absolute value of the difference between the dog's running position and 4.
We need to be sure the rows are assigned to the model and test groups randomly. Therefore, we add a new column called Random and fill it with random numbers. Next we sort the data on Random. This will ensure that each row has the same chance of being assigned to the model group or the test group. After sorting on the Random column, it is deleted.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Analyzing the Results
To see if the model is working, we rank the dogs by their scores and only consider half of them, the ones with the highest scores. The mid-point is not likely to be the best threshold, and we will want to experiment with different splits. So, the sheet shown in Figure 4-23 is set up to allow different percentages to be tested.
The number 50 is entered in cell L1. This indicates that we are going to test a 50% split. In cell L2, we calculate how many of the 449 model dogs are in the top half. The formula is =INT(450*(L1/100)). The result is 225.
Figure 4-22: Model with scores
Figure 4-23: Test setup
The threshold for this test is the 225th highest score. The formula in L3 is =LARGE(I2:I450,L2). This gives the value we need. With these formulas in place we can easily test any percentage. If we enter 15 in cell L1, the value in L3 gives us the value to test the top 15%.
In the range L6:L12 are details of the model's performance for the top 449 rows. The first formula is =COUNTIF(I2:I450,">=" & L3). This counts the number of scores equal to or greater than the threshold. It tells how many bets we make if we bet the top half of the scores.
In column J, the formula =IF(I2>=L$3,E2,0) gives the payout for each dog. If the dog's score is less than the threshold, it is not bet and the payout is zero. To get the total payout for the model group, we use the formula =SUM(J2:J450) in cell L7.
We get the average payout by dividing L7 by L6. In this case it is $1.87 in cell L8. The number of wining bets is calculated in cell L9 using =COUNTIF(J2:J450,">0"). The win rate, the percentage of bets that win, is in L10. The formula is =L9/L6.
We also want to know the average amount of a winning payout. We get it with =L7/L9 in L11. Finally, the formula =(L8-2)*L6 in cell L12 gives the total profit or loss for the 449 model dogs. We subtract 2 because the bet is $2.00 and we are interested in the profit.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Testing Non-Linear Relationships
Regression assumes that the relationships in the data are linear. This is usually a safe assumption, but sometimes you can get a more accurate model if you allow for non-linear relationships . We can use the solver to test the potential value of using non-linear terms in the model.
We start by inserting a row at the top of the worksheet, setting it up as shown in Figure 4-26.
Figure 4-26: Setup to test non-linear relationships
This is a classification problem . We are dividing the population of dogs into two groups. As long as we can set a good threshold, we do not care how well the model predicts the exact value.
This means the intercept is not adding any value. The model will do just as well without it because we are only interested the correlation between the score and the payout. If we substitute a zero for the intercept in cell I2, the performance of the model does not change.
The results in Figure 4-26 are exactly the same as in Figure 4-23. Only the threshold is different. In row one of columns F, G, and H we enter 1. We change the formula in F3 from =B3*F$1 to =(B3^F$1)*F$2), and fill this new formula across to column H and down to row 712.
At the top of the L column we add the formula =CORREL(I3:I451,E3:E451) to measure the correlation between the scores in column I and the payouts in column E. The value is 0.103. This means the scores are positively correlated with the payouts, but the correlation is not especially strong.
We want to see if changing the values in F1:H1 can increase the correlation. For this we use the Solver, which is on the Tools menu. If the Solver does not appear as one of the items on the Tools menu, it may be necessary to select Add-Ins and make sure the Solver Add-In is checked.
The Solver dialog is filled out as shown in Figure 4-27.
Figure 4-27: The Solver dialog
The target cell is L1. This is the cell with the correlation formula and is the value we want to improve. Equal to Max is selected because we want the highest value possible for correlation.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Measuring Quality
In the last two chapters we looked at measuring quantity. Now we turn to quality. Sampling and statistical analysis have been used to measure quality in manufacturing since the 1920s. Years ago it relied on small samples and simple calculations, but today we can do more. With modern technology we can look at thousands of pieces of work and the complexity of calculations is no longer an issue.
In this chapter we use Statistical Process Control techniques to build an application to measure quality. We will go through the techniques and calculations, and then use them to build an application. We will also look at the data requirements for this kind of project.
In earlier projects we used VBA to make the application interact with the user, but Excel can interact without VBA. This project uses controls to bring the application to life. Controls are easy to use and we will include four of them in the project.
The basic problem is always the same. We start with data that contains the answer to a question. We find the answer and give it to the user.
This chapter uses the following Excel functions and features, shown in Tables 5-1 and 5-2.
Table 5-1: Excel functions used in this chapter
INDEX( )
INDIRECT( )
ADDRESS( )
INT( )
ROW( )
MAX( )
STDEV( )
SUM( )
IF( )
AVERAGE( )
LN( )
EXP( )
The INDEX, INDIRECT, and ADDRESS functions are vital to understanding the application, and they are explained in Chapter 1.
Table 5-2: Excel features used in this chapter
Controls
Named Cells
Named Ranges
Array Formulas
Charting
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Statistical Process Control
This chapter's example is a check processing operation. The number of checks varies by day of the week as does the amount of money deposited. These are measures of quantity and can be forecasted and monitored using techniques in Chapter 3. But quality is as important as quantity. If something is going wrong in the operation (e.g., if payments are being misapplied, or check numbers are being recorded incorrectly), we need to know.
When monitoring a manufacturing process we can measure the diameter of a bolt, the weight of a bottle of shampoo, or the percent of electrical components failing a test. These are things that do not vary by day of the week, and a significant change in any of them can mean trouble. In our check processing operation we need to use metrics that behave this way.
First, we consider potential problem areas. Checks received for payment need to be processed quickly, so we measure the average age of the checks. Customers are supposed to send a remittance slip with their check, and we will measure the percentage of payments received that contain only a check and the average number of pages of remittance information per check. Money is important, so we measure both the average check amount and the average amount per remittance page. Finally, we need to monitor the accuracy of our data capture process. For this we look at the percentage of checks that have a valid invoice number, the average number of digits in the check number, and the average number of digits in the check amount.
If any of these metrics shows a significant change we need to find the reason. Avoiding metrics based on volume or day of the week keeps the focus on quality.
This concept can be applied to almost any operation. In a call center you might look at average talk time, percentage of calls abandoned, and percentage of calls transferred. In an invoicing area it could be average value, average lines, and product mix.
The process, like forecasting, is simply predicting what each metric should be, knowing how accurate the prediction is, and using this to set control limits for each metric. The prediction is the recent average. We don't consider lag since these metrics are not cyclic. We don't correct for the trend. If there is a trend, we want to know. We are looking for trends.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Running the Application
The main display of the application is shown in Figure 5-2.
Figure 5-2: Application Main Display
The application has only four options. First, the sample data contains 11 metrics. Use the combo box control (Item 1) to select the one you want. The chart and data area in the upper left update automatically.
All the dates and values for the selected metric appear in the scroll area. Use the scroll bar (Item 2) to move the data up or down.
The application is set to a sigma of three. The control limits are set at three standard deviations from the average. If you want to change the sigma, use the spinner control (Item 3).
Finally, if you want the control limits to be set using a log normal distribution, use the radio buttons (Item 4).
If the metric is out of control limits, it will be flagged as shown by Item 1 in Figure 5-3. The metric is the standard deviation of the number of digits in the check amount. The current value for this metric is 1.63. That is above the upper control limit of 1.54. The chart (Item 2) shows the metric is outside the control limits.
Figure 5-3: Main display with out of control item
What does it mean? The dollar amount of a check is a number. Here we are measuring the number of digits in the amounts of all the checks processed today. The high standard deviation tells us there is an unexpectedly high amount of variation in the metric today. We need to look at the information capture process for this metric. Perhaps a data entry person is skipping digits or an OCR process is having a problem. We are probably getting the check amount wrong some of the time. We can either look into it now, or wait until the monthly close and see if the CFO asks about it.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Application Design
The application contains three worksheets. A sheet named Data holds the data. It has no formulas and no real formatting. It is just a place for the information. All the calculations are done on the Workarea sheet. It references the Data sheet, does all of the calculations, and builds named display ranges. The Display sheet uses the named ranges on Workarea, formatting, and a chart to present the results to the user. This sheet also uses controls to allow the user to interact with the application.
The Data sheet is shown in Figure 5-4.
Figure 5-4: The Data sheet
Headings are in row one. Dates go in column A and the metrics fill out the columns to the right. The sample application will handle up to 25 metrics and up to 1,000 days of data. These limits are arbitrary and can easily be increased.
Changing the data is also simple. The new data must be arranged the same way, with headings and dates. Just select and clear the entire sheet. Don't delete the columns, just clear them. Then paste your data onto the sheet starting in cell A1. The formulas on Workarea do the rest. Everything updates automatically.
All the logic and calculations are on the Workarea sheet. A few conventions are used to make this sheet easier to understand. Some ranges on this sheet are linked to the Data sheet, and they are in blue font. Named ranges used on the Display sheet have a grey background. Ranges that are used together, but not named, have a border. An example is the cells used to populate the chart. All the intermediate calculations are in column A and the results are named. Values set by controls on the Display sheet are bold and in red font.
The layout of Workarea is shown in Figure 5-5.
Item 1 is the range that populates the chart on the Display sheet. The first two columns reference the Data sheet. Item 2 is a named range called DisplayData. This data appears in the upper-left part of the Display sheet. Item 3 is also a named range. It is called ScrollArea and feeds the scrolling area on Display.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Customizing the Application
The design of the application makes it easy to change. Using separate worksheets to isolate the data, the logic, and the display means that they can be modified without disturbing each other. First we will rearrange the Display sheet.
Rearranging the elements on the Display sheet is just a matter of dragging the controls and chart to new locations. The areas that reference named ranges can be cut and pasted without interfering with their formulas. By using these techniques and changing patterns and borders, the Display sheet can be rearranged to look like Figure 5-16 in a couple of minutes.
Figure 5-16: A new Display sheet
In Statistical Process Control, it is considered significant if the metric being measured has been on the same side of the average for three consecutive days. This could mean that a trend has started. A serious problem can start slowly. So, we want to be alerted if the last three days were higher or lower than average even if we are still within the control limits.
We start on the Workarea sheet, as in Figure 5-17.
Figure 5-17: Adding new Logic to Workarea
The last three days are in rows 19 through 21 in column D. In cell H19 we enter =IF(D19>G19,1,(IF(D19<G19,-1,0))). This is a nested IF function that returns 1 if the metric is above average, -1 if it is below, or 0 if it is equal to the average. The formula is filled down to cell H21. If the sum of H19:H21 is 3 or -3, the metric has been on the same side of the average for three days in a row. In cell A60 another nested IF function builds the alert. A result of blank is returned if there is no alert. This cell is named Alert.
On the Display sheet in M8, just above the upper-right corner of the chart, the formula =Alert is entered. The cell is formatted bold and red. If there is no alert, the value is blank so the user sees nothing.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 6: Monitoring Complex Systems
Content preview·Buy PDF of this chapter|