6.2. Strip Mining the S&P 500

Regression is the main statistical technique used to quantify the relationship between two or more variables.[] It was invented by Adrien-Marie Legendre in 1805.[] A regression analysis would show a positive relationship between height and weight, for example. If we threw in waistline along with height, we'd get an even better regression to predict weight.

The measure of the accuracy of a regression is called R-squared. A perfect relationship, with no error, would have an R-squared of 1.00 or 100 percent. Strong relationships, like height and weight, would have an R-squared of around 70 percent. A meaningless relationship, like zip code and weight, would have an R-squared of zero.

With this background, we can get down to some serious data mining. First, we need some data to mine. We'll use the annual closing price of the S&P 500 index for the 10 years from 1983 to 1993, shown in Figure 6.1.

This is the raw data, the S&P 500 for the period, what we are going to predict in terms of the idea of "maximizing predictability" discussed at the end of the previous chapter. Now, we want to go into the data mine and find some data to use to predict the stock index. If we included other U.S. stock market indexes such as the Dow Jones Industrial Average or the Russell 1000, we would see very good fits, with R-squared numbers close to 1.00. That would be an uninspired choice, though—and useless at making the point about the hazards of data mining.

Now we need some ...

Get Nerds on Wall Street: Math, Machines, and Wired Markets now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.