The previous chapter introduced data mining ideas using various data mining techniques well suited to databases, such as look-alike models, lookup tables, and naïve Bayesian models. This chapter extends these ideas to the realm of the most traditional statistical modeling technique: linear regression and best-fit lines.
Unlike the techniques in the previous chapter, linear regression requires that input and target variables all be numeric. The results of the regression are coefficients in a mathematical formula. A formal treatment of linear regression involves lots of mathematics and proofs. This chapter steers away from an overly theoretical approach.
In addition to providing a basis for statistical modeling, linear regression has many applications. Regressions—especially best-fit lines—are a great way to investigate relationships between different numeric quantities. The examples in this chapter include estimating potential product penetration in zip codes, studying price elasticity (investigating the relationship between product prices and sales volumes), and quantifying the effect of the initial monthly fee on yearly stop rates.
The simplest linear regression models are best-fit lines that have one input and one target. Such two-variable models are readily understood visually, using scatter plots. In fact, Excel builds linear regression models into charts via the best-fit trend line, one of six built-in types of trend ...