In Chapter 1, in which we briefly explored the realms of machine learning, we began with linear regression because it is probably something that you have come across at some point in your mathematical training. The process is fairly intuitive and easier to explain as a first concept than some other machine learning models. Additionally, many realms of data analysis rely on regression modeling ranging from a business trying to forecast its profits, to the frontiers of science trying to figure out new discoveries governing the laws of the universe. We can find regression in any scenario in which a prediction against time is needed. In this chapter, we examine how to use regression modeling in R to a deep extent, but we also explore some caveats and pitfalls to be aware of in the process.
The main motivation behind regression is to build an equation by which we can learn more about our data. There is no hardandfast rule about which type of regression model to fit to your data, however. Choosing between a logistic regression, linear regression, or multivariate regression model depends on the problem and the data that you have. You could fit a straight line to a given series of data points, but is that always the best case? Ideally, we are after a balance of simplicity and explanatory power. A straight line fit to a complex series of data might be simple, but might not describe the whole picture. On the other hand, having a very simple set of data that is basically a straight line and fitting a model with all sorts of wacky curves to it might give you a very high degree of accuracy, but leave very little room for new data points to be fit to it.
You might recall in your high school mathematics education about having a couple points of data and fitting a line through it. This fit to data is the easiest form of machine learning and is used often without realizing it is a type of machine learning. Although fitting a line to two data points is relatively easy to learn, fitting a line with three or more data points becomes a task better suited for computers to handle from a computation perspective. Simply adding one more data point (or, as we’ll see, several dozen more) makes the problem much more difficult to solve. But through mathematical techniques that power mainstream machine learning models, we can compute those kinds of problems very easily. R makes a lot of these steps quite simple to compute, and this chapter provides a foundation for assessing questions about where we draw the line between model complexity and accuracy.
In Chapter 1, we briefly encountered linear regression with an example of the
mtcars
dataset. In that example, we determined a linear
relationship of fuel efficiency as a function of vehicle weight and saw
the trend go downward. We extracted coefficients for a linear
mathematical equation and dusted our hands. Yet, there is a lot more
beyond simply slapping an equation onto a bunch of data and calling it a
day. Let’s revisit our mtcars
example (Figure 41):
model
<
lm
(
mtcars
$
mpg
~
mtcars
$
disp
)
plot
(
y
=
mtcars
$
mpg
,
x
=
mtcars
$
disp
,
xlab
=
"Engine Size (cubic inches)"
,
ylab
=
"Fuel Efficiency (Miles per Gallon)"
,
main
=
"Fuel Efficiency From the `mtcars` Dataset"
)
abline
(
a
=
coef
(
model
[
1
]),
b
=
coef
(
model
)[
2
],
lty
=
2
)
Let’s revisit our mtcars
example (Figure 41), where we model the fuel efficiency (mpg
) as a function of engine size (disp
) and then look at the outputs of the model with the summary
function:
summary
(
model
)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$disp)
##
## Residuals:
## Min 1Q Median 3Q Max
## 4.8922 2.2022 0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 29.599855 1.229720 24.070 < 2e16 ***
## mtcars$disp 0.041215 0.004712 8.747 9.38e10 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple Rsquared: 0.7183, Adjusted Rsquared: 0.709
## Fstatistic: 76.51 on 1 and 30 DF, pvalue: 9.38e10
There is a wealth of information dumped out from the summary()
function
call on this linear model object. Generally, the one number people will
typically look at to get a baseline accuracy assessment is the
multiple Rsquared value. The closer that value is to 1, the more
accurate the linear regression model is. There are a lot of other terms
in this output, though, so let’s walk through each element to gain a solid
understanding:
This displays the formulaic function call we used. In this
case, we used one response variable, mpg
, as a function of one dependent
variable, disp
, both of which were being called from the mtcars
data frame.
Residuals are a measure of vertical distance from each data point to the fitted line in our model. In this case, we have summary statistics for all of the vertical distances for all of our points relative to the fitted line. The smaller this value is, the better the fit is.
These are the estimates for the coefficients of our linear equation. Our equation in this case would be y = 0.04x + 29.59.
Std. Error: With those coefficients come error estimates as given by the Std. Error part of the coefficients table. In reality, our equation would be something like $y=(0.04\pm 0.005)x+(29.59\pm 1.23)$.
tvalue: This is the measurement of the difference relative to the variation in our data. This value is linked with pvalues, but pvalues are used far more frequently.
pvalue: pvalues are statistical assessments of significance. The workings of pvalues are more complicated than that, but for our purposes a pvalue being of value less than 0.05 means that we can take the number as being statistically significant. If the number in question has a pvalue greater than 0.05, we should err on the side of it not being statistically significant. The star ratings next to them are explained by the significance codes that follow.
This error estimate pertains to the standard deviation of our data.
This is the Rsquared value for when we have multiple predictors. This isn’t totally relevant for our linear example, but when we add more predictors to the model, invariably our multiple Rsquared will go up. This is because some feature we add to the model will explain some part of the variance, whether its true or not.
To counteract the biases introduced from having a constantly increasing Rsquared value with more predictors, the adjusted Rsquared tends to be a better representation of a model’s accuracy when there’s multiple features.
Finally, the Fstatistic is the ratio of the variance explained by parameters in the model and the unexplained variance.
This simple linear example has some decent explanatory power. We have determined a relationship between fuel efficiency and engine size. Oftentimes, this is where simple linear regression examples exhaust their usefulness. The things we are most after in this specific case are the slope and intercept. If this example were applied to sales over time, for example, our output from this modeling exercise would be a starting value for the intercept, and a growth rate for the coefficient.
Suppose that you want to build a more robust model of fuel efficiency with more variables built into it. Fuel efficiency of a vehicle can be a complex phenomenon with many contributing factors other than engine size, so finding all of the features that are responsible for driving the behavior of the model in the most accurate fashion is where you want to utilize regression as you have been, but in a multivariate context.
Recall that our simple linear regression example was based around:
where the coefficients are the intercept, b, and the slope, m, tied to the one variable we had in the model. If you want to bring in more factors that contribute to the model, change the mathematical form to:
where x_{1}, x_{2}, x_{3}, and so forth, are different features in
the model, such as a vehicle’s weight, engine size, number of cylinders, and so on. Because the new objective is to find coefficients for a model of the
form y = f(x_{1}, x_{2}, x_{3}, (…)), you need to revisit the call
to the lm()
function in R:
lm.wt
<
lm
(
mpg
~
disp
+
wt
,
data
=
mtcars
)
summary
(
lm.wt
)
##
## Call:
## lm(formula = mpg ~ disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.4087 2.3243 0.7683 1.7721 6.3484
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 34.96055 2.16454 16.151 4.91e16 ***
## disp 0.01773 0.00919 1.929 0.06362 .
## wt 3.35082 1.16413 2.878 0.00743 **
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple Rsquared: 0.7809, Adjusted Rsquared: 0.7658
## Fstatistic: 51.69 on 2 and 29 DF, pvalue: 2.744e10
This code extends the linear modeling from earlier to include the
vehicle’s weight in the model fitting procedure. In this case, what you
see is that the adjusted Rsquared has gone up slightly from 0.709 when
you fit a model of just the engine size, to 0.7658 after including the
engine weight in the fit. However, notice that the statistical
relevance of the previous feature has gone down considerably. Before,
the pvalue of the wt
feature was far below the 0.05 threshold for a
pvalue to be significant; now it’s 0.06. This might be due to
the vehicle fuel efficiency being more sensitive to changes in vehicle
weight than engine size.
If you want to extend this analysis further, you can bring in another feature from the dataset and see how the Rsquared of the model and pvalues of the coefficients change accordingly:
lm.cyl
<
lm
(
mpg
~
disp
+
wt
+
cyl
,
data
=
mtcars
)
summary
(
lm.cyl
)
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## 4.4035 1.4028 0.4955 1.3387 6.0722
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 41.107678 2.842426 14.462 1.62e14 ***
## disp 0.007473 0.011845 0.631 0.53322
## wt 3.635677 1.040138 3.495 0.00160 **
## cyl 1.784944 0.607110 2.940 0.00651 **
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.595 on 28 degrees of freedom
## Multiple Rsquared: 0.8326, Adjusted Rsquared: 0.8147
## Fstatistic: 46.42 on 3 and 28 DF, pvalue: 5.399e11
This code takes the same approach as before, but adds the engine’s cylinder count to the model. Notice that the Rsquared value has increased yet again from 0.709 to 0.8147. However, the statistical relevancy of the displacement in the data is basically defunct, with a pvalue 10 times the threshold at 0.53322 instead of closer to 0.05. This tells us that the fuel efficiency is tied more to the combined feature set of vehicle weight and number of cylinders than it is to the engine size. You can rerun this analysis with just the statistically relevant features:
lm.cyl.wt
<
lm
(
mpg
~
wt
+
cyl
,
data
=
mtcars
)
summary
(
lm.cyl.wt
)
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## 4.2893 1.5512 0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 39.6863 1.7150 23.141 < 2e16 ***
## wt 3.1910 0.7569 4.216 0.000222 ***
## cyl 1.5078 0.4147 3.636 0.001064 **
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple Rsquared: 0.8302, Adjusted Rsquared: 0.8185
## Fstatistic: 70.91 on 2 and 29 DF, pvalue: 6.809e12
By removing the statistically irrelevant feature from the model, you have more or less preserved the Rsquared accuracy at 0.8185 versus 0.8147, while maintaining only relevant features to the data.
You should take care when adding features to the data,
however. In R, you can easily model a response to all the features in the
data by calling the lm()
function with the following form:
lm.all
<
lm
(
mpg
~
.,
data
=
mtcars
)
summary
(
lm.all
)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.4506 1.6044 0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl 0.11144 1.04502 0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp 0.02148 0.02177 0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt 3.71530 1.89441 1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb 0.19942 0.82875 0.241 0.8122
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple Rsquared: 0.869, Adjusted Rsquared: 0.8066
## Fstatistic: 13.93 on 10 and 21 DF, pvalue: 3.793e07
This syntax creates a linear model with the
dependent variable mpg
being modeled against everything in the
dataset, as denoted by the .
mark in the function call. The problem
with this approach, however, is that you see very little statistical
value in the coefficients of the model. Likewise, the standard error for
each of the coefficients is very high, and thus pinning down an exact
value for the coefficients is very difficult. Instead of this topdown
approach to seeing which features are the most important in the dataset,
it is better to approach it from the bottom up as we have done thus far.
Although the theme of feature selection itself is a very broad topic—one
which we explore in depth with other machine learning algorithms—we can mitigate some of these problems in a couple of ways:
Pick features to add to the model one at a time and cut the ones that are statistically insignificant. We’ve accomplished this in the preceding code chunks by adding one parameter at a time and checking to see whether the pvalue of the model output for that parameter is statistically significant.
Keep all of the features but mathematically reduce the coefficients of the less important ones to minimize their impact on the model.
Regularization can be a tough concept mathematically, but in principle it’s fairly straightforward. The idea is that you want to include as many of the features in your data as you can squeeze into the final model. The more features, the better you can explain all the intricacies of the dataset. The catch here is that the degree to which each feature explains part of the model, after regularization is applied, can be quite different.
Through the use of regularization, you can make your model more succinct and reduce the noise in the dataset that might be coming from features that have little impact on what you are trying to model against.
Let’s see what the linear model for the mtcars
dataset would look like if we
included all the features. We would have an equation like this:
According to this linear equation, fuel efficiency is most sensitive to the weight of the vehicle (3.72wt), given that this one has the largest coefficient. However, most of these are all within an order of magnitude or so to one another. Regularization would keep all of the features, but the less important ones would have their coefficients scaled down much further.
To utilize this regularization technique, you call a particular type of regression modeling, known as a lasso regression, as shown here:
library
(
lasso2
)
lm.lasso
<
l1ce
(
mpg
~
.,
data
=
mtcars
)
summary
(
lm.lasso
)
$
coefficients
## Value Std. Error Z score Pr(>Z)
## (Intercept) 36.01809203 18.92587647 1.90311355 0.05702573
## cyl 0.86225790 1.12177221 0.76865686 0.44209704
## disp 0.00000000 0.01912781 0.00000000 1.00000000
## hp 0.01399880 0.02384398 0.58709992 0.55713660
## drat 0.05501092 1.78394922 0.03083659 0.97539986
## wt 2.68868427 2.05683876 1.30719254 0.19114733
## qsec 0.00000000 0.75361628 0.00000000 1.00000000
## vs 0.00000000 2.31605743 0.00000000 1.00000000
## am 0.44530641 2.14959278 0.20715850 0.83588608
## gear 0.00000000 1.62955841 0.00000000 1.00000000
## carb 0.09506985 0.91237207 0.10420075 0.91701004
This code calls the l1ce()
function from the lasso2
package on the mtcars
dataset. This uses the same function call that
we want the fuel efficiency variable mpg
modeled as a function of all
the other variables in the dataset. Built in to lasso regression is the
regularization technique, which is only applied during the heavy
mathematical lifting part of the algorithm. The regularization part of
the regression scales the coefficients according to how much actual
impact they have on the model in a more statistical fashion. In some
cases, this can result in some features being scaled down to such a low
value that they are approximated as zero. As a result of this regression
modeling, you now have a different equation:
Or, more simply:
The most important feature before the change to a lasso regression was
the vehicle’s weight, wt
, which has remained unchanged as far as its
relative importance. Even though the coefficient has changed somewhat, the
fact that it is the highest magnitude coefficient remains the same. What
you see in terms of less useful features being scaled down—in this case
to zero—are features that you would probably think have little impact on
fuel efficiency to begin with. Quartermile drag race time (qsec
),
engine configuration in terms of a Vshape or a straightline shape (vs
), and
number of forward gears (gear
) have all been rescaled down to zero.
However, the variable of displacement showed a clear relationship to fuel efficiency that we saw earlier. It being scaled down to zero does not mean there is no relationship whatsoever between just that one variable and our response, but when taken together with all the other variables in the dataset, its importance is negligible. Remember, in this case we are interested in a model of all features, not necessarily the importance of just one feature.
Notice from the new lasso regression model that some of the coefficients have been more or less mathematically eliminated from the model. To further refine the model and reduce the number of features in it, you can rerun the regression without those features and see what changes:
lm.lasso2
<
l1ce
(
mpg
~
cyl
+
hp
+
wt
+
am
+
carb
,
data
=
mtcars
)
summary
(
lm.lasso2
)
$
coefficients
## Value Std. Error Z score Pr(>Z)
## (Intercept) 31.2819166926 4.51160542 6.93365527 4.100942e12
## cyl 0.7864202230 0.86107128 0.91330444 3.610824e01
## hp 0.0009037003 0.02343634 0.03855979 9.692414e01
## wt 1.9248597501 1.38749433 1.38729198 1.653527e01
## am 0.0000000000 2.22143917 0.00000000 1.000000e+00
## carb 0.0000000000 0.67947216 0.00000000 1.000000e+00
With the reduced dataset being then passed into another lasso
regression, you can see that the transmission type of the car, am
, and the
number of carburetors, carb
, have both dropped to zero. By removing
these features and rerunning, you can see if any more drop out:
lm.lasso3
<
l1ce
(
mpg
~
cyl
+
hp
+
wt
,
data
=
mtcars
)
summary
(
lm.lasso3
)
$
coefficients
## Value Std. Error Z score Pr(>Z)
## (Intercept) 30.2106931 1.97117597 15.3262284 0.0000000
## cyl 0.7220771 0.82941877 0.8705821 0.3839824
## hp 0.0000000 0.01748364 0.0000000 1.0000000
## wt 1.7568469 1.07478525 1.6346028 0.1021324
In this case, the horsepower of the car, hp
, has now dropped out. You
can continue to run as long as you have multiple features to test against:
lm.lasso4
<
l1ce
(
mpg
~
cyl
+
wt
,
data
=
mtcars
)
summary
(
lm.lasso4
)
$
coefficients
## Value Std. Error Z score Pr(>Z)
## (Intercept) 29.8694933 1.4029760 21.290096 0.0000000
## cyl 0.6937847 0.5873288 1.181254 0.2375017
## wt 1.7052064 1.0720172 1.590652 0.1116879
The final result is a model that has only two features instead of the 11 you started with:
Polynomial regression is simply fitting a higher degree function to the data. Previously, we’ve seen fits to our data along the following form:
Polynomial regression differs from the simple linear cases by having multiple degrees for each feature in the dataset. The form of which could be represented as follows:
The following example will help with our reasoning (Figure 42):
pop
<
data.frame
(
uspop
)
pop
$
uspop
<
as.numeric
(
pop
$
uspop
)
pop
$
year
<
seq
(
from
=
1790
,
to
=
1970
,
by
=
10
)
plot
(
y
=
pop
$
uspop
,
x
=
pop
$
year
,
main
=
"United States Population From 1790 to 1970"
,
xlab
=
"Year"
,
ylab
=
"Population"
)
Here, we have a builtin dataset in R that we’ve tweaked slightly for
demonstration purposes. Normally the uspop
is a time–series object
that has its own plotting criteria, but here we’ve tuned it to plot
just the data points. This data is the population of the United States
in 10year periods from 1790 to 1970. Let’s begin by fitting a linear
model to the data:
lm1
<
lm
(
pop
$
uspop
~
pop
$
year
)
summary
(
lm1
)
##
## Call:
## lm(formula = pop$uspop ~ pop$year)
##
## Residuals:
## Min 1Q Median 3Q Max
## 19.569 14.776 2.933 9.501 36.345
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 1.958e+03 1.428e+02 13.71 1.27e10 ***
## pop$year 1.079e+00 7.592e02 14.21 7.29e11 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.12 on 17 degrees of freedom
## Multiple Rsquared: 0.9223, Adjusted Rsquared: 0.9178
## Fstatistic: 201.9 on 1 and 17 DF, pvalue: 7.286e11
This simple linear fit of the data seems to work pretty well. The pvalues of the estimates are very low, indicating a good statistical significance. Likewise, the Rsquared values are both very good. However, the residuals show a pretty wide degree of variability, ranging as much as a difference of 36, as demonstrated in Figure 43:
plot
(
y
=
pop
$
uspop
,
x
=
pop
$
year
,
main
=
"United States Population From 1790 to 1970"
,
xlab
=
"Year"
,
ylab
=
"Population"
)
abline
(
a
=
coef
(
lm1
[
1
]),
b
=
coef
(
lm1
)[
2
],
lty
=
2
,
col
=
"red"
)
The dotted line fit from the linear model seems to do OK. It fits some of the data better than others, but it’s pretty clear from the data that it’s not exactly a linear relationship. Moreover, we know from intuition that population over time tends to be more of an exponential shape than one that’s a straight line. What you want to do next is to see how a model of a higher degree stacks up against the linear case, which is the lowestorder degree polynomial that you can fit:
lm2
<
lm
(
pop
$
uspop
~
poly
(
pop
$
year
,
2
))
summary
(
lm2
)
##
## Call:
## lm(formula = pop$uspop ~ poly(pop$year, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## 6.5997 0.7105 0.2669 1.4065 3.9879
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 69.7695 0.6377 109.40 < 2e16 ***
## poly(pop$year, 2)1 257.5420 2.7798 92.65 < 2e16 ***
## poly(pop$year, 2)2 73.8974 2.7798 26.58 1.14e14 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.78 on 16 degrees of freedom
## Multiple Rsquared: 0.9983, Adjusted Rsquared: 0.9981
## Fstatistic: 4645 on 2 and 16 DF, pvalue: < 2.2e16
This code calls the lm()
function again, but this time
with an extra parameter around the dependent variable, the poly()
function. This function takes the date data and computes an orthogonal
vector, which is then scaled appropriately. By default, the poly()
function doesn’t change the values of the date data, but you can use it
to see if it yields any better results than the lowerorder fit that
you did previously. Recall that the linear fit is technically a
polynomial, but of degree 1. In an equation, here’s the resultant model fit:
Let’s slowly walk through the summary()
output first. Looking at the
residual output gives us a bit of relief: no residuals in the range of
30! Smaller residuals are always better in terms of model fit. The
coefficients table now has three entries: one for the intercept, one of
the firstdegree term, and now one for the seconddegree term. When you
called poly(pop$year, 2)
, you instructed R that you want a polynomial of the
date data with the highest degree being 2. Going back to the
coefficients table, you can see that all of the pvalues are statistically
significant, which is also a good indication that this is a solid model fit
to your data (see Figure 44):
plot
(
y
=
pop
$
uspop
,
x
=
pop
$
year
,
main
=
"United States Population From 1790 to 1970"
,
xlab
=
"Year"
,
ylab
=
"Population"
)
pop
$
lm2.predict
=
predict
(
lm2
,
newdata
=
pop
)
lines
(
sort
(
pop
$
year
),
fitted
(
lm2
)
[order
(
pop
$
year
)],
col
=
"blue"
,
lty
=
2
)
From Figure 44, it looks pretty obvious that the higher degree polynomial (in this case a quadratic equation) fits the data better. Clearly using higher degree polynomials works better than lower degree ones, right? What happens if you fit a thirddegree polynomial? Or something higher still? I’ll bet that if you use a sixthdegree polynomial you would have a very accurate model indeed! What immediately leaps out is that the simple linear fit that you had earlier fit the data as best it could, but the higher seconddegree polynomial (i.e., a simple quadratic) fit better. A better way to distinguish the difference between higherorder polynomial fits is by looking at plots of each model’s residuals, which you can see in Figure 45.
Figure 45 shows the linear fit compared to increasing degree polynomials. The polynomials are difficult to separate in terms of how well they fit, but all seem to fit better than the linear case. To compare models that are close approximations visually at this level, you would need to dive into looking at plots of their residuals instead, as demonstrated in Figure 46:
par
(
mfrow
=
c
(
2
,
3
))
plot
(
resid
(
lm1
),
main
=
"Degree 1"
,
xlab
=
"Sequential Year"
,
ylab
=
"Fit Residual"
)
plot
(
resid
(
lm2
),
main
=
"Degree 2"
,
xlab
=
"Sequential Year"
,
ylab
=
"Fit Residual"
)
plot
(
resid
(
lm3
),
main
=
"Degree 3"
,
xlab
=
"Sequential Year"
,
ylab
=
"Fit Residual"
)
plot
(
resid
(
lm4
),
main
=
"Degree 4"
,
xlab
=
"Sequential Year"
,
ylab
=
"Fit Residual"
)
plot
(
resid
(
lm5
),
main
=
"Degree 5"
,
xlab
=
"Sequential Year"
,
ylab
=
"Fit Residual"
)
plot
(
resid
(
lm6
),
main
=
"Degree 6"
,
xlab
=
"Sequential Year"
,
ylab
=
"Fit Residual"
)
Recall that a residual is the vertical distance between a data point and the fitted model line. A model that fits the data points exactly should have a residuals plot as close to a flat line as possible. In the case of your linear fit, the scale of the residuals plot is much larger than the rest, and you can see that the linear fit has some pretty bad residual distance at its start, halfway point, and end. This is not an ideal model. On the other hand, the higherdegree polynomials seem to do pretty well. The scale of their residual plots are much nicer, but the one that really stands out is the sixthdegree polynomial fit at the end. The residuals plot is pretty much zero to start, and then it becomes a little more errorprone.
This is all well and good, but it might be easier to rank the model fit by looking at their residuals numerically:
c
(
sum
(
abs
(
resid
(
lm1
))),
sum
(
abs
(
resid
(
lm2
))),
sum
(
abs
(
resid
(
lm3
))),
sum
(
abs
(
resid
(
lm4
))),
sum
(
abs
(
resid
(
lm5
))),
sum
(
abs
(
resid
(
lm6
))))
## [1] 272.51432 33.77224 34.54039 36.95125 25.45242 19.59938
This code sums the residual plots by absolute value of the residual. If you just take the raw sum of the residuals, you get an inaccurate picture because some residuals might be negative. So the total residual for the linear fit is quantitatively bad compared to the rest of the models, with the sixthdegree polynomial being the clear winner in terms of the best fit to the data points.
But is the best fit to the data points actually the best model? We must take into account the ideas of overfitting and underfitting the data. The linear model fit to the data in the previous case would be a good example of an underfit scenario. Clearly there’s some structure in the data that isn’t being explained by a simple linear fit. On the other hand, a model can be overfit if it is too specific to the data presented and offers little explanatory power for any new data that might come into the system. This is the risk you run by increasing the degree of polynomial models.
We have just run an example of trying to get a model that has the best fit to our data. This is a good goal to have, but you need to be careful not to go too far in fitting perfectly to the data. We have seen so far that the linear fit to a population curve probably isn’t the best model for the job. A quadratic or a cubic polynomial fit seems to do much better by comparison. Yet, is it worth it to keep increasing the degree of the model fit? Is the minimization of the residual the only goal in terms of selecting the best model for the job?
In statistics, the rootmeansquare error (RMSE) is a quantifiable way to see how our model’s predicted values stack up against actual values. Mathematically, the RMSE is given as follows:
To assess polynomial fits, you can perform an RMSE analysis on each one. You can then compare the resultant errors and select the one that has the lowest result. To do so, you need new data that isn’t in your model. For that, let’s use the US population census data from 1980 to 2010:
uspop.2020
<
data.frame
(
year
=
c
(
1980
,
1990
,
2000
,
2010
),
uspop
=
c
(
226.5
,
249.6
,
282.2
,
309.3
))
uspop.2020.predict
<
uspop.2020
pop2
<
data.frame
(
uspop
)
pop2
$
uspop
<
as.numeric
(
pop
$
uspop
)
pop2
$
year
<
seq
(
from
=
1790
,
to
=
1970
,
by
=
10
)
This code also reinitializes the old population data for prediction purposes as a general cleanup measure. From there, you can do your usual prediction routine, and then calculate the RMSE for each polynomial:
uspop.2020.predict
$
lm1
<
predict
(
lm
(
uspop
~
poly
(
year
,
1
),
data
=
pop2
),
uspop.2020
)
uspop.2020.predict
$
lm2
<
predict
(
lm
(
uspop
~
poly
(
year
,
2
),
data
=
pop2
),
uspop.2020
)
uspop.2020.predict
$
lm3
<
predict
(
lm
(
uspop
~
poly
(
year
,
3
),
data
=
pop2
),
uspop.2020
)
uspop.2020.predict
$
lm4
<
predict
(
lm
(
uspop
~
poly
(
year
,
4
),
data
=
pop2
),
uspop.2020
)
uspop.2020.predict
$
lm5
<
predict
(
lm
(
uspop
~
poly
(
year
,
5
),
data
=
pop2
),
uspop.2020
)
uspop.2020.predict
$
lm6
<
predict
(
lm
(
uspop
~
poly
(
year
,
6
),
data
=
pop2
),
uspop.2020
)
And, finally, calculate the RMSE:
c
(
sqrt
(
mean
((
uspop.2020.predict
$
uspop

uspop.2020.predict
$
lm1
)
^2
)),
sqrt
(
mean
((
uspop.2020.predict
$
uspop

uspop.2020.predict
$
lm2
)
^2
)),
sqrt
(
mean
((
uspop.2020.predict
$
uspop

uspop.2020.predict
$
lm3
)
^2
)),
sqrt
(
mean
((
uspop.2020.predict
$
uspop

uspop.2020.predict
$
lm4
)
^2
)),
sqrt
(
mean
((
uspop.2020.predict
$
uspop

uspop.2020.predict
$
lm5
)
^2
)),
sqrt
(
mean
((
uspop.2020.predict
$
uspop

uspop.2020.predict
$
lm6
)
^2
)))
## [1] 75.622445 8.192311 5.070814 9.153189 73.632318 124.429798
From these results, you can see that the simple linear fit had an RMSE of 75, the seconddegree polynomial had 8, and the thirddegree polynomial had 5. The errors blow up after the thirddegree polynomial, which is another indication that the models were too overfit to the data. In this case, you would select the model that has the lowest RMSE to the new predicted data by picking the polynomial of degree three.
If you recall the model coefficients, each one has an attached pvalue of
statistical significance tied to it from the lm()
model fitting
procedure. If a coefficient’s pvalue is less than 0.05, it’s safe to
assume that it is statistically important for your model.
To help you to decide which model to use, identify where the tradeoff is between model accuracy and model complexity. The more complex your model is—that is, the degree of polynomial used—the tighter it’s going to fit to your data, but you run the risk of some of the coefficients being less statistically valid as the model becomes more complex. To avoid this, first look at both the Rsquared and the number of statistically valid coefficients for each of your models:
table
((
summary
(
lm1
)
$
coefficients
[,
4
])
<
0.05
)
##
## TRUE
## 2
summary
(
lm1
)
$
r.squared
## [1] 0.9223434
This example takes the coefficients from the simple linear
fit, lm1
, and then extracts the pvalues tied to the coefficients. It then tabularizes how many of those are statistically valid (if they
are above 0.05). The result from the simple linear case is that there are two coefficients: the slope and the intercept, and that they are both
statistically valid. The Rsquared value also confirms that the fit is
pretty good, but let’s use that as a baseline for the sake of comparison.
Instead of computing this for each model and looking back and forth at
the results, you can dump all this information into a handy data frame
for easier readability. Let’s define a model.order
as the highest
degree of the polynomial fit (this is simply the number you pass into the
poly()
function during the linear model lm()
function call). You then define coef.true
as the number of coefficients that are
statistically valid in the model. In this case, you are looking only at
the coefficients related to the dependent variables and not the model’s
intercept, which is statistically valid in all cases, hence why you
subtract the coef.true
value by 1. Next, you define a coef.false
term
as the opposite case: how many of the model’s coefficients on the
dependent variables are not statistically meaningful. Finally, you
define a model.rsq
value, which is the extracted Rsquared model
accuracy. You then put it all together in a data frame and define a
final metric: goodness
. This measure compares the ratio of
statistically meaningful coefficients to the model’s order:
model.order
<
c
(
1
,
2
,
3
,
4
,
5
,
6
)
coef.true
<
c
(
table
((
summary
(
lm1
)
$
coefficients
[,
4
])
<
0.05
)

1
,
table
((
summary
(
lm2
)
$
coefficients
[,
4
])
<
0.05
)

1
,
table
((
summary
(
lm3
)
$
coefficients
[,
4
])
<
0.05
)[
2
]

1
,
table
((
summary
(
lm4
)
$
coefficients
[,
4
])
<
0.05
)[
2
]

1
,
table
((
summary
(
lm5
)
$
coefficients
[,
4
])
<
0.05
)[
2
]

1
,
table
((
summary
(
lm6
)
$
coefficients
[,
4
])
<
0.05
)[
2
]

1
)
coef.false
<
c
(
0
,
0
,
table
((
summary
(
lm3
)
$
coefficients
[,
4
])
<
0.05
)[
1
]
,
table
((
summary
(
lm4
)
$
coefficients
[,
4
])
<
0.05
)[
1
]
,
table
((
summary
(
lm5
)
$
coefficients
[,
4
])
<
0.05
)[
1
]
,
table
((
summary
(
lm6
)
$
coefficients
[,
4
])
<
0.05
)[
1
]
)
model.rsq
<
c
(
summary
(
lm1
)
$
r.squared
,
summary
(
lm2
)
$
r.squared
,
summary
(
lm3
)
$
r.squared
,
summary
(
lm4
)
$
r.squared
,
summary
(
lm5
)
$
r.squared
,
summary
(
lm6
)
$
r.squared
)
model.comparison
<
data.frame
(
model.order
,
model.rsq
,
coef.true
,
coef.false
)
model.comparison
$
goodness
<
(
model.comparison
$
coef.true
/
model.comparison
$
model.order
)
model.comparison
## model.order model.rsq coef.true coef.false goodness
## 1 1 0.9223434 1 0 1.0000000
## 2 2 0.9982808 2 0 1.0000000
## 3 3 0.9983235 2 1 0.6666667
## 4 4 0.9984910 2 2 0.5000000
## 5 5 0.9992208 3 2 0.6000000
## 6 6 0.9993027 3 3 0.5000000
The result demonstrates that, although the model’s Rsquared accuracy might be increasing as the fit becomes more complex, the goodness of that fit goes down over time because the number of statistically meaningful coefficients compared to the total number of coefficients tends to go down. One way that you can statistically quantify this is to rank the associated elements you’re interested in optimizing with the following:
model.comparison
$
rank
<
sqrt
(
model.comparison
$
goodness^2
+
model.comparison
$
model.rsq^2
)
model.comparison
## model.order model.rsq coef.true coef.false goodness rank
## 1 1 0.9223434 1 0 1.0000000 1.360411
## 2 2 0.9982808 2 0 1.0000000 1.412998
## 3 3 0.9983235 2 1 0.6666667 1.200456
## 4 4 0.9984910 2 2 0.5000000 1.116685
## 5 5 0.9992208 3 2 0.6000000 1.165522
## 6 6 0.9993027 3 3 0.5000000 1.117410
Now, you can see where the tradeoff is best between model accuracy and
goodness of fit. The model order with the highest rank
in this case is
a quadratic fit that has all of its coefficients that are statistically
valid. Although the model fit for a thirddegree polynomial is marginally
better (almost unmeasurably so), the goodness of fit isn’t great because
we have a coefficient that is not statistically meaningful.
What we can say about this procedure is that we have an optimal model to choose that has the highest rank value. A model that has a lower Rsquared and lower rank is underfit to the data. A model that has a higher Rsquared and a lower rank is an overfit model to our data.
Thus far we’ve considered regression models in terms of taking some kind of numeric data to which we want to fit some kind of curve so that we can use it for predictive purposes. Linear regression takes some sort of numeric data and renders an equation like y = mx + b out. Linear regression can also have multiple inputs and we could have an equation like y = b + m_{1}x_{1} + m_{2}x_{2} + (…). Further, these types of numerical regression models can be turned into nonlinear cases such as y = b + m_{1}x_{1} + m_{2}x_{1}^{2} + m_{3}x_{1}^{3} + (…). All of these have their own use cases and are totally dependent on the data we’re working with and how we strategize about the kind of accuracy for which we want to optimize.
All of these so far have ingested some numeric input and given us a numeric output. What if, instead, we wanted a “yes” or “no” outcome from our data? What if we were trying to do something like determine whether our input data was of a positive or negative result? In this case, we would be taking in continuous numeric data and getting some kind of discrete output. This is the basis for the classification end of our regression modeling. Logistic regression is a particular type of classification and relatively simple enough to be used as an introductory example. Logistic regression, in contrast to linear regression, finds the point at which the data goes from one kind of classification to another instead of trying to fit all the individual data points themselves.
Suppose that you are trying to diagnose patients to determine whether they have a malignant tumor. Let’s look at the code and the resulting plot in Figure 47:
data
<
data.frame
(
tumor.size
<
c
(
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
20
),
malignant
<
c
(
0
,
0
,
0
,
0
,
1
,
1
,
1
,
1
,
1
,
1
))
tumor.lm
<
lm
(
malignant
~
tumor.size
,
data
=
data
)
plot
(
y
=
data
$
malignant
,
x
=
data
$
tumor.size
,
main
=
"Tumor Malignancy by Size"
,
ylab
=
"Type (0 = benign, 1 = cancerous)"
,
xlab
=
"Tumor Size"
)
abline
(
a
=
coef
(
tumor.lm
[
1
]),
b
=
coef
(
tumor.lm
[
2
]))
coef
(
tumor.lm
)
## (Intercept) tumor.size
## 0.20380952 0.06095238
summary
(
tumor.lm
)
$
r.squared
## [1] 0.4063492
This code creates a dataset of tumor sizes from 1 to 20 and classifies whether they are malignant, with a benign or noncancerous tumor being classified as 0, and a malignant or cancerous tumor being labeled as 1. A naive instinct might be to slap a regression model on this data and see what the equation output is. With this approach, you would have an equation such as the following:
The poor fit of the Rsquared at 0.406 suggests that we could obtain a more accurate model. Additionally, you should
question the logical assessment of what it means to have a tumor that is
0.2 malignant when they are logged in the data as being either malignant
or not with no room in between. This would also not make much sense with
the mtcars
dataset if we had something modeled against transmission
type. What would a 0.2 transmission be if 0 was manual and 1 was an
automatic?
We need to rethink this approach. Instead of fitting a normal mathematical function, we need to fit something called a decision boundary to the data.
The decision boundary is simply a line in the sand of our data that says “anything on this side is classified as X and anything on the other side is classified as Y.” Figure 48 revisits the plot of tumor sizes and whether they’re malignant. You can clearly see that any tumor that’s greater in size than 5 always seems to be malignant:
plot
(
y
=
data
$
malignant
,
x
=
data
$
tumor.size
,
main
=
"Tumor Malignancy by Size"
,
ylab
=
"Type (0 = benign, 1 = cancerous)"
,
xlab
=
"Tumor Size"
)
abline
(
v
=
4.5
)
Logistic regression establishes the boundary against which you can classify data. The boundary in Figure 48 shows that any tumor size greater than 4.5 is malignant, whereas anything less than that is benign.
The way logistic regression (as well as many other types of classification algorithms) work is based on the mathematical underpinnings of the sigmoid function. The sigmoid function takes the following mathematical form:
Figure 49 shows what the plot looks like:
e
<
exp
(
1
)
curve
(
1
/
(
1
+
e^

x
),
10
,
10
,
main
=
"The Sigmoid Function"
,
xlab
=
"Input"
,
ylab
=
"Probability"
)
This function is used in logistic regression to classify data. On its own, the function takes in some numeric value that we are interested in and maps it to a probability between 0 and 1. We might be tempted to just plug in some of the values from our earlier example into the sigmoid function and see what the output is. If we did, like setting x = 1, for example, we would get h(1) = 0.73, or about a 73% chance a tumor is malignant if our input is 1. Yet our classification system is 0 for benign and 1 for malignant. The length = 1 input yields a result of 0.73, which is incorrect.
Instead, we need to pass a set of weighted parameters to the logistic regressor. Because we have only one dependent variable at the moment (keeping in mind that the yaxis for our classification output is not an input variable), we should expect to pass a function to our logistic regressor that has the form similar to the following:
A priori, we don’t know what the weights are just yet. What we do want is for them to be chosen such that our g(x) function, when passed to our sigmoid function, will give us a classification that looks reasonable to what we see in our data:
lengths
<
c
(
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
)
t1
=
4.5
t2
=
1
g
=
t1
+
t2
*
lengths
s
=
1
/
(
1
+
e^

g
)
data.frame
(
lengths
,
g
,
s
)
## lengths g s
## 1 1 3.5 0.02931223
## 2 2 2.5 0.07585818
## 3 3 1.5 0.18242552
## 4 4 0.5 0.37754067
## 5 5 0.5 0.62245933
## 6 6 1.5 0.81757448
## 7 7 2.5 0.92414182
## 8 8 3.5 0.97068777
## 9 9 4.5 0.98901306
## 10 10 5.5 0.99592986
This code chunk takes the input tumor lengths, which range from 1 to 10, and picks two weights of ${\theta}_{0}$ = 4.5 and ${\theta}_{0}$ = 1. In practice, you would either need to experiment with picking values for the weights and seeing how the outputs react, or crunch them through an algorithm that gives you the answer. The preceding code provides the answer as an end result. They are then used as the weights for the function g(x) that is then passed to the sigmoid. The table in the code presents the resultant classification of the data as s. A tumor of length 1, when passed through the input function g(x), gives a result of –3.5. This value, when passed to the sigmoid function, yields a result that’s pretty close to zero. This means that a tumor of length 1 has a very low probability of being malignant, as demonstrated in Figure 410:
plot
(
y
=
s
,
x
=
lengths
,
pch
=
1
,
main
=
"Sigmoid Function Inputs and Rounding Estimates"
,
xlab
=
"Tumor Lengths"
,
ylab
=
"Probability of Class 1 Typification"
)
points
(
y
=
round
(
s
),
x
=
lengths
,
pch
=
3
)
Figure 410 presents probabilities for tumor lengths being classified as malignant if the probability is 1.0 and benign if the probability is 0.0. The result is pretty close, but there’s some error with it. You would get a much better picture if you simply round the values to the nearest whole number. The final result is a classification that looks exactly like the starting data.
We originally started with the input data being tumor length. The output of tumor type between benign, y = 0, and malignant, y = 1, was already given to us. The objective was to design a model that calculates the probability that a tumor is benign or malignant based on its length. We did this by starting with the equation $g\left(x\right)={\theta}_{0}+{\theta}_{1}x$ and then finding the weights ${\theta}_{0}$ and ${\theta}_{1}$, which helped to get values out that, when passed through a sigmoid function, provided values that look about right for what we needed. What we get at the end of the day is a decision boundary at length = 4.5; any values above that are classified as 1, and any values below it are classified as 0.
The mechanisms by which classification algorithms like logistic regression work to determine those modeling weights are somewhat similar in scope to how simple linear regression weights are calculated. However, given that the goal of this text is to be introductory in nature, I’ll refer you to the statistical appendix for linear regression. Logistic regression and many other machine learning algorithms work in a similar fashion, but diving too deep into the realm of algorithm optimization can get overly mathy and we would lose focus on the understanding and application of the machine learning ecosystem as a whole.
Everything we’ve done so far in terms of classification has been on binary data: the tumor is either malignant or benign. Figure 411 looks at another example in which we determine the classes based on the data’s distribution:
plot
(
iris
$
Sepal.Length
~
iris
$
Sepal.Width
,
main
=
"Iris Flower Sepal Length vs Sepal Width"
,
xlab
=
"Sepal Width"
,
ylab
=
"Sepal Length"
)
In Figure 411, there are a bunch of data points and what appears to be two different classes of plants. There looks to be a grouping of data points at the bottom of the plot that seem to be more separated than the others. You can fit a logistic regression model to this data and find the equation for the line that makes your decision boundary. Any points below that threshold will be classified as one type, and all the points above the line will be classified as another type.
This exercise uses a generalized linear model, given by the
function glm()
. Its usage is more flexible than that of the standard
linear model function lm()
in that you can use it for classification
purposes:
iris.binary
<
iris
iris.binary
$
binary
<
as.numeric
(
iris
[,
5
]
==
"setosa"
)
iris.logistic
<
glm
(
binary
~
Sepal.Width
+
Sepal.Length
,
data
=
iris.binary
,
family
=
"binomial"
)
iris.logistic
##
## Call: glm(formula = binary ~ Sepal.Width + Sepal.Length,
## family = "binomial", data = iris.binary)
##
## Coefficients:
## (Intercept) Sepal.Width Sepal.Length
## 437.2 137.9 163.4
##
## Degrees of Freedom: 149 Total (i.e. Null); 147 Residual
## Null Deviance: 191
## Residual Deviance: 2.706e08 AIC: 6
The output from this method provides some coefficients and intercepts that don’t look totally right. You need one extra step to calculate the slope and intercepts of the decision boundary this way. Recall for a moment how you used the sigmoid function g(z) = 1/(1 + e z).
z is a function with the following form:
Because you want a value between 0 and 1 for binary classification, the classification is 1 when you have your sigmoid function g(z) ≥ 0.5. That’s only true when the function z that you pass it is itself greater than 0:
You can rewrite this equation and solve for our x_{2} value accordingly:
This equation is the same form as a y = b + mx line, where we can solve computationally for the slope and intercept to build out the function that determines the decision boundary:
You can calculate this directly from the logistic model object:
slope.iris
<
coef
(
iris.logistic
)[
2
]
/
(

coef
(
iris.logistic
)[
3
])
int.iris
<
coef
(
iris.logistic
)[
1
]
/
(

coef
(
iris.logistic
)[
3
])
slope.iris
## Sepal.Width
## 0.8440957
int.iris
## (Intercept)
## 2.675511
You then can plot this over your data and see how the classes shake out, as illustrated in Figure 412:
iris.binary
$
binary
[
iris.binary
$
binary
==
0
]
<
2
plot
(
Sepal.Length
~
Sepal.Width
,
data
=
iris.binary
,
pch
=
(
binary
),
main
=
"Iris Flower Sepal Length vs Sepal Width"
,
xlab
=
"Sepal Width"
,
ylab
=
"Sepal Length"
)
abline
(
a
=
int.iris
,
b
=
slope.iris
)
You now have an equation that helps determine how to separate the species of iris flowers we have. Any flowers in the dataset that have a value below the line of y = 2.68 + 0.844 × (Sepal.Width) will be classified accordingly.
If you want to find the splits in your data that define multiple classes and not just a binary classification, you need to use multiclass classification. This approach is slightly different in that you are basically applying the same binary classification scheme that you have been doing thus far, but you are comparing the class you’re interested in versus everything else.
Figure 413 illustrates what a multiclass classification exercise might look like:
multi
<
data.frame
(
x1
=
c
(
0.03
,
0.24
,
0.21
,
0
,
0
,
0.23
,
0.6
,
0.64
,
0.86
,
0.77
),
x2
=
c
(
0.07
,
0.06
,
0.19
,
1.15
,
0.95
,
1
,
0.81
,
0.64
,
0.44
,
0.74
),
lab
=
c
(
1
,
1
,
1
,
2
,
2
,
2
,
3
,
3
,
3
,
3
))
plot
(
x2
~
x1
,
pch
=
lab
,
cex
=
2
,
data
=
multi
,
main
=
"MultiClass Classification"
,
xlab
=
"x"
,
ylab
=
"y"
)
There are three distinct classes of data, and you want to find some kind of lines that split them into their own categories, much like you did for the binary case. What this essentially boils down to is our simple binary test, but you change which group you’re comparing against. This is called a “oneversusall” or “oneversusmany” test, in which you test three cases—trianglesversusrest, circlesversusrest, and crossesversusrest, as depicted in Figure 414:
par
(
mfrow
=
c
(
1
,
3
))
multi
$
lab2
<
c
(
1
,
1
,
1
,
4
,
4
,
4
,
4
,
4
,
4
,
4
)
plot
(
x2
~
x1
,
pch
=
lab2
,
cex
=
2
,
data
=
multi
,
main
=
"MultiClass Classification"
,
xlab
=
"x"
,
ylab
=
"y"
)
multi
$
lab3
<
c
(
4
,
4
,
4
,
2
,
2
,
2
,
4
,
4
,
4
,
4
)
plot
(
x2
~
x1
,
pch
=
lab3
,
cex
=
2
,
data
=
multi
,
main
=
"MultiClass Classification"
,
xlab
=
"x"
,
ylab
=
"y"
)
multi
$
lab4
<
c
(
4
,
4
,
4
,
4
,
4
,
4
,
3
,
3
,
3
,
3
)
plot
(
x2
~
x1
,
pch
=
lab4
,
cex
=
2
,
data
=
multi
,
main
=
"MultiClass Classification"
,
xlab
=
"x"
,
ylab
=
"y"
)
In a oneversusmany classification approach, you use one decision boundary to classify data for one type or class versus all the other types or classes. You then do the same for the rest of the types or classes in the data until you have a number of decision boundaries that you can use to typify your data accordingly. So, for the example in Figure 414, you’re computing (on the left plot) the circles versus the rest of the data, and then you compute the triangles versus the rest of the data (middle plot), and, finally, the crosses versus the rest of the data (right plot). By splitting a threeclass problem into three, twoclass problems, you can more easily find a single decision boundary for each plot and then combine those decision boundaries for a final model.
Here, you call upon the nnet
library’s function multinom()
. You use this to pass a multinomial case that’s basically the same as you’ve done for the simple binary case, but with three values instead of
two. This methodology can be applied for more than three categories:
library
(
nnet
)
multi.model
<
multinom
(
lab
~
x2
+
x1
,
data
=
multi
,
trace
=
F
)
Notice that you have two lines to separate the three categories:
multi.model
## Call:
## multinom(formula = lab ~ x2 + x1, data = multi, trace = F)
##
## Coefficients:
## (Intercept) x2 x1
## 2 12.47452 28.50805 17.97523
## 3 19.82927 12.95949 33.39610
##
## Residual Deviance: 0.0004050319
## AIC: 12.00041
Again, however, you need to do the special calculation for the slopes and intercepts of the decision boundaries based on the output of this model. You can apply the same math as earlier, but you need to apply it to each of the equations from the model. Then, you can plot the decision boundary lines, as illustrated in Figure 415:
multi.int.1
<

coef
(
multi.model
)[
1
]
/
coef
(
multi.model
)[
3
]
multi.slope.1
<

coef
(
multi.model
)[
5
]
/
coef
(
multi.model
)[
3
]
multi.int.2
<

coef
(
multi.model
)[
2
]
/
coef
(
multi.model
)[
4
]
multi.slope.2
<

coef
(
multi.model
)[
6
]
/
coef
(
multi.model
)[
4
]
plot
(
x2
~
x1
,
pch
=
lab
,
cex
=
2
,
data
=
multi
,
main
=
"MultiClass Classification"
,
xlab
=
"x"
,
ylab
=
"y"
)
abline
(
multi.int.1
,
multi.slope.1
)
abline
(
multi.int.2
,
multi.slope.2
)
The caret
package makes doing logistic regression problems very easy
for more complex examples than what we have been doing thus far. Using caret
is fairly straightforward, though for some particular machine
learning methods, some optimizations and tunings might be
warranted to achieve the best results possible. Following is an example of how you can perform an operation with caret
:
library
(
caret
)
data
(
"GermanCredit"
)
Train
<
createDataPartition
(
GermanCredit
$
Class
,
p
=
0.6
,
list
=
FALSE
)
training
<
GermanCredit
[
Train
,
]
testing
<
GermanCredit
[

Train
,
]
mod_fit
<
train
(
Class
~
Age
+
ForeignWorker
+
Property.RealEstate
+
Housing.Own
+
CreditHistory.Critical
,
data
=
training
,
method
=
"glm"
,
family
=
"binomial"
)
predictions
<
predict
(
mod_fit
,
testing
[,
10
])
table
(
predictions
,
testing
[,
10
])
##
## predictions Bad Good
## Bad 9 8
## Good 111 272
This simple example uses data from the GermanCredit
dataset and shows
how you can build a confusion matrix from a caret
training object. In
this case, the fit doesn’t seem super great, because about 50% of the data
seems to be classified incorrectly. Although caret
offers some great ways
to tune whatever particular machine learning method you are interested
in, it’s also quite flexible at changing machine learning methods. By
simply editing the method
option, you can specify one of the other 12
logistic regression algorithms to pass to the model, as shown here:
mod_fit
<
train
(
Class
~
Age
+
ForeignWorker
+
Property.RealEstate
+
Housing.Own
+
CreditHistory.Critical
,
data
=
training
,
method
=
"LogitBoost"
,
family
=
"binomial"
)
predictions
<
predict
(
mod_fit
,
testing
[,
10
])
table
(
predictions
,
testing
[,
10
])
##
## predictions Bad Good
## Bad 7 15
## Good 113 265
In this chapter, we looked at a couple different ways to build basic models between simple linear regression and logistic regression.
Regression comes in two forms: standard linear regression, which you might
have encountered early on in your mathematics classes, and logistic
regression, which is very different. R can create a linear model with
ease by using the lm()
function. In tandem with R’s formula operator,
~
, you can build a simple y = mx + b regression equation by doing something like lm(y~x)
. From this linear model object, you
can extract a wealth of information regarding coefficients, statistical
validity, and accuracy. You can do this by using the summary()
function, which can tell you how statistically valid each coefficient in
your model is. For those that aren’t statistically useful, you can safely remove them from your model.
Regression models can be more advanced by having more features. You can
model behavior like fuel efficiency as a function of a vehicle’s weight,
but you can also incorporate more things into your model, such as a vehicle’s
transmission type, how many cylinders its engine might have, and so
forth. Multivariate regression modeling in R follows the same practice
as singlefeature regression. The only difference is that there are more
features listed in the summary()
view of your model.
However, simply adding more features to a model doesn’t make it more accurate by default. You might need to employ techniques like regularization, which takes a dataset that has lots of features and reduces the impact of those that aren’t statistically as important as others. This can help you to simplify your model drastically and boost accuracy assessments, as well.
Sometimes, there might be nonlinear relationships in the data that
require polynomial fits. A regular linear model is of the form
y = b + m_{1}x_{1} + m_{2}x_{2} + (…), whereas a polynomial model
might have the form
y = m_{1}x_{1}^{2} + m_{2}x_{1} + m_{3}x_{2}^{2} + m_{4}x_{2} + (…).
You can fit polynomial behavior to your models by passing a poly()
function to the lm()
function; for example, lm(y~poly(x,2))
. This creates a quadratic relationship in the data. It’s important to not go
too crazy with polynomial degrees, however, because you run the risk of
fitting your data so tightly that any new data that comes in might have high error estimates that aren’t true to form.
In machine learning, there are standard regression techniques that estimate continuous values like numbers, and classification techniques that estimate discrete values like data types. In a lot of cases, the data can have discrete values like a flower’s species. If you try the standard linear regression techniques on these datasets, you’ll end up with very disingenuous relationships in your data that are better suited for classification schemes.
Logistic regression is a classification method that finds a boundary that separates data into discrete classes. It does this by passing the data through a sigmoid function that maps the actual value of the data to a binary 1 or 0 case. That result is then passed through another equation that yields weights to assign probabilities to the data. You can use this to determine how likely a given data point is of a certain class.
You can also use logistic regression to draw decision boundaries in your data. A decision boundary is a line that doesn’t necessarily fit the data in a standard (x, y) plot, but fits gaps in the data to separate them into specific classes. In the case of data for which you have two classes, you would have one line that splits your data into class 1 and class 2.
In the case of multiple classes, you treat each class as a oneversusmany approach. If you have three classes, you focus on one and group the other two together and find the decision boundary that separates them, and then move on to the next class. At the end, you should have series of decision boundaries that separate the data into zones. Any data in a particular zone is classified the same as the data points in that zone.
No credit card required