Errata

Machine Learning for Hackers

Errata for Machine Learning for Hackers

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
Printed Page page 82
middle



TermDocumentMatrix(doc.corpus,control)

Error in tolower(txt) : invalid multibyte string 1

huangqiyou  Nov 10, 2013 
Printed Page 6
1

Contrary to what is stated in the book, R does not appear to come pre-installed on Mac OS X.

At least, I can't find it, and the instructions as given do not work. (i.e. typing R in terminal gives 'command not found')

Perhaps the author had already installed R?

Checking a dozen or so of the top sites for "r mac os x" shows they all refer to installing R. I didn't find any references to R being preinstalled.

Anonymous  Mar 12, 2012 
PDF Page 11
table 1-2

The URLs for ggplot2 and glmnet are reversed.

John Cook  Apr 14, 2012 
Mobi Page 11
table 1-2

The location for the tm package is shown as http://www.spatstat.org/spatstat/, this however is the location for the SpatStat package. The location should be: http://tm.r-forge.r-project.org/index.html

David Clark  Oct 28, 2012 
PDF Page 11
Table 1-2

Links in the location column for ggplot2 and glmnet packages (rows 2 and 3) are reversed..

Richard Smith  Dec 04, 2014 
Printed Page 12
para 2

The package_installer.R script is not in "the code folder for this chapter". It is in the root directory of the example code. There is no "code" directory in chapter 1.

Also, there does not appear to be a link to the example code in the book. It's easily found on the website, but it should be in the book. (and it should be in the index of the book).

And the script has no error checking, so even if it fails on the first package, it keeps running, and failing, all the way through.

And it fails as it requires the mac developer tools to run (for make, etc). These are mentioned as optional on page 7. They are apparently required.

In addition, page 7 also says "requires both the C and Fortran compilers... you can install these compilers using the mac os x developers tools DVD". I don't have the DVD, but the downloaded version should be identical, and does not contain Fortran. This must be installed separately. (http://cran.r-project.org/bin/macosx/tools/). This is fairly severe, as the install will continue to the end, but the error message scroll past, unless you are watching, leaving you with a broken install. This seems likely to confuse new R users

And still some errors. Will report them later, after the build finishes.

Most or all of this could be avoided if the install script used binary packages, not source.

Anonymous  Mar 14, 2012 
PDF Page 12-13
loading libraries and the data

When going with this book it should be assumed that libraries and packages will run and install differently.

My error was found in frustration. Maybe when this book was published ggplot2 loaded two other required packages: plyr and reshape.


ggplot2 now uses a NAMESPACE, and only exports functions that should be user visible - this should make it play considerably more nicely with other packages in the R ecosystem.

from version 0.9.0, the implementation was changed to avoid possible conflicts when multiple packages were loaded.
ggplot2

Anonymous  Dec 06, 2012 
PDF, Mobi Page 12
1st paragraph

Some packages installed by the script require gfortran. However, they require gfortran 4.2.3. gfortran is currently upwards of version 4.9, and the command line options have changed.

Where the command line used by the packages says: -arch x86_64, it should say: -march=native

This allows gfortran to select the best architecture for the machine it's running on.

Without 4.2.3, these packages will fail to build. This point should be clarified in the book. It took me some digging to figure out what was wrong, and someone with less experience will have a very hard time making it work.

I am using a Macbook Pro, running the latest MacOS.

wackyvorlon  Jul 24, 2013 
PDF Page 14
last paragraph

YYY should be YYYY

John Cook  Apr 16, 2012 
Printed Page 14
1st paragraph

The data file ufo_awesome.tsv is way too big for quick processing. I'm trying out the statements in R while reading the book. The statement "ufo<-read.delim(...)" practically froze my Mac because of this large data file. Maybe provide a smaller file for quick programming along.

Anonymous  May 14, 2012 
PDF, Page 15
Second to last paragraph

Extraneous ">":

"good.rows<-ifelse(nchar(ufo$DateOccurred)>!=8 | nchar(ufo$DateReported)!=8,FALSE,
TRUE)"

should be:

"good.rows<-ifelse(nchar(ufo$DateOccurred)!=8 | nchar(ufo$DateReported)!=8,FALSE,
TRUE)"

Lorien Pratt  Apr 22, 2012 
Printed Page 15
Bottom Code sample

The number of bad rows reads "371" instead of "731". Here is the printout when the code is run:

> good.rows<-ifelse(nchar(ufo$DateOccurred)!=8 | nchar(ufo$DateReported)!=8,FALSE,TRUE)
> length(which(!good.rows))
[1] 731

Megan Squire  Sep 21, 2012 
PDF Page 15
2

from the data I download from github (https://github.com/johnmyleswhite/ML_for_Hackers/blob/master/01-Introduction/data/ufo/ufo_awesome.tsv), I think it's not propriate to use string length to filter out malformed data. Because I found "19940000" in DateOccurred and it will be transformed to "NA" by using "ufo$DateOccurred<-as.Date(ufo$DateOccurred, format="%Y%m%d")" after converting date strings

kaiwang  Apr 03, 2016 
PDF Page 16
second half (function and explanation)

The strsplit function doesn't throw an error when the split character isn't matched -- it just returns the string, so the [[1]] reference apparently will always return something.

Perhaps the solution here is to reference [[2]] rather than [[1]], to check if a split occurred?

Andrew Broman  Feb 16, 2012 
PDF Page 16
middle

strsplit function dosen't give an error when the split charater isn't matched. So the 'get.location' function should be changed like below.


get.location <- function(l)
{
split.location <- tryCatch(strsplit(l, ",")[[1]],
error = function(e) return(c(NA, NA)))
clean.location <- gsub("^ ","",split.location)
if (length(clean.location) > 2|length(clean.location)==1)
{
return(c(NA,NA))
}
else
{
return(clean.location)
}
}


By the way, thank you for this great book.

Jeong-ho park  Jan 03, 2014 
PDF Page 18
2nd Paragraph

Text and example diverge. Text states "We then use the is.na function to find which entries are not US states and reset them to NA in the USState column." In the example above the paragraph however the USState values are set to NA using a reverse lookup of the state name in us.states (due to the return of NA by match I presume). The USCity on the other hand uses the is.na function in the example.

Lukas  May 21, 2012 
PDF Page 18
example code

seems like the assignment to ufo$USState should be
ufo$USState <- ufo$USState[...]

Brian Drye  Jul 04, 2014 
19
Second Code snip-it

Creating the histogram using the code given does not work.

Running:

quick.hist<-ggplot(ufo.us, aes(x=DateOccurred))+geom_histogram()+
scale_x_date(major="50 years")

Generates an error.


> quick.hist<-ggplot(ufo.us, aes(x=DateOccurred))+geom_histogram()+
+ scale_x_date(major="50 years")
Error in continuous_scale(aesthetics, "date", identity, breaks = breaks, :
unused argument(s) (major = "50 years")

This error occurs when running the code from the book as well as the download code samples.

Anonymous  Mar 17, 2012 
Printed, PDF Page 19
2

The examples to plot the data using ggplot2 refers to an outdated version of ggplot2.
When using the code in the book with the new version of ggplot2, the following errors are prduced:
?Error in continuous_scale? and ?error in inherits?

Anonymous  Mar 24, 2012 
PDF Page 19
Second block of code

I think you have a ggplot version issue. When trying to generate the histogram for UFO sightings, I get the following error:

"Error in continuous_scale(aesthetics, "date", identity, breaks = breaks, :
unused argument(s) (major = "50 years")

I am using R 2.14.2 and ggplot 0.9.0.

It would appear someone on Stack Exchange is having the same error: http://stackoverflow.com/questions/9857123/error-in-continuous-scale-and-error-in-inherits-ggplot2-r-2-14-2

Trey Causey  Mar 25, 2012 
PDF, Mobi Page 19
code after second paragraph

Line on book, which doesn't work properly, is

quick.hist <- ggplot(ufo.us, aes(x=DateOccured)+geom_histogram)+scale_x_date(major="50 years")

while in the code provided by the authors

major="50years"

is replaces by

breaks="50 years"

Paulo Nuin  Oct 22, 2013 
PDF Page 24
3rd paragraph from the bottom

"To check this, run the first line of code from the preceding block,?"

This should have "first two lines of code" instead. Otherwise, it wouldn't include the "geom_line" call.

Anonymous  Feb 18, 2012 
PDF Page 25
last paragraph

Montana is listed as having a spike around mid-1997, but I believe you mean Missouri. (Missouri's abbreviation is MO, while Montana's is MT.)

Anonymous  Feb 18, 2012 
PDF Page 29
end of first paragraph

Is "gization" supposed to be "visualization"?

Anonymous  Feb 18, 2012 
PDF Page 32
Figure 2-2

"MxN" at bottom of vector should read "Mx1", as written above the graph.

John Sandall  Apr 25, 2012 
37
first paragraph

"by liberally by" should be "by liberal"

Lorien Pratt  Apr 24, 2012 
41
First paragraph under "Standard Deviations and Variances"

"center of list" should be "center of a list"

Lorien Pratt  Apr 25, 2012 
Printed Page 45
last paragraph

bindwidths
should be replaced by
binwidths

Martin Schader  Feb 02, 2013 
Printed Page 45
First Sentence of Last Paragraph

"Because setting bindwidths" has erroneously included an extra letter "d" and should state, "Because setting binwidths".

Joe Nolan  Mar 27, 2016 
PDF Page 49
bottom part - in the text

The example you write about is using the weight

ggplot(heights.weights, aes(x = Weight, fill = Gender)) + geom_density() +
facet_grid(Gender ~ .)

but in the text you are writing about the height:
"Once we?ve done this, we clearly see one bell curve centered at 64? for women and another bell curve centered at 69? for men."

You should change the code to:
ggplot(heights.weights, aes(x = Height, fill = Gender)) + geom_density() +
facet_grid(Gender ~ .)

Marco Pashkov  Jul 11, 2012 
Printed Page 49
last paragraph

Here you discuss Fig. 2-11 and the weights of women and men.
Therefore, instead of the curve centers 64" and 69" (inches), the means of the weights in pounds should be given.

Martin Schader  Feb 02, 2013 
54
Last paragraph

"in word." should be "in words."

Lorien Pratt  Apr 25, 2012 
ePub Page 54
code

For the scale_x_date function, the code in the book uses "major" as the parameter name, when it should be "breaks" e.g. scale_x_date(breaks = "5 years", ...
The same applies for the other uses of scale_x_date in this chapter. The example code is accurate.

Roy C  Feb 05, 2014 
Printed Page 71
United States

When I create the plot on page 71 with the code on page 70, the "Height" and "Weight" axes are switched.

Dan Williams  Apr 16, 2012 
73
First two paragraphs

The last line of the first paragraph references "Example 3-1", and the following paragraph refers to this example as containing black lines, blue dots, and red dots. Example 3-1 does not have these elements, rather it is an example of a candidate for spam email. Also, I cannot find a figure with black lines and blue and red dots. I believe that Example 3-1 should be changed to "Figure 3-1" and the text should be updated to refer to the horizontal dashed lines, and black triangles, rectangles, and circles.

Lorien Pratt  Apr 25, 2012 
Printed Page 73
second paragraph

You discuss blue dots and red dots in Fig. 3-1.
This figure is monochrome and displays circles and triangles.

Martin Schader  Feb 02, 2013 
Printed Page 74
third paragraph

the code/data folder...
should be replaced by
the data folder...

Martin Schader  Feb 02, 2013 
75
Last paragraph and figure above it

The last paragraph refers to Figure 3-1, but I think this is incorrect. If this is a correct figure reference, then it is not clear what "X", and "Y" refer to here, nor what triangles and circles refer to (spam or ham?), and these should be clarified. If this is not the intended figure, then I think that the correct reference is Figure 3-2.

Lorien Pratt  Apr 25, 2012 
Printed Page 75
Figure 3-1 and last paragraph

It is not clear what Fig. 3-1 displays.
What is the x axis, what the y axis?

In the text, when you say Figure 3-1, you might mean Figure 3-2.

Martin Schader  Feb 03, 2013 
76
first paragraph

This paragraph references Figure 3-2 as containing jittered data, but it does not. I do not believe that a jittered data picture is present.

Lorien Pratt  Apr 25, 2012 
76
Last paragraph

The code to generate the picture that is distributed in the book should be updated to correspond to the latest version of ggplot. Specifically, in the file email_classify.R,

ex1 <- ggplot(val, aes(x, V2)) +
geom_jitter(aes(shape = as.factor(V3)),
position = position_jitter(height = 2)) +
scale_shape_discrete(legend = FALSE, solid = FALSE) +
geom_hline(aes(yintercept = c(10,30), linetype = 2)) +
theme_bw() +
xlab("X") +
ylab("Y")

should be:
ex1 <- ggplot(val, aes(x, V2)) +
geom_jitter(aes(shape = as.factor(V3)),
position = position_jitter(height = 2)) +
scale_shape_discrete(guide = "none", solid = FALSE) +
geom_hline(aes(yintercept = c(10,30) )) +
theme_bw() +
xlab("X") +
ylab("Y")

This repairs two errors: "legend" is deprecated, and a new error: "A continuous variable can not be mapped to linetype" that is generated by "linetype = 2". Note that removing linetype = 2 produces solid, not dashed lines. This is consistent with the book text, but no longer matches the corresponding figure in the book.

Lorien Pratt  Apr 26, 2012 
Printed Page 76
Last paragraph

The code to generate Figure 3-2 needs to be updated to correspond to the latest version of ggplot. In the file email_classify.R, the plot command

ex1 <- ggplot(val, aes(x, V2)) + geom_jitter(aes(shape = as.factor(V3)), position = position_jitter(height = 2)) + scale_shape_discrete(legend = FALSE, solid = FALSE) + geom_hline(aes(yintercept = c(10,30), linetype = 2)) + theme_bw() + xlab("X") + ylab("Y")

produces one error, which prevents it from producing a graph, and one warning.

Warning message:
In discrete_scale("shape", "shape_d", shape_pal(solid), ...) :
"legend" argument in scale_XXX is deprecated. Use guide="none" for suppress the guide display.

Error: A continuous variable can not be mapped to linetype

The version below fixes both the error by moving the "linetype = 2" out of the aes() function (keeping the dashed lines that are removed in the alternative solution suggested by Lorien Pratt) and fixes the warning by using guide="none" instead of the deprecated legend=FALSE argument.

ex1 <- ggplot(val, aes(x, V2)) + geom_jitter(aes(shape = as.factor(V3)), position = position_jitter(height = 2)) + scale_shape_discrete(guide = "none", solid = FALSE) + geom_hline(aes(yintercept = c(10,30)), linetype = 2) + theme_bw() + xlab("X") + ylab("Y")

Anonymous  Jun 22, 2012 
Printed Page 76
1st paragraph

Instead of Figure 3-2
you mean Figure 3-3

Martin Schader  Feb 03, 2013 
Printed Page 80
para 1

Two errors.

The url for rfc 822 is https://tools.ietf.org/id/rfc822
The book transposes the f and r

But RFC 822 was replaced in 2011, and it's replacement again updated in 2008.

The correct URL for this is https://tools.ietf.org/html/rfc5322



Anonymous  Mar 20, 2012 
80
Middle of page

This function generates the indicated error message:

all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="/")))
Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
wrong sign in 'by' argument

it also generates warning messages on several spam files, the first such file being: "data/spam/00006.5ab5620d3d7c6c0db76234556a16f6c1". The error comes from the line:

?We've received 8,000 in 1 day and we are doing

and is generated because of the first character, which appears to be outside the usual ascii range.

The culprit is in the following function:

get.msg <- function(path) {
con <- file(path, open="rt", encoding="latin1")
text <- readLines(con)
# The message always begins after the first full line break
msg <- text[seq(which(text=="")[1]+1,length(text),1)]
close(con)
return(paste(msg, collapse="\n"))
}

If "latin1" is changed to "native.enc", then this error stops, and the all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="/")))
works.

Note that there is a second typo in the above line as well, reported in a separate errata. (sep = "/" not "")

Lorien Pratt  Apr 26, 2012 
PDF Page 80
United States

This line consistently throws error:

> con <- file("datasets/spam/", open="rt", encoding="native.enc")
Error in file("datasets/spam/", open = "rt", encoding = "native.enc") :
cannot open the connection
In addition: Warning message:
In file("datasets/spam/", open = "rt", encoding = "native.enc") :
cannot open file 'datasets/spam/': Permission denied

But if I try to read the individual files from the same folder, it works:

> con <- file("datasets/spam/spam.email.txt", open="rt", encoding="native.enc")

The above command works indicating that there are no permission issues on the folder or the file.

Please help me understand what is going on here.

Kingshuk Chatterjee  Oct 31, 2012 
Printed Page 80 ff
-

Some of the files in the spam folder you provided (no. 263, 320, 323, and 324) contain characters like \202, \203, etc.
If I don't remove these, get.msg will crash.

Martin Schader  Feb 02, 2013 
Printed Page 81
1st paragraph

When I run the following code:

spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
function(p) get.msg(file.path(spam.path, p)))

I get the following error:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
wrong sign in 'by' argument

Paul Reiners  Feb 24, 2012 
81
secdond paragraph

all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))

should read:

all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="/")))

Lorien Pratt  Apr 26, 2012 
Printed Page 82
top of page

To create the TDM, the options stopwords=TRUE and minDocFreq=2 are used. But the resulting TDM includes stopwords and terms with frequency of 1. The options for removePunctuation and removeNumbers appear to work properly.

It also happens with the code supplied with the book, not just the code printed in the book.

Is this an error in the package or the book?

Anonymous  Dec 25, 2012 
Printed Page 82
Top of page, 3rd and 4th line of code

The stopwords are still showing up in spam.df. Someone else also posted this in December. Any news?

Dave Gilsdorf  Mar 21, 2013 
Printed Page 82
get.tdm

In recent versions of the tm package minDocFreq=2 has been replaced by bounds = list(global = c(2,Inf)). See https://stackoverflow.com/questions/16287546/trying-to-remove-words-from-a-documenttermmatrix-in-order-to-use-topicmodels

ifernando  Feb 21, 2018 
Printed Page 83
4th par

When I user the data you provide on this website and compute spam.df, the result is

term frequency density occurrence
7135 email 741 0.006365378 0.530
17371 please 388 0.003333018 0.476
13596 list 392 0.003367379 0.424
2765 body 362 0.003109672 0.402
10623 html 392 0.003367379 0.380
8666 free 495 0.004252175 0.360

Martin Schader  Feb 03, 2013 
Printed Page 84
4th par

When I compute easyham.df with the data you provided, the result is

term frequency density occurrence
12731 wrote 237 0.004275894 0.378
6835 list 246 0.004438270 0.364
4888 group 196 0.003536183 0.348
11092 subject 155 0.002796471 0.256
11603 time 175 0.003157306 0.252
3550 email 174 0.003139264 0.250

Martin Schader  Feb 03, 2013 
87
First paragraph of text

"grey shaded area of Figure 3-3" should read "dark blue (center) shaded area of Figure 3-4".

Lorien Pratt  Apr 28, 2012 
87
First paragraph of text

"as depicted in Figure 3-3." should read "as depicted in Figure 3-4."

Lorien Pratt  Apr 28, 2012 
Printed Page 87
2nd par and 4th par

par 2:
Figure 3-3
shoud be replaced by
Figure 3-4
(twice)

and

par 4:
less than zero
should be replaced by
less than one

Martin Schader  Feb 03, 2013 
Printed Page 87
First (only) code block

The constant c is exponentiation (^) two times, when it should be multiplied (*). The narrative below (the 3rd and 4th paragraphs) indicate that a product is being obtained and multiplication seems to be more logical than exponentiation.

Anonymous  Feb 28, 2013 
PDF Page 87
Code

In classify.email R rounds to zero the product of probabilities of a long term lists.

I solved this issue using log transformation:

ClasifyEmail <- function(path, training.df, prior=0.5, c=1e-6) {
msg <- GetMsg(path)
msg.tdm <- GetTDM(msg)
msg.matrix <- as.matrix(msg.tdm)
msg.freq <- rowSums(msg.matrix)
# Find intersection of words
msg.match <- intersect(names(msg.freq), training.df$term)
# Compute probabilities of the unseen terms
unseen.probs <- prior*c^(length((msg.freq))-length(msg.match))
unseen.probs.log <- log(prior)+(length((msg.freq))-length(msg.match))*log(c)
if (length(msg.match) < 1) {
return(unseen.probs)
} else {
# Search matched terms probs
match.probs <- training.df$occurrence[match(msg.match,training.df$term)]
# Compute probability of occurrence ot the terms
# Add probabilities of the unseen terms
prob <- unseen.probs*prod(match.probs)
prob.log <- unseen.probs.log + sum(log(match.probs))
return (prob.log)
}
}

ifernando  Feb 26, 2018 
88
First code block

replace sep="" in two places in this code with sep="/"

Lorien Pratt  Apr 28, 2012 
Printed Page 88
first code snippet

hardham.res <- ifelse(hardham.spamtest > hardham.hamtest, TRUE, FALSE)
should be replaced by
hardham.res <- ifelse(hardham.spamtest > hardham.hamtest, FALSE, TRUE)

Martin Schader  Feb 03, 2013 
100
footnote

"are not acting" should be "are not more likely to act"

Lorien Pratt  Apr 30, 2012 
PDF Page 105
code

date <- msg.vec[date.grep[1]]
should be grepl

Anonymous  Mar 01, 2012 
105
code block at bottom of page

easyham.parse <- lapply(easyham.docs, function(p) parse.email(paste(easyham.path, p, sep="")))

should read:

easyham.parse <- lapply(easyham.docs, function(p) parse.email(paste(easyham.path, p, sep="/")))

Lorien Pratt  May 04, 2012 
PDF Page 105
First code sample - get.date

The get.date function fails because the second line of the function says (note the 'l' after 'grep'):

date.grepl <- which(date.grep == TRUE)

and the third line says (note the missing 'l'):

date <- msg.vec[date.grep[1]]


The sample code uses date.grep for both the second and third line.

Maymount  Jun 25, 2013 
Mobi Page 106
3rd paragraph

When defining the parameters for the strptime function, it would be helpful to point out that these return information like abbreviated Weekdays or Months in the current locale of the machine. This means if you are not a native speaker of English and you have configured your machine to talk in your native language, you are running the risk of getting lots of NAs when running that function on English emails.

Sys.setLocale('LC_TIME', 'en_US') did it for me.

Thomas Prosser  Dec 17, 2013 
108
First code block

As with the error reported for Page 80, the encoding should be "native.enc", not "latin1", otherwise this generates an error message.

It occurs to me that this errata, as with the page 80 one, may only apply on one type of machine, as other machines may not generate this error, and "latin1" may be correct there. I generated this error on a Windows 7 64 bit computer.

Lorien Pratt  May 04, 2012 
PDF Page 108
2nd paragraph

>from.weight <- ddply(priority.train, .(From.EMail),summarise, Freq = length(Subject))

The above code gives the error below:

Error in attributes(out) <- attributes(col) :
'names' attribute [9] must be the same length as the vector [1]

After checking online on stackoverflow, I found that converting the Date feature in the priority.train from a POSIXlt object to a POSIXct object before running the above code solves the problem. i.e.

priority.train$Date <- as.POSIXct(priority.train$Date)

David  Mar 21, 2013 
Printed Page 114
1st code block

R 3.2.5 user here. In the first line of the `thread.counts` function, the call to the `paste` function uses the default argument `sep=" "` because the `sep` argument is not supplied, so an unwanted space is introduced between the string "re: " and the subject line during comparison. The result is that most threads will not be found.

The solution is to use supply `sep=""` to the `paste` call. So the corrected line of code should be:

thread.times <- email.df$Date[which(email.df$Subject == thread | email.df$Subject == paste("re:", thread, sep=""))]

Anonymous  Apr 18, 2016 
Printed Page 114
1st code block

Sorry, I submitted the errata above but missed out a space in the corrected line of code above for the "re: " string. Correct code should be:

thread.times <- email.df$Date[which(email.df$Subject == thread | email.df$Subject == paste("re: ", thread, sep=""))]

Anonymous  Apr 18, 2016 
Printed Page 141
Code block half way down

R^2 is calculated as 1 - (model.rmse / mean.rmse), but these values should be MSEs rather than RMSEs.

Source: http://en.wikipedia.org/wiki/Coefficient_of_determination defines R^2 using MSEs (it doesn't actually take means, but the divisions cancel); and the R^2 reported by `summary(fitted.regression)` gives the same value as calculating it using MSEs but not RMSEs.

Phil Hazelden  Dec 12, 2012 
ePub Page 142
First paragraph of Chapter 3

The first paragraph of Chapter 3 refers to Example 3-1 as a dataset on health and ailments, but Example 3-1 is an email header for spam classification. The second paragraph also mentions blue and red dots, but there are no blue or red dots. Figure/Example 3-1 references are mismatched/missing.

Roy C  Feb 05, 2014 
Printed Page 150
2nd paragraph

The code in the 2nd sentence of the 2nd paragraph "sqrt(mean(residuals(lm.fit) ^ 2))" should be replaced by "sqrt((sum(residuals(lm.fit) ^ 2)) / 998)". The Residual Standard Error does not strictly use the mean of the squared residuals but rather the sum of the squared residuals divided by n - p (in this case 998), where p is the number of predictors in your model including intercept.

Clay Ford  Nov 24, 2012 
Printed Page 152
1st code block

For `summary(lm.fit)$r.squared` done on `lm.fit <- lm(log(PageViews) ~ InEnglish, data=top.1000.sites)`, I got 0.3043425 instead of 0.03122206

Anonymous  Apr 21, 2016 
156
figure at top of the page

figure should include labels (a), (b), (c), and (d) to match caption and text.

Lorien Pratt  May 05, 2012 
Mobi Page 169
code block

This function:

get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE,
minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}

trhows this error: "Error in x$nrow : $ operator is invalid for atomic vectors"; specifically when calling TermDocumentMatrix.


rocjoe  Sep 15, 2017 
Printed Page 170
Last code block

I'm using R 3.2.5 with glmnet 2.0-5

Doing:

x <- matrix(x)
library(glmnet)
glmnet(x, y)

gives the following error:

Error in glmnet(x, y) : x should be a matrix with 2 or more columns

In this case, `x` should be a matrix with 2 columns. A matrix with the first and second column both being the original `x` vector works:

x <- as.matrix(cbind(x, x))
library(glmnet)
glmnet(x, y)

Anonymous  Apr 24, 2016 
Printed Page 175
1st codeblock

I'm using R 3.2.5 with tm 0.6-2

This line of code:

corpus <- tm_map(corpus, tolower)

will cause the following error:

Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code

when this line of code is run:

dtm <- DocumentTermMatrix(corpus)

It turns out that we have to use this instead:

corpus <- tm_map(corpus, content_transformer(tolower))

More details can be found here: http://stackoverflow.com/a/24771621

Anonymous  Apr 25, 2016 
Printed Page 183
Last paragraph and top of following page

"In this example, the a parameter is the slope of the line and the b parameter is the intercept" disagrees with the preceding code snippet and following paragraphs. "a" and "b" need swapping for it to be correct.

Jonathan Hammler  Sep 14, 2012 
186
first paragraph

"another a second" should be "a second"

Lorien Pratt  May 07, 2012 
Printed Page 200
Second paragraph

"That value turns out to be not to be numerically unstable." should read "That value turns out not to be numerically unstable" or "That value turns out to be numerically stable."

Jonathan Hammler  Sep 14, 2012 
Printed Page 207
.

While reading ch.8, page 207, I wondered why the percentages of variance added up to more than 100%.

I later found time to check with some experts:
http://stats.stackexchange.com/q/32901/5503

and got confirmation that the text is incorrect. The quoted paragraph
could be changed to:

In this summary, the standard deviations tell us how much of the
variance in the data set is accounted for by the different principal
components. Use summary(pca) to see the proportions of variance. The
first component, called Comp.1, accounts for 46% of the variance, while
the next component accounts for another 22.7%. By the end, the last
component, Comp.24, accounts for a mere 0.01% of the variance. This
suggests that we can learn a lot about our data by just looking at the
first principal component.

Anonymous  Jul 26, 2012 
Printed Page 207
2nd code block

R 3.2.5 user here.

This line of code:

opts(legend.position="none")

results in the following error:

Error in eval(expr, envir, enclos) : could not find function "opts"

From what I read here: http://mfcovington.github.io/r_club/errata/2013/03/05/ch5-errata/ the `opts` function is deprecated. Changing that line of code to:

theme(legend.position="none")

works. Source: http://stackoverflow.com/a/19821839

Anonymous  Apr 27, 2016 
212
Last code section

First two lines of code should read:

comparison <- transform(comparison, MarketIndex = scale(MarketIndex))
comparison <- transform(comparison, DJI = scale(DJI))

Lorien Pratt  May 07, 2012 
216
Last paragraph

"products 2 and 3" should read "products 2 and 4"

Lorien Pratt  May 07, 2012 
218
bottom of figure

The table at the bottom of figure 9-1 should have rows titled A, B, C, D, not P1, P2, P3, P4

Lorien Pratt  May 07, 2012 
Printed Page 219
p. 219 ff.

What's the reason for invoking dist() on ex.mult and not on ex.matrix?

This will distort your scaling results.

You do the same with the Roll Call data, p.227.

Martin Schader  Feb 21, 2013 
Other Digital Version 223
3rd line

prices <- transform(prices, Date = ymd(Date)

Should be

prices <- transform(prices, Date = ymd(as.character(Date))

Anonymous  Jun 19, 2013 
224
Code at top of page

sep="" in the second line on the page should be sep="/" (at least on the Windows 7 machine on which I am testing)

Lorien Pratt  May 11, 2012 
224
Final text paragraph

"column are" should read "column names are"

Lorien Pratt  May 11, 2012 
Printed Page 242
p. 242 source code

Interesting that you recommend ten packages that the user (no. 1) has already installed.
Perhaps you should first remove the installed packages from the vector "listing".

Martin Schader  Mar 02, 2013 
Printed Page 250
Google SocialGraph API box

"supplemental files of the book that were generated by this code before the SocialGraph API occurred." should be "supplemental files of the book that were generated by this code before the change to SocialGraph API occurred."

Jonathan Hammler  Sep 14, 2012 
Printed Page 252
First paragraph (after code)

URLs should be split by a slash, not a backslash as stated. The code listing is correct, but the text is not.

Jonathan Hammler  Sep 14, 2012 
Printed Page 258
Bottom of page

Closer nodes are described as having "less hops between them". They should, of course, have "fewer hops".

Anonymous  Sep 14, 2012 
Printed Page 280
Graph

The authors apparently thought they were developing graphs for a colored media. The printed books, however, are black and white.

Many of the graphs, such as the one on 276, 280, 281 etc., have two types of circles. One of which is rendered in grey, and the other in... grey.

This makes the graphs useless. And it's not just one graph, is all through the book. Some graphs use symbols, and are ok. Others are grey vs. grey, and of no use at all.

Another example is on page 265. The graph is labelled "Drew Conway ego-network colored by local community structure."

The graph is monochromatic.

This is severe enough that I think I'm going to return the book, something I've never done with an O'Reilly title.

Anonymous  Mar 20, 2012