Programming Collective Intelligence

Errata for Programming Collective Intelligence

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date Submitted
Printed Page xix
3rd code snippet

print [v*10 for v in l1 if if v1>4] should read print [v*10 for v in l1 if if v>4]

Anonymous  May 29, 2008 
Printed Page xviii
Under List Comprehensions

print [v*10 for v in l1 if v1 > 4] should read: print [v*10 for v in l1 if v > 4]

Anonymous  Jul 19, 2008 
Printed Page xvii
Third example under the List and dictionary constructors heading

The example list is given as: string_list = ['a', 'b', 'c', 'd',] The third example under the List and dictionary heading states the following: string_list[2] # returns 'b' Actually, since [2] is the offset, and list offsets start with [0], string_list[2] should actually return 'c' not 'b' .

Anonymous  Aug 06, 2008 
Printed Page xx
line 13

v1 should be v

Anonymous  Nov 26, 2008 
Printed Page xvii
9th Paragraph

Under the Heading "Overview of the Chapters", sub-heading "Chapter 2..." The final clause of the paragraph contains the word "move" which should be "movie".

Anonymous  Dec 01, 2008 
Printed Page xviii
List comprehensions

[xviii] Python Tips: List comprehensions; The following statement is NOT valid python code. l1=[1,2,3,4,5,6,7,8,9] print [v*10 for v in l1 if v1>4] There is no variable called v1 in this scope. The code should probably look more like this: l1=[1,2,3,4,5,6,7,8,9] print [v*10 for v in l1 if v>4] Notice that the if statement was changed.

Ryan  Jan 01, 2009 
Safari Books Online NA
Section 6.4.1

c1 = classifier.classifier() c1.sampletrain() c1.fprob( 'money', 'bad' ) = 0.5 c1.weightedprob( 'money', 'bad' ) = 0.5 This is the result if we take the examples from his sampletrain(). Yet he lists a hypothetical example where there is only 1 document in the 'bad' category and comes up with these answers: weight = 1, ap = 0.5, total = 1, raw = 1 c1.weightedprob( 'money', 'bad' ) = 0.75 But his sampletrain() will yield: weight = 1, ap = 0.5, total = 1, raw = 0.5 c1.weightedprob( 'money', 'bad' ) = 0.5 If this is true, then his recurring example of c1.weightedprob( 'money', 'bad' ) = 0.75 is very misleading.

Anonymous  Jun 19, 2009 
Safari Books Online NA
Section 6.6.1

I quote: """ * clf = Pr(feature | category) for this category * freqsum = Sum of Pr(feature | category) for all the categories * cprob = clf / (clf+nclf) """ Where is 'nclf' anywhere here? The code itself says: cprob = clf / freqsum I might be mistaken but I am beginning to think that this book has been poorly edited. Some of the code have very bad names for variables, which reduces its readability. The editor probably did not take the time to properly peruse and understand this book, let alone peer review it. It appears to be a case of, "Oh, what is your book called? 'Programming Artificial Intelligence for the Web'? Oh, dear, I'm afraid that is not going to sell. How about 'Programming Collective Intelligence'? I say, that should sell it to the Twitterverse, chaps."

Anonymous  Jun 19, 2009 
Safari Books Online NA
Section 6.7.1

Persisting the classifier using SQLite is a good idea, but its implementation is terribly naive for the simple reason that primary keys for the tables are not specified. This leads to simply atrocious performance for anything but the most trivial of applications. Makes me think that the author wrote pretty much untested Python code for the book.

Anonymous  Jul 15, 2009 
Safari Books Online
addlinkref function

In the addlinkref function, the call to the separatewords function appears as separateWords rather than in all lowercase, as it defined within the text of the book. Mixing the two 'spellings' causes an error.

Marisano James  Aug 05, 2009 
Safari Books Online
chapter11, gridgame function, about halfway through

# Board wraps Should read: # Board limits as listed in the code on p. 270 of the printed version of the book (1st ed.), and described in the first paragraph on p. 269.

Marisano James  Oct 30, 2009 
Printed Page

I wanted to reaffirm the following post: Printed Page 21 1st piece of code I think the API might have changed and doesn't include the 'href' key anymore. When I ran the sample code, it choked on this line in the first piece of code on the page: for p2 in get_urlposts( p1[ 'href' ] ) The Python interpreter complained about the 'href' key: >>> delusers = initializeUserDict( 'programming' ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "", line 8, in initializeUserDict # find all users who posted this KeyError: 'href' Lina Faller Jun 16, 2009

Freemon Johnson  Mar 25, 2011 
Printed Page xvii
5th paragraph

string_list[2] # returns 'b' should be string_list[2] # returns 'c'

Anonymous  May 26, 2011 
Printed Page xviii
3rd paragraph

example code: print [v*10 for v in l1 if v1 > 4] should be print [v*10 for v in l1 if v > 4]

Anonymous  May 26, 2011 
Other Digital Version chapter6/
Example Code - chapter6/ - Lines 22 and 26

In the example code, there's a bug in 'chapter6/'. I think you should change the code like this. Line 22: -- print 'Guess: '+str(classifier.classify(entry)) ++ print 'Guess: '+str(classifier.classify(fulltext)) Line 26: -- classifier.train(entry,cl) ++ classifier.train(fulltext,cl) (submitted by a customer via email)

Anonymous  Aug 01, 2011 
PDF Page xvi

print [v*10 for v in l1 if v1>4] should be print [v*10 for v in l1 if v>4]

Kenny Darrell  Oct 09, 2011 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version Page xv
Python Tips

I the Programming Collective Intelligence Book, in the Preface in Python Tips section - there is a mistake: ---------------- Python has a good set of primitive types and two that are used heavily throughout this book are list and dictionary. A list is an ordered list of any type of value, and it is constructed with square brackets: number_list=[1,2,3,4] string_list=['a', 'b', 'c', 'd'] mixed_list=['a', 3, 'c', 8] A dictionary is an unordered set of key/value pairs, similar to a hash map in other languages. It is constructed with curly braces: ages={'John':24,'Sarah':28,'Mike':31} The elements of lists and dictionaries can be accessed using square brackets after the list name: string_list[2] # returns 'b' ages['Sarah'] # returns 28 ---------------- In reality it will return c , as in python an array starts from 0. Have a nice day. Best regards, PhD Student Pavel Kucherbaev University of Trento, Italy

AdamZ  Aug 01, 2013 
Other Digital Version Fig 2-3
Dupree data point

Error was found in Kindle edition, copyright 2008 The Dupree data point in Figure 2-3 is incorrect. It is shown at [1,2] in the publication; however, according to the values in the "critics" dictionary (both in the book and online), it should be displayed at [3.5, 2.5].

Bill Lampeter  Jan 12, 2015 
Other Digital Version m9h05uls

<a href=>lipitor</a> <a href=>find out more</a> <a href=>albuterol</a> <a href=>how much does benicar cost</a> <a href=>generic cialis for sale</a> <a href=>cleocin gel online</a> <a href=>here</a> <a href=>buy neurontin</a>

PhyllisPag  Jan 02, 2017 
Safari Books Online 8.1
Building a Sample Dataset

Text says: "The following function generates 200 bottles of wine and calculates their prices from the model." However, the wineset1 code generates 300 bottles of wine. Fix: "The following function generates 300 bottles of wine and calculates their prices from the model."

Anonymous  May 04, 2009 
Safari Books Online 8.1
Building a Sample Dataset

The wineprice function in text does not match online source. In text: # Past its peak, goes bad in 5 years price=price*(5-(age-peak_age)) Online: # Past its peak, goes bad in 10 years price=price*(5-(age-peak_age)/2)

Anonymous  May 04, 2009 
Safari Books Online 8.1
Building a Sample Dataset

The noise calculation in the wineset1 function in the text does not match the online source. In text: "It then randomly adds or subtracts 20 percent..." # Add some noise price*=(random()*0.4+0.8) Online: # Add some noise price*=(random()*0.2+0.9)

Anonymous  May 04, 2009 
Safari Books Online 8.3.3
Gaussian Function definition

The default value for sigma is printed as 10.0, but this should be 1.0 in order to match the graph and the printed results. Wrong: def gaussian(dist,sigma=10.0): return math.e**(-dist**2/(2*sigma**2)) Right: def gaussian(dist,sigma=1.0): return math.e**(-dist**2/(2*sigma**2))

Anonymous  May 04, 2009 
Safari Books Online 8.4

Typo in text and difference between crossvalidate function source in text and online source. 1. Typo: "Typically, the test set will be a small portion, perhaps 5 percent of [the] all the data..." 2. Crossvalidate source change: In crossvalidate function source, note that default value of test parameter is 0.5 in the text, but is 0.1 in the online source.

Anonymous  May 06, 2009 
Safari Books Online 8.5
Heterogeneous Variables

Difference in bottle size between source code and text: In text: "Unlike the variables you've used so far, which were between 0 and 100, its range would be up to 1,500." In source code in text, however, the bottle size ranges up to 3000. (However, in the source code online, the maximum bottle size is 1500.)

Anonymous  May 06, 2009 
Safari Books Online 8.6
Optimizing the Scale

Two differences between text and online source: 1. In createcostfunction source in text, trails=10, whereas trails=20 in online source. 2. In weightdomain definition in text, 20 is the maximum weight: "...let's restrict them to 20 for now." whereas the maximum weight in online source is 10.

Anonymous  May 06, 2009 
Safari Books Online 8.6
Optimizing the Scale

The text refers to the geneticoptimize function, but this function is not defined in (online). "You can also try the slower but often more accurate geneticoptimize function and see if it returns similar results..." However, there is a swarmoptimize function defined in

Anonymous  May 06, 2009 
Safari Books Online 8.7
Uneven Distributions

There is no createhiddendataset function but there is a wineset3 function: "The createhiddendataset function creates a dataset that simulates these properties." Should be: "The wineset3 function creates a dataset that simulates these properties."

Anonymous  May 06, 2009 
Safari Books Online 8.7.2
Graphing the Probabilities

The input vector in the example below is wacky: >>> numpredict.cumulativegraph(data, (1,1), 120) To match the graph in Figure 8-10, try the following: >>> numpredict.cumulativegraph(data,[99,20],120) (Page 186 in printed version?)

Anonymous  May 06, 2009 
Safari Books Online 8.7.2
Graphing the Probabilities

The input vector and the high value are wacky: >>> numpredict.probabilitygraph(data,(1,1),6) To match Figure 8-11, try the following: >>> numpredict.probabilitygraph(data,[99,20],120) (Page 186?)

Anonymous  May 06, 2009 
Safari Books Online 8.9
When to Use k-Nearest Neighbors

Typo: s/observation/observations/ "It's also easy to interpret exactly what's happening because you know it's using the weighted value of other observation[s] to make its predictions."

Anonymous  May 07, 2009 
Printed Page 10
after 3rd and 4th paragraph

The examples read: >> from math import sqrt >> sqrt(pow(5-4,2)+pow(4-1,2)) 3.1622776601683795 >> 1/(1+sqrt(pow(5-4,2)+pow(4-1,2))) 0.2402530733520421 but the values are wrong, they should be: >> from math import sqrt >> sqrt(pow(1-2,2)+pow(4.5-4.0,2)) 1.1180339887498949 >> 1/(1+sqrt(pow(1-2,2)+pow(4.5-4.0,2))) 0.47213595499957939

paulo  Aug 26, 2009 
Printed Page 10
code just below 3rd paragraph

Pythagoras' theorem: length^2 = (y2-y1)^2 + (x2-x1)^2 length = sqrt((y2-y1)^2 + (x2-x1)^2) If each x,y pair coordinate is (value of Dupree, value of Snakes) Toby's x,y pair coordinate is (1,4.5) LaSalle's x,y pair coordinate is (2,4) The distance between Toby and LaSalle is: >>> sqrt(pow(4.5-4,2)+pow(1-2,2)) 1.1180339887498949 >>> You can just visually inspect the chart and see the distance between Toby and LaSalle is not even close to 3.16227....

Edward Welling  Mar 15, 2010 
PDF Page 10
Between 3rd and 4th paragraph (code example)

The author is taking the hypotenuse between two (a,b) coordinates. The formula written out is: sqrt(pow(5-4,2)+pow(4-1,2)) Toby's coordinates are (4,5, 1) ~ His score of films "Snakes on a Plane" and "Me and Dupree". Lasalle's corresponding coordinates are (4, 2). Additionally, the difference in each axis should be expressed as an absolute value so: sqrt(pow(fabs(4.5-4),2)+pow(fabs(1-2),2))

Daniel Sikar  Apr 13, 2010 
PDF Page 10
Between 3rd and 4th paragraph (code example)

In addition to my previous submission, the fabs function does not need to be used since the difference is being squared, so my suggested edit: sqrt(pow(4.5-4,2)+pow(1-2,2))

Daniel Sikar  Apr 13, 2010 
Printed Page 11
2nd code block

reload(recommendations) is performed before the recommendations module has ever been loaded.

Anonymous  Jun 20, 2008 
Printed Page 11
last line of code example

The last line of the sim_distance function: "return 1(1+*sqr*t(sum_of_squares))" should be "return 1(1+sqrt(sum_of_squares))"

Anonymous  Jun 21, 2008 
Printed Page 11
1st example function

The 'confirmed' errata for this mistake was incorrect. shows return 1/(1+sum_of_squares) should be return 1/(1+*sqr*t(sum_of_squares)) when in fact return 1/(1+sum_of_squares) should be return 1/(1+*sqrt(sum_of_squares)) And as a result of that change, the output of the function when run shown in the next code block on the page should have a result of 0.294298055085549 not 0.148148148148148

Anonymous  Sep 26, 2008 
Other Digital Version 11
refer to the online error report 'Changes made in the 3/08 printing'

In the error fix code on your page called: Changes made in the 3/08 printing at this link: The error report reads like this: {11} last line of code sample; return 1/(1+sum_of_squares) should be return 1/(1+*sqr*t(sum_of_squares)) and the last line should be this: return 1/(1+sqrt(sum_of_squares))

Anonymous  Nov 28, 2008 
Printed Page 11
Result of execution of code example (sim_dstance btw 'Lisa Rose' and 'Gene Seymour')

Book gives sim_distance between Lisa and Gene of 0.148148148148. My result was 0.29429805508554946; verified with calculator.

Justin Middleton  Apr 05, 2009 
Printed Page 11
last line of code AND python output

last line of code reads: return 1/(1+*sqr*t(sum_of_squares)) Should read return 1/(1+sqrt(sum_of_squares)) AND Result is given below as 0.148148148148 This is equivalent to 1/(1+5.75) 5.75 is sum_of_squares NOT sqrt(sum_of_squares) Result should be 0.294298055086

Ian Ford  May 06, 2009 
Printed Page 11
In the 4th paragraph from the bottom on page 11

It seems that the Euclidean distance-based similarity score between 'Lisa Rose' and 'Gene Seymour' should be 0.294298

Daqing Chen  Jul 18, 2009 
Printed Page 11
numerical result (0.148148...)

This is (supposedly) the result of using the function 'sim_distance' as it appears on this page, but it actually uses the statement return 1/(1 + sum_of_squares), rather than the (correct) statement return 1/(1 + sqrt(sum_of_squares)). Apparently, the error is the result of *using* the wrong version of the function, while the correct version is *printed* .

Yehiel Milman  Sep 08, 2009 
Printed Page 11
Bottom line of python code section describing the function sim_distance

The sim_distance function should return 1/(1+sqrt(sum_of_squares)) rather than 1/(1+sum_of_squares) when inverting the Euclidean distance. The formula for Euclidean distance includes a square root, but the square root is never taken of the sum of all the squares in this segment of code.

Thea  Sep 10, 2009 
Printed Page 11
Euclidian Distance Score code snippet

In the code snippet for sim_distance: return 1/(1+sqrt(sum_of_squares)) Shouldn't this be return 1/(1+sqrt(sum_of_squares/len(si))) Also, in the example code zip on the companion website, this is return 1/(1+sum_of_squares) which should also be return 1/(1+sqrt(sum_of_squares/len(si))) I guess this error has no effect when getting recommendations for a single movie or critic; the resulting similarity values deviate, but keep the same order. However, this error has serious effect when ordering similarities over movies/critics. This is done in getting recommendations item based, on page 24. Am I correct? (If I'm not, please excuse me for this sunday morning observation *yawn* )

Koen Mannaerts  Sep 19, 2009 
Safari Books Online 11
Last code example

The correct similarity score is 0,294298055085 instead of 0.148148148148.

Anonymous  Dec 05, 2009 
Printed Page 11
after second paragraph

The numerical output values of some examples, in chapter 2, including the above seems to be due to an error in the Examples (PCI_Folder), chapter2,, line 38. Wrong: return 1/(1+sum_of_squares) Right : return 1/(1+sqrt(sum_of_squares))

Anonymous  Mar 30, 2010 
PDF Page 11
Second code block

# One would have had to execute and 'import' prior to a 'reload' ... >>> import recommendations >>> reload(recommendations)

Anonymous  Sep 01, 2010 
Printed Page 11
1st code snippet

>>> recommendations.sim_distance(recommendations.critics, 'Lisa Rose', ... 'Gene Seymour') 0.148148148148 should be: >>> recommendations.sim_distance(recommendations.critics, 'Lisa Rose', ... 'Gene Seymour') 0.294298055085 The output is wrong, but the sim_distance function is actually correct. My guess is the author forgot to update his code to reflect the code in the book. This error is also present in the downloadable version of the source code. It can easily be fixed by replacing the return statement in the sim_distance function in Replace: return 1/(1+sum_of_squares) with: return 1/(1+sqrt(sum_of_squares))

Roshan Patel  Oct 31, 2010 
Printed Page 11
second para

recommendations.sim_distance(recommendations.critics, 'Lisa Rose', 'Gene Seymour') results in a similarity score of: 0.29429805508554946. >>> 1/(1+sqrt(5.75)) #where 5.75 is the 'sum_of_squares' 0.29429805508554946 Not taking the square root of the 'sum_of_squares' will result in the value printed in the book: >>> 1/(1+5.75) 0.14814814814814814

Anonymous  Nov 11, 2010 
PDF Page 11
1st code block

Earlier on in the chapter, the Euclidian Distance score is defined as: 1/(1+sqrt(sum_of_squares)) However, the code example on page 11 leaves out the square root. ---- snip ----- from math import sqrt # Returns a distance-based similarity score for person1 and person2 def sim_distance(prefs,person1,person2): # Get the list of shared_items si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 # if they have no ratings in common, return 0 if len(si)==0: return 0 # Add up the squares of all the differences sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) return 1/(1+sum_of_squares) ---- end snip ---- The last line of this snippet should be: return 1/(1 + sqrt(sum_of_squares)) In addition, the result in the interpreter output for the code example contains the incorrect value, since the formula was wrong. By the way, great book so far, I'm really enjoying it.

Abraham Lee  Jun 09, 2011 
Printed Page 11
sim_distance function

The square root is not applied in the sim_distance function as is described on the previous page (page 10) in the section titled Euclidean Distance Score. E.g. The last line of the function is: return 1/(1+sum_of_squares) but possibly could/should be: return 1/(1+sqrt(sum_of_squares)) If this is an error/omission it also follows through to the example usage of the function, also on page 11 directly beneath the function code and the example code download accompanying the book. If it is not an error/omission then I think an explanation should be included explaining why the square root is not applied; As it appears to contradict the explanation of how to calculate Euclidean Distance on page 10.

Anonymous  Oct 18, 2011 
Other Digital Version 11
end of sim_distance function

It says: "return 1/(1+sqrt(sum_of_squares))" If you run this, you get a different result than stated in the book. The downloaded book code samples say: "return 1/(1+sum_of_squares)" Which gives the proper results, and the right results as shown in the book.

Anonymous  Dec 29, 2011 
Printed Page 11
2nd Para

On page 11, the author computes recommendations.sim_distance(recommendations.critics,'Lisa Rose', 'Gene Seymour') as 0.148148148148. When I programmed his algorithm into Python 2.7.3 as well as implemented it in MS Excel I get 0.294298055086. I can send the author a spreadsheet, as well as my Python implementation if needed.

Salil Athalye  Nov 08, 2012 
Printed, PDF Page 11
Top half of the page

I'm seeing three different versions of the code on this page. Most of them are giving an output of : "Out[3]: 0.29429805508554946" The sample code (which differs from the code given in the book) is the only code that gives the published output of: "0.14814814814814814" Which is correct?

Anonymous  Mar 17, 2013 
Printed Page 11
Euclidean Distance Score code

sum_of_squares in downloadable code (correct) sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) sum_of_squares in printed code (incorrect) sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])

Curie  Jul 01, 2015 
Printed Page 11
Euclidean Distance Score code

sum_of_squares in downloadable code-- yields (0.14814814814814814) sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) return 1/(1+sum_squares) sum_of_squares in printed code -- yields (0.29429805508554946) sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si]) return 1/(1+sqrt(sum_squares))

Curie  Jul 01, 2015 
Printed Page 11
12th line in code

If there is nothing found, it is conventional to return -1 meaning not found. Like that it can be filtered out and otherwise it could tamper the results.

Orion  Jul 06, 2015 
Printed Page 11
last line in code

The return statement reads: return 1/(1+sum_of_squares) It should read: return 1/(1+sqrt(sum_of_squares)) Reason: After calculating the sum of each squared value, the square root of it should be calculated. Conclusion: The 'from math import sqrt' is not used and should be used as i have shown in the corrected version. Therefore the given output is also incorrect.

Orion  Jul 06, 2015 
PDF Page 12
United States

In Figure 2-3 on page 12, Dupree is plotted with a score of 1 from Jack Matthews, which doesn't match the score specified earlier (3.5).

Stephen Dewey  Nov 08, 2013 
Printed Page 13
sim_pearson code fragment

Definition of sim_pearson is subject to integer math errors, at least in Python 2.5.1. The number of overlapping elements is saved as "n = len(is)". This is used later to determine the numerator of the formula. It turns out this can allow the function to return values > 1, for this reason: 13 * 11 / 3 ==> 47 whereas: 13 * 11 / 3.0 ==> 47.666666666666664 Changing the initial line to be "n = float(len(si))" is enough to prevent this. As someone in a separate errata noted, the function should also return 1.0 if the denominator is 0.

Anonymous  Nov 14, 2008 
PDF Page 13
in the sample code

# if they are no ratings in common, return 0 if n==0: return 0 must be: # if they are no ratings in common, return 0 if n==0: return -1 because sim_pearson returns -1 if no match found, not 0 - zero means something about 50/50 matching, and this will give false results in all futher functions what use it, like topMatch e.g. at page [14]

Anonymous  Mar 20, 2009 
PDF Page 13
Code block (3rd or 4th paragraph depending on definition)

There were multiple problems with the code on this page. I answered this question: which provides working code for the example.

Raphael  Nov 25, 2012 
PDF Page 13
Second Half

The Pearson Algorithm doesn't work properly. I have pasted the working code for this section here at

Siddharth Menon  Nov 26, 2012 
Printed Page 13
11th line in code (in the comment)

It reads: #if they are no ratings in common, return 0 It should read: #if they have no ratings in common, return ... Reason: It is grammatically incorrect and not consistent with the code for the Euclidean Distance Score on page 11 as there it reads: #if they have no ratings in common, return 0 Conclusion: "are" should be replaced with "have" so it is consistent and grammatically correct

Orion  Jul 06, 2015 
Printed Page 14
first paragraph of "Ranking the Critics"

"learning which movie critics have tastes simliar to mine" ^^^^^^^

Anonymous  Jun 09, 2008 
Printed Page 15
Table 2-2

FinalScore = Total/Sim.Sum This will remove (or reduce) the weight of similarity. Let's imagine this scenario, I'm X A,B,C is very similar to X, let's say ABC has 0.9 similarity to X D,E,F is quite different from X, let's say DEF has 0.1 similarity to X A,B,C all rated Movie1 4.5 D,E,F all rated Movie2 4.6 So for Movie1 we get (4.5*0.9+4.5*0.9+4.5*0.9)/(0.9+0.9+0.9)=4.5 for Movie2, we get 4.6 which > Movie1 But I think this is a improper recommendation, as X should listen more to ABC, and ABC recommend Movie1. I think it may be better to calculate FinalScore = Total/sqrt(Sim.Sum) With this, we also can get consistent recommendation for the two examples in this book (Table 2-2 @page 15, and Table 2-3 page 24) Thanks,

Fuchen Ying  Nov 08, 2009 
PDF Page 16

As person is not in the list, the second condition in the original code will never be fulfilled: # only score movies I haven't seen yet if item not in prefs[person] or prefs[person][item]==0: The code may ignore the zero scored items by the "other" person using this code: # only score movies I haven't seen yet that contribute to the recommendation if item not in prefs[person] or prefs[other][item]!=0.0:

ifernando  Jan 06, 2017 
Printed Page 20

In the module, an exception is raised when "feedparser" is imported.

Anonymous  Oct 28, 2008 
PDF Page 20
5rd paragraph

New API uses "url" field instead of "href" field. All the text referring to "href" field should be updated.

Jingguo Yao  Nov 14, 2010 
Safari Books Online 20
United States

Upon running the install script for version 0.6, I get 'ImportError: No module named etree.ElementTree' How do you get the API to work??? There is absolutely no documentation for it. The documentation folder is empty and there is no documentation on the code download site. It doesn't even say which version of Python is needed! You should add the version of the API that works for you. "To make things even easier for you, there is a Python API that you can down-load from or" I clicked on the book link to but did not find the API code there. I downloaded the book code and did not find it there either!

George Ulmer  Nov 26, 2011 
ePub Page 20
sim_distance result for 'Lisa Rose' and 'Gene Seymour'

The similarity score of 0.148148148148 given for Lisa Rose and Gene Seymour is wrong. It should be 0.29429805508554946.

Bob Felts  Oct 25, 2012 
Printed Page 21
1st piece of code

I think the API might have changed and doesn't include the 'href' key anymore. When I ran the sample code, it choked on this line in the first piece of code on the page: for p2 in get_urlposts( p1[ 'href' ] ) The Python interpreter complained about the 'href' key: >>> delusers = initializeUserDict( 'programming' ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "", line 8, in initializeUserDict # find all users who posted this KeyError: 'href'

Lina Faller  Jun 16, 2009 
PDF, Other Digital Version Page 21
Fourth line

The code: url=post['href'] Is no longer correct and generates a key error. It looks like href has been changed to url. Changing the line to this: url=post['url'] resolves the issue.

Paul Anderson  Mar 27, 2011 
Printed Page 21
top of page

I've read that the 'href' key was renamed to 'url' by the Delicious people, but even when I change the code in line 8 from for p2 in get_urlposts(p1['href']) to: for p2 in get_urlposts(p1['url']) I still got the same error: File "", line 8, in initializeUserDict for p2 in get_urlposts(p1['url']): TypeError: string indices must be integers, not str So... not sure where we stand on this, but any help would be greatly appreciated.

Morgan Thompson  May 16, 2013 
Printed Page 25
2nd code sample

The noted mistake on pg 11 with regards to calculating the similarity based on the distance metric propagates to this code sample as well. The numbers are slightly incorrect. They should really be: [(3.1667425234070894, 'The Night Listener'), (2.936629402844435, 'Just My Luck'), (2.868767392626467, 'Lady in the Water')]

radoshi  Oct 21, 2011 
Printed Page 31
2nd code sample

The code to split text into words uses this regular expression: [^A-Z^a-z] the above includes caret ('^') in the set with the alphabetic characters, so caret will not be used to split words. For example: >>> re.compile("[^A-Z^a-z]").split("foo^bar") ['foo^bar'] The correct regex is [^A-Za-z]: >>> re.compile("[^A-Za-z]").split("foo^bar") ['foo', 'bar']

Anonymous  Nov 17, 2009 
Printed Page 31
first code snippet

In the first for loop of the first code snippet: Instead of: for e in d.entries: if 'summary' in e: summary=e.summary

Roshan  Nov 02, 2010 
Printed Page 31
First codeblock

The return d.feed.title,wc will throw an AttributeError if any of the feed urls in feedlist.txt no longer exist. This can be catered for by catching the exception as follows: try: return d.feed.title,wc except AttributeError: return "nofeed",wc

Anonymous  Dec 16, 2010 
Printed Page 32
code listing after 3rd paragraph and another code listing following it

1. In the main loop the first line with function call title,wc=getwordcounts(feedurl) generates an uncaught exception: TypeError: 'NoneType' object is not iterable Adding the try..except fixes the problem: 2. In the second code listing feedlist variable is not defined

Dmitry Kan  Feb 13, 2010 
Printed Page 32
1st code block, 1st line in the for loop

The line feedlist.add(feedurl) throws the following error: AttributeError: 'list' object has no attribute 'add' This can be resolved by replacing the code above by: feedlist.append(feedurl)

Peter Tegelaar  Jan 07, 2011 
Printed Page 32
1st code snippet

feedlist.add(feedurl) add method doesn't exist. should probably read feedlist.append(feedurl)

Anders Jansson  Nov 17, 2012 
Printed Page 32
3rd code snippet

blog = blog.encode(`ascii`, `ignore) it's not single quotes, and the terminating quote on ignore is missing.

Anders Jansson  Nov 17, 2012 
Printed Page 32
7th line of last code block on page

Missing closing single quote for 'ignore' parameter on line that reads: blog = blog.encode('ascii', 'ignore)

Alan Esenther  Dec 20, 2012 
Printed Page 32
Top code snippet for ""

The apcount map is supposed to contain "..the number of blogs each word appeared in", yet the code contains the following statement: for word,count in wc.items(): apcount.setdefault(word,0) if (count>1): apcount[word]+=1 in other words, apcount actually contains the number of blogs each word appeared in _more than once_. This is probably not what the author wants, and I propose the following change to the last two lines of code: if (count>0): apcount[word]+=1

Jens Roland  Mar 08, 2015 
Printed Page 35
Near end of def pearson, about 2/5 down the page

The Pearson correlation coefficient error that anonymous first identified on p. 13 also appears here. The line reading, num=pSum-(sum1*sum2/len(v1)) should read num=pSum-(sum1*sum2/float(len(v1))) to avoid integer roundoff errors.

Marisano James  Apr 26, 2010 
Printed Page 35
Near end of def pearson, about 2/5 down the page

This Pearson correlation coefficient function contains another slight error: den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1))) should read den=sqrt((sum1Sq-pow(sum1,2)/float(len(v1)))*(sum2Sq-pow(sum2,2)/float(len(v1)))) to avoid integer roundoff errors.

Marisano James  Apr 26, 2010 
Printed Page 35
last line of pearson(v1, v2)

The Pearson correlation function on pg. 35 should end in "return num/den", not "return 1.0-num/den". On pg. 13, it ends in: r=num/den return r which is correct.

Scott Vokes  Nov 27, 2010 
Printed Page 44
Inset commands after 3rd paragraph, about 3/5 down the page

[rownames[r] for r in k[0]] should read, [blognames[r] for r in kclust[0]] Likewise, [rownames[r] for r in k[1]] should read, [blognames[r] for r in kclust[1]]

Marisano James  Apr 26, 2010 
PDF Page 44
4th paragraph code

A try catch block can be put in the code, as when the program fails to load data for a user posts will become undefined in the ensuing portion of code. def fillItems(user_dict): all_items={} # Find links posted by all users for user in user_dict: print user; for i in range(3): try: posts=get_userposts(user) break except: print "Failed user "+user+", retrying" time.sleep(4) try: // put this try-catch block additionally for post in posts: url=post['href'] user_dict[user][url]=1.0 all_items[url]=1 except: print 'no posts'

Anonymous  May 16, 2013 
Printed Page 46
First line of second paragraph of text.

In this line: "Once this is done, the code first has to create a list of items than more than five people want, ..." change "five" to "ten". Alternatively, change 10 to 5 in the code listing following this paragraph.

Alan Esenther  Dec 20, 2012 
Other Digital Version 49
chapter 2, def sim_distance(prefs,person1,person2)

I have a electronic copy (Kindle) so I don't have the page number. In chapter 2: >>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour') 0.148148148148 The code in the book is different from the code that is available in the Examples file I just downloaded. The value presented in the book is 0.148148148148 and is the result of the buggy version in the example code that lacks the final call to sqrt. The value in the book should be 0.294298055086 which is the result of using the code in the book. Please tell me which code is correct, the one in the book or the one in the Examples (

Gabriel Becedillas  Jan 18, 2010 
Printed Page 50
Very bottom of page

The variable "outersum" is defined but never used.

Anonymous  Mar 17, 2009 
Printed Page 58
3rd paragraph code

The code on page 58 (for did not work for me (using python 2.7 and beautifulsoup 4). I had to change the code to: links=soup.find_all('a') for link in links: if (link.has_attr('href')): hope that helps!

Ahmad Al Ben Ali  May 18, 2013 
Printed Page 62
2nd paragraph (suggested), just after def getentryid

The first edition of the book does not include the working definition for the addlinkref function. That is, as printed, the function only contains the pass statement, and unfortunately is never updated. This causes the code in the Inbound Links section to fail (particularly the PageRank algorithm). I suggest that the completed code be placed on p. 62, just after the getentryid definition. [The full function definition does appear in the file available via the Examples link, however.]

Marisano James  Aug 05, 2009 
Printed Page 62
second block of code

words=self.separateWords(linkText) should read words=self.separatewords(linkText) with a lowercase w

Quentin Groom  Dec 07, 2010 
PDF Page 63

When using a query not in the database I receive the following error >>> e.query('gorgonzola') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "", line 223, in query rows,wordids=self.getmatchrows(q) File "", line 201, in getmatchrows cur=self.con.execute(fullquery) pysqlite2.dbapi2.OperationalError: near "where": syntax error I intend to find a solution.

Lola Montes  Jun 25, 2013 
PDF Page 67
First Paragraph in sub-topic Cluster of Preferences

The website seems to be no longer operational. Please take note of this.

Anonymous  May 16, 2013 
Printed Page 68
1st code segment of the "Word Distance" section; 3rd from last line

The line dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))]) contains an error that renders the code substantially less useful - often almost useless. Recall that Python indexing begins at zero. Thus for the line to work appropriately, it should read: dist=sum([abs(row[i]-row[i-1]) for i in range(1,len(row))]) Note that the lower bound of the range has been decremented by one. Summing across the word distances is correct, however (as explained in the text), unlike what was anonymously reported as an error on p. 69 below.

Marisano James  Aug 08, 2009 
Printed Page 68
1st code segment of the "Word Distance" section

Saw this a little while back. Ilya Verbitskiy writes that my errata and submitted fix are incorrect. I am no longer able to download the data to re-check my submitted error and fix (and I don't recall enough to make my own test case), but what I was saying is that although urlid is the first element, indexing begins at zero, so the element in the first (rather than the "zeroeth" position) is a word with some distance to other words the user queried. At the time I tried the code with and without the updated offset and surveyed the output of each as well.

Marisano James  May 22, 2016 
Printed Page 69
1st code segment (4th line down)

dist = sum([abs(rows[i] - row[i-1) for row in range(2, len(ow))]) This makes no sense. Presumably sum should be min. This is a lovely book, but I agree with an earlier submitter, very poorly proofread. This is especially the case as the code isn't commented. Even if the code were error free, it is quite terse and hard for a non Python expert to read. The errors really detract from readability.

Anonymous  Jul 28, 2009 
Printed Page 69
1st code segment of the "Word Distance" section

What Marisano James said is not right. Remember, the first element in the row tuple is the urlid. So it makes no sense to do the following: dist=sum([abs(row[i]-row[i-1]) for i in range(1,len(row))]). The code in the book is right, because the urlid of the row tuple must be ignored and the beginning is the second element in the row tuple (index 1).

Ilya Verbitskiy  Feb 17, 2012 
Printed Page 70
Figure 4-3

Page C has only three links to other pages. It should have four links to other pages.

Jarno Mielikainen  Dec 09, 2008 
Printed Page 70
Figure 4-3

The text states "C links to four other pages", but in the figure are only 3 additional arrows besides the one pointing to A. In the computation the total number of links on C is 5. Just a inconsistency.

Anonymous  Dec 19, 2008 
Printed Page 70
Last Paragraph

Looking at Figure 4-3, Page B and page C both have three links going to pages other than A. The text in the final paragraph states "B also has links to three other pages and C links to four other pages". Essentially, it looks like there should be an extra arrow coming out of C in the diagram to bring the total number of arrows to four.

Bryce Thomas  Apr 11, 2009 
Printed Page 70
Figure 4-3

Figure 4-3. Calculating the PageRank of A Page C links to A and four other pages the text says. This is confirmed by the calculation for PR(A) on page 71 where PR(C)/links(C) is defined as 0.7/5. The figure, however, only displays C's link to A and THREE other pages, which is correct for B, not C.

Maurice Jutte  Jan 31, 2011 
Printed Page 70
Figure 4-3

The outgoing links (arrows) from page C are wrong. In the text you read that "... and C links to four other pages.". Thus, the arrows going out in the drawing from page C should be 4 and not 3. This is also confirmed by the formula to compute PR(A) few lines below, where for page C you can read: PR(C)/links(C) and this is 0.7/5. Thanks, Christian

pezzin  Nov 08, 2012 
Printed Page 71
Beginning of code section, calculatepagerank

To be most effective, a call to self.dbcommit() should be inserted after the self.con.execute('drop table if exists pagerank') call (i.e. just after the first command of the calculatepagerank function). This permits calculatepagerank to be called multiple times in a single python session without the database returning an error stating that it already contains a pagerank table, and subsequently remaining locked due to the error.

Marisano James  Aug 05, 2009 
Printed Page 72
Second example down

the statement: for i in range(3): print does nothing, why have a for loop if you are not going to use the index? Actually this can be fixed, and there are a lot of other typos in this book, to anyone looking for solutions here is what I did, I hope this saves you some time 1. look at the example code available to download, it is sometimes different from the code in the book 2. check here to see if anyone else had the same problem as you

Peter Harrington  Feb 10, 2010 
Printed Page 79
#node outputs block and #hidden activations block

The list variable ai is a list of 1's. The value of ai never changes and it is only used to multiply against another list. That multiplication is redundant as it just results in the original list. Also, is created twice. Once in the middle of setupnetwork and at the start of feedforward. It is never changed and should result in the same value since setupnetwork is always run right before feedforward.

Anonymous  Jun 27, 2011 
Printed Page 80
5th Paragraph (Python console)

For the output of .getresult() (which comes from .feedforward()), it states: [0.76,0.76,0.76] However, I got: [0.076,0.076,0.076] Since these values are produced by TANH and the values are off by exactly one order-of-magnitude, I find it unlikely to be a decimal problem somewhere in the code, and more likely that it's a simple misprint.

Dustin Oprea  Aug 24, 2010 
PDF Page 81
last lines of last two paragraphs of the codes

The formulas to calculate output_deltas and hidden_deltas are wrong. According to the Delta Rule (, either these formulas should be error divided by dtanh, or the dtanh itself should be 1/(1-y*y).

Eric Wang  Mar 21, 2009 
Printed Page 81
1st paragraph

The derivative of tanh(y) is defined as 1-y*y. It is unclear how the author arrives at this result. Taking the Taylor expansion of tanh(y) around 0 results in y-y^3/3+(2 y^5)/15-(17 y^7)/315+(62 y^9)/2835-(1382 y^11)/155925+O(y^12). Throwing away the higher order terms and differentiating, we arrive at tanh'(y)=1-y^2, a result which holds around y=0.

Peter Tegelaar  May 05, 2011 
Other Digital Version 87
2nd paragraph - code

When I type code below into my Notepad (Win32) from, I get the following error? flights={} # for line in file('schedule.txt'): origin,dest,depart,arrive,price=line.strip().split(',') flights.setdefault((origin,dest),[]) Error message: 'ValueError: need more than 1 value to unpack'. I've tried everything and I can't seem to get around this error. I'm new so maybe I'm doing something wrong??

Anonymous  Jul 01, 2012 
Printed, Safari Books Online Page 87
3rf Paragraph (ignoring code sample)

The paragraph references a link to retrieve the sample data: But this is a dead link. It looks like all the data (for this and other chapters) has been moved to the sample code zip file but there's no reference to this on the support page for the book.

Russell  Oct 26, 2014 
Printed Page 88
Code definition for printschedule method

for d in range(0, len(r), 2): should be for d in range(0, len(people)): Iterating over the schedule is incorrect, the list of people should be iterated over. This is why the out and ret variables are retrieved via r[2*d] and r[2*d+1], respectively. As is, this causes an index out of bounds exception since the length of the schedule is twice the length of the number of people.

Jakob Homan  Jun 15, 2009 
Printed Page 88,90
Throughout schedulecost and printschedule method definitions

All instances of determining the return flight (returnf and ret variables) should have their destination and origin assignments switched, in order to find a flight back from LGA. As printed, the code finds the same flight for both origin and return.

Jakob Homan  Jun 15, 2009 
Printed Page 88
bottom of page

The index for the line beginning with out= should be [int(r[2*d])] rather than [r[d]], and the index for the line beginning with ret= should be [int(r[2*d+1])] rather than [r[d+1]]. There is no need to change the range of the for loop in the printschedule function, however, or to reverse any of the destination and origin assignments, as was erroneously reported elsewhere in this errata list.

Anonymous  Nov 10, 2009 
PDF Page 88-94 def printschedule(r): for d in range(len(r)/2): name=people[d][0] origin=people[d][1] out=flights[(origin,destination)][int(r[2*d])] # corr ret=flights[(destination,origin)][int(r[2*d+1])] # corr print '%10s%10s %5s-%5s $%3s %5s-%5s $%3s' % (name,origin, out[0],out[1],out[2], ret[0],ret[1],ret[2]) def schedulecost(sol): totalprice=0 latestarrival=0 earliestdep=24*60 for d in range(len(sol)/2): # Get the inbound and outbound flights origin=people[d][1] outbound=flights[(origin,destination)][sol[2*d]] # corr returnf=flights[(destination,origin)][sol[2*d+1]] # corr # Total price is the price of all outbound and return flights totalprice+=outbound[2] totalprice+=returnf[2] # Track the latest arrival and earliest departure if latestarrival<getminutes(outbound[1]): latestarrival=getminutes(outbound[1]) if earliestdep>getminutes(returnf[0]): earliestdep=getminutes(returnf[0]) # Every person must wait at the airport until the latest person arrives. # They also must arrive at the same time and wait for their flights. totalwait=0 for d in range(len(sol)/2): origin=people[d][1] outbound=flights[(origin,destination)][sol[2*d]] # corr returnf=flights[(destination,origin)][sol[2*d+1]] # corr totalwait+=latestarrival-getminutes(outbound[1]) totalwait+=getminutes(returnf[0])-earliestdep # Does this solution require an extra day of car rental? That'll be $50! if latestarrival<earliestdep: totalprice+=50 # corr return totalprice+totalwait def randomoptimize(domain,costf): best=999999999 bestr=None for i in range(0,1000): # Create a random solution r=[random.randint(domain[i][0],domain[i][1]) # corr for i in range(len(domain))] # Get the cost cost=costf(r) # Compare it to the best one so far if cost<best: best=cost bestr=r return bestr # corr def hillclimb(domain,costf): # Create a random solution sol=[random.randint(domain[i][0],domain[i][1]) for i in range(len(domain))] # Main loop while 1: # Create list of neighboring solutions neighbors=[] for j in range(len(domain)): # One away in each direction if sol[j]>domain[j][0] and sol[j]<domain[j][1]: # corr neighbors.append(sol[0:j]+[sol[j]+1]+sol[j+1:]) neighbors.append(sol[0:j]+[sol[j]-1]+sol[j+1:]) if sol[j]==domain[j][0]: # corr neighbors.append(sol[0:j]+[sol[j]+1]+sol[j+1:]) if sol[j]==domain[j][1]: # corr neighbors.append(sol[0:j]+[sol[j]-1]+sol[j+1:]) # See what the best solution amongst the neighbors is current=costf(sol) best=current for j in range(len(neighbors)): cost=costf(neighbors[j]) if cost<best: best=cost sol=neighbors[j] # If there's no improvement, then we've reached the top if best==current: break return sol def annealingoptimize(domain,costf,T=10000.0,cool=0.95,step=1): # Initialize the values randomly vec=[random.randint(domain[i][0],domain[i][1]) # corr for i in range(len(domain))] ----------------------------------------------------------- I haven't looked at the geneticoptimize() function, but it also seems problematic. Executing it I get 4616 4282 3893 3610 Traceback (most recent call last): File "glass_gen", line 3, in <module> s=optimization.geneticoptimize(domain,optimization.schedulecost) File "/root/Collective_Intelligence/", line 184, in geneticoptimize scores=[(costf(v),v) for v in pop] File "/root/Collective_Intelligence/", line 43, in schedulecost for d in range(len(sol)/2): TypeError: object of type 'NoneType' has no len() ----------------------------------------------------- p. 91 quotation: In theory, you could try every possible combination, but in this example there are 16 flights, all with 9 possibilities, giving a total of 9**16 (around 300 billion) combinations. Testing every combination would guarantee you'd get the best answer. However, the total cost of the outbount flights is independent of the cost of the return flights, so the minimization of the outbound and return costs can be done separately. Six outbound and return flights yield 2*6**9 (20155392) combinations, which makes testing every combination an easy task. Thus,the optimal solution is [4, 3, 3, 3, 4, 3, 3, 4, 4, 3, 4, 3] cost: 2306, [1102, 81, 942, 181, 0] Seymour BOS 12:34-15:02 $109 10:33-12:03 $ 74 Franny DAL 10:30-14:57 $290 10:51-14:16 $256 Zooey CAK 12:08-14:59 $149 10:32-13:16 $139 Walt MIA 11:28-14:40 $248 12:37-15:05 $170 Buddy ORD 12:44-14:17 $134 10:33-13:11 $132 Les OMA 12:18-14:56 $172 11:07-13:24 $171

Olov Einarsson  Sep 12, 2010 
Printed Page 88
function def printschedule(r)

The function defprintschedule(r) uses d to key into both the people and r arrays. This results in an error in logic when executing the function. # Algorithm from book did not work, added d = 0 and personIndex variable in for loop. Made personIndex as key index into the people array and used d as out and ret key index. This is necessary because each consecutive location in the r array contains a pair of origin, destination locations, so for each person, array index i and index i+1 refer to the flight number for a round trip. Solution: d = 0 for personIndex in range(len(r)/2): name = people[personIndex][0] origin = people[personIndex][1] out = flights[(origin, destination)][r[d]] ret = flights[(destination, origin)][r[d+1]] d = d + 2 print '%10s%10s %5s-%5s $%3s %5s-%5s $%3s' % (name, origin, out[0], out[1], out[2], ret[0], ret[1], ret[2])

Anonymous  Nov 15, 2011 
Printed Page 88
4th paragraph

In this case, each number can represent which flight a person chooses to take, where 0 is the first flight of the day, 1 is the second, and so on. Since each person needs an outbound flight and a return flight, the length of this list is twice the number of people. For example, the list: [1,4,3,2,7,3,6,3,2,4,5,3] Represents a solution in which Seymour takes the second flight of the day from Boston to New York, and the fifth flight back to Boston on the day he returns. Franny takes the fourth flight from Dallas to New York, and the third flight back. Should be : Represents a solution in which Seymour takes the second flight of the day from Boston to New York, and the (seventh) flight back to Boston on the day he returns. Franny takes the (fifth) flight from Dallas to New York, and the (fourth) flight back.

Arie Schlesinger  Mar 26, 2012 
Printed Page 89
code example output

there has been previous report on some other issues on the example, e.g airport name is nowhere in schedule.txt and not programmed, also, the discussion and code are a bit inconsistent. Now, when running example code, the output is very different from the printout in the book as well. In summary, the illustration here is very confusing.

Anonymous  Jun 10, 2008 
PDF Page 90
the last 3 and 4 lines

int(sol[d]) should be int(sol[2*d]) in(sol[d+1]) should be int(sol[2*d+1]) and therefore the calculate result at the next page should be 5469

Eric Wang  Mar 23, 2009 
Printed Page 90
3rd paragraph, first sentence

The text states that, "There are a huge number of possibilities for the getcost function defined here." There is no getcost function, however; "getcost" should be replaced with "schedulecost".

Marisano James  Nov 06, 2009 
Printed Page 91
3rd paragraph

In theory, you could try every possible combination, but in this example there are 10 flights, all with 6 possibilities, giving a total of 6^10 combinations. instead of In theory, you could try every possible combination, but in this example there are 16 flights, all with 9 possibilities, giving a total of 9^16 combinations. b'cause 6 family members having 10 flights to LGA

Anonymous  Jun 21, 2008 
Printed Page 91
3rd paragraph

I believe that calculation should be: (10 flights to LGA * 10 flights from LGA) ^ (# of family members) = (10^2)^6 = 10^12 or in Python, (10**2)**6 = 10**12

Anonymous  Nov 06, 2009 
Printed Page 91
3rd paragraph

Oh, and 9**16 is much closer to 300 trillion than it is to 300 billion

Anonymous  Nov 06, 2009 
Printed Page 92
1st code snippet

> return r should be > return bestr

Anonymous  Oct 23, 2008 
Printed Page 92
first code snippet

The randomoptimize function should return bestr, not r.

Andy Young  Jan 20, 2009 
Printed Page 92
2nd line of Python session section (toward top of page)

The domain goes from 0-10, i.e. 0..9 rather than 0..7. (See schedule.txt for confirmation; also on p. 97, 8 could not be included as part of the solution if the highest available value were 7.) This means the line specifying the domain should read: domain=[(0,10)]*(len(optimization.people)*2)

Marisano James  Nov 07, 2009 
Printed Page 93
hillclimb function code sample

The spelling of neighbors is inconsistent in the code sample. All instances of the word neighbors/neighbours in the code sample should be changed either to "neighbors" or "neighbours", but not a mix of both.

Bryce Thomas  May 06, 2009 
Printed Page 93
bottom third of page, about halfway through the code section

The code as printed in the first edition of the book, allows for negative index references (in Python these will not cause index out of range errors, but will instead read from the opposite end of the list) and does not explore the full domain. One way to address these shortcomings is to change the domain to take on actual values from 0 up to 9 inclusive, and to make the following changes: if sol[j]>domain[j][0]: should be changed to read, if sol[j] > domain[j][0] and sol[j] < domain[j][1] - 1: Likewise, if sol[j]<domain[j][1]: should read, if sol[j] < domain[j][1] and sol[j] > domain[j][0]: Another solution is to adjust the range as stated, but then to enforce a wrap-around at the ends of the range (this approach allows the algorithm to test all of the neighboring schedules that comprise the local minimum), for j in range(len(domain)): # One way in each direction. # # Test modified index again the upper bound. if sol[j] > domain[j][0]: if sol[j]+1 < domain[j][1] - 1: value = sol[j]+1 else: value = domain[j][0] neighbors.append(sol[0:j] + [value] + sol[j+1:]) # The lower upper bound. if sol[j] < domain[j][1]: # Use the mod function to wrap around. neighbors.append(sol[0:j] + [(sol[j]-1)%domain[j][1]] + sol[j+1:])

Marisano James  Nov 07, 2009 
Printed Page 95

The probability of a higher-cost solution being accepted should read: p = e^(-(highcost-lowcost)/temperature)

Sercan Taha Ahi  Dec 13, 2009 
Printed Page 96
Chapter 5

calculation of probability is wrong. Is: p=pow(math.e,-eb-ea/T) must be: p=pow(math.e,-(eb-ea)/T) reason: if eb<ea, eb is accepted anyway if eb>ea, then eb-ea is how much the solution got worse. If this is almost 0, then why not accepting (p becomes almost 1)? if this is very big, it should be less likely to accept the change, at least at low temperatures! -eb-ea also "somehow" works, because the temperature decreases enough. However, since p is quite small, there is not much annealing and we have to start from very high temperatures to have some p's close to 1.

Anonymous  Dec 29, 2009 
Printed Page 98
mutate function definition

The mutate function is missing an else clause (or some other mechanism for returning a default value). A possible solution would be to remove the elif conditional and guarantee that the function always returns a value: # Mutate operation def mutate(vec): i=random.randint(0,len(domain)-1) if random.random()<0.5 and vec[i]>domain[i][0]: return vec[0:i]+[vec[i]-step]+vec[i+1:] else: return vec[0:i]+[vec[i]+step]+vec[i+1:]

Jeremy Mason  Dec 08, 2008 
Printed Page 98
code example in function def mutate

the mutate function returns None if neither the if or elif are true. This results in None being appended to the pop list which causes badness in the cost function. suggest adding the following to the end of the mutate function else: return vec

Matt Mercer  Mar 01, 2009 
PDF Page 98
the mutate function of code example

the mutate function should has a else clause which return vec. Otherwise None will be added into pop occasionally

Eric Wang  Mar 24, 2009 
PDF Page 98
In the geneticoptimize function

Firstly in this function correct the mistyped 'mutprod' to 'mutprob' The mutate function goes out of bounds for different step values plus the previously reported error so here is my corrected version def mutate(vec): i=random.randint(0,len(domain)-1) # Corrected so different step values can be added if random.random()<0.5 and vec[i]-step>domain[i][0]: return vec[0:i]+[vec[i]-step]+vec[i+1:] elif vec[i]+step<domain[i][1]: return vec[0:i]+[vec[i]+step]+vec[i+1:] else: return vec Also during the mainloop mutations/crossovers are made but not scored, so my solution is to make maxiter +1 and in the last iteration just score it without adding mutations; # Main loop for i in range(maxiter): scores=[(costf(v),v) for v in pop] scores.sort() if i == maxiter-1: pass # Corrected from here for last iteration else: ranked=[v for (s,v) in scores] # Start with the pure winers pop=ranked[0:topelite] # Add mutated and bred forms of the winners while len(pop)<popsize: if random.random()<mutprob: # Mutation c=random.randint(0,topelite) pop.append(mutate(ranked[c])) else: # Crossover c1=random.randint(0,topelite) c2=random.randint(0,topelite) pop.append(crossover(ranked[c1],ranked[c2])) # Print current best score print scores[0][0] return scores[0][1]

Lola Montes  Jul 02, 2013 
Printed Page 101

When you follow the Kayak API link (, you get the following error message: 'Kayak Search API is no longer supported. Because of costly misuse of the search API, KAYAK no longer can support it. Sorry.'

Peter Tegelaar  Apr 11, 2011 
Printed Page 108
second code example

for i in range(len(dorms): slots += [i,i] should be: for i in range(len(dorms)): slots += [i,i]

Anonymous  Jun 12, 2008 
Printed Page 109

The output of dorm.printsolution(s) uses the value of s as generated by s=optimize.randomoptimize(), for a solution cost of 18 (the original generated solution.) optimize.geneticoptimize() is then called, but the results are not reassigned to s, and the output is confusing. It might be more clear if this session were changed to: reload(dorm) s=optimize.randomoptimize(dorm.domain,dorm.dormcost) dorm.printsolution(s) dorm.dormcost(s) r=optimize.geneticoptimize(dorm.domain,dorm.dormcost) dorm.printsolution(r) so that the before and after can be seen.

Anonymous  Oct 24, 2009 
Printed Page 112
code, bottom

when calculating ua and ub, the book contains this code ua = ((x4 - x3) * (y1 - y3) - (y4 - y3) * (x1 - x3)) / den ub = ((x2 - x1) * (y1 - y3) - (y2 - y1) * (x1 - x3)) / den this works well for python 3.x but not for python 2.x if you are using python 2.x you should instead write: ua = ((x4 - x3) * (y1 - y3) - (y4 - y3) * (x1 - x3)) / float(den) ub = ((x2 - x1) * (y1 - y3) - (y2 - y1) * (x1 - x3)) / float(den) (by default python 2.x uses integer division when using the '/' operator together with two integers)

Ilya Verbitskiy  Apr 02, 2012 
PDF Page 113

The PIL wasn't showing me any results so I print it out as a a jpg instead from PIL import Image, ImageDraw def drawnetwork(sol): jpeg='network_diagram.jpg' # Create the image'RGB',(400,400),(255,255,255)) draw=ImageDraw.Draw(img) # Create the position dict pos=dict([(people[i],(sol[i*2],sol[i*2+1])) for i in range(0,len(people))]) # Draw links for (a,b) in links: draw.line((pos[a],pos[b]),fill=(255,0,0)) # draw people for n,p in pos.items(): draw.text(p,n,(0,0,0)),'JPEG')

Lola Montes  Jul 05, 2013 
Printed Page 121
2nd sample code block

After importing docclass, the db should be initialized. cl.setdb('test.db') Or this line can be added to

Anonymous  May 21, 2008 
PDF Page 121
Calculating Probabilities

"In this case, you can calculate the probability that a word is in a particular category by dividing the number of times the word appears in a document in that category by the total number of documents in that category." in this sentence, it indicated that the Probability of a word = total number times the word appears / total count of document But I think it is wrong , right one must as follow two case: A. Probability of a word = total number times the word appears/ total count of the word in all document B. Probability of a word = total number times the word appears / total count of the word in the document.

Ricky  Jan 05, 2013 
Printed Page 123
1st paragraph

The function for the calculation of the weighted probabilities (weightedprob) is wrong. In more details, the variable "total" is calculated in a wrong way. "total" is a weighting factor for "basicprob" (probability to find a document with a given feature in a given category). In the book "total" is calculated as "the number of times this feature has appeared in all categories". However, the weighting factor "total" should be equal to the number of items in a considered category.

Anonymous  Oct 09, 2009 
PDF Page 126
4th Paragraph

The line states that a new instance variable should be added to classifier when in fact it should be added to naivebayes, as should the other methods on the page.

Lola Montes  Jul 09, 2013 
Printed Page 129
very last line of code

I believe the last argument to the invchi2 function should *not* be multiplied by 2. That is, it should be: return self.invchi2(fscore, len(features)) rather than: return self.invchi2(fscore, len(features)*2)

Roy Pardee  Mar 12, 2009 
PDF Page 129
United States

Fisher's method is described as: "The Fisher method involves multiplying all the probabilities together, then taking the natural log (math.log in Python), and then multiplying the result by 2. Add this method to fisherclassifier to do this calculation:" And indeed, this is how it is implemented. However, as far as I can tell, Fisher's method, according to other sources, is the summation of the natural logarithm of each observation, not the natural logarithm of the product sum of each observation. In practice, it works out because log(xy) = log(x) + log(y) and summation is distributive, but this explanation caused me some serious confusion at first when I tried to find more information about this technique.

Anonymous  Mar 26, 2013 
Safari Books Online 132-133
Do not know

The SQL classification SQLite example should be coded using the SQL ? substitution model rather than using Python's method. The functions incf() and fcount() generate program exceptions when the feature input contains a quotation mark (and perhaps other characters as well). For example, the SQLite statement: self.con.execute("insert into fc values ('%s','%s',1)" % (f,cat)) should instead be coded: self.con.execute("insert into fc values (?,?,1)", (f,cat)) The latter (improved) format allows SQL to handle input that would otherwise cause problems.

Thornton Martin  Dec 06, 2009 
Printed Page 137
code sample

In the entryfeatures function, summarywords is produced by calling the lower() method on each string produced by splitting the summary text. The function later tries to count uppercase words by calling the isupper() method on each string in summarywords. Since these strings have already been converted to lower case, isupper() will never return true.

Anonymous  Dec 29, 2009 
Printed Page 137
At the code example

I found a minor issue with the code example at page 137. The code tries to detect overuse o uppercase use but it fails because it converts all the words to lowercase in the second list comprehension of the function. summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20] The code to detect the uppercase abuse correctly is (see that I called lower() in f[w.lower()] and in f[towowords.lower()]: def entryfeatures(entry): splitter = re.compile('\\W*') f = {} # Extract the title words and annotate titlewords = [s.lower() for s in splitter.split(entry['title']) if len(s) > 2 and len(s) < 20] # Extract the summary words summarywords = [s for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20] # Count uppercase words uc = 0 for i in range(len(summarywords)): w = summarywords[i] if (w.isupper()): uc += 1 f[w.lower()] = 1 # Get word pairs in summary as features if i < len(summarywords) - 1: twowords = ' ' . join(summarywords[i:i+1]) f[twowords.lower()] = 1 # Keep creator and publisher whole f['Publisher:'+ entry['publisher']] = 1 # UPPERCASE is a virtual word flagging too much shouting if (float(uc) / len(summarywords) > 0.3): f['UPPERCASE'] = 1 return f

Eriksen Costa  Mar 21, 2010 
PDF Page 138
Top of the page

Just a mild typo but had me confused for a couple of minutes :) What it says: If you want to use this new version with filterfeed, you'll have to change the function to pass the entries to the classifier rather than to fulltext. Just change the end to: ....... What it means: Change the ending of;

Lola Montes  Jul 11, 2013 
PDF Page 144
2nd Paragraph

Importing decision_tree_example.txt keeps "\n"s in the data, so remove them or this taints the data example from uniquecounts() {'Basic\n': 4, 'Premium\n': 1, 'Basic': 1} This is how I felt like doing it today my_data=[line.split('\t') for line in file('decision_tree_example.txt')] for a in range(len(my_data)): for b in range(len(my_data[a])): my_data[a][b] = my_data[a][b].replace('\n','')

Lola Montes  Jul 22, 2013 
PDF Page 146
Top of Page

divideset() was splitting the numerical-data as string-data because that was is it was stored in my_data. So, instead of splitting >=value it was splitting ==value The corrected version produces the same tree, but with different datasets who knows? def divideset(rows,column,value): try: value = int(value) # Convert to int here except: pass split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row: int(row[column])>=value # Convert here too else: split_function=lambda row:row[column]==value etc......................

Lola Montes  Aug 01, 2013 
PDF Page 148
In the second indented block

In the line Entropy = sum of p(i) x log(p(i)) for all outcomes By what the author mean, Entropy should be the minus of that. It's probably a typo, since in the implementation below, the computation didn't forget the minus sign.

jvc  Jul 25, 2014 
Printed Page 150
buildtree function

The recursive calls at the end of the buildtree function should propagate the scoref parameter. Otherwise if you use a scoref function besides the default "entropy" function it will only be used on the first call. So instead of trueBranch=buildtree(best_sets[0]) it should be trueBranch=buildtree(best_sets[0], scoref)

Stan Dyck  May 05, 2009 
PDF Page 155
last line

The line delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2) should be delta=entropy(tb+fb)-(entropy(tb)+entropy(fb))/2

Xiaochen Wang  Jun 04, 2012 
Printed Page 157
bottom of page, second last line of code

mdclassify function is called qualified by treepredict2, but treepredict2 is neither imported (doesn't exist anyway) or defined anywhere. Changing treepredict2.mdclassify(['google', 'France',None,None],tree) to simply treepredict.mdclassify(['google', 'France',None,None],tree) appears to generate the intended results.

Bryce Thomas  May 09, 2009 
Printed Page 157
code sample at bottom of page

The reported results for the first test of the mdclassify function are different than those produced by the code in the downloadable code samples. Here's what I get with Python 2.6.4: >>> tree = treepredict.buildtree(treepredict.my_data) >>> treepredict.mdclassify(['google', None, 'yes', None], tree) {'Premium': 2.25, 'Basic': 0.25}

Anonymous  Jan 05, 2010 
Printed Page 160
getaddressdata function code sample top half of page

It appears as though the Zillow API does not return "totalRooms" anymore (assuming it once did). Furthermore, some of the houses that it searches for appear to return no actual values from Zillow, or still cause the exception block to be hit. Later on, when the code asks for len(row) in the variance method of, this will cause "TypeError: object of type 'NoneType' has no len()". As a work around: rooms=doc.getElementsByTagName('totalRooms')[0] should be commented out. Also, in getpricelist() function, change l1.append(data) to if data != None l1.append(data) this should prevent null rows getting added and therefore prevent the TypeError exception described above.

Bryce Thomas  May 09, 2009 
Printed Page 161
Modelling "Hotness"

Hot or Not API is no longer available, so Hot or Not stuff will not work.

Bryce Thomas  May 09, 2009 
Printed Page 175
top half of page, gaussian function and interactive interpreter sample

The gaussian function shown at the top of the page does not produce the results shown in the interactive interpreter sample code. E.g. numpredict.gaussian(1.0) produces 0.99501247... not 0.606530659... as shown. To get results which match that of the sample interactive interpreter, change the gaussian function to: def gaussian(dist,sigma=10.0): return math.e**(-(dist*10)**2/(2*sigma**2)) This way, results match what's shown on the page and are then also consistent with what's illustrated in figure 8-5.

Bryce Thomas  May 10, 2009 
PDF Page 182
5th Paragraph

The pdf says to call geneticoptimize using; optimization.geneticoptimize(numpredict.weightdomain,costf,popsize=5,lrate=1,maxv=4,iters=20) I assume this was an different version of geneticoptimize(), as the values accepted by geneticoptimize are geneticoptimize(domain,costf,popsize=50,step=1,mutprob=0.2,elite=0.2,maxiter=100) I tried this instead, guessing that 'maxv' related to 'elite'. optimization.geneticoptimize(numpredict.weightdomain,costf,popsize=5,step=1,elite=0.2,maxiter=20)

Lola Montes  Aug 05, 2013 
Printed Page 186
code example

numpredict.cumulativegraph(data, (1,1), 6) the "high" price 6 doesn't make sense. according to the figure 8-10, the high should be at least 120.

Anonymous  Jun 19, 2008 
Printed Page 186
interactive interpreter code sample

the cumulativegraph function is being passed the vector (1,1), which means it will draw the cumulative probability for the price of bottles of wine which have a rating of 1 and are 1 year old. These wine bottles would be so terrible that the cumulative probability would reach 1 at a price of 0, meaning there's nothing to see on the graph generated. A better example would be to use say a wine bottle rating of 99 that's 10 years old, with a call like: numpredict.cumulativegraph(data,(99,10),120) This way, they'll actually be something useful displayed on the graph when its printed and it should look more like Figure 8-10 does.

Bryce Thomas  May 10, 2009 
Printed Page 188
interactive interpreter code sample

Like the error on page 186, using the vector (1,1) - a wine bottle with a rating of 1 and 1 year old, means that the wine would be so bad that it would all cost essentially 0. Instead, use a vector like say (99,10). Also, the value of 6 used for high is quite small. To get something that looks similar to figure 8-11, the value 120 should be used instead. So, the line which reads: numpredict.probabilitygraph(data,(1,1),6) should become something like numpredict.probabilitygraph(data,(99,10),120)

Bryce Thomas  May 10, 2009 
Printed Page 199
code sample

The paragraph after the code sample states "The points will be O if the people are a match and X if they are not." The code sample however draws a scatterplot where people that are a match are represented with a green O and people that are not a match are represented with a red O (not an X). To get a scatterplot that uses red X's instead of red O's, change the line: plot(xdn,ydn,'ro') to plot(xdn,ydn,'rx') Alternatively, to get a scatterplot that looks like that in figure 9-1 (which uses neither O's nor X's), change the same line: plot(xdn,ydn,'ro') to plot(xdn,ydn,'r+')

Bryce Thomas  Oct 13, 2009 
Printed Page 199
middle of page (code section)

If the plotagematches function is really to be called from one's Python session (rather than being added to the file) then one does not need to reload advancedclassify, and should only type "plotagematches(agesonly)" instead of "advancedclassify.plotagematches(agesonly)" into the Python session.

Marisano James  Nov 01, 2009 
Printed Page 204
First Paragraph

There is a sentence which states "There are two other points, X0 and X1, which are examples that have to be classified". The diagram on the other hand uses points X1 and X2, but no X0. I believe that the sentence should be changed to "There are two other points, X1 and X2, which are examples that have to be classified."

Bryce Thomas  Oct 15, 2009 
Printed Page 208
top of page, 2nd line of getlocation function body

The URL for Yahoo! Maps Web Services has changed. The line reading, 'http:\\'+\ should now read, ''+\

Marisano James  Nov 02, 2009 
Printed Page 209
halfway down the page

Apparently there are two 824 3rd Avenues in New York city: one that's approximately 0.9 miles from 220 W 42nd St, and another that's about 6.6 miles away. It might be good to include zip codes with some, or all, of the addresses from the matchmaker.csv file to avoid confusion. [The new Yahoo! Maps Services returns the latitude and longitude for the second address - the one that's circa 6.6 miles away - as the default rather than those of the address that's approx. 0.9 miles away, as listed in the book.]

Marisano James  Nov 02, 2009 
Printed Page 210
In the scaledata function

In the book, scaleinput should be defined as: # Create a function that scales data def scaleinput(d): return [(d[i] - low[i]) / (high[i] - low[i]) for i in range(len(low))] In the printed copy, the return statement tries to access[i], even though the data member was passed in to scaleinput, in both scaledata and in the interpreter example below the code.

Anonymous  Aug 15, 2008 
Printed Page 213

The printed first edition does not include the source code for the veclength function, def veclength(v): return sum([p**2 for p in v]) Perhaps it should be placed toward the top of page 213. [The function does appear in the file, available via the Examples link <>.]

Marisano James  Nov 03, 2009 
Printed Page 213

Just above the definition of the rbf function there should be an import math statement to ensure that the constant math.e is defined.

Marisano James  Nov 03, 2009 
Printed Page 213
top of page, in rbf definition

To be in accordance with the other gammas, and with the code listed in, the first line of the rbf function definition should read, def rbf(v1, v2, gamma=10): For the purposes of the book, I believe that the gamma parameter is always explicitly passed, however.

Marisano James  Nov 03, 2009 
Printed Page 213
bottom third of page (2nd from last line of the nlclassify function)

The line, if y<0: return 0 should read, if y>0: return 0 This is in accordance with the code included in and also allows the offset used on page 214 to be meaningfully calculated. (See below.) Without this change, each result returned by nlclassify will be the opposite of its intended value.

Marisano James  Nov 03, 2009 
Printed Page 214
Interactive interpreter sample at top of page

Before you can execute any of the statements, you first need to reload advancedclassify using: reload(advancedclassify) The book also never defines the variable "offset" which is being parsed into the nlclassify function. I'm not sure what a reasonable value for this would be.

Bryce Thomas  Oct 15, 2009 
Printed Page 214
top of page, just before the first call to nlclassify

Just after the missing reload(advancedclassify) statement (See Bryce's comment for more), one should execute offset=advancedclassify.getoffset(agesonly) To properly define the otherwise undefined offset. This should produce a value of some -0.0076450020098023288.

Marisano James  Nov 03, 2009 
PDF Page 217
Beginning 3rd paragraph

The current version of the Python Wrapper for LIBSVM,(3.17), doesn't support the commands in the PDF. Two solutions: 1: Download version 2.91 from and make from the python_old directory. 2: Using the following commands found on Stack Overflow Note I couldn't get cross-validation to work though : from svm import * from svmutil import * svm_model.predict = lambda self, x: svm_predict([0], [x], self)[0][0] prob = svm_problem([1,-1], [[1,0,1], [-1,0,-1]]) param = svm_parameter() param.kernel_type = LINEAR param.C = 10 m=svm_train(prob, param) m.predict([1,1,1]) svm_save_model('test.model',m) m = svm_load_model('test.model') answers,inputs=[r.match for r in scaledset],[ for r in scaledset] param = svm_parameter() param.kernel_type = RBF prob = svm_problem(answers,inputs) m=svm_train(prob, param) newrow=[28.0,-1,-1,26.0,-1,1,2,0.8] # Man doesn't want children, woman does m.predict(scalef(newrow)) newrow=[28.0,-1,1,26.0,-1,1,2,0.8] # Both want children m.predict(scalef(newrow))

Lola Montes  Aug 23, 2013 
Printed Page 218

The lines reading, m=svm_model(test.model) should read,'test.model') m=svm_model('test.model') Also, in order for the svmc.pyd file to work with your Python environment, it must be of the correct version. The svmc.pyd file included in is for Python 2.4, the one available from the LIBSVM website (as of November 3, 2009) is for Python 2.6. I used this website to get a version that works with Python 2.5: <>. [It's also possible to build the svmc.pyd with C.]

Marisano James  Nov 03, 2009 
Printed Page 219
Facebook section

The code for facebook has change dramatically since the publishing and no longer functions. Use of the book code returns a "This API version is deprecated" <?xml version="1.0" encoding="UTF-8"?> <error_response xmlns="" xmlns:xsi="" xsi:schemaLocation=""> <error_code>12</error_code> <error_msg>This API version is deprecated</error_msg> <request_args list="true"> <arg>

Jeffery Shipman  Apr 08, 2009 
Printed Page 230
makematrix() function

Non-Negative Matrix Factorization is well defined only when there all no all-zero rows and columns. The makeMatrix() function, by excluding "words that are common but not too common," can eliminate all the word in some documents yielding all-zero rows. The evidence that this has occurred is error messages and NaNs in the output. It can be fixed two ways. A quick-and-dirty fix is to eliminate fewer words in makematrix(); this can be accomplished with simple changes to the if statement. A more robust fix is removing the resulting all-zero rows.

Scott Ainsworth  May 03, 2009 
PDF Page 238
factorize(v, pc=10, iter = 50) function

As of D. Shrestha, "Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents.," pp. 1-10. The parameter epsilon << 1 is added to the multiplicative update rule to avoid division by zero, thus avoiding a NaN in the algorithm. This is added to: wd=(w*h*transpose(h))+0.000000001 and hd=(transpose(w)*w*h)+0.000000001

Pavel Soriano  Mar 15, 2010 
Printed Page 238
Factorize function

Hi there, When I run the code of chapter 10 that comes with the book Programming Collective Intelligence I get the following error: RuntimeWarning: invalid value encountered in divide w=matrix(array(w)*array(wn)/array(wd)) RuntimeWarning: invalid value encountered in double_scalars dif += pow ( a [ i, j ] - b [ i, j ], 2 ) nan I have been looking for possible explanations today, but I can't find any. Could you please help? Kind regards, Thomas van Pelt the Netherlands

Anonymous  Dec 27, 2012 
PDF Page 238
Last paragraph

retuI found this version of factorize() that deals with the appearance of NAN's on GitHub Enjoy :) def factorize(v, pc=10, iter=50): ic, fc = shape(v) w = matrix([[random.random() for j in range(pc)] for i in range(ic)]) h = matrix([[random.random() for i in range(fc)] for i in range(pc)]) for i in range(iter): wh = w * h cost = difcost(v, wh) if i % 10 == 0: print("%d: %f" % (i, cost)) if cost == 0 or math.isnan(cost): print("%d: %f" % (i, cost)) break hn = transpose(w) * v hd = transpose(w) * w * h h = matrix(array(h) * array(hn) / array(hd)) wn = v * transpose(h) wd = w * h * transpose(h) # RuntimeWarning: invalid value encountered in divide wd = [[1e-20 if (x - 1e-20 < 0) else x for x in lst] for lst in wd.tolist()] w = matrix(array(w) * array(wn) / array(wd)) return w, hrn w, h

Lola Montes  Aug 30, 2013 
PDF Page 238
Last paragraph

""" Try this version of NMF that is about 15/20 times faster and in my opinion more accurate in the results it gives - lolamontes69 # NMF by alternative non-negative least squares using projected gradients # Author: Chih-Jen Lin, National Taiwan University # Python/numpy translation: Anthony Di Franco # Adapted for the Chapter 10 Programming Collective Intelligence by lolamontes69 *see end for usage """ import random as random from numpy import * from numpy.linalg import norm from time import time from sys import stdout def factorize(v1, pc, maxiter): ic, fc = shape(v1) temp1=v1.tolist() v=array(temp1) w = array([[random.random() for j in range(pc)] for i in range(ic)]) h = array([[random.random() for i in range(fc)] for i in range(pc)]) wo,ho = nmf(v, w, h, 0.00000001, 900, maxiter) temp2=wo.tolist() wo1=matrix(temp2) temp3=ho.tolist() ho1=matrix(temp3) return wo1, ho1 def nmf(V,Winit,Hinit,tol,timelimit,maxiter): """ (W,H) = nmf(V,Winit,Hinit,tol,timelimit,maxiter) W,H: output solution Winit,Hinit: initial solution tol: tolerance for a relative stopping condition timelimit, maxiter: limit of time and iterations """ W = Winit; H = Hinit; initt = time(); gradW = dot(W, dot(H, H.T)) - dot(V, H.T) gradH = dot(dot(W.T, W), H) - dot(W.T, V) initgrad = norm(r_[gradW, gradH.T]) print 'Init gradient norm %f' % initgrad tolW = max(0.001,tol)*initgrad tolH = tolW for iter in xrange(1,maxiter): # stopping condition projnorm = norm(r_[gradW[logical_or(gradW<0, W>0)], gradH[logical_or(gradH<0, H>0)]]) if projnorm < tol*initgrad or time() - initt > timelimit: break (W, gradW, iterW) = nlssubprob(V.T,H.T,W.T,tolW,1000) W = W.T gradW = gradW.T if iterW==1: tolW = 0.1 * tolW (H,gradH,iterH) = nlssubprob(V,W,H,tolH,1000) if iterH==1: tolH = 0.1 * tolH if iter % 10 == 0: stdout.write('.') print '\nIter = %d Final proj-grad norm %f' % (iter, projnorm) return (W,H) def nlssubprob(V,W,Hinit,tol,maxiter): """ H, grad: output solution and gradient iter: #iterations used V, W: constant matrices Hinit: initial solution tol: stopping tolerance maxiter: limit of iterations """ H = Hinit WtV = dot(W.T, V) WtW = dot(W.T, W) alpha = 1; beta = 0.1; for iter in xrange(1, maxiter): grad = dot(WtW, H) - WtV projgrad = norm(grad[logical_or(grad < 0, H >0)]) if projgrad < tol: break # search step size for inner_iter in xrange(1,20): Hn = H - alpha*grad Hn = where(Hn > 0, Hn, 0) d = Hn-H gradd = sum(grad * d) dQd = sum(dot(WtW,d) * d) suff_decr = 0.99*gradd + 0.5*dQd < 0; if inner_iter == 1: decr_alpha = not suff_decr; Hp = H; if decr_alpha: if suff_decr: H = Hn; break; else: alpha = alpha * beta; else: if not suff_decr or (Hp == Hn).all(): H = Hp; break; else: alpha = alpha/beta; Hp = Hn; if iter == maxiter: print 'Max iter in nlssubprob' return (H, grad, iter) """ ************* Example Usage ************* import newsfeatures as newsfeatures import welcome_to_the_nmf as new_nmf from numpy import * allw,artw,artt=newsfeatures.getarticlewords() wordmatrix,wordvec=newsfeatures.makematrix(allw,artw) v=matrix(wordmatrix) weights,feat=new_nmf.factorize(v,pc=20,maxiter=5000) topp,pn=newsfeatures.showfeatures(weights,feat,artt,wordvec,out='todays_features.txt') newsfeatures.showarticles(artt,topp,pn,out='todays_articles.txt') ------------------------------------------------------------------------------- ******************************************* Variables in line 20: have fun with them :) ******************************************* wo,ho = alt_nmf.nmf(v, w, h, tol, timelimit, iter) v is the matrix converted to an array w and h are the two randomly generated factors of v converted to arrays tol is the degree of error allowed before stopping timelimit is the time allowed before stopping iter the iterations to perform before stopping The usage of factorize() is the same as using factorize in Chapter 10 but about 15/20 times faster and it appears more accurate too :D Note: can be optimized further """

Lola Montes  Sep 04, 2013 
Printed Page 261
3rd line from bottom of page

isinstance(t, node): should read hasattr(t, 'children'): The former does not function correctly (under Python 2.5.2 at any rate).

Marisano James  Oct 30, 2009 
Printed, PDF, ePub, Mobi Page 261
def selectindex():

As some wrote, function selectindex() can return value that is greater or equal to popsize. This results an error in caller function. I think this solves the problem: # Add uniform function to import from random from random import random, randint, choice, uniform # Compute constants once xmin = pexp**(popsize-2) den = log(pexp) def selectindex(): """ xmin prevents functionf from risisng error by restricting rand. value """ return int(log(uniform(xmin, 1.0)) / den)

Karol Olszanowski  Aug 19, 2016 
Printed Page 263
4th line from the bottom of the page

The line if isinstance(t1, node) and isinstance(t2, node): should read if hasattr(t1, 'children') and hasattr(t2, 'children'): The former does not allow the program to converge to the correct solution. [The latter code snippet also appears in the book's source code as distributed in]

Marisano James  Oct 30, 2009 
Printed Page 265
def selectindex():

Function selectindex() can return value that is greater or equal to popsize. This results an error in caller function.

Vladimir Aksenov  Dec 21, 2011 
Printed Page 271
toward the bottom of the tournament function

As currently presented, the when there is a tie between players i and j it is as if player i lost. Presently the code reads, elif winner==-1: losses[i]+=1 losses[i]+=1 pass However, it should read, elif winner==-1: losses[i]+=1 losses[j]+=1 pass Applying this change (replacing the second losses[i] with losses[j]) will assist in evolving improved game-playing AI. [The above error appears in the online code,, as well.]

Marisano James  Oct 30, 2009 
Printed Page 313
9th line

The command should be $ tar xvf numpy-1.0.2.tar instead of $ tar xvf numpy-1.0.2.tar.gz

Anonymous  Oct 08, 2008 
Printed Page 317
In def pearson, top 1/3 of page

The Pearson correlation coefficient error that anonymous first identified on p. 13 also appears here. The line reading, n=len(x) should read n=float(len(x)) to avoid integer roundoff errors in a subsequent division [i.e. in num=pSum-(sumx*sumy/n)].

Marisano James  Apr 26, 2010 
Safari Books Online 322
in the function veclength

In veclength, the coefficients a[i] need to be squared: def veclength(a): return sum([a[i]**2 for i in range(len(a))])**.5 to properly compute the length.

Anonymous  Sep 03, 2012 
ePub Page 2425 (Kindle)
United States

In the gettextonly method in the FTS chapter: the subtext is bound to None sometimes raising exceptions and causing the indexing to fail. Handle that via a None check for subtext: if subtext!=None: resulttext+=subtext+'\n' return resulttext

H kelkar  Apr 08, 2013 
ePub Page 2434 (kindle)
United States

In the FTS chapter there is a separatewords function [s.lower() for s in splitter.split(text) if s!=''] encounters non strings sometimes Handle that by placing it within a try except block try: statement except: print "statement" continue Not hanadling that case will cause indexing to fail.

Anonymous  Apr 08, 2013