book

Programming Collective Intelligence

Name: Programming Collective Intelligence
Author: Toby Segaran
ISBN: 9780596550684

by Toby Segaran

August 2007

Beginner to intermediate

362 pages

10h 11m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Programming Collective Intelligence
A Note Regarding Supplemental Files
Praise for Programming Collective Intelligence
Preface
PrerequisitesStyle of ExamplesWhy Python?Python TipsList and dictionary constructorsSignificant WhitespaceList comprehensionsOpen APIsOverview of the ChaptersConventionsUsing Code ExamplesHow to Contact UsSafari® Books OnlineAcknowledgments
1. Introduction to Collective Intelligence
What Is Collective Intelligence?What Is Machine Learning?Limits of Machine LearningReal-Life ExamplesOther Uses for Learning Algorithms
2. Making Recommendations
Collaborative FilteringCollecting PreferencesFinding Similar UsersEuclidean Distance ScorePearson Correlation ScoreWhich Similarity Metric Should You Use?Ranking the CriticsRecommending ItemsMatching ProductsBuilding a del.icio.us Link RecommenderThe del.icio.us APIBuilding the DatasetRecommending Neighbors and LinksItem-Based FilteringBuilding the Item Comparison DatasetGetting RecommendationsUsing the MovieLens DatasetUser-Based or Item-Based Filtering?Exercises
3. Discovering Groups
Supervised versus Unsupervised LearningWord VectorsPigeonholing the BloggersCounting the Words in a FeedHierarchical ClusteringDrawing the DendrogramColumn ClusteringK-Means ClusteringClusters of PreferencesGetting and Preparing the DataBeautiful SoupScraping the Zebo ResultsDefining a Distance MetricClustering ResultsViewing Data in Two DimensionsOther Things to ClusterExercises
4. Searching and Ranking
What’s in a Search Engine?A Simple CrawlerUsing urllib2Crawler CodeBuilding the IndexSetting Up the SchemaFinding the Words on a PageAdding to the IndexQueryingContent-Based RankingNormalization FunctionWord FrequencyDocument LocationWord DistanceUsing Inbound LinksSimple CountThe PageRank AlgorithmUsing the Link TextLearning from ClicksDesign of a Click-Tracking NetworkSetting Up the DatabaseFeeding ForwardTraining with BackpropagationTraining TestConnecting to the Search EngineExercises
5. Optimization
Group TravelRepresenting SolutionsThe Cost FunctionRandom SearchingHill ClimbingSimulated AnnealingGenetic AlgorithmsReal Flight SearchesThe Kayak APIThe minidom PackageFlight SearchesOptimizing for PreferencesStudent Dorm OptimizationThe Cost FunctionRunning the OptimizationNetwork VisualizationThe Layout ProblemCounting Crossed LinesDrawing the NetworkOther PossibilitiesExercises
6. Document Filtering
Filtering SpamDocuments and WordsTraining the ClassifierCalculating ProbabilitiesStarting with a Reasonable GuessA Naïve ClassifierProbability of a Whole DocumentA Quick Introduction to Bayes’ TheoremChoosing a CategoryThe Fisher MethodCategory Probabilities for FeaturesCombining the ProbabilitiesClassifying ItemsPersisting the Trained ClassifiersUsing SQLiteFiltering Blog FeedsImproving Feature DetectionUsing AkismetAlternative MethodsExercises

7. Modeling with Decision Trees
Predicting SignupsIntroducing Decision TreesTraining the TreeChoosing the Best SplitGini ImpurityEntropyRecursive Tree BuildingDisplaying the TreeGraphical DisplayClassifying New ObservationsPruning the TreeDealing with Missing DataDealing with Numerical OutcomesModeling Home PricesThe Zillow APIModeling “Hotness”When to Use Decision TreesExercises
8. Building Price Models
Building a Sample Datasetk-Nearest NeighborsNumber of NeighborsDefining SimilarityCode for k-Nearest NeighborsWeighted NeighborsInverse FunctionSubtraction FunctionGaussian FunctionWeighted kNNCross-ValidationHeterogeneous VariablesAdding to the DatasetScaling DimensionsOptimizing the ScaleUneven DistributionsEstimating the Probability DensityGraphing the ProbabilitiesUsing Real Data—the eBay APIGetting a Developer KeySetting Up a ConnectionPerforming a SearchGetting Details for an ItemBuilding a Price PredictorWhen to Use k-Nearest NeighborsExercises
9. Advanced Classification: Kernel Methods and SVMs
Matchmaker DatasetDifficulties with the DataDecision Tree ClassifierBasic Linear ClassificationCategorical FeaturesYes/No QuestionsLists of InterestsDetermining Distances Using Yahoo! MapsGetting a Yahoo! Application KeyUsing the Geocoding APICalculating the DistanceCreating the New DatasetScaling the DataUnderstanding Kernel MethodsThe Kernel TrickSupport-Vector MachinesUsing LIBSVMGetting LIBSVMA Sample SessionApplying SVM to the Matchmaker DatasetMatching on FacebookGetting a Developer KeyCreating a SessionDownload Friend DataBuilding a Match DatasetCreating an SVM ModelExercises
10. Finding Independent Features
A Corpus of NewsSelecting SourcesDownloading SourcesConverting to a MatrixPrevious ApproachesBayesian ClassificationClusteringNon-Negative Matrix FactorizationA Quick Introduction to Matrix MathWhat Does This Have to Do with the Articles Matrix?Using NumPyThe AlgorithmDisplaying the ResultsDisplaying by ArticleUsing Stock Market DataWhat Is Trading Volume?Downloading Data from Yahoo! FinancePreparing a MatrixRunning NMFDisplaying the ResultsExercises
11. EVOLVING INTELLIGENCE
What Is Genetic Programming?Genetic Programming Versus Genetic AlgorithmsPrograms As TreesRepresenting Trees in PythonBuilding and Evaluating TreesDisplaying the ProgramCreating the Initial PopulationTesting a SolutionA Simple Mathematical TestMeasuring SuccessMutating ProgramsCrossoverBuilding the EnvironmentThe Importance of DiversityA Simple GameA Round-Robin TournamentPlaying Against Real PeopleFurther PossibilitiesMore Numerical FunctionsMemoryDifferent DatatypesExercises
12. Algorithm Summary
Bayesian ClassifierTrainingClassifyingUsing Your CodeStrengths and WeaknessesDecision Tree ClassifierTrainingUsing Your Decision Tree ClassifierStrengths and WeaknessesNeural NetworksTraining a Neural NetworkUsing Your Neural Network CodeStrengths and WeaknessesSupport-Vector MachinesThe Kernel TrickUsing LIBSVMStrengths and Weaknessesk-Nearest NeighborsScaling and Superfluous VariablesUsing Your kNN CodeStrengths and WeaknessesClusteringHierarchical ClusteringK-Means ClusteringUsing Your Clustering CodeMultidimensional ScalingUsing Your Multidimensional Scaling CodeNon-Negative Matrix FactorizationUsing Your NMF CodeOptimizationThe Cost FunctionSimulated AnnealingGenetic AlgorithmsUsing Your Optimization Code
A. Third-Party Libraries
Universal Feed ParserInstallation for All PlatformsPython Imaging LibraryInstallation on WindowsInstallation on Other PlatformsSimple Usage ExampleBeautiful SoupInstallation on All PlatformsSimple Usage ExamplepysqliteInstallation on WindowsInstallation on Other PlatformsSimple Usage ExampleNumPyInstallation on WindowsInstallation on Other PlatformsSimple Usage ExamplematplotlibInstallationSimple Usage ExamplepydeliciousInstallation for All PlatformsSimple Usage Example
B. Mathematical Formulas
Euclidean DistancePearson Correlation CoefficientWeighted MeanTanimoto CoefficientConditional ProbabilityGini ImpurityEntropyVarianceGaussian FunctionDot-Products
Index
About the Author
Colophon
Copyright

Content preview from Programming Collective Intelligence

Appendix B. Mathematical Formulas

Throughout the book I have introduced a number of mathematical concepts. This appendix covers selected concepts and gives a description, relevant formulas, and code for each of them.

Euclidean Distance

Euclidean distance finds the distance between two points in multidimensional space, which is the kind of distance you measure with a ruler. If the points are written as (p₁, p₂, p₃, p₄, ...) and (q₁, q₂, q₃, q₄, ...), then the formula for Euclidean distance can be expressed as shown in Figure B-1.

Figure B-1. Euclidean distance

A clear implementation of this formula is shown here:

def euclidean(p,q):
  sumSq=0.0

  # add up the squared differences
  for i in range(len(p)):
    sumSq+=(p[i]-q[i])**2

  # take the square root
  return (sumSq**0.5)

Euclidean distance is used in several places in this book to determine how similar two items are.

Pearson Correlation Coefficient

The Pearson correlation coefficient is a measure of how highly correlated two variables are. It is a value between 1 and −1, where 1 indicates that the variables are perfectly correlated, 0 indicates no correlation, and −1 means they are perfectly inversely correlated.

Figure B-2 shows the Pearson correlation coefficient.

Figure B-2. Pearson correlation coefficient

This can be implemented with the following code: ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780596529321Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design