book

Programming Collective Intelligence

Name: Programming Collective Intelligence
Author: Toby Segaran
ISBN: 9780596550684

by Toby Segaran

August 2007

Beginner to intermediate

362 pages

10h 11m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Programming Collective Intelligence
A Note Regarding Supplemental Files
Praise for Programming Collective Intelligence
Preface
PrerequisitesStyle of ExamplesWhy Python?Python TipsList and dictionary constructorsSignificant WhitespaceList comprehensionsOpen APIsOverview of the ChaptersConventionsUsing Code ExamplesHow to Contact UsSafari® Books OnlineAcknowledgments
1. Introduction to Collective Intelligence
What Is Collective Intelligence?What Is Machine Learning?Limits of Machine LearningReal-Life ExamplesOther Uses for Learning Algorithms
2. Making Recommendations
Collaborative FilteringCollecting PreferencesFinding Similar UsersEuclidean Distance ScorePearson Correlation ScoreWhich Similarity Metric Should You Use?Ranking the CriticsRecommending ItemsMatching ProductsBuilding a del.icio.us Link RecommenderThe del.icio.us APIBuilding the DatasetRecommending Neighbors and LinksItem-Based FilteringBuilding the Item Comparison DatasetGetting RecommendationsUsing the MovieLens DatasetUser-Based or Item-Based Filtering?Exercises
3. Discovering Groups
Supervised versus Unsupervised LearningWord VectorsPigeonholing the BloggersCounting the Words in a FeedHierarchical ClusteringDrawing the DendrogramColumn ClusteringK-Means ClusteringClusters of PreferencesGetting and Preparing the DataBeautiful SoupScraping the Zebo ResultsDefining a Distance MetricClustering ResultsViewing Data in Two DimensionsOther Things to ClusterExercises
4. Searching and Ranking
What’s in a Search Engine?A Simple CrawlerUsing urllib2Crawler CodeBuilding the IndexSetting Up the SchemaFinding the Words on a PageAdding to the IndexQueryingContent-Based RankingNormalization FunctionWord FrequencyDocument LocationWord DistanceUsing Inbound LinksSimple CountThe PageRank AlgorithmUsing the Link TextLearning from ClicksDesign of a Click-Tracking NetworkSetting Up the DatabaseFeeding ForwardTraining with BackpropagationTraining TestConnecting to the Search EngineExercises
5. Optimization
Group TravelRepresenting SolutionsThe Cost FunctionRandom SearchingHill ClimbingSimulated AnnealingGenetic AlgorithmsReal Flight SearchesThe Kayak APIThe minidom PackageFlight SearchesOptimizing for PreferencesStudent Dorm OptimizationThe Cost FunctionRunning the OptimizationNetwork VisualizationThe Layout ProblemCounting Crossed LinesDrawing the NetworkOther PossibilitiesExercises
6. Document Filtering
Filtering SpamDocuments and WordsTraining the ClassifierCalculating ProbabilitiesStarting with a Reasonable GuessA Naïve ClassifierProbability of a Whole DocumentA Quick Introduction to Bayes’ TheoremChoosing a CategoryThe Fisher MethodCategory Probabilities for FeaturesCombining the ProbabilitiesClassifying ItemsPersisting the Trained ClassifiersUsing SQLiteFiltering Blog FeedsImproving Feature DetectionUsing AkismetAlternative MethodsExercises

7. Modeling with Decision Trees
Predicting SignupsIntroducing Decision TreesTraining the TreeChoosing the Best SplitGini ImpurityEntropyRecursive Tree BuildingDisplaying the TreeGraphical DisplayClassifying New ObservationsPruning the TreeDealing with Missing DataDealing with Numerical OutcomesModeling Home PricesThe Zillow APIModeling “Hotness”When to Use Decision TreesExercises
8. Building Price Models
Building a Sample Datasetk-Nearest NeighborsNumber of NeighborsDefining SimilarityCode for k-Nearest NeighborsWeighted NeighborsInverse FunctionSubtraction FunctionGaussian FunctionWeighted kNNCross-ValidationHeterogeneous VariablesAdding to the DatasetScaling DimensionsOptimizing the ScaleUneven DistributionsEstimating the Probability DensityGraphing the ProbabilitiesUsing Real Data—the eBay APIGetting a Developer KeySetting Up a ConnectionPerforming a SearchGetting Details for an ItemBuilding a Price PredictorWhen to Use k-Nearest NeighborsExercises
9. Advanced Classification: Kernel Methods and SVMs
Matchmaker DatasetDifficulties with the DataDecision Tree ClassifierBasic Linear ClassificationCategorical FeaturesYes/No QuestionsLists of InterestsDetermining Distances Using Yahoo! MapsGetting a Yahoo! Application KeyUsing the Geocoding APICalculating the DistanceCreating the New DatasetScaling the DataUnderstanding Kernel MethodsThe Kernel TrickSupport-Vector MachinesUsing LIBSVMGetting LIBSVMA Sample SessionApplying SVM to the Matchmaker DatasetMatching on FacebookGetting a Developer KeyCreating a SessionDownload Friend DataBuilding a Match DatasetCreating an SVM ModelExercises
10. Finding Independent Features
A Corpus of NewsSelecting SourcesDownloading SourcesConverting to a MatrixPrevious ApproachesBayesian ClassificationClusteringNon-Negative Matrix FactorizationA Quick Introduction to Matrix MathWhat Does This Have to Do with the Articles Matrix?Using NumPyThe AlgorithmDisplaying the ResultsDisplaying by ArticleUsing Stock Market DataWhat Is Trading Volume?Downloading Data from Yahoo! FinancePreparing a MatrixRunning NMFDisplaying the ResultsExercises
11. EVOLVING INTELLIGENCE
What Is Genetic Programming?Genetic Programming Versus Genetic AlgorithmsPrograms As TreesRepresenting Trees in PythonBuilding and Evaluating TreesDisplaying the ProgramCreating the Initial PopulationTesting a SolutionA Simple Mathematical TestMeasuring SuccessMutating ProgramsCrossoverBuilding the EnvironmentThe Importance of DiversityA Simple GameA Round-Robin TournamentPlaying Against Real PeopleFurther PossibilitiesMore Numerical FunctionsMemoryDifferent DatatypesExercises
12. Algorithm Summary
Bayesian ClassifierTrainingClassifyingUsing Your CodeStrengths and WeaknessesDecision Tree ClassifierTrainingUsing Your Decision Tree ClassifierStrengths and WeaknessesNeural NetworksTraining a Neural NetworkUsing Your Neural Network CodeStrengths and WeaknessesSupport-Vector MachinesThe Kernel TrickUsing LIBSVMStrengths and Weaknessesk-Nearest NeighborsScaling and Superfluous VariablesUsing Your kNN CodeStrengths and WeaknessesClusteringHierarchical ClusteringK-Means ClusteringUsing Your Clustering CodeMultidimensional ScalingUsing Your Multidimensional Scaling CodeNon-Negative Matrix FactorizationUsing Your NMF CodeOptimizationThe Cost FunctionSimulated AnnealingGenetic AlgorithmsUsing Your Optimization Code
A. Third-Party Libraries
Universal Feed ParserInstallation for All PlatformsPython Imaging LibraryInstallation on WindowsInstallation on Other PlatformsSimple Usage ExampleBeautiful SoupInstallation on All PlatformsSimple Usage ExamplepysqliteInstallation on WindowsInstallation on Other PlatformsSimple Usage ExampleNumPyInstallation on WindowsInstallation on Other PlatformsSimple Usage ExamplematplotlibInstallationSimple Usage ExamplepydeliciousInstallation for All PlatformsSimple Usage Example
B. Mathematical Formulas
Euclidean DistancePearson Correlation CoefficientWeighted MeanTanimoto CoefficientConditional ProbabilityGini ImpurityEntropyVarianceGaussian FunctionDot-Products
Index
About the Author
Colophon
Copyright

Content preview from Programming Collective Intelligence

Chapter 9. Advanced Classification: Kernel Methods and SVMs

Previous chapters have considered several classifiers, including decision trees, Bayesian classifiers, and neural networks. This chapter will introduce the concept of linear classifiers and kernel methods as a prelude to covering one of the most advanced classifiers, and one that remains an active area of research, called support-vector machines (SVMs).

The dataset used throughout much of the chapter pertains to matching people on a dating site. Given information about two people, can we predict whether they will be a good match? This is an interesting problem because there are many variables, both numerical and nominal, and many nonlinear relationships. This dataset will be used to demonstrate some of the weaknesses of the previously described classifiers, and to show how the dataset can be tweaked to work better with these algorithms. An important thing to take away from this chapter is that it’s rarely possible to throw a complex dataset at an algorithm and expect it to learn how to classify things accurately. Choosing the right algorithm and preprocessing the data appropriately is often required to get good results. I hope that going through the process of tweaking this dataset will give you ideas for how to modify others in the future.

At the end of the chapter, you’ll learn how to build a dataset of real people from Facebook, a popular social networking site, and you’ll use the algorithms to predict whether people with ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Grokking Artificial Intelligence Algorithms

Publisher Resources

ISBN: 9780596529321Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Programming Collective Intelligence

by Toby Segaran

Chapter 9. Advanced Classification: Kernel Methods and SVMs

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.