book

Data Science from Scratch

by Joel Grus

April 2015

Beginner

328 pages

7h 18m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Data ScienceFrom ScratchConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
The Ascendance of DataWhat Is Data Science?Motivating Hypothetical: DataSciencesterFinding Key ConnectorsData Scientists You May KnowSalaries and ExperiencePaid AccountsTopics of InterestOnward
The BasicsGetting PythonThe Zen of PythonWhitespace FormattingModulesArithmeticFunctionsStringsExceptionsListsTuplesDictionariesSetsControl FlowTruthinessThe Not-So-BasicsSortingList ComprehensionsGenerators and IteratorsRandomnessRegular ExpressionsObject-Oriented ProgrammingFunctional Toolsenumeratezip and Argument Unpackingargs and kwargsWelcome to DataSciencester!For Further Exploration
matplotlibBar ChartsLine ChartsScatterplotsFor Further Exploration
VectorsMatricesFor Further Exploration
Describing a Single Set of DataCentral TendenciesDispersionCorrelationSimpson’s ParadoxSome Other Correlational CaveatsCorrelation and CausationFor Further Exploration
Dependence and IndependenceConditional ProbabilityBayes’s TheoremRandom VariablesContinuous DistributionsThe Normal DistributionThe Central Limit TheoremFor Further Exploration
Statistical Hypothesis TestingExample: Flipping a CoinConfidence IntervalsP-hackingExample: Running an A/B TestBayesian InferenceFor Further Exploration
The Idea Behind Gradient DescentEstimating the GradientUsing the GradientChoosing the Right Step SizePutting It All TogetherStochastic Gradient DescentFor Further Exploration
stdin and stdoutReading FilesThe Basics of Text FilesDelimited FilesScraping the WebHTML and the Parsing ThereofExample: O’Reilly Books About DataUsing APIsJSON (and XML)Using an Unauthenticated APIFinding APIsExample: Using the Twitter APIsGetting CredentialsFor Further Exploration

Exploring Your DataExploring One-Dimensional DataTwo DimensionsMany DimensionsCleaning and MungingManipulating DataRescalingDimensionality ReductionFor Further Exploration
ModelingWhat Is Machine Learning?Overfitting and UnderfittingCorrectnessThe Bias-Variance Trade-offFeature Extraction and SelectionFor Further Exploration
The ModelExample: Favorite LanguagesThe Curse of DimensionalityFor Further Exploration
A Really Dumb Spam FilterA More Sophisticated Spam FilterImplementationTesting Our ModelFor Further Exploration
The ModelUsing Gradient DescentMaximum Likelihood EstimationFor Further Exploration
The ModelFurther Assumptions of the Least Squares ModelFitting the ModelInterpreting the ModelGoodness of FitDigression: The BootstrapStandard Errors of Regression CoefficientsRegularizationFor Further Exploration
The ProblemThe Logistic FunctionApplying the ModelGoodness of FitSupport Vector MachinesFor Further Investigation
What Is a Decision Tree?EntropyThe Entropy of a PartitionCreating a Decision TreePutting It All TogetherRandom ForestsFor Further Exploration
PerceptronsFeed-Forward Neural NetworksBackpropagationExample: Defeating a CAPTCHAFor Further Exploration
The IdeaThe ModelExample: MeetupsChoosing kExample: Clustering ColorsBottom-up Hierarchical ClusteringFor Further Exploration
Word Cloudsn-gram ModelsGrammarsAn Aside: Gibbs SamplingTopic ModelingFor Further Exploration
Betweenness CentralityEigenvector CentralityMatrix MultiplicationCentralityDirected Graphs and PageRankFor Further Exploration
Manual CurationRecommending What’s PopularUser-Based Collaborative FilteringItem-Based Collaborative FilteringFor Further Exploration
CREATE TABLE and INSERTUPDATEDELETESELECTGROUP BYORDER BYJOINSubqueriesIndexesQuery OptimizationNoSQLFor Further Exploration
Example: Word CountWhy MapReduce?MapReduce More GenerallyExample: Analyzing Status UpdatesExample: Matrix MultiplicationAn Aside: CombinersFor Further Exploration
IPythonMathematicsNot from ScratchNumPypandasscikit-learnVisualizationRFind DataDo Data ScienceHacker NewsFire TrucksT-shirtsAnd You?

Content preview from Data Science from Scratch

Chapter 17. Decision Trees

A tree is an incomprehensible mystery.

Jim Woodring

DataSciencester’s VP of Talent has interviewed a number of job candidates from the site, with varying degrees of success. He’s collected a data set consisting of several (qualitative) attributes of each candidate, as well as whether that candidate interviewed well or poorly. Could you, he asks, use this data to build a model identifying which candidates will interview well, so that he doesn’t have to waste time conducting interviews?

This seems like a good fit for a decision tree, another predictive modeling tool in the data scientist’s kit.

What Is a Decision Tree?

A decision tree uses a tree structure to represent a number of possible decision paths and an outcome for each path.

If you have ever played the game Twenty Questions, then it turns out you are familiar with decision trees. For example:

“I am thinking of an animal.”
“Does it have more than five legs?”
“No.”
“Is it delicious?”
“No.”
“Does it appear on the back of the Australian five-cent coin?”
“Yes.”
“Is it an echidna?”
“Yes, it is!”

This corresponds to the path:

“Not more than 5 legs” → “Not delicious” → “On the 5-cent coin” → “Echidna!”

in an idiosyncratic (and not very comprehensive) “guess the animal” decision tree (Figure 17-1).

Decision trees have a lot to recommend ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781491901410Errata Page

Data Science from Scratch

by Joel Grus

Chapter 17. Decision Trees

What Is a Decision Tree?

Figure 17-1. A “guess the animal” decision tree

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like