book

Data Science from Scratch, 2nd Edition

by Joel Grus

May 2019

Beginner

403 pages

9h 18m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Data ScienceFrom Scratch
The Ascendance of DataWhat Is Data Science?Motivating Hypothetical: DataSciencesterFinding Key ConnectorsData Scientists You May KnowSalaries and ExperiencePaid AccountsTopics of InterestOnward
The Zen of PythonGetting PythonVirtual EnvironmentsWhitespace FormattingModulesFunctionsStringsExceptionsListsTuplesDictionariesdefaultdictCountersSetsControl FlowTruthinessSortingList ComprehensionsAutomated Testing and assertObject-Oriented ProgrammingIterables and GeneratorsRandomnessRegular ExpressionsFunctional Programmingzip and Argument Unpackingargs and kwargsType AnnotationsHow to Write Type AnnotationsWelcome to DataSciencester!For Further Exploration
matplotlibBar ChartsLine ChartsScatterplotsFor Further Exploration
VectorsMatricesFor Further Exploration
Describing a Single Set of DataCentral TendenciesDispersionCorrelationSimpson’s ParadoxSome Other Correlational CaveatsCorrelation and CausationFor Further Exploration
Dependence and IndependenceConditional ProbabilityBayes’s TheoremRandom VariablesContinuous DistributionsThe Normal DistributionThe Central Limit TheoremFor Further Exploration
Statistical Hypothesis TestingExample: Flipping a Coinp-ValuesConfidence Intervalsp-HackingExample: Running an A/B TestBayesian InferenceFor Further Exploration
The Idea Behind Gradient DescentEstimating the GradientUsing the GradientChoosing the Right Step SizeUsing Gradient Descent to Fit ModelsMinibatch and Stochastic Gradient DescentFor Further Exploration

stdin and stdoutReading FilesThe Basics of Text FilesDelimited FilesScraping the WebHTML and the Parsing ThereofExample: Keeping Tabs on CongressUsing APIsJSON and XMLUsing an Unauthenticated APIFinding APIsExample: Using the Twitter APIsGetting CredentialsFor Further Exploration
Exploring Your DataExploring One-Dimensional DataTwo DimensionsMany DimensionsUsing NamedTuplesDataclassesCleaning and MungingManipulating DataRescalingAn Aside: tqdmDimensionality ReductionFor Further Exploration
ModelingWhat Is Machine Learning?Overfitting and UnderfittingCorrectnessThe Bias-Variance TradeoffFeature Extraction and SelectionFor Further Exploration
The ModelExample: The Iris DatasetThe Curse of DimensionalityFor Further Exploration
A Really Dumb Spam FilterA More Sophisticated Spam FilterImplementationTesting Our ModelUsing Our ModelFor Further Exploration
The ModelUsing Gradient DescentMaximum Likelihood EstimationFor Further Exploration
The ModelFurther Assumptions of the Least Squares ModelFitting the ModelInterpreting the ModelGoodness of FitDigression: The BootstrapStandard Errors of Regression CoefficientsRegularizationFor Further Exploration
The ProblemThe Logistic FunctionApplying the ModelGoodness of FitSupport Vector MachinesFor Further Investigation
What Is a Decision Tree?EntropyThe Entropy of a PartitionCreating a Decision TreePutting It All TogetherRandom ForestsFor Further Exploration
PerceptronsFeed-Forward Neural NetworksBackpropagationExample: Fizz BuzzFor Further Exploration
The TensorThe Layer AbstractionThe Linear LayerNeural Networks as a Sequence of LayersLoss and OptimizationExample: XOR RevisitedOther Activation FunctionsExample: FizzBuzz RevisitedSoftmaxes and Cross-EntropyDropoutExample: MNISTSaving and Loading ModelsFor Further Exploration
The IdeaThe ModelExample: MeetupsChoosing kExample: Clustering ColorsBottom-Up Hierarchical ClusteringFor Further Exploration
Word Cloudsn-Gram Language ModelsGrammarsAn Aside: Gibbs SamplingTopic ModelingWord VectorsRecurrent Neural NetworksExample: Using a Character-Level RNNFor Further Exploration
Betweenness CentralityEigenvector CentralityMatrix MultiplicationCentralityDirected Graphs and PageRankFor Further Exploration
Manual CurationRecommending What’s PopularUser-Based Collaborative FilteringItem-Based Collaborative FilteringMatrix FactorizationFor Further Exploration
CREATE TABLE and INSERTUPDATEDELETESELECTGROUP BYORDER BYJOINSubqueriesIndexesQuery OptimizationNoSQLFor Further Exploration
Example: Word CountWhy MapReduce?MapReduce More GenerallyExample: Analyzing Status UpdatesExample: Matrix MultiplicationAn Aside: CombinersFor Further Exploration
What Is Data Ethics?No, Really, What Is Data Ethics?Should I Care About Data Ethics?Building Bad Data ProductsTrading Off Accuracy and FairnessCollaborationInterpretabilityRecommendationsBiased DataData ProtectionIn SummaryFor Further Exploration
IPythonMathematicsNot from ScratchNumPypandasscikit-learnVisualizationRDeep LearningFind DataDo Data ScienceHacker NewsFire TrucksT-ShirtsTweets on a GlobeAnd You?

Content preview from Data Science from Scratch, 2nd Edition

Chapter 13. Naive Bayes

It is well for the heart to be naive and for the mind not to be.

Anatole France

A social network isn’t much good if people can’t network. Accordingly, DataSciencester has a popular feature that allows members to send messages to other members. And while most members are responsible citizens who send only well-received “how’s it going?” messages, a few miscreants persistently spam other members about get-rich schemes, no-prescription-required pharmaceuticals, and for-profit data science credentialing programs. Your users have begun to complain, and so the VP of Messaging has asked you to use data science to figure out a way to filter out these spam messages.

A Really Dumb Spam Filter

Imagine a “universe” that consists of receiving a message chosen randomly from all possible messages. Let S be the event “the message is spam” and B be the event “the message contains the word bitcoin.” Bayes’s theorem tells us that the probability that the message is spam conditional on containing the word bitcoin is:

P (S | B) = [P (B | S) P (S)] / [P (B | S) P (S) + P (B | \neg S) P (\neg S)]

The numerator is the probability that a message is spam and contains bitcoin, while the denominator is just the probability that a message contains bitcoin. Hence, you can think of this calculation as simply representing the proportion of bitcoin messages that are spam.

If we have a large collection of messages we know are spam, and a large collection of messages ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Python Data Science Handbook, 2nd Edition

Publisher Resources

ISBN: 9781492041122Errata Page

Data Science from Scratch, 2nd Edition

by Joel Grus

Chapter 13. Naive Bayes

A Really Dumb Spam Filter

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Python Data Science Handbook, 2nd Edition

Practical Statistics for Data Scientists, 2nd Edition

Essential Math for Data Science

Data Analysis with Python and PySpark

Publisher Resources

Chapter 13. Naive Bayes

A Really Dumb Spam Filter

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Python Data Science Handbook, 2nd Edition

Practical Statistics for Data Scientists, 2nd Edition

Essential Math for Data Science

Data Analysis with Python and PySpark

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.