book

Data Science from Scratch

Name: Data Science from Scratch
Author: Joel Grus
ISBN: 9781491901427

by Joel Grus

April 2015

Beginner

328 pages

7h 18m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Data ScienceFrom ScratchConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Introduction
The Ascendance of DataWhat Is Data Science?Motivating Hypothetical: DataSciencesterFinding Key ConnectorsData Scientists You May KnowSalaries and ExperiencePaid AccountsTopics of InterestOnward
2. A Crash Course in Python
The BasicsGetting PythonThe Zen of PythonWhitespace FormattingModulesArithmeticFunctionsStringsExceptionsListsTuplesDictionariesSetsControl FlowTruthinessThe Not-So-BasicsSortingList ComprehensionsGenerators and IteratorsRandomnessRegular ExpressionsObject-Oriented ProgrammingFunctional Toolsenumeratezip and Argument Unpackingargs and kwargsWelcome to DataSciencester!For Further Exploration
3. Visualizing Data
matplotlibBar ChartsLine ChartsScatterplotsFor Further Exploration
4. Linear Algebra
VectorsMatricesFor Further Exploration
5. Statistics
Describing a Single Set of DataCentral TendenciesDispersionCorrelationSimpson’s ParadoxSome Other Correlational CaveatsCorrelation and CausationFor Further Exploration
6. Probability
Dependence and IndependenceConditional ProbabilityBayes’s TheoremRandom VariablesContinuous DistributionsThe Normal DistributionThe Central Limit TheoremFor Further Exploration
7. Hypothesis and Inference
Statistical Hypothesis TestingExample: Flipping a CoinConfidence IntervalsP-hackingExample: Running an A/B TestBayesian InferenceFor Further Exploration
8. Gradient Descent
The Idea Behind Gradient DescentEstimating the GradientUsing the GradientChoosing the Right Step SizePutting It All TogetherStochastic Gradient DescentFor Further Exploration
9. Getting Data
stdin and stdoutReading FilesThe Basics of Text FilesDelimited FilesScraping the WebHTML and the Parsing ThereofExample: O’Reilly Books About DataUsing APIsJSON (and XML)Using an Unauthenticated APIFinding APIsExample: Using the Twitter APIsGetting CredentialsFor Further Exploration

10. Working with Data
Exploring Your DataExploring One-Dimensional DataTwo DimensionsMany DimensionsCleaning and MungingManipulating DataRescalingDimensionality ReductionFor Further Exploration
11. Machine Learning
ModelingWhat Is Machine Learning?Overfitting and UnderfittingCorrectnessThe Bias-Variance Trade-offFeature Extraction and SelectionFor Further Exploration
12. k-Nearest Neighbors
The ModelExample: Favorite LanguagesThe Curse of DimensionalityFor Further Exploration
13. Naive Bayes
A Really Dumb Spam FilterA More Sophisticated Spam FilterImplementationTesting Our ModelFor Further Exploration
14. Simple Linear Regression
The ModelUsing Gradient DescentMaximum Likelihood EstimationFor Further Exploration
15. Multiple Regression
The ModelFurther Assumptions of the Least Squares ModelFitting the ModelInterpreting the ModelGoodness of FitDigression: The BootstrapStandard Errors of Regression CoefficientsRegularizationFor Further Exploration
16. Logistic Regression
The ProblemThe Logistic FunctionApplying the ModelGoodness of FitSupport Vector MachinesFor Further Investigation
17. Decision Trees
What Is a Decision Tree?EntropyThe Entropy of a PartitionCreating a Decision TreePutting It All TogetherRandom ForestsFor Further Exploration
18. Neural Networks
PerceptronsFeed-Forward Neural NetworksBackpropagationExample: Defeating a CAPTCHAFor Further Exploration
19. Clustering
The IdeaThe ModelExample: MeetupsChoosing kExample: Clustering ColorsBottom-up Hierarchical ClusteringFor Further Exploration
20. Natural Language Processing
Word Cloudsn-gram ModelsGrammarsAn Aside: Gibbs SamplingTopic ModelingFor Further Exploration
21. Network Analysis
Betweenness CentralityEigenvector CentralityMatrix MultiplicationCentralityDirected Graphs and PageRankFor Further Exploration
22. Recommender Systems
Manual CurationRecommending What’s PopularUser-Based Collaborative FilteringItem-Based Collaborative FilteringFor Further Exploration
23. Databases and SQL
CREATE TABLE and INSERTUPDATEDELETESELECTGROUP BYORDER BYJOINSubqueriesIndexesQuery OptimizationNoSQLFor Further Exploration
24. MapReduce
Example: Word CountWhy MapReduce?MapReduce More GenerallyExample: Analyzing Status UpdatesExample: Matrix MultiplicationAn Aside: CombinersFor Further Exploration
25. Go Forth and Do Data Science
IPythonMathematicsNot from ScratchNumPypandasscikit-learnVisualizationRFind DataDo Data ScienceHacker NewsFire TrucksT-shirtsAnd You?
Index

Content preview from Data Science from Scratch

Preface

Data Science

Data scientist has been called “the sexiest job of the 21st century,” presumably by someone who has never visited a fire station. Nonetheless, data science is a hot and growing field, and it doesn’t take a great deal of sleuthing to find analysts breathlessly prognosticating that over the next 10 years, we’ll need billions and billions more data scientists than we currently have.

But what is data science? After all, we can’t produce data scientists if we don’t know what data science is. According to a Venn diagram that is somewhat famous in the industry, data science lies at the intersection of:

Hacking skills
Math and statistics knowledge
Substantive expertise

Although I originally intended to write a book covering all three, I quickly realized that a thorough treatment of “substantive expertise” would require tens of thousands of pages. At that point, I decided to focus on the first two. My goal is to help you develop the hacking skills that you’ll need to get started doing data science. And my goal is to help you get comfortable with the mathematics and statistics that are at the core of data science.

This is a somewhat heavy aspiration for a book. The best way to learn hacking skills is by hacking on things. By reading this book, you will get a good understanding of the way I hack on things, which may not necessarily be the best way for you to hack on things. You will get a good understanding of some of the tools I use, which will not necessarily ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning, Data Science and Generative AI with Python

Publisher Resources

ISBN: 9781491901410Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science from Scratch

by Joel Grus

Preface

Data Science

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.