book

Essential Math for Data Science

by Thomas Nield

May 2022

Intermediate to advanced

352 pages

9h 15m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Has Sandbox

Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Number TheoryOrder of OperationsVariablesFunctionsSummationsExponentsLogarithmsEuler’s Number and Natural LogarithmsEuler’s NumberNatural LogarithmsLimitsDerivativesPartial DerivativesThe Chain RuleIntegralsConclusionExercises
Understanding ProbabilityProbability Versus StatisticsProbability MathJoint ProbabilitiesUnion ProbabilitiesConditional Probability and Bayes’ TheoremJoint and Union Conditional ProbabilitiesBinomial DistributionBeta DistributionConclusionExercises
What Is Data?Descriptive Versus Inferential StatisticsPopulations, Samples, and BiasDescriptive StatisticsMean and Weighted MeanMedianModeVariance and Standard DeviationThe Normal DistributionThe Inverse CDFZ-ScoresInferential StatisticsThe Central Limit TheoremConfidence IntervalsUnderstanding P-ValuesHypothesis TestingThe T-Distribution: Dealing with Small SamplesBig Data Considerations and the Texas Sharpshooter FallacyConclusionExercises
What Is a Vector?Adding and Combining VectorsScaling VectorsSpan and Linear DependenceLinear TransformationsBasis VectorsMatrix Vector MultiplicationMatrix MultiplicationDeterminantsSpecial Types of MatricesSquare MatrixIdentity MatrixInverse MatrixDiagonal MatrixTriangular MatrixSparse MatrixSystems of Equations and Inverse MatricesEigenvectors and EigenvaluesConclusionExercises
A Basic Linear RegressionResiduals and Squared ErrorsFinding the Best Fit LineClosed Form EquationInverse Matrix TechniquesMatrix DecompositionGradient DescentOverfitting and VarianceStochastic Gradient DescentThe Correlation CoefficientStatistical SignificanceCoefficient of DeterminationStandard Error of the EstimatePrediction IntervalsTrain/Test SplitsMultiple Linear RegressionConclusionExercises
Understanding Logistic RegressionPerforming a Logistic RegressionLogistic FunctionFitting the Logistic CurveMultivariable Logistic RegressionUnderstanding the Log-OddsR-SquaredP-ValuesTrain/Test SplitsConfusion MatricesBayes’ Theorem and ClassificationReceiver Operator Characteristics/Area Under CurveClass ImbalanceConclusionExercises
When to Use Neural Networks and Deep LearningA Simple Neural NetworkActivation FunctionsForward PropagationBackpropagationCalculating the Weight and Bias DerivativesStochastic Gradient DescentUsing scikit-learnLimitations of Neural Networks and Deep LearningConclusionExercise
Redefining Data ScienceA Brief History of Data ScienceFinding Your EdgeSQL ProficiencyProgramming ProficiencyData VisualizationKnowing Your IndustryProductive LearningPractitioner Versus AdvisorWhat to Watch Out For in Data Science JobsRole DefinitionOrganizational Focus and Buy-InAdequate ResourcesReasonable ObjectivesCompeting with Existing SystemsA Role Is Not What You ExpectedDoes Your Dream Job Not Exist?Where Do I Go Now?Conclusion
Using LaTeX Rendering with SymPyBinomial Distribution from ScratchBeta Distribution from ScratchDeriving Bayes’ TheoremCDF and Inverse CDF from ScratchUse e to Predict Event Probability Over TimeHill Climbing and Linear RegressionHill Climbing and Logistic RegressionA Brief Intro to Linear ProgrammingMNIST Classifier Using scikit-learn

Chapter 1Chapter 2Chapter 3Chapter 4Chapter 5Chapter 6Chapter 7

Content preview from Essential Math for Data Science

Chapter 6. Logistic Regression and Classification

In this chapter we are going to cover logistic regression, a type of regression that predicts a probability of an outcome given one or more independent variables. This in turn can be used for classification, which is predicting categories rather than real numbers as we did with linear regression.

We are not always interested in representing variables as continuous, where they can represent an infinite number of real decimal values. There are situations where we would rather variables be discrete, or representative of whole numbers, integers, or booleans (1/0, true/false). Logistic regression is trained on an output variable that is discrete (a binary 1 or 0) or a categorical number (which is a whole number). It does output a continuous variable in the form of probability, but that can be converted into a discrete value with a threshold.

Logistic regression is easy to implement and fairly resilient against outliers and other data challenges. Many machine learning problems can best be solved with logistic regression, offering more practicality and performance than other types of supervised machine learning.

Just like we did in Chapter 5 when we covered linear regression, we will attempt to walk the line between statistics and machine learning, using tools and analysis from both disciplines. Logistic regression will integrate many concepts we have learned from this book, from probability to linear regression.