book

Essential Math for AI

by Hala Nelson

January 2023

Intermediate to advanced

602 pages

20h 59m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Why I Wrote This BookWho Is This Book For?Who Is This Book Not For?How Will the Math Be Presented in This Book?InfographicWhat Math Background Is Expected from You to Be Able to Read This Book?Overview of the ChaptersMy Favorite Books on AIConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
What Is AI?Why Is AI So Popular Now?What Is AI Able to Do?An AI Agent’s Specific TasksWhat Are AI’s Limitations?What Happens When AI Systems Fail?Where Is AI Headed?Who Are the Current Main Contributors to the AI Field?What Math Is Typically Involved in AI?Summary and Looking Ahead
Data for AIReal Data Versus Simulated DataMathematical Models: Linear Versus NonlinearAn Example of Real DataAn Example of Simulated DataMathematical Models: Simulations and AIWhere Do We Get Our Data From?The Vocabulary of Data Distributions, Probability, and StatisticsRandom VariablesProbability DistributionsMarginal ProbabilitiesThe Uniform and the Normal DistributionsConditional Probabilities and Bayes’ TheoremConditional Probabilities and Joint DistributionsPrior Distribution, Posterior Distribution, and Likelihood FunctionMixtures of DistributionsSums and Products of Random VariablesUsing Graphs to Represent Joint Probability DistributionsExpectation, Mean, Variance, and UncertaintyCovariance and CorrelationMarkov ProcessNormalizing, Scaling, and/or Standardizing a Random Variable or Data SetCommon ExamplesContinuous Distributions Versus Discrete Distributions (Density Versus Mass)The Power of the Joint Probability Density FunctionDistribution of Data: The Uniform DistributionDistribution of Data: The Bell-Shaped Normal (Gaussian) DistributionDistribution of Data: Other Important and Commonly Used DistributionsThe Various Uses of the Word “Distribution”A/B TestingSummary and Looking Ahead
Traditional and Very Useful Machine Learning ModelsNumerical Solutions Versus Analytical SolutionsRegression: Predict a Numerical ValueTraining FunctionLoss FunctionOptimizationLogistic Regression: Classify into Two ClassesTraining FunctionLoss FunctionOptimizationSoftmax Regression: Classify into Multiple ClassesTraining FunctionLoss FunctionOptimizationIncorporating These Models into the Last Layer of a Neural NetworkOther Popular Machine Learning Techniques and Ensembles of TechniquesSupport Vector MachinesDecision TreesRandom Forestsk-means ClusteringPerformance Measures for Classification ModelsSummary and Looking Ahead
The Brain Cortex and Artificial Neural NetworksTraining Function: Fully Connected, or Dense, Feed Forward Neural NetworksA Neural Network Is a Computational Graph Representation of the Training FunctionLinearly Combine, Add Bias, Then ActivateCommon Activation FunctionsUniversal Function ApproximationApproximation Theory for Deep LearningLoss FunctionsOptimizationMathematics and the Mysterious Success of Neural NetworksGradient Descent ω → i+1 = ω → i - η ∇ L ( ω → i ) Explaining the Role of the Learning Rate Hyperparameter η Convex Versus Nonconvex LandscapesStochastic Gradient DescentInitializing the Weights ω → 0 for the Optimization ProcessRegularization TechniquesDropoutEarly StoppingBatch Normalization of Each LayerControl the Size of the Weights by Penalizing Their NormPenalizing the l 2 Norm Versus Penalizing the l 1 NormExplaining the Role of the Regularization Hyperparameter α Hyperparameter Examples That Appear in Machine LearningChain Rule and Backpropagation: Calculating ∇ L ( ω → i ) Backpropagation Is Not Too Different from How Our Brain LearnsWhy Is It Better to Backpropagate?Backpropagation in DetailAssessing the Significance of the Input Data FeaturesSummary and Looking Ahead
Convolution and Cross-CorrelationTranslation Invariance and Translation EquivarianceConvolution in Usual Space Is a Product in Frequency SpaceConvolution from a Systems Design PerspectiveConvolution and Impulse Response for Linear and Translation Invariant SystemsConvolution and One-Dimensional Discrete SignalsConvolution and Two-Dimensional Discrete SignalsFiltering ImagesFeature MapsLinear Algebra NotationThe One-Dimensional Case: Multiplication by a Toeplitz MatrixThe Two-Dimensional Case: Multiplication by a Doubly Block Circulant MatrixPoolingA Convolutional Neural Network for Image ClassificationSummary and Looking Ahead
Matrix FactorizationDiagonal MatricesMatrices as Linear Transformations Acting on SpaceAction of A on the Right Singular VectorsAction of A on the Standard Unit Vectors and the Unit Square Determined by ThemAction of A on the Unit CircleBreaking Down the Circle-to-Ellipse Transformation According to the Singular Value DecompositionRotation and Reflection MatricesAction of A on a General Vector x → Three Ways to Multiply MatricesThe Big PictureThe Condition Number and Computational StabilityThe Ingredients of the Singular Value DecompositionSingular Value Decomposition Versus the Eigenvalue DecompositionComputation of the Singular Value DecompositionComputing an Eigenvector NumericallyThe PseudoinverseApplying the Singular Value Decomposition to ImagesPrincipal Component Analysis and Dimension ReductionPrincipal Component Analysis and ClusteringA Social Media ApplicationLatent Semantic AnalysisRandomized Singular Value DecompositionSummary and Looking Ahead
Natural Language AIPreparing Natural Language Data for Machine ProcessingStatistical Models and the log FunctionZipf’s Law for Term CountsVarious Vector Representations for Natural Language DocumentsTerm Frequency Vector Representation of a Document or Bag of WordsTerm Frequency-Inverse Document Frequency Vector Representation of a DocumentTopic Vector Representation of a Document Determined by Latent Semantic AnalysisTopic Vector Representation of a Document Determined by Latent Dirichlet AllocationTopic Vector Representation of a Document Determined by Latent Discriminant AnalysisMeaning Vector Representations of Words and of Documents Determined by Neural Network EmbeddingsCosine SimilarityNatural Language Processing ApplicationsSentiment AnalysisSpam FilterSearch and Information RetrievalMachine TranslationImage CaptioningChatbotsOther ApplicationsTransformers and Attention ModelsThe Transformer ArchitectureThe Attention MechanismTransformers Are Far from PerfectConvolutional Neural Networks for Time Series DataRecurrent Neural Networks for Time Series DataHow Do Recurrent Neural Networks Work?Gated Recurrent Units and Long Short-Term Memory UnitsAn Example of Natural Language DataFinance AISummary and Looking Ahead
What Are Generative Models Useful For?The Typical Mathematics of Generative ModelsShifting Our Brain from Deterministic Thinking to Probabilistic ThinkingMaximum Likelihood EstimationExplicit and Implicit Density ModelsExplicit Density-Tractable: Fully Visible Belief NetworksExample: Generating Images via PixelCNN and Machine Audio via WaveNetExplicit Density-Tractable: Change of Variables Nonlinear Independent Component AnalysisExplicit Density-Intractable: Variational Autoencoders Approximation via Variational MethodsExplicit Density-Intractable: Boltzman Machine Approximation via Markov ChainImplicit Density-Markov Chain: Generative Stochastic NetworkImplicit Density-Direct: Generative Adversarial NetworksHow Do Generative Adversarial Networks Work?Example: Machine Learning and Generative Networks for High Energy PhysicsOther Generative ModelsNaive Bayes Classification ModelGaussian Mixture ModelThe Evolution of Generative ModelsHopfield NetsBoltzmann MachineRestricted Boltzmann Machine (Explicit Density and Intractable)The Original AutoencoderProbabilistic Language ModelingSummary and Looking Ahead
Graphs: Nodes, Edges, and Features for EachExample: PageRank AlgorithmInverting Matrices Using GraphsCayley Graphs of Groups: Pure Algebra and Parallel ComputingMessage Passing Within a GraphThe Limitless Applications of GraphsBrain NetworksSpread of DiseaseSpread of InformationDetecting and Tracking Fake News PropagationWeb-Scale Recommendation SystemsFighting CancerBiochemical GraphsMolecular Graph Generation for Drug and Protein Structure DiscoveryCitation NetworksSocial Media Networks and Social Influence PredictionSociological StructuresBayesian NetworksTraffic ForecastingLogistics and Operations ResearchLanguage ModelsGraph Structure of the WebAutomatically Analyzing Computer ProgramsData Structures in Computer ScienceLoad Balancing in Distributed NetworksArtificial Neural NetworksRandom Walks on GraphsNode Representation LearningTasks for Graph Neural NetworksNode ClassificationGraph ClassificationClustering and Community DetectionGraph GenerationInfluence MaximizationLink PredictionDynamic Graph ModelsBayesian NetworksA Bayesian Network Represents a Compactified Conditional Probability TableMaking Predictions Using a Bayesian NetworkBayesian Networks Are Belief Networks, Not Causal NetworksKeep This in Mind About Bayesian NetworksChains, Forks, and CollidersGiven a Data Set, How Do We Set Up a Bayesian Network for the Involved Variables?Graph Diagrams for Probabilistic Causal ModelingA Brief History of Graph TheoryMain Considerations in Graph TheorySpanning Trees and Shortest Spanning TreesCut Sets and Cut VerticesPlanarityGraphs as Vector SpacesRealizabilityColoring and MatchingEnumerationAlgorithms and Computational Aspects of GraphsSummary and Looking Ahead

No Free LunchComplexity Analysis and O() NotationOptimization: The Heart of Operations ResearchThinking About OptimizationOptimization: Finite Dimensions, UnconstrainedOptimization: Finite Dimensions, Constrained Lagrange MultipliersOptimization: Infinite Dimensions, Calculus of VariationsOptimization on NetworksTraveling Salesman ProblemMinimum Spanning TreeShortest PathMax-Flow Min-CutMax-Flow Min-CostThe Critical Path Method for Project DesignThe n-Queens ProblemLinear OptimizationThe General Form and the Standard FormVisualizing a Linear Optimization Problem in Two DimensionsConvex to LinearThe Geometry of Linear OptimizationThe Simplex MethodTransportation and Assignment ProblemsDuality, Lagrange Relaxation, Shadow Prices, Max-Min, Min-Max, and All ThatSensitivityGame Theory and MultiagentsQueuingInventoryMachine Learning for Operations ResearchHamilton-Jacobi-Bellman EquationOperations Research for AISummary and Looking Ahead
Where Did Probability Appear in This Book?What More Do We Need to Know That Is Essential for AI?Causal Modeling and the Do CalculusAn Alternative: The Do CalculusParadoxes and Diagram InterpretationsMonty Hall ProblemBerkson’s ParadoxSimpson’s ParadoxLarge Random MatricesExamples of Random Vectors and Random MatricesMain Considerations in Random Matrix TheoryRandom Matrix EnsemblesEigenvalue Density of the Sum of Two Large Random MatricesEssential Math for Large Random MatricesStochastic ProcessesBernoulli ProcessPoisson ProcessRandom WalkWiener Process or Brownian MotionMartingaleLevy ProcessBranching ProcessMarkov ChainItô’s LemmaMarkov Decision Processes and Reinforcement LearningExamples of Reinforcement LearningReinforcement Learning as a Markov Decision ProcessReinforcement Learning in the Context of Optimal Control and Nonlinear DynamicsPython Library for Reinforcement LearningTheoretical and Rigorous GroundsWhich Events Have a Probability?Can We Talk About a Wider Range of Random Variables?A Probability Triple (Sample Space, Sigma Algebra, Probability Measure)Where Is the Difficulty?Random Variable, Expectation, and IntegrationDistribution of a Random Variable and the Change of Variable TheoremNext Steps in Rigorous Probability TheoryThe Universality Theorem for Neural NetworksSummary and Looking Ahead
Various Logic FrameworksPropositional LogicFrom Few Axioms to a Whole TheoryCodifying Logic Within an AgentHow Do Deterministic and Probabilistic Machine Learning Fit In?First-Order LogicRelationships Between For All and There ExistProbabilistic LogicFuzzy LogicTemporal LogicComparison with Human Natural LanguageMachines and Complex Mathematical ReasoningSummary and Looking Ahead
What Is a Partial Differential Equation?Modeling with Differential EquationsModels at Different ScalesThe Parameters of a PDEChanging One Thing in a PDE Can Be a Big DealCan AI Step In?Numerical Solutions Are Very ValuableContinuous Functions Versus Discrete FunctionsPDE Themes from My Ph.D. ThesisDiscretization and the Curse of DimensionalityFinite DifferencesFinite ElementsVariational or Energy MethodsMonte Carlo MethodsSome Statistical Mechanics: The Wonderful Master EquationSolutions as Expectations of Underlying Random ProcessesTransforming the PDEFourier TransformLaplace TransformSolution OperatorsExample Using the Heat EquationExample Using the Poisson EquationFixed Point IterationAI for PDEsDeep Learning to Learn Physical Parameter ValuesDeep Learning to Learn MeshesDeep Learning to Approximate Solution Operators of PDEsNumerical Solutions of High-Dimensional Differential EquationsSimulating Natural Phenomena Directly from DataHamilton-Jacobi-Bellman PDE for Dynamic ProgrammingPDEs for AI?Other Considerations in Partial Differential EquationsSummary and Looking Ahead
Good AIPolicy MattersWhat Could Go Wrong?From Math to WeaponsChemical Warfare AgentsAI and PoliticsUnintended Outcomes of Generative ModelsHow to Fix It?Addressing Underrepresentation in Training DataAddressing Bias in Word VectorsAddressing PrivacyAddressing FairnessInjecting Morality into AIDemocratization and Accessibility of AI to NonexpertsPrioritizing High Quality DataDistinguishing Bias from DiscriminationThe HypeFinal Thoughts

Content preview from Essential Math for AI

Chapter 4. Optimization for Neural Networks

I have lived each and every day of my life optimizing….My first aha moment was when I learned that our brain, too, learns a model of the world.

H.

Various artificial neural networks have fully connected layers in their architecture. In this chapter, we explain how the mathematics of a fully connected neural network works. We design and experiment with various training and loss functions. We also explain that the optimization and backpropagation steps used when training neural networks are similar to how learning happens in our brains. The brain learns by reinforcing neuron connections when faced with a concept it has seen before, and weakening connections if it learns new information that contradicts previously learned concepts. Machines only understand numbers. Mathematically, stronger connections correspond to larger numbers, and weaker connections correspond to smaller numbers.

Finally, we walk through various regularization techniques, explaining their advantages, disadvantages, and use cases.

The Brain Cortex and Artificial Neural Networks

Neural networks are modeled after the brain cortex, which involves billions of neurons arranged in a layered structure. Figure 4-1 shows an image of three vertical cross-sections of the brain neocortex, and Figure 4-2 shows a diagram of a fully connected artificial neural network.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence

Publisher Resources

ISBN: 9781098107628Errata Page Supplemental Content

Essential Math for AI

by Hala Nelson

Chapter 4. Optimization for Neural Networks

The Brain Cortex and Artificial Neural Networks

Figure 4-1. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence

AI Engineering

AI Engineering

AI Engineering

Publisher Resources

Chapter 4. Optimization for Neural Networks

The Brain Cortex and Artificial Neural Networks

Figure 4-1. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence

AI Engineering

AI Engineering

AI Engineering

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.