The Shape of Data

Book description

Whether you’re a mathematician, seasoned data scientist, or marketing professional, you’ll find The Shape of Data to be the perfect introduction to the critical interplay between the geometry of data structures and machine learning.

This book’s extensive collection of case studies (drawn from medicine, education, sociology, linguistics, and more) and gentle explanations of the math behind dozens of algorithms provide a comprehensive yet accessible look at how geometry shapes the algorithms that drive data analysis.

In addition to gaining a deeper understanding of how to implement geometry-based algorithms with code, you’ll explore:

•Supervised and unsupervised learning algorithms and their application to network data analysis
•The way distance metrics and dimensionality reduction impact machine learning
•How to visualize, embed, and analyze survey and text data with topology-based algorithms
•New approaches to computational solutions, including distributed computing and quantum algorithms

Table of contents

  1. Praise for The Shape of Data
  2. Title Page
  3. Copyright
  4. Dedication
  5. About the Authors
  6. Foreword
  7. Acknowledgments
  8. Introduction
    1. Who Is This Book For?
    2. About This Book
    3. Downloading and Installing R
    4. Installing R Packages
    5. Getting Help with R
    6. Support for Python Users
    7. Summary
  9. Chapter 1: The Geometric Structure of Data
    1. Machine Learning Categories
      1. Supervised Learning
      2. Unsupervised Learning
      3. Matching Algorithms and Other Machine Learning
    2. Structured Data
      1. The Geometry of Dummy Variables
      2. The Geometry of Numerical Spreadsheets
      3. The Geometry of Supervised Learning
    3. Unstructured Data
      1. Network Data
      2. Image Data
      3. Text Data
    4. Summary
  10. Chapter 2: The Geometric Structure of Networks
    1. The Basics of Network Theory
      1. Directed Networks
      2. Networks in R
      3. Paths and Distance in a Network
    2. Network Centrality Metrics
      1. The Degree of a Vertex
      2. The Closeness of a Vertex
      3. The Betweenness of a Vertex
      4. Eigenvector Centrality
      5. PageRank Centrality
      6. Katz Centrality
      7. Hub and Authority
    3. Measuring Centrality in an Example Social Network
    4. Additional Quantities of a Network
      1. The Diversity of a Vertex
      2. Triadic Closure
      3. The Efficiency and Eccentricity of a Vertex
      4. Forman–Ricci Curvature
    5. Global Network Metrics
      1. The Interconnectivity of a Network
      2. Spreading Processes on a Network
      3. Spectral Measures of a Network
    6. Network Models for Real-World Behavior
      1. Erdös–Renyi Graphs
      2. Scale-Free Graphs
      3. Watts–Strogatz Graphs
    7. Summary
  11. Chapter 3: Network Analysis
    1. Using Network Data for Supervised Learning
      1. Making Predictions with Social Media Network Metrics
      2. Predicting Network Links in Social Media
    2. Using Network Data for Unsupervised Learning
      1. Applying Clustering to the Social Media Dataset
      2. Community Mining in a Network
    3. Comparing Networks
    4. Analyzing Spread Through Networks
      1. Tracking Disease Spread Between Towns
      2. Tracking Disease Spread Between Windsurfers
      3. Disrupting Communication and Disease Spread
    5. Summary
  12. Chapter 4: Network Filtration
    1. Graph Filtration
    2. From Graphs to Simplicial Complexes
      1. Examples of Betti Numbers
      2. The Euler Characteristic
      3. Persistent Homology
    3. Comparison of Networks with Persistent Homology
    4. Summary
  13. Chapter 5: Geometry in Data Science
    1. Common Distance Metrics
      1. Simulating a Small Dataset
      2. Using Norm-Based Distance Metrics
      3. Comparing Diagrams, Shapes, and Probability Distributions
    2. K-Nearest Neighbors with Metric Geometry
    3. Manifold Learning
      1. Using Multidimensional Scaling
      2. Extending Multidimensional Scaling with Isomap
      3. Capturing Local Properties with Locally Linear Embedding
      4. Visualizing with t-Distributed Stochastic Neighbor Embedding
    4. Fractals
    5. Summary
  14. Chapter 6: Newer Applications of Geometry in Machine Learning
    1. Working with Nonlinear Spaces
      1. Introducing dgLARS
      2. Predicting Depression with dgLARS
      3. Predicting Credit Default with dgLARS
    2. Applying Discrete Exterior Derivatives
    3. Nonlinear Algebra in Machine Learning Algorithms
    4. Comparing Choice Rankings with HodgeRank
    5. Summary
  15. Chapter 7: Tools for Topological Data Analysis
    1. Finding Distinctive Groups with Unique Behavior
    2. Validating Measurement Tools
    3. Using the Mapper Algorithm for Subgroup Mining
      1. Stepping Through the Mapper Algorithm
      2. Using TDAmapper to Find Cluster Structures in Data
    4. Summary
  16. Chapter 8: Homotopy Algorithms
    1. Introducing Homotopy
    2. Introducing Homotopy-Based Regression
    3. Comparing Results on a Sample Dataset
    4. Summary
  17. Chapter 9: Final Project: Analyzing Text Data
    1. Building a Natural Language Processing Pipeline
    2. The Project: Analyzing Language in Poetry
      1. Tokenizing Text Data
      2. Tagging Parts of Speech
      3. Normalizing Vectors
    3. Analyzing the Poem Dataset in R
    4. Using Topology-Based NLP Tools
    5. Summary
  18. Chapter 10: Multicore and Quantum Computing
    1. Multicore Approaches to Topological Data Analysis
    2. Quantum Computing Approaches
      1. Using the Qubit-Based Model
      2. Using the Qumodes-Based Model
      3. Using Quantum Network Algorithms
      4. Speeding Up Algorithms with Quantum Computing
      5. Using Image Classifiers on Quantum Computers
    3. Summary
  19. References
  20. Index

Product information

  • Title: The Shape of Data
  • Author(s): Colleen M. Farrelly, Yaé Ulrich Gaba
  • Release date: September 2023
  • Publisher(s): No Starch Press
  • ISBN: 9781718503083