O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Statistics, Data Mining, and Machine Learning in Astronomy

Book Description

As telescopes, detectors, and computers grow ever more powerful, the volume of data at the disposal of astronomers and astrophysicists will enter the petabyte domain, providing accurate measurements for billions of celestial objects. This book provides a comprehensive and accessible introduction to the cutting-edge statistical methods needed to efficiently analyze complex data sets from astronomical surveys such as the Panoramic Survey Telescope and Rapid Response System, the Dark Energy Survey, and the upcoming Large Synoptic Survey Telescope. It serves as a practical handbook for graduate students and advanced undergraduates in physics and astronomy, and as an indispensable reference for researchers.

Statistics, Data Mining, and Machine Learning in Astronomy presents a wealth of practical analysis problems, evaluates techniques for solving them, and explains how to use various approaches for different types and sizes of data sets. For all applications described in the book, Python code and example data sets are provided. The supporting data sets have been carefully selected from contemporary astronomical surveys (for example, the Sloan Digital Sky Survey) and are easy to download and use. The accompanying Python code is publicly available, well documented, and follows uniform coding standards. Together, the data sets and code enable readers to reproduce all the figures and examples, evaluate the methods, and adapt them to their own fields of interest.

  • Describes the most useful statistical and data-mining methods for extracting knowledge from huge and complex astronomical data sets
  • Features real-world data sets from contemporary astronomical surveys
  • Uses a freely available Python codebase throughout
  • Ideal for students and working astronomers

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Contents
  5. Preface
  6. I Introduction
    1. 1 About the Book and Supporting Material
      1. 1.1 What Do Data Mining, Machine Learning, and Knowledge Discovery Mean?
      2. 1.2 What is This Book About?
      3. 1.3 An Incomplete Survey of the Relevant Literature
      4. 1.4 Introduction to the Python Language and the Git Code Management Tool
      5. 1.5 Description of Surveys and Data Sets Used in Examples (1/4)
      6. 1.5 Description of Surveys and Data Sets Used in Examples (2/4)
      7. 1.5 Description of Surveys and Data Sets Used in Examples (3/4)
      8. 1.5 Description of Surveys and Data Sets Used in Examples (4/4)
      9. 1.6 Plotting and Visualizing the Data in This Book (1/2)
      10. 1.6 Plotting and Visualizing the Data in This Book (2/2)
      11. 1.7 How to Efficiently Use This Book
      12. References
    2. 2 Fast Computation on Massive Data Sets
      1. 2.1 Data Types and Data Management Systems
      2. 2.2 Analysis of Algorithmic Efficiency
      3. 2.3 Seven Types of Computational Problem
      4. 2.4 Seven Strategies for Speeding Things Up
      5. 2.5 Case Studies: Speedup Strategies in Practice (1/3)
      6. 2.5 Case Studies: Speedup Strategies in Practice (2/3)
      7. 2.5 Case Studies: Speedup Strategies in Practice (3/3)
      8. References
  7. II Statistical Frameworks and Exploratory Data Analysis
    1. 3 Probability and Statistical Distributions
      1. 3.1 Brief Overview of Probability and Random Variables (1/2)
      2. 3.1 Brief Overview of Probability and Random Variables (2/2)
      3. 3.2 Descriptive Statistics (1/2)
      4. 3.2 Descriptive Statistics (2/2)
      5. 3.3 Common Univariate Distribution Functions (1/4)
      6. 3.3 Common Univariate Distribution Functions (2/4)
      7. 3.3 Common Univariate Distribution Functions (3/4)
      8. 3.3 Common Univariate Distribution Functions (4/4)
      9. 3.4 The Central Limit Theorem
      10. 3.5 Bivariate and Multivariate Distribution Functions (1/2)
      11. 3.5 Bivariate and Multivariate Distribution Functions (2/2)
      12. 3.6 Correlation Coefficients
      13. 3.7 Random Number Generation for Arbitrary Distributions
      14. References
    2. 4 Classical Statistical Inference
      1. 4.1 Classical vs. Bayesian Statistical Inference
      2. 4.2 Maximum Likelihood Estimation (MLE) (1/2)
      3. 4.2 Maximum Likelihood Estimation (MLE) (2/2)
      4. 4.3 The goodness of Fit and Model Selection
      5. 4.4 ML Applied to Gaussian Mixtures: The Expectation Maximization Algorithm (1/2)
      6. 4.4 ML Applied to Gaussian Mixtures: The Expectation Maximization Algorithm (2/2)
      7. 4.5 Confidence Estimates: the Bootstrap and the Jackknife
      8. 4.6 Hypothesis Testing
      9. 4.7 Comparison of Distributions (1/3)
      10. 4.7 Comparison of Distributions (2/3)
      11. 4.7 Comparison of Distributions (3/3)
      12. 4.8 Nonparametric Modeling and Histograms
      13. 4.9 Selection Effects and Luminosity Function Estimation (1/2)
      14. 4.9 Selection Effects and Luminosity Function Estimation (2/2)
      15. 4.10 Summary
      16. References
    3. 5 Bayesian Statistical Inference
      1. 5.1 Introduction to the Bayesian Method
      2. 5.2 Bayesian Priors
      3. 5.3 Bayesian Parameter Uncertainty Quantification
      4. 5.4 Bayesian Model Selection
      5. 5.5 Nonuniform Priors: Eddington, Malmquist, and Lutz–Kelker Biases
      6. 5.6 Simple Examples of Bayesian Analysis: Parameter Estimation (1/6)
      7. 5.6 Simple Examples of Bayesian Analysis: Parameter Estimation (2/6)
      8. 5.6 Simple Examples of Bayesian Analysis: Parameter Estimation (3/6)
      9. 5.6 Simple Examples of Bayesian Analysis: Parameter Estimation (4/6)
      10. 5.6 Simple Examples of Bayesian Analysis: Parameter Estimation (5/6)
      11. 5.6 Simple Examples of Bayesian Analysis: Parameter Estimation (6/6)
      12. 5.7 Simple Examples of Bayesian Analysis: Model Selection (1/2)
      13. 5.7 Simple Examples of Bayesian Analysis: Model Selection (2/2)
      14. 5.8 Numerical Methods for Complex Problems (MCMC) (1/2)
      15. 5.8 Numerical Methods for Complex Problems (MCMC) (2/2)
      16. 5.9 Summary of Pros and Cons for Classical and Bayesian methods
      17. References
  8. III Data Mining and Machine Learning
    1. 6 Searching for Structure in Point Data
      1. 6.1 Nonparametric Density Estimation (1/2)
      2. 6.1 Nonparametric Density Estimation (2/2)
      3. 6.2 Nearest-Neighbor Density Estimation
      4. 6.3 Parametric Density Estimation (1/3)
      5. 6.3 Parametric Density Estimation (2/3)
      6. 6.3 Parametric Density Estimation (3/3)
      7. 6.4 Finding Clusters in Data (1/2)
      8. 6.4 Finding Clusters in Data (2/2)
      9. 6.5 Correlation Functions (1/3)
      10. 6.5 Correlation Functions (2/3)
      11. 6.5 Correlation Functions (3/3)
      12. 6.6 Which Density Estimation and Clustering Algorithms Should I Use?
      13. References
    2. 7 Dimensionality and Its Reduction
      1. 7.1 The Curse of Dimensionality
      2. 7.2 The Data Sets Used in This Chapter
      3. 7.3 Principal Component Analysis (1/3)
      4. 7.3 Principal Component Analysis (2/3)
      5. 7.3 Principal Component Analysis (3/3)
      6. 7.4 Nonnegative Matrix Factorization
      7. 7.5 Manifold Learning (1/2)
      8. 7.5 Manifold Learning (2/2)
      9. 7.6 Independent Component Analysis and Projection Pursuit
      10. 7.7 Which Dimensionality Reduction Technique Should I Use?
      11. References
    3. 8 Regression and Model Fitting
      1. 8.1 Formulation of the Regression Problem
      2. 8.2 Regression for Linear Models (1/2)
      3. 8.2 Regression for Linear Models (2/2)
      4. 8.3 Regularization and Penalizing the Likelihood
      5. 8.4 Principal Component Regression
      6. 8.5 Kernel Regression
      7. 8.6 Locally Linear Regression
      8. 8.7 Nonlinear Regression
      9. 8.8 Uncertainties in the Data
      10. 8.9 Regression that is Robust to Outliers
      11. 8.10 Gaussian Process Regression
      12. 8.11 Overfitting, Underfitting, and Cross-Validation (1/2)
      13. 8.11 Overfitting, Underfitting, and Cross-Validation (2/2)
      14. 8.12 Which Regression Method Should I Use?
      15. References
    4. 9 Classification
      1. 9.1 Data Sets Used in This Chapter
      2. 9.2 Assigning Categories: Classification
      3. 9.3 Generative Classification (1/2)
      4. 9.3 Generative Classification (2/2)
      5. 9.4 K-Nearest-Neighbor Classifier
      6. 9.5 Discriminative Classification
      7. 9.6 Support Vector Machines
      8. 9.7 Decision Trees (1/2)
      9. 9.7 Decision Trees (2/2)
      10. 9.8 Evaluating Classifiers: ROC Curves
      11. 9.9 Which Classifier Should I Use?
      12. References
    5. 10 Time Series Analysis
      1. 10.1 Main Concepts for Time Series Analysis
      2. 10.2 Modeling Toolkit for Time Series Analysis (1/5)
      3. 10.2 Modeling Toolkit for Time Series Analysis (2/5)
      4. 10.2 Modeling Toolkit for Time Series Analysis (3/5)
      5. 10.2 Modeling Toolkit for Time Series Analysis (4/5)
      6. 10.2 Modeling Toolkit for Time Series Analysis (5/5)
      7. 10.3 Analysis of Periodic Time Series (1/6)
      8. 10.3 Analysis of Periodic Time Series (2/6)
      9. 10.3 Analysis of Periodic Time Series (3/6)
      10. 10.3 Analysis of Periodic Time Series (4/6)
      11. 10.3 Analysis of Periodic Time Series (5/6)
      12. 10.3 Analysis of Periodic Time Series (6/6)
      13. 10.4 Temporally Localized Signals
      14. 10.5 Analysis of Stochastic Processes (1/2)
      15. 10.5 Analysis of Stochastic Processes (2/2)
      16. 10.6 Which Method Should I Use for Time Series Analysis?
      17. References
  9. IV Appendices
    1. A An Introduction to Scientific Computing with Python
      1. A.1 A Brief History of Python
      2. A.2 The SciPy Universe
      3. A.3 Getting Started with Python (1/3)
      4. A.3 Getting Started with Python (2/3)
      5. A.3 Getting Started with Python (3/3)
      6. A.4 IPython: The Basics of Interactive Computing
      7. A.5 Introduction to NumPy (1/2)
      8. A.5 Introduction to NumPy (2/2)
      9. A.6 Visualization with Matplotlib
      10. A.7 Overview of Useful NumPy/SciPy Modules
      11. A.8 Efficient Coding with Python and NumPy
      12. A.9 Wrapping Existing Code in Python
      13. A.10 Other Resources
    2. B AstroML:Machine Learning for Astronomy
      1. B.1 Introduction
      2. B.2 Dependencies
      3. B.3 Tools Included in AstroML v0.1
    3. C Astronomical Flux Measurements andMagnitudes
      1. C.1 The Definition of the Specific Flux
      2. C.2 Wavelength Window Function for Astronomical Measurements
      3. C.3 The Astronomical Magnitude Systems
    4. D SQL Query for Downloading SDSS Data
    5. E Approximating the Fourier Transform with the FFT
      1. References
  10. Visual Figure Index (1/2)
  11. Visual Figure Index (2/2)
  12. Index (1/2)
  13. Index (2/2)