Data Science Using Python and R

Book description

Learn data science by doing data science! 

Data Science Using Python and R will get you plugged into the world’s two most widespread open-source platforms for data science: Python and R.

Data science is hot. Bloomberg called data scientist “the hottest job in America.” Python and R are the top two open-source data science tools in the world. In Data Science Using Python and R, you will learn step-by-step how to produce hands-on solutions to real-world business problems, using state-of-the-art techniques. 

Data Science Using Python and R is written for the general reader with no previous analytics or programming experience. An entire chapter is dedicated to learning the basics of Python and R. Then, each chapter presents step-by-step instructions and walkthroughs for solving data science problems using Python and R.

Those with analytics experience will appreciate having a one-stop shop for learning how to do data science using Python and R. Topics covered include data preparation, exploratory data analysis, preparing to model the data, decision trees, model evaluation, misclassification costs, naïve Bayes classification, neural networks, clustering, regression modeling, dimension reduction, and association rules mining.

Further, exciting new topics such as random forests and general linear models are also included. The book emphasizes data-driven error costs to enhance profitability, which avoids the common pitfalls that may cost a company millions of dollars.

Data Science Using Python and R provides exercises at the end of every chapter, totaling over 500 exercises in the book. Readers will therefore have plenty of opportunity to test their newfound data science skills and expertise. In the Hands-on Analysis exercises, readers are challenged to solve interesting business problems using real-world data sets.

Table of contents

  1. COVER
  2. PREFACE
    1. DATA SCIENCE USING PYTHON AND R
  3. ABOUT THE AUTHORS
  4. ACKNOWLEDGMENTS
    1. CHANTAL'S ACKNOWLEDGMENTS
    2. DANIEL'S ACKNOWLEDGMENTS
  5. Chapter 1: INTRODUCTION TO DATA SCIENCE
    1. 1.1 WHY DATA SCIENCE?
    2. 1.2 WHAT IS DATA SCIENCE?
    3. 1.3 THE DATA SCIENCE METHODOLOGY
    4. 1.4 DATA SCIENCE TASKS
    5. EXERCISES
  6. Chapter 2: THE BASICS OF PYTHON AND R
    1. 2.1 DOWNLOADING PYTHON
    2. 2.2 BASICS OF CODING IN PYTHON
    3. 2.3 DOWNLOADING R AND RSTUDIO
    4. 2.4 BASICS OF CODING IN R
    5. REFERENCES
    6. EXERCISES
  7. Chapter 3: DATA PREPARATION
    1. 3.1 THE BANK MARKETING DATA SET
    2. 3.2 THE PROBLEM UNDERSTANDING PHASE
    3. 3.3 DATA PREPARATION PHASE
    4. 3.4 ADDING AN INDEX FIELD
    5. 3.5 CHANGING MISLEADING FIELD VALUES
    6. 3.6 REEXPRESSION OF CATEGORICAL DATA AS NUMERIC
    7. 3.7 STANDARDIZING THE NUMERIC FIELDS
    8. 3.8 IDENTIFYING OUTLIERS
    9. REFERENCES
    10. EXERCISES
  8. Chapter 4: EXPLORATORY DATA ANALYSIS
    1. 4.1 EDA VERSUS HT
    2. 4.2 BAR GRAPHS WITH RESPONSE OVERLAY
    3. 4.3 CONTINGENCY TABLES
    4. 4.4 HISTOGRAMS WITH RESPONSE OVERLAY
    5. 4.5 BINNING BASED ON PREDICTIVE VALUE
    6. REFERENCES
    7. EXERCISES
  9. Chapter 5: PREPARING TO MODEL THE DATA
    1. 5.1 THE STORY SO FAR
    2. 5.2 PARTITIONING THE DATA
    3. 5.3 VALIDATING YOUR PARTITION
    4. 5.4 BALANCING THE TRAINING DATA SET
    5. 5.5 ESTABLISHING BASELINE MODEL PERFORMANCE
    6. REFERENCES
    7. EXERCISES
  10. Chapter 6: DECISION TREES
    1. 6.1 INTRODUCTION TO DECISION TREES
    2. 6.2 CLASSIFICATION AND REGRESSION TREES
    3. 6.3 THE C5.0 ALGORITHM FOR BUILDING DECISION TREES
    4. 6.4 RANDOM FORESTS
    5. REFERENCES
    6. EXERCISES
  11. Chapter 7: MODEL EVALUATION
    1. 7.1 INTRODUCTION TO MODEL EVALUATION
    2. 7.2 CLASSIFICATION EVALUATION MEASURES
    3. 7.3 SENSITIVITY AND SPECIFICITY
    4. 7.4 PRECISION, RECALL, AND Fβ SCORES
    5. 7.5 METHOD FOR MODEL EVALUATION
    6. 7.6 AN APPLICATION OF MODEL EVALUATION
    7. 7.7 ACCOUNTING FOR UNEQUAL ERROR COSTS
    8. 7.8 COMPARING MODELS WITH AND WITHOUT UNEQUAL ERROR COSTS
    9. 7.9 DATA‐DRIVEN ERROR COSTS
    10. EXERCISES
  12. Chapter 8: NAÏVE BAYES CLASSIFICATION
    1. 8.1 INTRODUCTION TO NAÏVE BAYES
    2. 8.2 BAYES THEOREM
    3. 8.3 MAXIMUM A POSTERIORI HYPOTHESIS
    4. 8.4 CLASS CONDITIONAL INDEPENDENCE
    5. 8.5 APPLICATION OF NAÏVE BAYES CLASSIFICATION
    6. REFERENCES
    7. EXERCISES
  13. Chapter 9: NEURAL NETWORKS
    1. 9.1 INTRODUCTION TO NEURAL NETWORKS
    2. 9.2 THE NEURAL NETWORK STRUCTURE
    3. 9.3 CONNECTION WEIGHTS AND THE COMBINATION FUNCTION
    4. 9.4 THE SIGMOID ACTIVATION FUNCTION
    5. 9.5 BACKPROPAGATION
    6. 9.6 AN APPLICATION OF A NEURAL NETWORK MODEL
    7. 9.7 INTERPRETING THE WEIGHTS IN A NEURAL NETWORK MODEL
    8. 9.8 HOW TO USE NEURAL NETWORKS IN R
    9. REFERENCES
    10. EXERCISES
  14. Chapter 10: CLUSTERING
    1. 10.1 WHAT IS CLUSTERING?
    2. 10.2 INTRODUCTION TO THE k‐MEANS CLUSTERING ALGORITHM
    3. 10.3 AN APPLICATION OF k‐MEANS CLUSTERING
    4. 10.4 CLUSTER VALIDATION
    5. 10.5 HOW TO PERFORM k‐MEANS CLUSTERING USING PYTHON
    6. 10.6 HOW TO PERFORM k‐MEANS CLUSTERING USING R
    7. EXERCISES
  15. Chapter 11: REGRESSION MODELING
    1. 11.1 THE ESTIMATION TASK
    2. 11.2 DESCRIPTIVE REGRESSION MODELING
    3. 11.3 AN APPLICATION OF MULTIPLE REGRESSION MODELING
    4. 11.4 HOW TO PERFORM MULTIPLE REGRESSION MODELING USING PYTHON
    5. 11.5 HOW TO PERFORM MULTIPLE REGRESSION MODELING USING R
    6. 11.6 MODEL EVALUATION FOR ESTIMATION
    7. 11.7 STEPWISE REGRESSION
    8. 11.8 BASELINE MODELS FOR REGRESSION
    9. REFERENCES
    10. EXERCISES
  16. Chapter 12: DIMENSION REDUCTION
    1. 12.1 THE NEED FOR DIMENSION REDUCTION
    2. 12.2 MULTICOLLINEARITY
    3. 12.3 IDENTIFYING MULTICOLLINEARITY USING VARIANCE INFLATION FACTORS
    4. 12.4 PRINCIPAL COMPONENTS ANALYSIS
    5. 12.5 AN APPLICATION OF PRINCIPAL COMPONENTS ANALYSIS
    6. 12.6 HOW MANY COMPONENTS SHOULD WE EXTRACT?
    7. 12.7 PERFORMING PCA WITH k = 4
    8. 12.8 VALIDATION OF THE PRINCIPAL COMPONENTS
    9. 12.9 HOW TO PERFORM PRINCIPAL COMPONENTS ANALYSIS USING PYTHON
    10. 12.10 HOW TO PERFORM PRINCIPAL COMPONENTS ANALYSIS USING R
    11. 12.11 WHEN IS MULTICOLLINEARITY NOT A PROBLEM?
    12. REFERENCES
    13. EXERCISES
  17. Chapter 13: GENERALIZED LINEAR MODELS
    1. 13.1 AN OVERVIEW OF GENERAL LINEAR MODELS
    2. 13.2 LINEAR REGRESSION AS A GENERAL LINEAR MODEL
    3. 13.3 LOGISTIC REGRESSION AS A GENERAL LINEAR MODEL
    4. 13.4 AN APPLICATION OF LOGISTIC REGRESSION MODELING
    5. 13.5 POISSON REGRESSION
    6. 13.6 AN APPLICATION OF POISSON REGRESSION MODELING
    7. REFERENCE
    8. EXERCISES
  18. Chapter 14: ASSOCIATION RULES
    1. 14.1 INTRODUCTION TO ASSOCIATION RULES
    2. 14.2 A SIMPLE EXAMPLE OF ASSOCIATION RULE MINING
    3. 14.3 SUPPORT, CONFIDENCE, AND LIFT
    4. 14.4 MINING ASSOCIATION RULES
    5. 14.5 CONFIRMING OUR METRICS
    6. 14.6 THE CONFIDENCE DIFFERENCE CRITERION
    7. 14.7 THE CONFIDENCE QUOTIENT CRITERION
    8. REFERENCES
    9. EXERCISES
  19. APPENDIX DATA SUMMARIZATION AND VISUALIZATION
    1. PART 1: SUMMARIZATION 1: BUILDING BLOCKS OF DATA ANALYSIS
    2. PART 2: VISUALIZATION: GRAPHS AND TABLES FOR SUMMARIZING AND ORGANIZING DATA
    3. PART 3: SUMMARIZATION 2: MEASURES OF CENTER, VARIABILITY, AND POSITION
    4. PART 4: SUMMARIZATION AND VISUALIZATION OF BIVARIATE RELATIONSHIPS
  20. INDEX
  21. END USER LICENSE AGREEMENT

Product information

  • Title: Data Science Using Python and R
  • Author(s): Chantal D. Larose, Daniel T. Larose
  • Release date: April 2019
  • Publisher(s): Wiley
  • ISBN: 9781119526810