O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Science: Mindset, Methodologies, and Misconceptions

Book Description

Master the concepts and strategies underlying success and progress in data science.

From the author of the bestsellers, Data Scientist and Julia for Data Science, this book covers four foundational areas of data science. The first area is the data science pipeline including methodologies and the data scientist's toolbox. The second are essential practices needed in understanding the data including questions and hypotheses. The third are pitfalls to avoid in the data science process. The fourth is an awareness of future trends and how modern technologies like Artificial Intelligence (AI) fit into the data science framework.

The following chapters cover these four foundational areas:
  • Chapter 1 - What Is Data Science?
  • Chapter 2 - The Data Science Pipeline
  • Chapter 3 - Data Science Methodologies
  • Chapter 4 - The Data Scientist's Toolbox
  • Chapter 5 - Questions to Ask and the Hypotheses They Are Based On
  • Chapter 6 - Data Science Experiments and Evaluation of Their Results
  • Chapter 7 - Sensitivity Analysis of Experiment Conclusions
  • Chapter 8 - Programming Bugs
  • Chapter 9 - Mistakes Through the Data Science Process
  • Chapter 10 - Dealing with Bugs and Mistakes Effectively and Efficiently
  • Chapter 11 - The Role of Heuristics in Data Science
  • Chapter 12 - The Role of AI in Data Science
  • Chapter 13 - Data Science Ethics
  • Chapter 14 - Future Trends and How to Remain Relevant
Targeted towards data science learners of all levels, this book aims to help the reader go beyond data science techniques and obtain a more holistic and deeper understanding of what data science entails. With a focus on the problems data science tries to solve, this book challenges the reader to become a self-sufficient player in the field.

Table of Contents

  1. Introduction
  2. Part 1 Overview of Data Science and the Data Scientist’s Work
  3. Chapter 1 What is Data Science?
    1. Data Science vs. Business Intelligence vs. Statistics
      1. Data Science
      2. Business Intelligence
      3. Statistics
    2. Big Data, Machine Learning, and AI
      1. Big Data
      2. Machine Learning
      3. AI – The Scientific Field, Not the Sci-fi Movie!
    3. The Need for Data Scientists and the Products/Services Provided
      1. What Does a Data Scientist Actually Do?
      2. What Does a Data Scientist Not Do?
      3. The Ever-growing Need for Data Science Professionals
    4. Summary
  4. Chapter 2 The Data Science Pipeline
    1. Data Engineering
      1. Data Preparation
      2. Data Exploration
      3. Data Representation
    2. Data Modeling
      1. Data Discovery
      2. Data Learning
    3. Information Distillation
      1. Data Product Creation
      2. Insight, Deliverance, and Visualization
    4. Putting It All Together
    5. Summary
  5. Chapter 3 Data Science Methodologies
    1. Predictive Analytics
      1. Classification
      2. Regression
      3. Time-series Analysis
      4. Anomaly Detection
      5. Text Prediction
    2. Recommender Systems
      1. Content-based Systems
      2. Collaborative Filtering
      3. Non-negative Matrix Factorization (NMF or NNMF)
    3. Automated Data Exploration Methods
      1. Data Mining
      2. Association Rules
      3. Clustering
    4. Graph Analytics
      1. Dimensionless Space
      2. Graph Algorithms
      3. Other Graph-related Topics
    5. Natural Language Processing (NLP)
      1. Sentiment Analysis
      2. Topic Extraction/Modeling
      3. Text Summarization
      4. Other NLP Methods
    6. Other Methodologies
      1. Chatbots
      2. Artificial Creativity
      3. Other AI-based Methods
    7. Summary
  6. Chapter 4 The Data Scientist’s Toolbox
    1. Database Platforms
      1. SQL-based Databases
      2. NoSQL Databases
      3. Graph-based Databases
    2. Programming Languages for Data Science
      1. Julia
      2. Python
      3. R
      4. Scala
      5. Which Language is Best for You?
    3. The Most Useful Packages for Julia and Python
    4. Other Data Analytics Software
      1. MATLAB
      2. Analytica
      3. Mathematica
    5. Visualization Software
      1. Plot.ly
      2. D3.js
      3. WolframAlpha
      4. Tableau
    6. Data Governance Software
      1. Spark
      2. Hadoop
      3. Storm
    7. Version Control Systems (VCS)
      1. Git
      2. Github
      3. CVS
    8. Summary
  7. Part 2 Setting the Stage for Data Analytics
  8. Chapter 5 Data Science Questions and Hypotheses
    1. Importance of Asking (the Right) Questions
      1. Formulating a Hypothesis
    2. Questions Related to Most Common Use Cases
      1. Is Feature X Related to Feature Y?
      2. Is Subset X Significantly Different to Subset Y?
      3. Do Features X and Y Collaborate Well with Each Other for Predicting Variable Z?
      4. Should We Remove X from the Feature Set?
      5. How Similar are Variables X and Y?
      6. Does Variable X Cause Variable Y?
      7. Other Question Types
    3. Questions Not to Ask
    4. Summary
  9. Chapter 6 Data Science Experiments and Evaluation of Their Results
    1. The Importance of Experiments
    2. How to Construct an Experiment
    3. Experiments for Assessing the Performance of a Predictive Analytics System
    4. A Matter of Confidence
    5. Evaluating the Results of an Experiment
    6. Summary
  10. Chapter 7 Sensitivity Analysis of Experiment Conclusions
    1. The Importance of Sensitivity Analysis
    2. The Butterfly Effect
    3. Global Sensitivity Analysis Using Resampling Methods
      1. Bootstrapping
      2. Permutation Methods
      3. Jackknife
      4. Monte Carlo
    4. Local Sensitivity Analysis Employing “What If?” Questions
    5. Some Useful Considerations on Sensitivity Analysis
    6. Summary
  11. Part 3 Common Errors in Data Science
  12. Chapter 8 Programming Bugs
    1. The Importance of Understanding and Dealing with Programming Bugs
    2. Places You Usually Find Bugs
    3. Types of Bugs Commonly Encountered
    4. Some Useful Considerations on Programming Bugs
    5. Summary
  13. Chapter 9 Mistakes Through the Data Science Process
    1. How Mistakes Differ From Bugs
    2. Most Common Types of Mistakes
    3. Choosing the Right Model
    4. Value of a Mentor
    5. Some Useful Considerations on Mistakes
    6. Summary
  14. Chapter 10 Handling Bugs and Mistakes
    1. Strategies for Coping with Bugs
    2. Strategies for Coping with High-level Mistakes
    3. Preventing Erroneous Situations in the Pipeline
      1. Types of Models
      2. Evaluating the Data at Hand and Pairing It with a Model
      3. Choosing the Right Model for a Classification Methodology
      4. Combining Different Options in an Ensemble Setting
      5. Other Considerations for Choosing the Right Model
    4. Summary
  15. Part 4 Other Aspects of Data Science
  16. Chapter 11 The Role of Heuristics in Data Science
    1. Heuristics as Information in the Making
    2. Problems that Require Heuristics
    3. Why Heuristics are Essential for an AI System
    4. Applications of Heuristics in Data Science
      1. Heuristics and Machine Learning Processes
      2. Custom Heuristics and Data Engineering
      3. Heuristics for Feature Evaluation
      4. Other Applications of Heuristics
      5. Anatomy of a Good Heuristic
    5. Some Final Considerations on Heuristics
    6. Summary
  17. Chapter 12 The Role of AI in Data Science
    1. Problems AI Solves
    2. Types of AI Systems Used in Data Science
      1. Deep Learning Networks
      2. Autoencoders
      3. Other Types of AI Systems
    3. AI Systems Using Data Science
      1. Computer Vision
      2. Chatbots
      3. Artificial Creativity
      4. Other AI Systems Using Data Science
    4. Some Final Considerations on AI
    5. Summary
  18. Chapter 13 Data Science Ethics
    1. The Importance of Ethics in Data Science
    2. Confidentiality Matters
      1. Privacy
      2. Data Anonymization
      3. Data Security
    3. Licensing Matters
    4. Other Ethical Matters
    5. Some Final Considerations on Ethics
    6. Summary
  19. Chapter 14 Future Trends and How to Remain Relevant
    1. General Trends in Data Science
      1. The Role of AI in the Years to Come
      2. Big Data: Getting Bigger and More Quantitative
      3. New Programming Paradigms
      4. The Rise of Hadoop Alternatives
      5. Other Trends
    2. Remaining Relevant in the Field
      1. The Versatilist Data Scientist
      2. Data Science Research
      3. The Need to Educate Oneself Continuously
      4. Collaborative Projects
      5. Mentoring
    3. Summary
  20. Final Words
  21. Glossary
  22. Index