Software Engineering for Data Scientists

Book description

Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's success—and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering, and clearly explains how to apply the best practices from software engineering to data science.

Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how to:

  • Understand data structures and object-oriented programming
  • Clearly and skillfully document your code
  • Package and share your code
  • Integrate data science code with a larger code base
  • Learn how to write APIs
  • Create secure code
  • Apply best practices to common tasks such as testing, error handling, and logging
  • Work more effectively with software engineers
  • Write more efficient, maintainable, and robust code in Python
  • Put your data science projects into production
  • And more

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Is This Book For?
    2. Why Python?
    3. What Is Not in This Book
    4. Guide to This Book
      1. Reading Order
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
  2. 1. What Is Good Code?
    1. Why Good Code Matters
    2. Adapting to Changing Requirements
    3. Simplicity
      1. Don’t Repeat Yourself (DRY)
      2. Avoid Verbose Code
    4. Modularity
    5. Readability
      1. Standards and Conventions
      2. Names
      3. Cleaning up
      4. Documentation
    6. Performance
    7. Robustness
      1. Errors and Logging
      2. Testing
    8. Key Takeaways
  3. 2. Analyzing Code Performance
    1. Methods to Improve Performance
    2. Timing Your Code
    3. Profiling Your Code
      1. cProfile
      2. line_profiler
      3. Memory Profiling with Memray
    4. Time Complexity
      1. How to Estimate Time Complexity
      2. Big O Notation
    5. Key Takeaways
  4. 3. Using Data Structures Effectively
    1. Native Python Data Structures
      1. Lists
      2. Tuples
      3. Dictionaries
      4. Sets
    2. NumPy Arrays
      1. NumPy Array Functionality
      2. NumPy Array Performance Considerations
      3. Array Operations Using Dask
      4. Arrays in Machine Learning
    3. pandas DataFrames
      1. DataFrame Functionality
      2. DataFrame Performance Considerations
    4. Key Takeaways
  5. 4. Object-Oriented Programming and Functional Programming
    1. Object-Oriented Programming
      1. Classes, Methods, and Attributes
      2. Defining Your Own Classes
      3. OOP Principles
    2. Functional Programming
      1. Lambda Functions and map()
      2. Applying Functions to DataFrames
    3. Which Paradigm Should I Use?
    4. Key Takeaways
  6. 5. Errors, Logging, and Debugging
    1. Errors in Python
      1. Reading Python Error Messages
      2. Handling Errors
      3. Raising Errors
    2. Logging
      1. What to Log
      2. Logging Configuration
      3. How to Log
    3. Debugging
      1. Strategies for Debugging
      2. Tools for Debugging
    4. Key Takeaways
  7. 6. Code Formatting, Linting, and Type Checking
    1. Code Formatting and Style Guides
      1. PEP8
      2. Import Formatting
      3. Automatic Code Formatting with Black
    2. Linting
      1. Linting Tools
      2. Linting in Your IDE
    3. Type Checking
      1. Type Annotations
      2. Type Checking with mypy
    4. Key Takeaways
  8. 7. Testing Your Code
    1. Why You Should Write Tests
    2. When to Test
    3. How to Write and Run Tests
      1. A Basic Test
      2. Testing Unexpected Inputs
      3. Running Automated Tests with Pytest
    4. Types of Tests
      1. Unit Tests
      2. Integration Tests
    5. Data Validation
      1. Data Validation Examples
      2. Using Pandera for Data Validation
      3. Data Validation with Pydantic
    6. Testing for Machine Learning
      1. Testing Model Training
      2. Testing Model Inference
    7. Key Takeaways
  9. 8. Design and Refactoring
    1. Project Design and Structure
      1. Project Design Considerations
      2. An Example Machine Learning Project
    2. Code Design
      1. Modular Code
      2. A Code Design Framework
      3. Interfaces and Contracts
      4. Coupling
    3. From Notebooks to Scalable Scripts
      1. Why Use Scripts Instead of Notebooks?
      2. Creating Scripts from Notebooks
    4. Refactoring
      1. Strategies for Refactoring
      2. An Example Refactoring Workflow
    5. Key Takeaways
  10. 9. Documentation
    1. Documentation Within the Codebase
      1. Names
      2. Comments
      3. Docstrings
      4. Readmes, Tutorials, and Other Longer Documents
    2. Documentation in Jupyter Notebooks
    3. Documenting Machine Learning Experiments
    4. Key Takeaways
  11. 10. Sharing Your Code: Version Control, Dependencies, and Packaging
    1. Version Control Using Git
      1. How Does Git Work?
      2. Tracking Changes and Committing
      3. Remote and Local
      4. Branches and Pull Requests
    2. Dependencies and Virtual Environments
      1. Virtual Environments
      2. Managing Dependencies with pip
      3. Managing Dependencies with Poetry
    3. Python Packaging
      1. Packaging Basics
      2. pyproject.toml
      3. Building and Uploading Packages
    4. Key Takeaways
  12. 11. APIs
    1. Calling an API
      1. HTTP Methods and Status Codes
      2. Getting Data from the SDG API
    2. Creating Your Own API Using FastAPI
      1. Setting Up the API
      2. Adding Functionality to Your API
      3. Making Requests to Your API
    3. Key Takeaways
  13. 12. Automation and Deployment
    1. Deploying Code
    2. Automation Examples
      1. Pre-Commit Hooks
      2. GitHub Actions
    3. Cloud Deployments
      1. Containers and Docker
      2. Building a Docker Container
      3. Deploying an API on Google Cloud
      4. Deploying an API on Other Cloud Providers
    4. Key Takeaways
  14. 13. Security
    1. What Is Security?
    2. Security Risks
      1. Credentials, Physical Security, and Social Engineering
      2. Third-Party Packages
      3. The Python Pickle Module
      4. Version Control Risks
      5. API Security Risks
    3. Security Practices
      1. Security Reviews and Policies
      2. Secure Coding Tools
      3. Simple Code Scanning
    4. Security for Machine Learning
      1. Attacks on ML Systems
      2. Security Practices for ML Systems
    5. Key Takeaways
  15. 14. Working in Software
    1. Development Principles and Practices
      1. The Software Development Lifecycle
      2. Waterfall Software Development
      3. Agile Software Development
      4. Agile Data Science
    2. Roles in the Software Industry
      1. Software Engineer
      2. QA or Test Engineer
      3. Data Engineer
      4. Data Analyst
      5. Product Manager
      6. UX Researcher
      7. Designer
    3. Community
      1. Open Source
      2. Speaking at Events
      3. The Python Community
    4. Key Takeaways
  16. 15. Next Steps
    1. The Future of Code
    2. Your Future in Code
    3. Thank You
  17. Index
  18. About the Author

Product information

  • Title: Software Engineering for Data Scientists
  • Author(s): Catherine Nelson
  • Release date: April 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098136208