book

Software Engineering for Data Scientists

Name: Software Engineering for Data Scientists
Author: Catherine Nelson
ISBN: 9781098136208

by Catherine Nelson

April 2024

Intermediate to advanced

260 pages

6h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Who Is This Book For?Why Python?What Is Not in This BookGuide to This BookReading OrderConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. What Is Good Code?
Why Good Code MattersAdapting to Changing RequirementsSimplicityDon’t Repeat Yourself (DRY)Avoid Verbose CodeModularityReadabilityStandards and ConventionsNamesCleaning upDocumentationPerformanceRobustnessErrors and LoggingTestingKey Takeaways
2. Analyzing Code Performance
Methods to Improve PerformanceTiming Your CodeProfiling Your CodecProfileline_profilerMemory Profiling with MemrayTime ComplexityHow to Estimate Time ComplexityBig O NotationKey Takeaways
3. Using Data Structures Effectively
Native Python Data StructuresListsTuplesDictionariesSetsNumPy ArraysNumPy Array FunctionalityNumPy Array Performance ConsiderationsArray Operations Using DaskArrays in Machine Learningpandas DataFramesDataFrame FunctionalityDataFrame Performance ConsiderationsKey Takeaways
4. Object-Oriented Programming and Functional Programming
Object-Oriented ProgrammingClasses, Methods, and AttributesDefining Your Own ClassesOOP PrinciplesFunctional ProgrammingLambda Functions and map()Applying Functions to DataFramesWhich Paradigm Should I Use?Key Takeaways
5. Errors, Logging, and Debugging
Errors in PythonReading Python Error MessagesHandling ErrorsRaising ErrorsLoggingWhat to LogLogging ConfigurationHow to LogDebuggingStrategies for DebuggingTools for DebuggingKey Takeaways
6. Code Formatting, Linting, and Type Checking
Code Formatting and Style GuidesPEP8Import FormattingAutomatic Code Formatting with BlackLintingLinting ToolsLinting in Your IDEType CheckingType AnnotationsType Checking with mypyKey Takeaways
7. Testing Your Code
Why You Should Write TestsWhen to TestHow to Write and Run TestsA Basic TestTesting Unexpected InputsRunning Automated Tests with PytestTypes of TestsUnit TestsIntegration TestsData ValidationData Validation ExamplesUsing Pandera for Data ValidationData Validation with PydanticTesting for Machine LearningTesting Model TrainingTesting Model InferenceKey Takeaways
8. Design and Refactoring
Project Design and StructureProject Design ConsiderationsAn Example Machine Learning ProjectCode DesignModular CodeA Code Design FrameworkInterfaces and ContractsCouplingFrom Notebooks to Scalable ScriptsWhy Use Scripts Instead of Notebooks?Creating Scripts from NotebooksRefactoringStrategies for RefactoringAn Example Refactoring WorkflowKey Takeaways
9. Documentation
Documentation Within the CodebaseNamesCommentsDocstringsReadmes, Tutorials, and Other Longer DocumentsDocumentation in Jupyter NotebooksDocumenting Machine Learning ExperimentsKey Takeaways

10. Sharing Your Code: Version Control, Dependencies, and Packaging
Version Control Using GitHow Does Git Work?Tracking Changes and CommittingRemote and LocalBranches and Pull RequestsDependencies and Virtual EnvironmentsVirtual EnvironmentsManaging Dependencies with pipManaging Dependencies with PoetryPython PackagingPackaging Basicspyproject.tomlBuilding and Uploading PackagesKey Takeaways
11. APIs
Calling an APIHTTP Methods and Status CodesGetting Data from the SDG APICreating Your Own API Using FastAPISetting Up the APIAdding Functionality to Your APIMaking Requests to Your APIKey Takeaways
12. Automation and Deployment
Deploying CodeAutomation ExamplesPre-Commit HooksGitHub ActionsCloud DeploymentsContainers and DockerBuilding a Docker ContainerDeploying an API on Google CloudDeploying an API on Other Cloud ProvidersKey Takeaways
13. Security
What Is Security?Security RisksCredentials, Physical Security, and Social EngineeringThird-Party PackagesThe Python Pickle ModuleVersion Control RisksAPI Security RisksSecurity PracticesSecurity Reviews and PoliciesSecure Coding ToolsSimple Code ScanningSecurity for Machine LearningAttacks on ML SystemsSecurity Practices for ML SystemsKey Takeaways
14. Working in Software
Development Principles and PracticesThe Software Development LifecycleWaterfall Software DevelopmentAgile Software DevelopmentAgile Data ScienceRoles in the Software IndustrySoftware EngineerQA or Test EngineerData EngineerData AnalystProduct ManagerUX ResearcherDesignerCommunityOpen SourceSpeaking at EventsThe Python CommunityKey Takeaways
15. Next Steps
The Future of CodeYour Future in CodeThank You
Index
About the Author

Content preview from Software Engineering for Data Scientists

Chapter 1. What Is Good Code?

This book aims to help you write better code. But first, what makes code “good”? There are a number of ways to think about this: the best code could be the code that runs fastest. Or it could be easiest to read. Another possible definition is that good code is easy to maintain. That is, if the project changes, it should be easy to go back to the code and change it to reflect the new requirements. The requirements for your code will change frequently because of updates to the business problem you’re solving, new research directions, or updates elsewhere in the codebase.

In addition, your code shouldn’t be complex, and it shouldn’t break if it gets an unexpected input. It should be easy to add a simple new feature to your code; if this is hard it suggests your code is not well written. In this chapter, I’ll introduce aspects of good code and show examples for each. I’ll divide these into five categories: simplicity, modularity, readability, performance, and robustness.

Why Good Code Matters

Good code is especially important when your data science code integrates with a larger system. This could be putting a machine learning model into production, writing packages for wider distribution, or building tools for other data scientists. It’s most useful for larger codebases that will be run repeatedly. As your project grows in size and complexity, the value of good code will increase.

Sometimes, the code you write will be a one-off, a prototype that needs ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098136192Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Software Engineering for Data Scientists

by Catherine Nelson

Chapter 1. What Is Good Code?

Why Good Code Matters

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.