End-to-End Data Science Workflows in Jupyter Notebooks: An Introduction
The Jupyter Notebook is a popular tool for learning and performing data science in Python (and other languages used in data science). What is the Jupyter Notebook and how do you get started?
This tutorial will get you up and running with the Jupyter Notebook, walk you through how to build a data product in Python, and show you how to share your analyses via multiple outputs including presentation slides and web documents.
We’ll also review several other tools in the Jupyter ecosystem, learn how to install and use R (or other programming languages) in the notebook, and learn how to share notebooks with colleagues who do not have access to a Jupyter installation.
What you'll learn-and how you can apply it
- A start-to-finish Jupyter Notebook workflow: from installing Jupyter, to creating your data analysis, to ultimately sharing your results
- Additional tools within the Jupyter ecosystem that facilitate collaboration and sharing
- How to incorporate other programming languages (ex: R) in Jupyter Notebook analyses
This training course is for you because...
- You recently started working with Jupyter Notebooks and want to use the full range of tools available within the Jupyter environment.
- You want a repeatable process for conducting, sharing, and presenting your data science projects.
- You want to share your data science work with friends and colleagues who do not use or do not have access to a Jupyter installation.
- Basic knowledge of Python
- Comfort with command line basics like navigating directories
- Basic knowledge of data science methodologies and frameworks
Participants enrolled in this course need to have the following software installed on their computers:
- Download and install the Anaconda distribution of Python here. You can install either version 2.7 or 3.x, whichever you prefer.
- Create a GitHub account here (strongly recommended but not required).
- If you are unable to install software on your computer, you can access a hosted version via the Project Jupyter website (click on “try it in your browser”) or through Microsoft’s Azure Notebooks.
About your instructor
Jamie Whitacre has more than 10 years of experience in scientific computing systems, informatics, data science, and data analysis. Her specialties include integrating research data and systems, streamlining data pipelines, and educating users about data workflows and tools.
The timeframes are only estimates and may vary according to how the class is progressing
Project Jupyter & the Jupyter Ecosystem [30 min]
- “Human in the loop computing”; facilitating collaboration and sharing in data science
- Jupyter’s history and roots in IPython and IPython Notebooks
- Jupyter & NumFOCUS
- Finding Resources on Jupyter.org
- Hosted notebooks: nbviewer / GitHub
- Gallery of Interesting Jupyter Notebooks
- Current Development Work
- Real Time Collaboration
- Contributing to the Jupyter ecosystem via GitHub & enhancement proposals
Notebook [30 min]
- Installing the Anaconda Distribution of Python
- Navigating the Jupyter Notebook
- Quantitative and visual exploratory data analysis in Python
- Connecting to datasets
- Data Visualization packages: matplotlib, seaborn, plotly, Bokeh, Altair
Kernels [15 min]
- Which Python?
- Language-specific notebooks
- Installing the R kernel
- Installing other data science kernels: Scala, Julia
nbconvert [30 min]
- Converting your Jupyter Notebook to a slide presentation or html document
Sharing notebooks [30 min]
- Working with .ipynb files
- Using Jupyter using the command line
- Azure notebooks
- Using Anaconda console
nbdime for diffing notebooks [20 min]
Learn more about other projects in the ecosystem [25 min]
- Real Time Collaboration